I am writing a Java text component, and is trying to partially load some large text file in the middle (for speed reasons).
My question is if the text is in some multi-bytes encoding format, like UTF8, Big5, GBK, etc. How can I align the bytes so that I can correctly decode the text?
Best Solution
I can't speak for the other formats but utf8 shouldn't be too hard.
Just look at the first byte of the chunk you grabbed and figure out from there:
Taken from wikipedia:
If the byte is in the 2'nd or 3'rd group then you know you missed part of a character. If it's in the 1'st,4'th,5'th,6'th group then you know you are on the start of a character. Proceed accordingly from there.