Java – Partially load large text file with different encodings


I am writing a Java text component, and is trying to partially load some large text file in the middle (for speed reasons).

My question is if the text is in some multi-bytes encoding format, like UTF8, Big5, GBK, etc. How can I align the bytes so that I can correctly decode the text?

Best Solution

I can't speak for the other formats but utf8 shouldn't be too hard.

Just look at the first byte of the chunk you grabbed and figure out from there:

Taken from wikipedia:

00000000-01111111   00-7F   0-127   US-ASCII (single byte)
10000000-10111111   80-BF   128-191 2'nd, 3rd, or 4'th byte of a multi-byte sequence
11000000-11000001   C0-C1   192-193 start of a 2-byte sequence, but code point <= 127
11000010-11011111   C2-DF   194-223 Start of 2-byte sequence
11100000-11101111   E0-EF   224-239 Start of 3-byte sequence
11110000-11110100   F0-F4   240-244 Start of 4-byte sequence

If the byte is in the 2'nd or 3'rd group then you know you missed part of a character. If it's in the 1'st,4'th,5'th,6'th group then you know you are on the start of a character. Proceed accordingly from there.