I have a university programming exam coming up, and one section is on unicode.
I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help.
The question will be something like:
The string 'mЖ丽' has these unicode codepoints
U+006D
,U+0416
and
U+4E3D
, with answers written in hexadecimal, manually encode the
string into UTF-8 and UTF-16.
Any help at all will be greatly appreciated as I am trying to get my head round this.
Best Answer
Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)
The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the
utf-8(7)
manpage on many Linux systems:It might be easier to remember a 'compressed' version of the chart:
Initial bytes starts of mangled codepoints start with a
1
, and add padding1+0
. Subsequent bytes start10
.You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:
I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)
Update
Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:
This fits in the
0x00000800 - 0x0000FFFF
range (0x4E3E < 0xFFFF
), so the representation will be of the form:0x4E3E
is100111000111110b
. Drop the bits into thex
above (start from the right, we'll fill in missing bits at the start with0
):There is an
x
spot left over at the start, fill it in with0
:Convert from bits to hex: