In Python 2, Unicode strings may contain both unicode and bytes:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.
The bytes in the string above are UTF-8 for
My objective is to get a unicode string containing everything in Unicode, which is to say
Русский ек (
Encoding it to UTF-8 yields
>>> a.encode('utf-8') '\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
>>> a.encode('utf-8').decode('utf-8') u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I found a hacky way to solve the problem, however:
>>> repr(a) "u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'" >>> eval(repr(a)[1:]) '\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba' >>> s = eval(repr(a)[1:]).decode('utf8') >>> s u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a' # Almost there, the bytes are proper now but the former real-unicode characters # are now escaped with \u's; need to un-escape them. >>> import re >>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s) u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!
This works fine but looks very hacky due to its use of
repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?