C++ – Unicode std::string class replacement


I'm looking for suggestions regarding unicode aware std::string library replacements. I have a bunch of code that uses std::string, its iterators etc, and would like to now support unicode strings (free or open source implementations preferred, regex capabilities would be great!).

I'm not sure at this point if I require a complete rewrite or if I can get away with dropping in a new string library that supports all of the std::string interfaces. The unicode world seems very complex and I'm just wanting to enable it in my applications not have to learn every single aspect of it.

btw how does the index operator work when it has to pass back a reference to either a 1, 2,3 or 4 structure which could in theory change to either a 1,2,3 or 4 byte structure. if a larger or smaller sized value is passed, does the shifting back and forth of the internal data representation occur insitu?

Best Solution

You don't need a complete rewrite if you make sure about what your std::string contains. For example, you could assume (and convert inputs to be sure) that your std::string contain UTF8 encoded strings (for those that need localization). Don't forget that std::string is only a container of raw data, it's not associated with an encoding (even in C++0x, it's only a possibility, not a requirement).

Then when you pass text to other libraries that require different encodings, you can use libraries like UTF8CPP to convert to the required encoding (but most of the time such libraries will do it themselves).

That way makes it simple. UTF8 with standard std::string in your code, enabling passing unicode string to everything else (with conversion if necessary).

There have been a lot of discussions about this in the boost community mailing list. Maybe reading it (if you have enough time...) can help you understand other possible solutions.