R – How to correctly display Japanese RTF Fonts

delphifontsrtfunicode

I am working on an application in Delphi 2009 which makes heavy use of RTF, edited using TRichEdit and TLMDRichEdit. Users who entered Japanese text in these RTF controls have been submitting intermittent reports about the Japanese text being displayed as gibberish when reloading the content, both on Win XP and Vista, with Eastern Language Support installed.

Typically, English and Japanese is mixed and is mostly displayed without a problem, for example:

Inventory turns partnerships.  在庫回転率の

(my apologies if the Japanese text is broken incorrectly – I do not speak or read the language).

Quite frequently however, only the Japanese portion of the text will be gibberish, for example:

ŒÉñ?“]-¦Œüã‚Ì·•Ê‰?-vˆö‚ðŽû‰v‚ÉŒø‰?“I‚ÉŒ‹‚т‚¯‚é’mŽ¯‚ª‘÷Ý‚·‚é?(マーケットセクター、
見込み客の優  先順位と彼らに販売する知識)

From extensive online searching, it appears that the problem is as a result of the fonts saved as part of the RTF. Fonts present on Japanese language version of Windows is not necessarily the same as a US English version. It is possible to programmatically replace the fonts in the RTF file which yields an almost acceptable result, i.e.

-D‚‚スƒIƒyƒŒ[ƒVƒ・“‚ニƒƒWƒXƒeƒBƒbƒN‚フƒpƒtƒH[ƒ}ƒ“ƒX‚-˜‰v‚ノŒ‹‚ム‚ツ‚ッ‚ネ‚「‚±ニ‚ヘ?A‘‚「‚ノ-ウ‘ハ‚ナ‚ ‚驕B‚サ‚‚ヘAl“セ‚オ‚ス・‘P‚フˆロ‚ƒƒXƒN‚ノ‚ウ‚‚キB

However, there are still quite a few "junk" characters in there which are not correctly recognized as Japanese characters. Looking at the raw RTF you'll see the following:

-D\'82\'82\u65405?\'83I\'83y\'83\'8c[\'83V\'83\u12539?\ldblquote\'82\u65414?

Clearly, the Unicode characters are rendered correctly, but for example the \'82\'82 pair of characters should be something else? My guess is that it actually represents a double byte character of some sort, which was for some mysterious reason encoded as two separate characters rather than a single Unicode character.

Is there a generic, (relatively) foolproof way to take RTF containing Eastern Languages and reliably displaying it again?


For completeness sake, I updated the RTF font table in the following way:

  • Replaced the font name "?l?r ?o?S?V?b?N;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
  • Updated font names by replacing "\froman\fprq1\fcharset0 " with "\fnil\fprq1\fcharset128 "
  • Updated font names by replacing "\froman\fprq1\fcharset238 " with "\fnil\fprq1\fcharset128 "
  • Updated font names by replacing "\froman\fprq1 " with "\fnil\fprq1\fcharset128 "
  • Replacing font name "?? ?????;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"

Update: Updating font names alone wont make a difference. The locale seems to be the big problem. I have seen a few site discussing ways around converting the display of Japanese RTF to something most reader would handle, but I haven't found a solution yet, see for example:
here and here.

Best Answer

My guess is that changing font names in the RTF has probably made things worse. If a font specified in the RTF is not a Unicode font, then surely the characters due to be rendered in that font will be encoded as Shift-JIS, not as Unicode. And then so will the other characters in the text. So treating the whole thing as Unicode, or appending Unicode text, will cause the corruption you see. You need to establish whether RTF you import is encoded Shift-JIS or Unicode, and also whether the machine you are running on (and therefore D2009 default input format) is Japanese or not. In Japan, if a text file has no Unicode BOM it would usually be Shift-JIS (but not always).