R – File names containing non-ascii international language characters


Has anyone had experience generating files that have filenames containing non-ascii international language characters?

Is doing this an easy thing to achieve, or is it fraught with danger?

Is this functionality expected from Japanese/Chinese speaking web users?

Should file extensions also be international language characters?

Info: We currently support multilanguage on our site, but our filenames are always ASCII. We are using ASP.NET on the .NET framework. This would be used in a scenario where international users could choose a common format and name for there files.

Best Solution

Is this functionality expected from Japanese/Chinese speaking web users?


Is doing this an easy thing to achieve, or is it fraught with danger?

There are issues. If you are serving files directly, or otherwise have the filename in the URL (eg.: http://​www.example.com/files/こんにちは.txt -> http://​www.example.com/files/%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF.txt), you're generally OK.

But if you're serving files with the filename generated by the script, you can have problems. The issue is with the header:

Content-Disposition: attachment;filename="こんにちは.txt"

How do we encode those characters into the filename parameter? Well it would be nice if we could just dump it in in UTF-8. And that will work in some browsers. But not IE, which uses the system codepage to decode characters from HTTP headers. On Windows, the system codepage might be cp1252 (Latin-1) for Western users, or cp932 (Shift-JIS) for Japanese, or something else completely, but it will never be UTF-8 and you can't really guess what it's going to be in advance of sending the header.

Tedious aside: what does the standard say should happen? Well, it doesn't really. The HTTP standard, RFC2616, says that bytes in HTTP headers are ISO-8859-1, which wouldn't allow us to use Japanese. It goes on to say that non-Latin-1 characters can be embedded in a header by the rules of RFC2047, but RFC2047 explicitly denies that its encoded-words can fit in a quoted-string. Normally in RFC822-family headers you would use RFC2231 rules to embed Unicode characters in a parameter of a Content-Disposition (RFC2183) header, and RFC2616 does defer to RFC2183 for definition of that header. But HTTP is not actually an RFC822-family protocol and its header syntax is not completely compatible with the 822 family anyway. In summary, the standard is a bloody mess and no-one knows what to do, certainly not the browser manufacturers who pay no attention to it whatsoever. Hell, they can't even get the ‘quoted-string’ format of ‘filename="..."’ right, never mind character encodings.

So if you want to serve a file dynamically with non-ASCII characters in the name, the trick is to avoid sending the ‘filename’ parameter and instead dump the filename you want in a trailing part of the URL.

Should file extensions also be international language characters?

In principle yes, file extensions are just part of the filename and can contain any character.

In practice on Windows I know of no application that has ever used a non-ASCII file extension.

One final thing to look out for on systems for East Asian users: you will find them typing weird, non-ASCII versions of Latin characters sometimes. These are known as the full-width and half-width forms, and are designed to allow Asians to type Latin characters that line up with the square grid used by their ideographic (Han etc.) characters.

That's all very well in free text, but for fields you expect to parse as Latin text or numbers, receiving an unexpected ‘42’ integer or ‘.txt’ file extension can trip you up. To convert these ‘compatibility characters’ down to plain Latin, normalise your strings to ‘Unicode Normal Form NFKC’ before doing anything with them.