In our application, we receive text files (.txt
, .csv
, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.
Is there a way to (automatically) detect the codepage of a text file?
The detectEncodingFromByteOrderMarks
, on the StreamReader
constructor, works for UTF8
and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850
, windows1252
.
Thanks for your answers, this is what I've done.
The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.
Solution:
- Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
- I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
- Loop through all codepages, and display the ones that give a solution with the user provided text.
- If more as one codepage pops up, ask the user to specify more text.
Best Answer
You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.
Anyway, this is what you need to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Specifically Joel says: