Read UTF-8 XML with MSXML 4.0


I have a problem with classc ASP / VBScript trying to read an UTF-8 encoded XML file with MSXML. The file is encoded correctly, I can see that with all other tools.

Constructed XML example:

<?xml version="1.0" encoding="UTF-8"?>
    <Product Name="Backup gewünscht" />

If I try to do this in ASP…

Set fso = Server.CreateObject("Scripting.FileSystemObject")
Set ts = fso.OpenTextFile("input.xml", FOR_READING)
XML = ts.ReadAll
Set ts = nothing
Set fso = Nothing

Set myXML = Server.CreateObject("Msxml2.DOMDocument.4.0")
Set DocElement = myXML.documentElement
Set ProductNodes = DocElement.selectNodes("//Product")
Response.Write ProductNodes(0).getAttribute("Name")
' ...

… and Name contains special characters (german umlauts to be specific) the bytes of the umlaut "two-byte-code" get reencoded, so I end up with two totally crappy nonsense characters. What should be "ü" becomes "ü" – being FOUR bytes in my output, not two (correct UTF-8) or one (ISO-8859-#).

What am I doing wrong? Why is MSXML thinking that the input is ISO-8859-# so that it tries to convert it to UTF-8?

Best Solution

Set ts = fso.OpenTextFile("input.xml", FOR_READING, False, True)

The last parameter is the "Unicode" flag.

OpenTextFile() has the following signature:

object.OpenTextFile(filename[, iomode[, create[, format]]])

where "format" is defined as

Optional. One of three Tristate values used to indicate the format of the opened file. If omitted, the file is opened as ASCII.

And Tristate is defined as:

TristateUseDefault  -2   Opens the file using the system default.
TristateTrue        -1   Opens the file as Unicode.
TristateFalse        0   Opens the file as ASCII.

And -1 happens to be the numerical value of True.

Anyway, better is:

Set myXML = Server.CreateObject("Msxml2.DOMDocument.4.0")

Why should you use a TextStream object to read in a file that MSXML can read perfectly on it's own.

The TextStream object also has no notion of the actual file encoding. The docs say "Unicode", but there is more than one way of encoding Unicode. The load() method of the MSXML object will be able to deal with all of them.

Related Question