Asp – Parsing UTF-8-encoded XML in MSXML/ASP

asp-classicmsxmlutf-8xml

I'm at the receiving end of a HTTP POST (x-www-form-urlencoded), where one of the fields contains an XML document. I need to receive that document, look at a couple of elements, and store it in a database (for later use).
The document is in UTF-8 format (and has the appropriate header), and can contain lots of strange characters.

When I receive the data, like this:

Set xmlDoc = CreateObject("MSXML2.DOMDocument.3.0")
xmlDoc.async = False
xmlDoc.loadXML(Request.Form("xml"))

everything I can dig out of the DOM document is still in UTF-8 form.
For example, this document (grossly simplified):

<?xml version="1.0" encoding="UTF-8"?>
<data>
 ä
</data>

always comes out as

<?xml version="1.0" encoding="UTF-8"?>
<data>
 ä
</data>

If I look at xmlDoc.XML, I get this:

<?xml version="1.0"?>
<data>
 ä
</data>

It removes the encoding from the header (since whatever string I'm using in VBScript is "encoding-agnostic", this sort of makes sense), but it's still a sequence of characters representing an UTF-8 encoded document.

It's just as if MSXML didn't care about the encoding info in the header. Is the problem with MSXML, or is it with the encoding of the post data? It's a form of "double encoding", first UTF-8 (where certain characters are written with several bytes) and then urlencoded byte by byte ("ä" is actually sent as %C3%A4).

I would not want to hard-code anything such as assuming that it is always UTF-8 (as it could well be UTF-16 sometime in the future). I cannot do a "hard conversion" to any other character set either (such as iso-8859-1), as the data can contain cyrillic and arabic characters. How should I go about fixing this?

Best Answer

Option 1

Before reading any form fields modify your Response.CodePage value:-

Response.CodePage = 65001

The problem is the content of the form data is not understood by the receiving page to be UTF-8 Encoded. Hence the %C3%A4 data is seen as two distinct ANSI characters. The pages Response.CodePage weirdly influences how the form data is decode in the absence of character set info sent by the client.

Option 2

Modify the form element on the source page. Add the following attribute to to it:-

<form accept-charset="UTF-8" ...>

This enforces UTF-8 encoding of the characters in the post, and causes the post to carry data about the chosen charset, which gives the server the info it needs to decode the data correctly.

Option 3

Finally, my personal preference, don't post XML as field values in a form. Instead, turn it around, by adding the other form field values as attributes or elements to the XML then post the XML using XmlHttpRequest. For navigation have the server return a URL to which the client should navigate that would contain a GUID handle to the posted data so that when the server receives the request it can take the appropriate action. I realize however, that this is all quite a bit more work, in which case, one of the other two options should work for you.