Java – How to tell whether an XML document validates against a DTD or XSD

dtdjavavalidationxmlxsd

In Java, I can validate an XML document against an XSD schema using javax.xml.validation.Validator, or against a DTD by simply parsing the document using org.xml.sax.XMLReader.

What I need though is a way of programmatically determining whether the document itself validates against a DTD (i.e. it contains a <!DOCTYPE ...> statement) or an XSD. Ideally I need to do this without loading the whole XML document into memory. Can anyone please help?

(Alternatively, if there's a single way of validating an XML document in Java that works for both XSDs and DTDs – and allows for custom resolving of resources – that would be even better!)

Many thanks,

A

Best Solution

There is no 100% foolproof process for determining how to validate an arbitrary XML document.

For example, this version 2.4 web application deployment descriptor specifies a W3 schema to validate the document:

<?xml version="1.0" encoding="UTF-8"?>
<web-app id="WebApp_ID" version="2.4"
    xmlns="http://java.sun.com/xml/ns/j2ee"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd">

However, this is an equally valid way of expressing the same thing:

<?xml version="1.0" encoding="UTF-8"?>
<web-app id="WebApp_ID" version="2.4"
    xmlns="http://java.sun.com/xml/ns/j2ee">

RELAX NG doesn't seem to have a mechanism that offers any hints in the document that you should use it. Validation mechanisms are determined by document consumers, not producers. If I'm not mistaken, this was one of the impetuses driving the switch from DTD to more modern validation mechanisms.

In my opinion, your best bet is to tailor the mechanism detector to the set of document types you are processing, reading header information and interpreting it as appropriate. The StAX parser is good for this - because it is a pull mechanism, you can just read the start of the file and then quit parsing on the first element.

Link to more of the same and sample code and whatnot.

Related Question