Java – Parsing XML with REGEX in Java

javaregexxml

Given the below XML snippet I need to get a list of name/value pairs for each child under DataElements. XPath or an XML parser cannot be used for reasons beyond my control so I am using regex.

<?xml version="1.0"?>
<StandardDataObject xmlns="myns">
  <DataElements>
    <EmpStatus>2.0</EmpStatus>
    <Expenditure>95465.00</Expenditure>
    <StaffType>11.A</StaffType>
    <Industry>13</Industry>
  </DataElements>
  <InteractionElements>
    <TargetCenter>92f4-MPA</TargetCenter>
    <Trace>7.19879</Trace>
  </InteractionElements>
</StandardDataObject>

The output I need is:
[{EmpStatus:2.0}, {Expenditure:95465.00}, {StaffType:11.A}, {Industry:13}]

The tag names under DataElements are dynamic and so cannot be expressed literally in the regex. The tag names TargetCenter and Trace are static and could be in the regex but if there is a way to avoid hardcoding that would be preferable.

"<([A-Za-z0-9]+?)>([A-Za-z0-9.]*?)</"

This is the regex I have constructed and it has the problem that it erroneously includes {Trace:719879} in the results. Relying on new-lines within the XML or any other apparent formatting is not an option.

Below is an approximation of the Java code I am using:

private static final Pattern PATTERN_1 = Pattern.compile(..REGEX..);
private List<DataElement> listDataElements(CharSequence cs) {
    List<DataElement> list = new ArrayList<DataElement>();
    Matcher matcher = PATTERN_1.matcher(cs);
    while (matcher.find()) {
        list.add(new DataElement(matcher.group(1), matcher.group(2)));
    }
    return list;
}

How can I change my regex to only include data elements and ignore the rest?

Best Solution

XML is not a regular language. You cannot parse it using a regular expression. An expression you think will work will break when you get nested tags, then when you fix that it will break on XML comments, then CDATA sections, then processor directives, then namespaces, ... It cannot work, use an XML parser.