Xml – Tool to find duplicate sections in a text (XML) file

duplicatesfindtextxml

I have an XML file, and I want to find nodes that have duplicate CDATA. Are there any tools that exist that can help me do this?

I'd be fine with a tool that does this generally for text documents.

Best Solution

Here is a first attempt, written in Python and using only standard libraries. You can improve it in many ways (trim leading and ending whitespaces, computing a hash of the text to decrease memory requirments, better displaying of the elements, with their line number, etc):

import xml.etree.ElementTree as ElementTree
import sys

def print_elem(element):
    return "<%s>" % element.tag

if len(sys.argv) != 2:
    print >> sys.stderr, "Usage: %s filename" % sys.argv[0]
    sys.exit(1)
filename = sys.argv[1]    
tree = ElementTree.parse(filename)
root = tree.getroot()
chunks = {}
iter = root.findall('.//*')
for element in iter:
    if element.text in chunks:
        chunks[element.text].append(element)
    else:
        chunks[element.text] = [element,]
for text in chunks:
    if len(chunks[text]) > 1:
        print "\"%s\" is a duplicate: found in %s" % \
              (text, map(print_elem, chunks[text]))

If you give it this XML file:

<foo>
<bar>Hop</bar><quiz>Gaw</quiz>
<sub>
<und>Hop</und>
</sub>

it will output:

"Hop" is a duplicate: found in ['<bar>', '<und>']