I would like to use a language that I am familiar with – Java, C#, Ruby, PHP, C/C++, although examples in any language or pseudocode are more than welcome.

What is the best way of splitting a large XML document into smaller sections that are still valid XML? For my purposes, I need to split them into roughly thirds or fourths, but for the sake of providing examples, splitting them into n components would be good.

Best Solution

Parsing XML documents using DOM doesn't scale.

This Groovy-script is using StAX (Streaming API for XML) to split an XML document between the top-level elements (that shares the same QName as the first child of the root-document). It's pretty fast, handles arbitrary large documents and is very useful when you want to split a large batch-file into smaller pieces.

Requires Groovy on Java 6 or a StAX API and implementation such as Woodstox in the CLASSPATH

import javax.xml.stream.*

pieces = 5
input = "input.xml"
output = "output_%04d.xml"
eventFactory = XMLEventFactory.newInstance()
fileNumber = elementCount = 0

def createEventReader() {
    reader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(input))
    start = reader.next()
    root = reader.nextTag()
    firstChild = reader.nextTag()
    return reader

def createNextEventWriter () {
    println "Writing to '${filename = String.format(output, ++fileNumber)}'"
    writer = XMLOutputFactory.newInstance().createXMLEventWriter(new FileOutputStream(filename), start.characterEncodingScheme)
    return writer

elements = createEventReader().findAll { it.startElement && it.name == firstChild.name }.size()
println "Splitting ${elements} <${firstChild.name.localPart}> elements into ${pieces} pieces"
chunkSize = elements / pieces
writer = createNextEventWriter()
createEventReader().each { 
    if (it.startElement && it.name == firstChild.name) {
        if (++elementCount > chunkSize) {
            writer = createNextEventWriter()
            elementCount = 0
