A while ago I wrote a post talking about avoiding the use of the XMLDocument class for writing large XML files. XMLDocument objects quickly become very large and can easily overwhelm even a fairly powerful machine when used to output large amounts of data.
Reading large XML files presents many of the same problems. The XMLDocument does not scale well in terms of memory use when large amounts of XML are loaded into it. The solution appears to be to use the XMLTextReader.
This is a better method which works pretty well, but the XMLTextReader doesn’t allow me to use XPath to pull items out of the XML data, and imposes an unfamiliar (to me, at least) pseudo-event driven model for reading items from the XML file.
For certain types of XML file, specifically those where the file is made up of many repeated sections (think large RSS file with many <item> elements), I prefer to use a combination of an XMLTextReader to parse the file as a whole, reading the repeated items into an XMLDocument to process them. This ensures that only a small amount of XML is loaded into an XMLDocument at a time, and still allows me to use XPath within individual elements. It’s not as easy or powerful as loading the entire document into an XMLDocument would be, but it’s far cheaper in terms of memory usage.
To illustrate, see the code sample below.
Dim xtrInput As XmlTextReader
Dim xdItem As XmlDocument
xtrInput = New XmlTextReader("sample.xml")
While xtrInput.Read
While xtrInput.NodeType = XmlNodeType.Element AndAlso xtrInput.Name.ToLower() = "job"
xdItem = New XmlDocument
xdItem.LoadXml(xtrInput.ReadOuterXml())
'Process xdItem here
End While
End While
xtrInput.Close()
A sample input file to the above code might look like this:
<?xml version="1.0" encoding="iso-8859-1"?>
<jobs>
<job>
<title> Test Job Title 1 </title>
<description> Test Description 1</description>
<salary> £10,000 pa </salary>
<location> Test Location 1 </location>
</job>
<job>
<title> Test Job Title 2 </title>
<description> Test Description 2</description>
<salary> £10,000 pa </salary>
<location> Test Location 2 </location>
</job>
<job>
<title> Test Job Title 3 </title>
<description> Test Description 3</description>
<salary> £10,000 pa </salary>
<location> Test Location 3 </location>
</job>
</jobs>
The outer while loop reads through the xml file one node at a time until it runs out of data, at which point the loop exits and the file closes.
Within the outer loop is another While loop which checks to see whether the XMLTextReader is positioned on an element start node of with a name I’m interested in – in this case a <job> element. If that is the case, I read the entire element into an XMLDocument called xdItem, by calling ReadOuterXml on the XMLTextReader. As well as giving me the XML for the current node and any children, this call also moves the XMLTextReader’s file pointer onto the next sibling node of the current <job> node, so the next time I loop around the while loop the XMLTextReader is immediately on another <job> node, so I process the next one, and so on. This continues until I run out of <job> nodes.
Once the <job> element and its children are loaded into the XMLDocument I can use all the familiar XMLDocument ways to extract content from any elements and/or attributes in the <job>.
As an aside, if I were deploying the above code to production I’d wrap the creation of the xdItem and any subsequent processing in a Try…Catch. This means that any unexpected errors in an individual <job> item do not stop the rest of the file being processed.
The code I’ve outlined here is capable of reading in and processing XML files of practically any size and doesn’t suffer from the memory problems associated with loading large files into XMLDocuments. If it fits what you’re intending to do, by all means use it. If you come up with any improvements, please post your thoughts in the comments.