My colleague Uche Ogbuji has written a short article on ElementTree for another publication. One of the tests he ran compared the relative speed and memory consumption of ElementTree to that of DOM. Uche chose to use his own cDomlette for the comparison. Unfortunately, I am unable to install 4Suite 1.0a1 on the Mac OSX machine I use (a workaround is in the works). However, I can use Uche's estimates to guess the likely performance -- he indicates that ElementTree is 30% slower, but 30% more memory-friendly, than cDomlette.
Mostly I was curious how ElementTree compares in speed and memory to gnosis.xml.objectify. I had never actually benchmarked my module very precisely before, since I never had anything concrete to compare it to. I selected two documents that I had used for benchmarking in the past: a 289 KB XML version of Shakespeare's Hamlet and a 3 MB XML Web log. I created scripts that simply parse an XML document into the object models of the various tools, but do not perform any additional manipulation:Listing 1. Scripts to time XML object models for Python
Creating the program object is quite similar in all three
cases, and also with cDomlette.
I estimated memory
usage by watching the output of
top in another window; each
test was run three times to make sure that they were consistent, and the
median value was used (memory was identical across runs).
One thing that is clear is that xml.minidom quickly becomes quite impractical for moderately large XML documents. The rest stay (fairly) reasonable. gnosis.xml.objectify is the most memory-friendly, but that is not surprising since it does not preserve all the information in the original XML instance (data content is kept, but not all structural information).
I also ran a test of Ruby's REXML, using the following script:Listing 2. Ruby REXML parsing script (time_rexml.rb)
REXML proved about as resource intensive as xml.minidom: parsing Hamlet.xml took 10 seconds and used 14 MB; parsing Weblog.xml took 190 seconds and used 150 MB. Obviously, the choice of programming language usually takes precedence over the comparison of libraries.