Developer Forums | About Us | Site Map


Useful Lists

Web Host
site hosted by netplex

Online Manuals

The Python Web services developer: RSS for Python
By Mike Olson and Uche Ogbuji - 2004-01-14 Page:  1 2 3 4

Mark Pilgrim offers another module for RSS file parsing. It doesn't provide all the features and options that does, but it does offer a very liberal parser, which deals well with all the confusing diversity in the world of RSS. To quote from the page:

You see, most RSS feeds suck. Invalid characters, unescaped ampersands (Blogger feeds), invalid entities (Radio feeds), unescaped and invalid HTML (The Register's feed most days). Or just a bastardized mix of RSS 0.9x elements with RSS 1.0 elements (Movable Type feeds).
Then there are feeds, like Aaron's feed, which are too bleeding edge. He puts an excerpt in the description element but puts the full text in the content:encoded element (as CDATA). This is valid RSS 1.0, but nobody actually uses it (except Aaron), few news aggregators support it, and many parsers choke on it. Other parsers are confused by the new elements (guid) in RSS 0.94 (see Dave Winer's feed for an example). And then there's Jon Udell's feed, with the fullitem element that he just sort of made up.

It's funny to consider this in the light of the fact that XML and Web services are supposed to increase interoperability. Anyway, is designed to deal with all the madness.

Installing is also very easy. You download the Python file (see Resources), rename it from "" to "", and copy it to your PYTHONPATH. I also suggest getting the optional timeoutsocket module which improves the timeout behavior of socket operations in Python, and thus can help getting RSS feeds less likely to stall the application thread in case of error.

Listing 3 is a script that is the equivalent of Listing 1, but using, rather than

Listing 3

import rssparser
#Parse the data, returns a tuple: (data for channels, data for items)
channel, items = rssparser.parse("")

for item in items:
    #Each item is a dictionary mapping properties to values
    print "RSS Item:", item.get('link', "(none)")
    print "Title:", item.get('title', "(none)")
    print "Description:", item.get('description', "(none)")

As you can see, the code is much simpler. The trade-off between and is largely that the former has more features, and maintains more syntactic information from the RSS feed. The latter is simpler, and a more forgiving parser (the parser only accepts well-formed XML).

The output should be the same as in Listing 2.

View The Python Web services developer: RSS for Python Discussion

Page:  1 2 3 4 Next Page: Conclusion & Resources

First published by IBM developerWorks

Copyright 2004-2024 All rights reserved.
Article copyright and all rights retained by the author.