feedreader and hpricot feed parsing
Tue, Jun 5, 2007After doing a little profiling (not that it was necessary) I found that REXML was the source of the lag in feed parsing. The obvious solution is to switch from REXML to some other xml parser. Having using libxml for c projects, I must admit a deep dislike of the API. The ruby-libxml interface does nothing to hide this api, and was therefore of out the running. So that left me with hpricot, which is has a great api and can parse xml/html quickly.
At this point I have a feedtools rewrite/replacement sitting in svn://hasno.info/feedtools/trunk, it’s currently going by the name FeedEater until such a time as it merges in with feedtools. I’ve left the interface almost the same, so the regular duckie rules apply. It is MUCH faster than feedtools, but does not to nearly as much. Feedtools attempts to fix any and all issues with feeds, anything from entity conversion to changing feed:// urls into http:// urls. The simple library I’ve written does none of that. It parses the feed into some ruby objects and spits it back at you. Though I did provide support for caching and have implemented a matching database cache (it basically uses the feedtools cache table).
Since all this feed parsing cruft was done in order to make the mephisto feedreader plugin faster, I’ve gone ahead and created a branch that uses the new library. You’ll find the branch at svn://hasno.info/mephisto/plugins/branches/mephisto_feedreader. Give the library and/or plugin a try.
As a aside, after all this hpricot goodness I’ve decided that at some point I want to write a ruby-ish hpricot-ish libxml wrapper.