fastxml progress
September 6th, 2007
My little pet xml parser interface is coming along well. I'm starting to implement all the little detail bits of the Hpricot api. As well as can be done using libxml that is. If anyone wants to test it out or help out, the code is available from svn://hasno.info/fastxml. I've also gone ahead and created a trac instance. Now for some random stats:
(in /Users/segfault/Devel/fastxml)
ruby ./benchmarks/unicode.rb
user system total real
fastxml.new 0.040000 0.000000 0.040000 ( 0.048961)
fastxml.to_s 0.020000 0.010000 0.030000 ( 0.021063)
fastxml.search 0.000000 0.000000 0.000000 ( 0.002023)
hpricot.new 0.700000 0.030000 0.730000 ( 0.753549)
hpricot.to_s 0.140000 0.010000 0.150000 ( 0.154857)
hpricot.search 0.280000 0.000000 0.280000 ( 0.294201)
libxml.new 0.040000 0.000000 0.040000 ( 0.038258)
libxml.to_s 0.010000 0.010000 0.020000 ( 0.024945)
libxml.search 0.010000 0.000000 0.010000 ( 0.002113)
REXML.new 1.390000 0.030000 1.420000 ( 1.452444)
REXML.to_s 0.440000 0.010000 0.450000 ( 0.464809)
REXML.xpath 103.720000 0.500000 104.220000 (107.125149)
xpath expression: //p
fastxml nodes: 10577
libxml nodes: 10577
hpricot nodes: 10577
REXML nodes: 10577The unicode benchmark just run's (new,to_s and an xpath query) on a well formed xml file (~900k). It's apparent that everything is faster than rexml. I wonder if anyone's game to add a REXML wrapper onto one of these libraries in order to speed up existing apps...
easy fast ruby libxml interface
July 26th, 2007
I've been a working on a little project for a while now, an hpricot-styled ruby libxml library. I started the project in order to learn the ruby c extension api and create an easy to use xml library for ruby. Hpricot is great but it is not a full fledged xml library and isn't intended as such. Libxml has ruby bindings, but they provide the same libxml api which is very un-ruby (imho). Rexml is just plain old slow. So my current hacking has left me with a library capable of loading xml strings/arrays whatever into an object (from what I've seen libxml-ruby doesn't support loading from memory/strings). I can run xpath searches and do xslt. I'm working on cleaning up the api to make it match hpricot and then I'll probably release the parse/read-only version as v0.1 in the next few weeks.
Here's a snippet of benchmark output comparing the different libraries in use (run on a late 2k6 Macbook w/ 2gb of ram):
Here's a snippet of benchmark output comparing the different libraries in use (run on a late 2k6 Macbook w/ 2gb of ram):
(in /Users/segfault/Devel/fastxml)
ruby ./benchmarks/speedtest.rb
user system total real
fastxml.new 0.000000 0.000000 0.000000 ( 0.001102)
fastxml.to_s 0.000000 0.000000 0.000000 ( 0.000629)
fastxml.search 0.000000 0.000000 0.000000 ( 0.000207)
hpricot.new 0.010000 0.000000 0.010000 ( 0.012319)
hpricot.to_s 0.000000 0.000000 0.000000 ( 0.003164)
hpricot.search 0.010000 0.000000 0.010000 ( 0.000603)
libxml.new 0.000000 0.000000 0.000000 ( 0.001287)
libxml.to_s 0.000000 0.000000 0.000000 ( 0.000698)
libxml.search 0.000000 0.000000 0.000000 ( 0.000073)
REXML.new 0.020000 0.000000 0.020000 ( 0.024030)
REXML.to_s 0.010000 0.000000 0.010000 ( 0.011971)
REXML.xpath 0.000000 0.000000 0.000000 ( 0.001092)
xpath expression: /feed/entry
fastxml nodes: 15
libxml nodes: 0
hpricot nodes: 15
REXML nodes: 15
feedreader and hpricot feed parsing
June 4th, 2007
After doing a little profiling (not that it was necessary) I found that REXML was the source of the lag in feed parsing. The obvious solution is to switch from REXML to some other xml parser. Having using libxml for c projects, I must admit a deep dislike of the API. The ruby-libxml interface does nothing to hide this api, and was therefore of out the running. So that left me with hpricot, which is has a great api and can parse xml/html quickly.
At this point I have a feedtools rewrite/replacement sitting in svn://hasno.info/feedtools/trunk, it's currently going by the name FeedEater until such a time as it merges in with feedtools. I've left the interface almost the same, so the regular duckie rules apply. It is MUCH faster than feedtools, but does not to nearly as much. Feedtools attempts to fix any and all issues with feeds, anything from entity conversion to changing feed:// urls into http:// urls. The simple library I've written does none of that. It parses the feed into some ruby objects and spits it back at you. Though I did provide support for caching and have implemented a matching database cache (it basically uses the feedtools cache table).
Since all this feed parsing cruft was done in order to make the mephisto feedreader plugin faster, I've gone ahead and created a branch that uses the new library. You'll find the branch at svn://hasno.info/mephisto/plugins/branches/mephisto_feedreader. Give the library and/or plugin a try.
As a aside, after all this hpricot goodness I've decided that at some point I want to write a ruby-ish hpricot-ish libxml wrapper.
At this point I have a feedtools rewrite/replacement sitting in svn://hasno.info/feedtools/trunk, it's currently going by the name FeedEater until such a time as it merges in with feedtools. I've left the interface almost the same, so the regular duckie rules apply. It is MUCH faster than feedtools, but does not to nearly as much. Feedtools attempts to fix any and all issues with feeds, anything from entity conversion to changing feed:// urls into http:// urls. The simple library I've written does none of that. It parses the feed into some ruby objects and spits it back at you. Though I did provide support for caching and have implemented a matching database cache (it basically uses the feedtools cache table).
Since all this feed parsing cruft was done in order to make the mephisto feedreader plugin faster, I've gone ahead and created a branch that uses the new library. You'll find the branch at svn://hasno.info/mephisto/plugins/branches/mephisto_feedreader. Give the library and/or plugin a try.
As a aside, after all this hpricot goodness I've decided that at some point I want to write a ruby-ish hpricot-ish libxml wrapper.