TagSoup: “Just Keep On Truckin’”

I just discovered John Cowan’s TagSoup, an XML parser for Java (SAX interface) that can parse documents even if they are not well-formed. It can also parse HTML documents (which, unlike XHTML documents, are not necessarily well-formed XML).

It should be very easy to implement screen scraping by combining this parser with XSLT:

  • Parse the HTML web page with TagSoup, convert it to a well-formed XML document.
  • Use XPath Explorer to construct the XPath expressions that select the content you are interested in.
  • Use the XPath expressions to write an XSLT stylesheet that converts the web page into a format you like (e.g. RSS).
  • Hope that the layout of the web site does not change too much in the future…

I did the above, using HTML Tidy to convert HTML web pages to XHTML, but I ran into the following problems:

  • HTML Tidy aborts when encountering non-HTML tags. TagSoup claims that it never throws an exception, no matter what it finds in the document.
  • It’s a hassle to run native command line tools from Java programs.
  • JTidy, the Java version of HTML Tidy did not work as well as the C version. The last release of JTidy seems to have been in August 2000 or 2001.
blog comments powered by Disqus