TagSoup: “Just Keep On Truckin’”
I just discovered John Cowan’s TagSoup, an XML parser for Java (SAX interface) that can parse documents even if they are not well-formed. It can also parse HTML documents (which, unlike XHTML documents, are not necessarily well-formed XML).
It should be very easy to implement screen scraping by combining this parser with XSLT:
- Parse the HTML web page with TagSoup, convert it to a well-formed XML document.
- Use XPath Explorer to construct the XPath expressions that select the content you are interested in.
- Use the XPath expressions to write an XSLT stylesheet that converts the web page into a format you like (e.g. RSS).
- Hope that the layout of the web site does not change too much in the future…
I did the above, using HTML Tidy to convert HTML web pages to XHTML, but I ran into the following problems:
- HTML Tidy aborts when encountering non-HTML tags. TagSoup claims that it never throws an exception, no matter what it finds in the document.
- It’s a hassle to run native command line tools from Java programs.
- JTidy, the Java version of HTML Tidy did not work as well as the C version. The last release of JTidy seems to have been in August 2000 or 2001.