HTML and CSS Reference
In-Depth Information
TagSoup
John Cowan's TagSoup ( http://home.ccil.org/~cowan/XML/tagsoup/ ) is an open source HTML parser written in
Java that implements the Simple API for XML, or SAX. Cowan describes TagSoup as "a SAX-compliant parser
written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor,
nasty, and brutish, though quite often far from short. TagSoup is designed for people who have to process this
stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard
XML tools to be applied to even the worst HTML."
TagSoup is not intended as an end-user tool, but it does have a basic command-line interface. It's also
straightforward to hook it up to any number of XML tools that accept input from SAX. Once you've done that,
feed in HTML, and out will come well-formed XHTML. For example:
$ java -jar tagsoup.jar index.html
<?xml version="1.0" standalone="yes"?>
<html lang="en-US" xmlns="http://www.w3.org/1999/
xhtml"><head><title>Java Virtual Machines</title><meta
name="description" content="A Growing
list of Java virtual machines and their capabilities">
</meta></head><body bgcolor="#ffffff" text="#000000">
<h1 align="center">Java Virtual Machines</h1>
...
You can improve its output a little bit by adding the --omit-xml-declaration and -nodefaults command-line
options:
$ java -jar tagsoup.jar --omit-xml-declaration
-nodefaults index.html
<html lang="en-US" xmlns="http://www.w3.org/1999/
xhtml"><head><title>Java Virtual Machines</title><meta
name="description" content="A Growing
list of Java virtual machines and their capabilities"></meta>
</head><body bgcolor="#ffffff" text="#000000">
<h1 align="center">Java Virtual Machines</h1>
...
This will remove a few pieces that are likely to confuse one browser or another.
You can use the --encoding option to specify the character encoding of the input document. For example, if you
know the document is written in Latin-1, ISO 8859-1, you could run it like so:
$ java -jar tagsoup.jar --encoding=ISO-8859-1 index.html
TagSoup's output is always UTF-8.
Finally, you can use the --files option to write new copies of the input files with the extension .xhtml.
Otherwise, TagSoup prints the output on stdout, from where you can redirect it to any convenient location.
TagSoup cannot change a file in place like Tidy can.
Search WWH ::




Custom Search