Tools - Refactoring HTML: Improving the Design of Existing Web Applications

HTML and CSS Reference

In-Depth Information

TagSoup

John Cowan's TagSoup ( http://home.ccil.org/~cowan/XML/tagsoup/ ) is an open source HTML parser written in

Java that implements the Simple API for XML, or SAX. Cowan describes TagSoup as "a SAX-compliant parser

written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor,

nasty, and brutish, though quite often far from short. TagSoup is designed for people who have to process this

stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard

XML tools to be applied to even the worst HTML."

TagSoup is not intended as an end-user tool, but it does have a basic command-line interface. It's also

straightforward to hook it up to any number of XML tools that accept input from SAX. Once you've done that,

feed in HTML, and out will come well-formed XHTML. For example:

$ java -jar tagsoup.jar index.html

<?xml version="1.0" standalone="yes"?>

<html lang="en-US" xmlns="http://www.w3.org/1999/

xhtml"><head><title>Java Virtual Machines</title><meta

name="description" content="A Growing

list of Java virtual machines and their capabilities">

</meta></head><body bgcolor="#ffffff" text="#000000">

<h1 align="center">Java Virtual Machines</h1>

...

You can improve its output a little bit by adding the --omit-xml-declaration and -nodefaults command-line

options:

$ java -jar tagsoup.jar --omit-xml-declaration

-nodefaults index.html

<html lang="en-US" xmlns="http://www.w3.org/1999/

xhtml"><head><title>Java Virtual Machines</title><meta

name="description" content="A Growing

list of Java virtual machines and their capabilities"></meta>

</head><body bgcolor="#ffffff" text="#000000">

<h1 align="center">Java Virtual Machines</h1>

...

This will remove a few pieces that are likely to confuse one browser or another.

You can use the --encoding option to specify the character encoding of the input document. For example, if you

know the document is written in Latin-1, ISO 8859-1, you could run it like so:

$ java -jar tagsoup.jar --encoding=ISO-8859-1 index.html

TagSoup's output is always UTF-8.

Finally, you can use the --files option to write new copies of the input files with the extension .xhtml.

Otherwise, TagSoup prints the output on stdout, from where you can redirect it to any convenient location.

TagSoup cannot change a file in place like Tidy can.

Search WWH ::

Custom Search

Home