HTML and CSS Reference
However, TagSoup is primarily designed for use as a library. Its output from command-line mode leaves
something to be desired compared to Tidy. In particular:
It does not convert presentational markup to CSS.
It does not include a DOCTYPE declaration, which is needed before some browsers will recognize XHTML.
It does include an XML declaration, which needlessly confuses older browsers.
It uses start-tag and end-tag pairs for empty elements such as br and hr , which may confuse some older
TagSoup does not guarantee absolutely valid XHTML (though it does guarantee well-formedness). There are a
few things it cannot handle. Most important, XHTML requires all img elements to have an alt attribute. If the
alt attribute is empty, the image is purely presentational and should be ignored by screen readers. If the
attribute is not empty, it is used in place of the image by screen readers. TagSoup has no way of knowing
whether any given img with an omitted alt attribute is presentational or not, so it does not insert any such
attributes. Similarly, TagSoup does not add summaries to tables. You'll have to do that by hand, and you'll want
to validate after using TagSoup to make sure you catch all these instances.
However, despite these limits, TagSoup does do a huge amount of work for you at very little cost.
TagSoup versus Tidy
For an end-user, the primary difference between TagSoup and Tidy is one of philosophy. Tidy will sometimes
give up and ask for help. There are some things it does not know how to fix and will not try to fix. TagSoup will
never give up. It will always produce well-formed XHTML as output. It does not always produce perfectly valid
XHTML, but it will give you something. For the same reasons, TagSoup does not warn you about what it could
not handle so that you can fix it manually. Its assumption is that you really don't care that much. If that's not
true, you might prefer to use Tidy instead. Tidy is more careful. If it isn't pretty sure it knows what the
document means, it won't give you anything. TagSoup will always give you something.
For a programmer, the differences are a little more striking. First, TagSoup is written in Java and Tidy is written
in C. That alone may be enough to help you decide which one to use (though there is a Java port of Tidy, called
JTidy). Another important difference is that TagSoup operates in streaming mode. Rather than working on the
entire document at once, it works on just a little piece of it at a time, moving from start to finish. That makes it
very fast and allows it to process very large documents. However, it can't do things such as add a style rule to
the head that applies to the last paragraph of the document. Because HTML documents are rarely very large
(usually a few hundred kilobytes at most), I think a whole-document approach such as Tidy's is more powerful.