Java Reference
In-Depth Information
Using Boilerpipe to extract text from HTML
There are several libraries available for extracting text from HTML documents. We will
demonstrate how to use Boilerpipe (
https://code.google.com/p/boilerpipe/
) to perform this
operation. This is a flexible API that not only extracts the entire text of an HTML docu-
ment but can also extract selected parts of an HTML document such as its title and indi-
vidual text blocks.
We will use the HTML page at
http://en.wikipedia.org/wiki/Berlin
to illustrate the use of
Boilerpipe. Part of this page is shown in the following screenshot. In order to use Boiler-
pipe, you will need to download the binary for the Xerces Parser found at
ht-