Java Reference
In-Depth Information
Using Boilerpipe to extract text from HTML
There are several libraries available for extracting text from HTML documents. We will
demonstrate how to use Boilerpipe ( https://code.google.com/p/boilerpipe/ ) to perform this
operation. This is a flexible API that not only extracts the entire text of an HTML docu-
ment but can also extract selected parts of an HTML document such as its title and indi-
vidual text blocks.
We will use the HTML page at http://en.wikipedia.org/wiki/Berlin to illustrate the use of
Boilerpipe. Part of this page is shown in the following screenshot. In order to use Boiler-
pipe, you will need to download the binary for the Xerces Parser found at ht-
tp://xerces.apache.org/index.html .
Search WWH ::




Custom Search