Combined Approaches - Natural Language Processing with Java

Java Reference

In-Depth Information

Using Boilerpipe to extract text from HTML

There are several libraries available for extracting text from HTML documents. We will

demonstrate how to use Boilerpipe ( https://code.google.com/p/boilerpipe/ ) to perform this

operation. This is a flexible API that not only extracts the entire text of an HTML docu-

ment but can also extract selected parts of an HTML document such as its title and indi-

vidual text blocks.

We will use the HTML page at http://en.wikipedia.org/wiki/Berlin to illustrate the use of

Boilerpipe. Part of this page is shown in the following screenshot. In order to use Boiler-

pipe, you will need to download the binary for the Xerces Parser found at ht-

Search WWH ::

Custom Search

Home