Java Reference
In-Depth Information
We start by creating a URL object that represents this page as shown here. The try-catch
block handles exceptions:
try {
URL url = new URL("http://en.wikipedia.org/wiki/
Berlin");
} catch (MalformedURLException ex) {
// Handle exceptions
} catch (BoilerpipeProcessingException | SAXException
| IOException ex) {
// Handle exceptions
}
We will use two classes to extract text. The first is the HTMLDocument class that repres-
ents the HTML document. The second is the TextDocument class that represents the
text within an HTML document. It consists of one or more TextBlock objects that can
be accessed individually if needed.
In the following sequence, a HTMLDocument instance is created for the Berlin page. The
BoilerpipeSAXInput class uses this input source to create a TextDocument in-
stance. It then uses the TextDocument class' getText method to retrieve the text.
This method uses two arguments. The first argument specifies whether to include the
TextBlock instances marked as content. The second argument specifies whether non-
content TextBlock instances should be included. In this example, both types of Tex-
tBlock instances are included:
HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
InputSource is = htmlDoc.toInputSource();
TextDocument document =
new BoilerpipeSAXInput(is).getTextDocument();
System.out.println(document.getText(true, true));
The output of this sequence is quite large since the page is large. A partial listing of the
output is as follows:
Berlin
From Wikipedia, the free encyclopedia
Jump to: navigation , search
Search WWH ::




Custom Search