Java Reference
In-Depth Information
We start by creating a
URL
object that represents this page as shown here. The try-catch
block handles exceptions:
try {
URL url = new URL("http://en.wikipedia.org/wiki/
Berlin");
…
} catch (MalformedURLException ex) {
// Handle exceptions
} catch (BoilerpipeProcessingException | SAXException
| IOException ex) {
// Handle exceptions
}
We will use two classes to extract text. The first is the
HTMLDocument
class that repres-
ents the HTML document. The second is the
TextDocument
class that represents the
text within an HTML document. It consists of one or more
TextBlock
objects that can
be accessed individually if needed.
In the following sequence, a
HTMLDocument
instance is created for the Berlin page. The
BoilerpipeSAXInput
class uses this input source to create a
TextDocument
in-
stance. It then uses the
TextDocument
class'
getText
method to retrieve the text.
This method uses two arguments. The first argument specifies whether to include the
TextBlock
instances marked as content. The second argument specifies whether non-
content
TextBlock
instances should be included. In this example, both types of
Tex-
tBlock
instances are included:
HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
InputSource is = htmlDoc.toInputSource();
TextDocument document =
new BoilerpipeSAXInput(is).getTextDocument();
System.out.println(document.getText(true, true));
The output of this sequence is quite large since the page is large. A partial listing of the
output is as follows:
Berlin
From Wikipedia, the free encyclopedia
Jump to: navigation , search