Combined Approaches - Natural Language Processing with Java

Java Reference

In-Depth Information

We start by creating a URL object that represents this page as shown here. The try-catch

block handles exceptions:

try {

URL url = new URL("http://en.wikipedia.org/wiki/

Berlin");

…

} catch (MalformedURLException ex) {

// Handle exceptions

} catch (BoilerpipeProcessingException | SAXException

| IOException ex) {

// Handle exceptions

}

We will use two classes to extract text. The first is the HTMLDocument class that repres-

ents the HTML document. The second is the TextDocument class that represents the

text within an HTML document. It consists of one or more TextBlock objects that can

be accessed individually if needed.

In the following sequence, a HTMLDocument instance is created for the Berlin page. The

BoilerpipeSAXInput class uses this input source to create a TextDocument in-

stance. It then uses the TextDocument class' getText method to retrieve the text.

This method uses two arguments. The first argument specifies whether to include the

TextBlock instances marked as content. The second argument specifies whether non-

content TextBlock instances should be included. In this example, both types of Tex-

tBlock instances are included:

HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

InputSource is = htmlDoc.toInputSource();

TextDocument document =

new BoilerpipeSAXInput(is).getTextDocument();

System.out.println(document.getText(true, true));

The output of this sequence is quite large since the page is large. A partial listing of the

output is as follows:

Berlin

From Wikipedia, the free encyclopedia

Jump to: navigation , search

Search WWH ::

Custom Search

Home