Combined Approaches - Natural Language Processing with Java

Java Reference

In-Depth Information

Preparing data

Text extraction is an early step in most NLP tasks. Here, we will quickly cover how text

extraction can be performed for HTML, Word, and PDF documents. Although there are

several APIs that support these tasks, we will use:

• Boilerpipe ( https://code.google.com/p/boilerpipe/ ) for HTML

• POI ( http://poi.apache.org/index.html ) for Word

• PDFBox ( http://pdfbox.apache.org/ ) for PDF

Some APIs support the use of XML for input and output. For example, the Stanford

XMLUtils class provides support for reading XML files and manipulating XML data. The

LingPipe's XMLParser class will parse XML text.

Organizations store their data in many forms and frequently it is not in simple text files.

Presentations are stored in PowerPoint slides, specifications are created using Word docu-

ments, and companies provide marketing and other materials in PDF documents. Most or-

ganizations have an Internet presence, which means that much useful information is found

in HTML documents. Due to the widespread nature of these data sources, we need to use

tools to extract their text for processing.

Search WWH ::

Custom Search

Home