Java Reference
In-Depth Information
Preparing data
Text extraction is an early step in most NLP tasks. Here, we will quickly cover how text
extraction can be performed for HTML, Word, and PDF documents. Although there are
several APIs that support these tasks, we will use:
• Boilerpipe ( https://code.google.com/p/boilerpipe/ ) for HTML
• POI ( http://poi.apache.org/index.html ) for Word
• PDFBox ( http://pdfbox.apache.org/ ) for PDF
Some APIs support the use of XML for input and output. For example, the Stanford
XMLUtils class provides support for reading XML files and manipulating XML data. The
LingPipe's XMLParser class will parse XML text.
Organizations store their data in many forms and frequently it is not in simple text files.
Presentations are stored in PowerPoint slides, specifications are created using Word docu-
ments, and companies provide marketing and other materials in PDF documents. Most or-
ganizations have an Internet presence, which means that much useful information is found
in HTML documents. Due to the widespread nature of these data sources, we need to use
tools to extract their text for processing.
Search WWH ::




Custom Search