Java Reference
In-Depth Information
Chapter 8. Combined Approaches
In this chapter, we will address several issues surrounding the use of combinations of tech-
niques to solve NLP problems. We start with a brief introduction to the process of prepar-
ing data. This is followed by a discussion on pipelines and their construction. A pipeline is
nothing more than a sequence of tasks integrated to solve some problems. The chief ad-
vantage of a pipeline is the ability to insert and remove various elements of the pipeline to
solve a problem in a slightly different manner.
The Stanford API supports a good pipeline architecture, which we have used repeatedly in
this topic. We will expand upon the details of this approach and then show how OpenNLP
can be used to construct a pipeline.
Preparing data for processing is an important first step in solving many NLP problems. We
introduced the data preparation process in Chapter 1 , Introduction to NLP, and then dis-
cussed the normalization process in Chapter 2 , Finding Parts of Text . In this chapter, we
will focus on extracting text from different data sources, such as HTML, Word, and PDF
documents, to be precise.
The Stanford StanfordCoreNLP class is a good example of a pipeline that is easily
used. In a sense, it is preconstructed. The actual tasks performed are dependent on the an-
notations added. This works well for many types of problems.
However, other NLP APIs do not support pipeline architecture as directly as Stanford
APIs; while more difficult to construct, these approaches can be more flexible for many ap-
plications. We demonstrate this construction process using OpenNLP.
Search WWH ::




Custom Search