Introduction to NLP - Natural Language Processing with Java

Java Reference

In-Depth Information

Stanford NLP

The Stanford NLP Group conducts NLP research and provides tools for NLP tasks. The

Stanford CoreNLP is one of these toolsets. In addition, there are other tool sets such as the

Stanford Parser, Stanford POS tagger, and the Stanford Classifier. The Stanford tools sup-

port English and Chinese languages and basic NLP tasks, including tokenization and name

entity recognition.

These tools are released under the full GPL but it does not allow them to be used in com-

mercial applications, though a commercial license is available. The API is well organized

and supports the core NLP functionality.

There are several tokenization approaches supported by the Stanford group. We will use the

PTBTokenizer class to illustrate the use of this NLP library. The constructor demon-

strated here uses a Reader object, a LexedTokenFactory<T> argument, and a string

to specify which of the several options is to be used.

The LexedTokenFactory is an interface that is implemented by the CoreLa-

belTokenFactory and WordTokenFactory classes. The former class supports the

retention of the beginning and ending character positions of a token, whereas the latter

class simply returns a token as a string without any positional information. The

WordTokenFactory class is used by default.

The CoreLabelTokenFactory class is used in the following example. A

StringReader is created using a string. The last argument is used for the option para-

meter, which is null for this example. The Iterator interface is implemented by the

PTBTokenizer class allowing us to use the hasNext and next methods to display the

tokens:

PTBTokenizer ptb = new PTBTokenizer(

new StringReader("He lives at 1511 W. Randolph."),

new CoreLabelTokenFactory(), null);

while (ptb.hasNext()) {

System.out.println(ptb.next());

}

The output is as follows:

Search WWH ::

Custom Search

Home