Java Reference
In-Depth Information
Stanford NLP
The Stanford NLP Group conducts NLP research and provides tools for NLP tasks. The
Stanford CoreNLP is one of these toolsets. In addition, there are other tool sets such as the
Stanford Parser, Stanford POS tagger, and the Stanford Classifier. The Stanford tools sup-
port English and Chinese languages and basic NLP tasks, including tokenization and name
entity recognition.
These tools are released under the full GPL but it does not allow them to be used in com-
mercial applications, though a commercial license is available. The API is well organized
and supports the core NLP functionality.
There are several tokenization approaches supported by the Stanford group. We will use the
PTBTokenizer class to illustrate the use of this NLP library. The constructor demon-
strated here uses a Reader object, a LexedTokenFactory<T> argument, and a string
to specify which of the several options is to be used.
The LexedTokenFactory is an interface that is implemented by the CoreLa-
belTokenFactory and WordTokenFactory classes. The former class supports the
retention of the beginning and ending character positions of a token, whereas the latter
class simply returns a token as a string without any positional information. The
WordTokenFactory class is used by default.
The CoreLabelTokenFactory class is used in the following example. A
StringReader is created using a string. The last argument is used for the option para-
meter, which is null for this example. The Iterator interface is implemented by the
PTBTokenizer class allowing us to use the hasNext and next methods to display the
tokens:
PTBTokenizer ptb = new PTBTokenizer(
new StringReader("He lives at 1511 W. Randolph."),
new CoreLabelTokenFactory(), null);
while (ptb.hasNext()) {
System.out.println(ptb.next());
}
The output is as follows:
Search WWH ::




Custom Search