Java Reference
In-Depth Information
Stanford NLP
The Stanford NLP Group conducts NLP research and provides tools for NLP tasks. The
Stanford CoreNLP is one of these toolsets. In addition, there are other tool sets such as the
Stanford Parser, Stanford POS tagger, and the Stanford Classifier. The Stanford tools sup-
port English and Chinese languages and basic NLP tasks, including tokenization and name
entity recognition.
These tools are released under the full GPL but it does not allow them to be used in com-
mercial applications, though a commercial license is available. The API is well organized
and supports the core NLP functionality.
There are several tokenization approaches supported by the Stanford group. We will use the
PTBTokenizer
class to illustrate the use of this NLP library. The constructor demon-
strated here uses a
Reader
object, a
LexedTokenFactory<T>
argument, and a string
to specify which of the several options is to be used.
The
LexedTokenFactory
is an interface that is implemented by the
CoreLa-
belTokenFactory
and
WordTokenFactory
classes. The former class supports the
retention of the beginning and ending character positions of a token, whereas the latter
class simply returns a token as a string without any positional information. The
WordTokenFactory
class is used by default.
The
CoreLabelTokenFactory
class is used in the following example. A
StringReader
is created using a string. The last argument is used for the option para-
meter, which is
null
for this example. The
Iterator
interface is implemented by the
PTBTokenizer
class allowing us to use the
hasNext
and
next
methods to display the
tokens:
PTBTokenizer ptb = new PTBTokenizer(
new StringReader("He lives at 1511 W. Randolph."),
new CoreLabelTokenFactory(), null);
while (ptb.hasNext()) {
System.out.println(ptb.next());
}
The output is as follows: