Introduction to NLP - Natural Language Processing with Java

Java Reference

In-Depth Information

LingPipe

LingPipe consists of a set of tools to perform common NLP tasks. It supports model train-

ing and testing. There are both royalty free and license versions of the tool. The production

use of the free version is limited.

To demonstrate the use of LingPipe, we will illustrate how it can be used to tokenize text

using the Tokenizer class. Start by declaring two lists, one to hold the tokens and a

second to hold the whitespace:

List<String> tokenList = new ArrayList<>();

List<String> whiteList = new ArrayList<>();

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from

your account at http://www.packtpub.com . If you purchased this topic elsewhere, you can

visit http://www.packtpub.com/support and register to have the files e-mailed directly to

you.

Next, declare a string to hold the text to be tokenized:

String text = "A sample sentence processed \nby \tthe " +

"LingPipe tokenizer.";

Now, create an instance of the Tokenizer class. As shown in the following code block, a

static tokenizer method is used to create an instance of the Tokenizer class based on

a Indo-European factory class:

Tokenizer tokenizer = IndoEuropeanTokenizerFactory.INSTANCE.

tokenizer(text.toCharArray(), 0, text.length());

The tokenize method of this class is then used to populate the two lists:

tokenizer.tokenize(tokenList, whiteList);

Use a for-each statement to display the tokens:

Search WWH ::

Custom Search

Home