Java Reference
In-Depth Information
LingPipe
LingPipe consists of a set of tools to perform common NLP tasks. It supports model train-
ing and testing. There are both royalty free and license versions of the tool. The production
use of the free version is limited.
To demonstrate the use of LingPipe, we will illustrate how it can be used to tokenize text
using the
Tokenizer
class. Start by declaring two lists, one to hold the tokens and a
second to hold the whitespace:
List<String> tokenList = new ArrayList<>();
List<String> whiteList = new ArrayList<>();
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from
your account at
http://www.packtpub.com
.
If you purchased this topic elsewhere, you can
visit
http://www.packtpub.com/support
and register to have the files e-mailed directly to
you.
Next, declare a string to hold the text to be tokenized:
String text = "A sample sentence processed \nby \tthe " +
"LingPipe tokenizer.";
Now, create an instance of the
Tokenizer
class. As shown in the following code block, a
static
tokenizer
method is used to create an instance of the
Tokenizer
class based on
a
Indo-European factory
class:
Tokenizer tokenizer = IndoEuropeanTokenizerFactory.INSTANCE.
tokenizer(text.toCharArray(), 0, text.length());
The
tokenize
method of this class is then used to populate the two lists:
tokenizer.tokenize(tokenList, whiteList);
Use a for-each statement to display the tokens: