Java Reference
In-Depth Information
LingPipe
LingPipe consists of a set of tools to perform common NLP tasks. It supports model train-
ing and testing. There are both royalty free and license versions of the tool. The production
use of the free version is limited.
To demonstrate the use of LingPipe, we will illustrate how it can be used to tokenize text
using the Tokenizer class. Start by declaring two lists, one to hold the tokens and a
second to hold the whitespace:
List<String> tokenList = new ArrayList<>();
List<String> whiteList = new ArrayList<>();
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from
your account at http://www.packtpub.com . If you purchased this topic elsewhere, you can
visit http://www.packtpub.com/support and register to have the files e-mailed directly to
you.
Next, declare a string to hold the text to be tokenized:
String text = "A sample sentence processed \nby \tthe " +
"LingPipe tokenizer.";
Now, create an instance of the Tokenizer class. As shown in the following code block, a
static tokenizer method is used to create an instance of the Tokenizer class based on
a Indo-European factory class:
Tokenizer tokenizer = IndoEuropeanTokenizerFactory.INSTANCE.
tokenizer(text.toCharArray(), 0, text.length());
The tokenize method of this class is then used to populate the two lists:
tokenizer.tokenize(tokenList, whiteList);
Use a for-each statement to display the tokens:
Search WWH ::




Custom Search