Java Reference
In-Depth Information
Using the WhitespaceTokenizer class
As its name implies, this class uses whitespaces as delimiters. In the following code se-
quence, an instance of the tokenizer is created and the tokenize method is executed
against it using paragraph as input. The for statement then displays the tokens:
String tokens[] =
WhitespaceTokenizer.INSTANCE.tokenize(paragraph);
for (String token : tokens) {
System.out.println(token);
}
The output is as follows:
Let's
pause,
and
then
reflect.
Although this does not separate contractions and similar units of text, it can be useful for
some applications. The class also possesses a tokizePos method that returns boundar-
ies of the tokens.
Using the TokenizerME class
The TokenizerME class uses models created using Maximum Entropy ( maxent ) and a
statistical model to perform tokenization. The maxent model is used to determine the rela-
tionship between data, in our case, text. Some text sources, such as various social media,
are not well formatted and use a lot of slang and special symbols such as emoticons. A
statistical tokenizer, such as the maxent model, improves the quality of the tokenization
process.
Note
A detailed discussion of this model is not possible here due to its complexity. A good
starting point for an interested reader can be found at http://en.wikipedia.org/w/in-
dex.php?title=Multinomial_logistic_regression&redirect=no .
Search WWH ::




Custom Search