Finding Parts of Text - Natural Language Processing with Java - page 45

Java Reference

In-Depth Information

What is tokenization?

Tokenization is the process of breaking text down into simpler units. For most text, we are

concerned with isolating words. Tokens are split based on a set of delimiters. These delim-

iters are frequently whitespace characters. Whitespace in Java is defined by the Charac-

ter class' isWhitespace method. These characters are listed in the following table.

However, there may be a need at times to use a different set of delimiters. For example, dif-

ferent delimiters can be useful when whitespace delimiters obscure text breaks, such as

paragraph boundaries, and detecting these text breaks is important.

Character

Meaning

Unicode space character (space_separator, line_separator, or paragraph_separator)

\t

U+0009 horizontal tabulation

\n

U+000A line feed

\u000B

U+000B vertical tabulation

\f

U+000C form feed

\r

U+000D carriage return

\u001C

U+001C file separator

\u001D

U+001D group separator

\u001E

U+001E record separator

U+001F unit separator

\u001F

The tokenization process is complicated by a large number of factors such as:

• Language : Different languages present unique challenges. Whitespace is a com-

monly used delimiter but it will not be sufficient if we need to work with Chinese,

where they are not used.

Next Page

Natural Language Processing with Java

Search WWH ::

Custom Search

Home