Java Reference
In-Depth Information
What is tokenization?
Tokenization is the process of breaking text down into simpler units. For most text, we are
concerned with isolating words. Tokens are split based on a set of delimiters. These delim-
iters are frequently whitespace characters. Whitespace in Java is defined by the Charac-
ter class' isWhitespace method. These characters are listed in the following table.
However, there may be a need at times to use a different set of delimiters. For example, dif-
ferent delimiters can be useful when whitespace delimiters obscure text breaks, such as
paragraph boundaries, and detecting these text breaks is important.
Character
Meaning
Unicode space character (space_separator, line_separator, or paragraph_separator)
\t
U+0009 horizontal tabulation
\n
U+000A line feed
\u000B
U+000B vertical tabulation
\f
U+000C form feed
\r
U+000D carriage return
\u001C
U+001C file separator
\u001D
U+001D group separator
\u001E
U+001E record separator
U+001F unit separator
\u001F
The tokenization process is complicated by a large number of factors such as:
Language : Different languages present unique challenges. Whitespace is a com-
monly used delimiter but it will not be sufficient if we need to work with Chinese,
where they are not used.
Search WWH ::




Custom Search