Java Reference
In-Depth Information
Preparing data
An important step in NLP is finding and preparing data for processing. This includes data
for training purposes and the data that needs to be processed. There are several factors that
need to be considered. Here, we will focus on the support Java provides for working with
characters.
We need to consider how characters are represented. Although we will deal primarily with
English text, other languages present unique problems. Not only are there differences in
how a character can be encoded, the order in which text is read will vary. For example,
Japanese orders its text in columns going from right to left.
There are also a number of possible encodings. These include ASCII, Latin, and Unicode
to mention a few. A more complete list is found in the following table. Unicode, in particu-
lar, is a complex and extensive encoding scheme:
Encoding
Description
ASCII
A character encoding using 128 (0-127) values.
There are several Latin variations that uses 256 values. They include various combination of the umlaut, such as , and other char-
acters. Various versions of Latin have been introduced to address various Indo-European languages, such as Turkish and Esperanto.
Latin
Big5
A two-byte encoding to address the Chinese character set.
There are three encodings for Unicode: UTF-8, UTF-16, and UTF-32. These use 1, 2, and 4 bytes, respectively. This encoding is
able to represent all known languages in existence today, including newer languages such as Klingon and Elvish.
Unicode
Java is capable of handling these encoding schemes. The javac executable's -encod-
ing command-line option is used to specify the encoding scheme to use. In the following
command line, the Big5 encoding scheme is specified:
javac -encoding Big5
Character processing is supported using the primitive data type char , the Character
class, and several other classes and interfaces as summarized in the following table:
Search WWH ::




Custom Search