Introduction to NLP - Natural Language Processing with Java - page 39

Java Reference

In-Depth Information

Preparing data

An important step in NLP is finding and preparing data for processing. This includes data

for training purposes and the data that needs to be processed. There are several factors that

need to be considered. Here, we will focus on the support Java provides for working with

characters.

We need to consider how characters are represented. Although we will deal primarily with

English text, other languages present unique problems. Not only are there differences in

how a character can be encoded, the order in which text is read will vary. For example,

Japanese orders its text in columns going from right to left.

There are also a number of possible encodings. These include ASCII, Latin, and Unicode

to mention a few. A more complete list is found in the following table. Unicode, in particu-

lar, is a complex and extensive encoding scheme:

Encoding

Description

ASCII

A character encoding using 128 (0-127) values.

There are several Latin variations that uses 256 values. They include various combination of the umlaut, such as , and other char-

acters. Various versions of Latin have been introduced to address various Indo-European languages, such as Turkish and Esperanto.

Latin

Big5

A two-byte encoding to address the Chinese character set.

There are three encodings for Unicode: UTF-8, UTF-16, and UTF-32. These use 1, 2, and 4 bytes, respectively. This encoding is

able to represent all known languages in existence today, including newer languages such as Klingon and Elvish.

Unicode

Java is capable of handling these encoding schemes. The javac executable's -encod-

ing command-line option is used to specify the encoding scheme to use. In the following

command line, the Big5 encoding scheme is specified:

javac -encoding Big5

Character processing is supported using the primitive data type char , the Character

class, and several other classes and interfaces as summarized in the following table:

Next Page

Natural Language Processing with Java

Search WWH ::

Custom Search

Home