Java Reference
In-Depth Information
7.1. Lexical Elements
One of the first phases of compilation is the
scanning
of the lexical ele-
ments into
tokens.
This phase ignores
whitespace
and
comments
that
appear in the textso the language must define what form whitespace
and comments take. The remaining sequence of characters must then be
parsed into tokens.
7.1.1. Character Set
Most programmers are familiar with source code that is prepared using
one of two major families of character representations:
ASCII
and its vari-
ants (including Latin-1) and
EBCDIC
. Both character sets contain charac-
ters used in English and several other Western European languages.
The Java programming language, on the other hand, is written in a 16-bit
encoding of
Unicode.
The Unicode standard originally supported a 16-bit
character set, but has expanded to allow for up to 21-bit characters with
a maximum value of 0x10ffff. The characters above the value 0x00ffff
are termed the
supplementary
characters. Any particular 21-bit value is
termed a
code point.
To allow all characters to be represented by 16-bit
values, Unicode defines an encoding format called
UTF
-16, and this is how
the Java programming language represents text. In
UTF
-16 all the values
between 0x0000 and 0xffff map directly to Unicode characters. The sup-
plementary characters are encoded by a pair of 16-bit values: The first
value in the pair comes from the
high-surrogates
range, and the second
comes from the
low-surrogates
range. Methods that want to work with
individual code point values can either accept a
UTF
-16 encoded
char[]
of
length two, or a single
int
that holds the code point directly. An individu-
al
char
in a
UTF
-16 sequence is termed a
code unit.
The first 256 characters of Unicode are the Latin-1 character set, and
most of the first 128 characters of Latin-1 are equivalent to the 7-bit
ASCII
character set. Current environments read
ASCII
or Latin-1 files, con-
verting them to Unicode on the fly.
[1]