Java Reference
In-Depth Information
7.1. Lexical Elements
One of the first phases of compilation is the scanning of the lexical ele-
ments into tokens. This phase ignores whitespace and comments that
appear in the textso the language must define what form whitespace
and comments take. The remaining sequence of characters must then be
parsed into tokens.
7.1.1. Character Set
Most programmers are familiar with source code that is prepared using
one of two major families of character representations: ASCII and its vari-
ants (including Latin-1) and EBCDIC . Both character sets contain charac-
ters used in English and several other Western European languages.
The Java programming language, on the other hand, is written in a 16-bit
encoding of Unicode. The Unicode standard originally supported a 16-bit
character set, but has expanded to allow for up to 21-bit characters with
a maximum value of 0x10ffff. The characters above the value 0x00ffff
are termed the supplementary characters. Any particular 21-bit value is
termed a code point. To allow all characters to be represented by 16-bit
values, Unicode defines an encoding format called UTF -16, and this is how
the Java programming language represents text. In UTF -16 all the values
between 0x0000 and 0xffff map directly to Unicode characters. The sup-
plementary characters are encoded by a pair of 16-bit values: The first
value in the pair comes from the high-surrogates range, and the second
comes from the low-surrogates range. Methods that want to work with
individual code point values can either accept a UTF -16 encoded char[] of
length two, or a single int that holds the code point directly. An individu-
al char in a UTF -16 sequence is termed a code unit.
The first 256 characters of Unicode are the Latin-1 character set, and
most of the first 128 characters of Latin-1 are equivalent to the 7-bit
ASCII character set. Current environments read ASCII or Latin-1 files, con-
verting them to Unicode on the fly. [1]
 
Search WWH ::




Custom Search