Java Reference
In-Depth Information
Some APIs of the Java SE platform, primarily in the
Character
class, use 32-bit integers
to represent code points as individual entities. The Java SE platform provides meth-
ods to convert between 16-bit and 32-bit representations.
This specification uses the terms
code point
and
UTF-16 code unit
where the representation
is relevant, and the generic term
character
where the representation is irrelevant to the dis-
cussion.
Except for comments (§
3.7
), identifiers, and the contents of character and string literals
ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The
first 128 characters of the Unicode UTF-16 encoding are the ASCII characters.
3.2. Lexical Translations
A raw Unicode character stream is translated into a sequence of tokens, using the following
three lexical translation steps, which are applied in turn:
1.
A translation of Unicode escapes (§
3.3
) in the raw stream of Unicode characters to
the corresponding Unicode character. A Unicode escape of the form
\u
xxxx
, where
xxxx
is a hexadecimal value, represents the UTF-16 code unit whose encoding is
xxxx
. This translation step allows any program to be expressed using only ASCII
characters.
2.
A translation of the Unicode stream resulting from step 1 into a stream of input
characters and line terminators (§
3.4
).
3.
A translation of the stream of input characters and line terminators resulting from
symbols of the syntactic grammar (§
2.3
).
The longest possible translation is used at each step, even if the result does not ultimately
make a correct program while another lexical translation would.
grammatically correct program, even though the tokenization
a
,
-
,
-
,
b
could be part of
a grammatically correct program.