Java Reference
In-Depth Information
Some APIs of the Java SE platform, primarily in the Character class, use 32-bit integers
to represent code points as individual entities. The Java SE platform provides meth-
ods to convert between 16-bit and 32-bit representations.
This specification uses the terms code point and UTF-16 code unit where the representation
is relevant, and the generic term character where the representation is irrelevant to the dis-
cussion.
Except for comments (§ 3.7 ), identifiers, and the contents of character and string literals
3.10.4 , § 3.10.5 ), all input elements (§ 3.5 ) in a program are formed only from ASCII char-
acters (or Unicode escapes (§ 3.3 ) which result in ASCII characters).
ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The
first 128 characters of the Unicode UTF-16 encoding are the ASCII characters.
3.2. Lexical Translations
A raw Unicode character stream is translated into a sequence of tokens, using the following
three lexical translation steps, which are applied in turn:
1. A translation of Unicode escapes (§ 3.3 ) in the raw stream of Unicode characters to
the corresponding Unicode character. A Unicode escape of the form \u xxxx , where
xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is
xxxx . This translation step allows any program to be expressed using only ASCII
characters.
2. A translation of the Unicode stream resulting from step 1 into a stream of input
characters and line terminators (§ 3.4 ).
3. A translation of the stream of input characters and line terminators resulting from
step 2 into a sequence of input elements (§ 3.5 ) which, after white space (§ 3.6 ) and
comments (§ 3.7 ) are discarded, comprise the tokens (§ 3.5 ) that are the terminal
symbols of the syntactic grammar (§ 2.3 ).
The longest possible translation is used at each step, even if the result does not ultimately
make a correct program while another lexical translation would.
Thus, the input characters a--b are tokenized (§ 3.5 ) as a , -- , b , which is not part of any
grammatically correct program, even though the tokenization a , - , - , b could be part of
a grammatically correct program.
Search WWH ::




Custom Search