Lexical Structure - The Java Language Specification

Java Reference

In-Depth Information

Some APIs of the Java SE platform, primarily in the Character class, use 32-bit integers

to represent code points as individual entities. The Java SE platform provides meth-

ods to convert between 16-bit and 32-bit representations.

This specification uses the terms code point and UTF-16 code unit where the representation

is relevant, and the generic term character where the representation is irrelevant to the dis-

cussion.

Except for comments (§ 3.7 ), identifiers, and the contents of character and string literals

(§ 3.10.4 , § 3.10.5 ), all input elements (§ 3.5 ) in a program are formed only from ASCII char-

acters (or Unicode escapes (§ 3.3 ) which result in ASCII characters).

ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The

first 128 characters of the Unicode UTF-16 encoding are the ASCII characters.

3.2. Lexical Translations

A raw Unicode character stream is translated into a sequence of tokens, using the following

three lexical translation steps, which are applied in turn:

1. A translation of Unicode escapes (§ 3.3 ) in the raw stream of Unicode characters to

the corresponding Unicode character. A Unicode escape of the form \u xxxx , where

xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is

xxxx . This translation step allows any program to be expressed using only ASCII

characters.

2. A translation of the Unicode stream resulting from step 1 into a stream of input

characters and line terminators (§ 3.4 ).

3. A translation of the stream of input characters and line terminators resulting from

step 2 into a sequence of input elements (§ 3.5 ) which, after white space (§ 3.6 ) and

comments (§ 3.7 ) are discarded, comprise the tokens (§ 3.5 ) that are the terminal

symbols of the syntactic grammar (§ 2.3 ).

The longest possible translation is used at each step, even if the result does not ultimately

make a correct program while another lexical translation would.

Thus, the input characters a--b are tokenized (§ 3.5 ) as a , -- , b , which is not part of any

grammatically correct program, even though the tokenization a , - , - , b could be part of

a grammatically correct program.

Search WWH ::

Custom Search

Home