Lexical Structure - The Java Language Specification

Java Reference

In-Depth Information

3.3. Unicode Escapes

A compiler for the Java programming language (“Java compiler”) first recognizes Unicode

escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits

to the UTF-16 code unit (§ 3.1 ) of the indicated hexadecimal value, and passing all oth-

er characters unchanged. Representing supplementary characters requires two consecutive

Unicode escapes. This translation step results in a sequence of Unicode input characters.

UnicodeInputCharacter:

UnicodeEscape

RawInputCharacter

UnicodeEscape:

\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

UnicodeMarker:

u

UnicodeMarker u

RawInputCharacter:

any Unicode character

HexDigit: one of

0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

The \ , u , and hexadecimal digits here are all ASCII characters.

In addition to the processing implied by the grammar, for each raw input character that is a

backslash \ , input processing must consider how many other \ characters contiguously pre-

cede it, separating it from a non- \ character or the start of the input stream. If this number

is even, then the \ is eligible to begin a Unicode escape; if the number is odd, then the \ is

not eligible to begin a Unicode escape.

For example, the raw input "\\u2126=\u2126" results in the eleven characters " \ \ u 2 1 2 6

= Ω " ( \u2126 is the Unicode encoding of the character Ω ).

If an eligible \ is not followed by u , then it is treated as a RawInputCharacter and remains

part of the escaped Unicode stream.

If an eligible \ is followed by u , or more than one u , and the last u is not followed by four

hexadecimal digits, then a compile-time error occurs.

The character produced by a Unicode escape does not participate in further Unicode es-

capes.

Search WWH ::

Custom Search

Home