Database Reference
In-Depth Information
assertThat ( s . charAt ( 2 ), is ( '\u6771' ));
assertThat ( s . charAt ( 3 ), is ( '\uD801' ));
assertThat ( s . charAt ( 4 ), is ( '\uDC00' ));
assertThat ( s . codePointAt ( 0 ), is ( 0x0041 ));
assertThat ( s . codePointAt ( 1 ), is ( 0x00DF ));
assertThat ( s . codePointAt ( 2 ), is ( 0x6771 ));
assertThat ( s . codePointAt ( 3 ), is ( 0x10400 ));
}
@Test
public void text () {
Text t = new Text ( "\u0041\u00DF\u6771\uD801\uDC00" );
assertThat ( t . getLength (), is ( 10 ));
assertThat ( t . find ( "\u0041" ), is ( 0 ));
assertThat ( t . find ( "\u00DF" ), is ( 1 ));
assertThat ( t . find ( "\u6771" ), is ( 3 ));
assertThat ( t . find ( "\uD801\uDC00" ), is ( 6 ));
assertThat ( t . charAt ( 0 ), is ( 0x0041 ));
assertThat ( t . charAt ( 1 ), is ( 0x00DF ));
assertThat ( t . charAt ( 3 ), is ( 0x6771 ));
assertThat ( t . charAt ( 6 ), is ( 0x10400 ));
}
}
The test confirms that the length of a String is the number of char code units it con-
tains (five, made up of one from each of the first three characters in the string and a sur-
rogate pair from the last), whereas the length of a Text object is the number of bytes in
its UTF-8 encoding (10 = 1+2+3+4). Similarly, the indexOf() method in String re-
turns an index in char code units, and find() for Text returns a byte offset.
The charAt() method in String returns the char code unit for the given index,
which in the case of a surrogate pair will not represent a whole Unicode character. The
codePointAt() method, indexed by char code unit, is needed to retrieve a single
Unicode character represented as an int . In fact, the charAt() method in Text is
more like the codePointAt() method than its namesake in String . The only differ-
ence is that it is indexed by byte offset.
Iteration
Iterating over the Unicode characters in Text is complicated by the use of byte offsets for
indexing, since you can't just increment the index. The idiom for iteration is a little ob-
Search WWH ::




Custom Search