Database Reference
In-Depth Information
assertThat ( "Finds 'o' from position 4 or later" , t . find ( "o" , 4 ),
is ( 4 ));
assertThat ( "No match" , t . find ( "pig" ), is (- 1 ));
Unicode
When we start using characters that are encoded with more than a single byte, the differ-
ences between Text and String become clear. Consider the Unicode characters shown
in Table 5-8 . [ 45 ]
Table 5-8. Unicode characters
Unicode code
point
U+0041
U+00DF
U+6771
U+10400
Name
LATIN CAPITAL
LETTER A
LATIN SMALL
LETTER SHARP S
N/A (a unified
Han ideograph)
DESERET CAPITAL
LETTER LONG I
UTF-8 code
units
41
c3 9f
e6 9d b1
f0 90 90 80
Java repres-
entation
\u0041
\u00DF
\u6771
\uD801\uDC00
All but the last character in the table, U+10400, can be expressed using a single Java
char . U+10400 is a supplementary character and is represented by two Java char s,
known as a surrogate pair . The tests in Example 5-5 show the differences between
String and Text when processing a string of the four characters from Table 5-8 .
Example 5-5. Tests showing the differences between the String and Text classes
public class StringTextComparisonTest {
@Test
public void string () throws UnsupportedEncodingException {
String s = "\u0041\u00DF\u6771\uD801\uDC00" ;
assertThat ( s . length (), is ( 5 ));
assertThat ( s . getBytes ( "UTF-8" ). length , is ( 10 ));
assertThat ( s . indexOf ( "\u0041" ), is ( 0 ));
assertThat ( s . indexOf ( "\u00DF" ), is ( 1 ));
assertThat ( s . indexOf ( "\u6771" ), is ( 2 ));
assertThat ( s . indexOf ( "\uD801\uDC00" ), is ( 3 ));
assertThat ( s . charAt ( 0 ), is ( '\u0041' ));
assertThat ( s . charAt ( 1 ), is ( '\u00DF' ));
Search WWH ::




Custom Search