Database Reference
In-Depth Information
assertThat
(
"Finds 'o' from position 4 or later"
,
t
.
find
(
"o"
,
4
),
is
(
4
));
assertThat
(
"No match"
,
t
.
find
(
"pig"
),
is
(-
1
));
Unicode
When we start using characters that are encoded with more than a single byte, the differ-
ences between
Text
and
String
become clear. Consider the Unicode characters shown
Table 5-8. Unicode characters
Unicode code
point
U+0041
U+00DF
U+6771
U+10400
Name
LATIN CAPITAL
LETTER A
LATIN SMALL
LETTER SHARP S
N/A (a unified
Han ideograph)
DESERET CAPITAL
LETTER LONG I
UTF-8 code
units
41
c3 9f
e6 9d b1
f0 90 90 80
Java repres-
entation
\u0041
\u00DF
\u6771
\uD801\uDC00
All but the last character in the table, U+10400, can be expressed using a single Java
char
. U+10400 is a supplementary character and is represented by two Java
char
s,
known as a
surrogate pair
. The tests in
Example 5-5
show the differences between
Example 5-5. Tests showing the differences between the String and Text classes
public class
StringTextComparisonTest
{
@Test
public
void
string
()
throws
UnsupportedEncodingException
{
String s
=
"\u0041\u00DF\u6771\uD801\uDC00"
;
assertThat
(
s
.
length
(),
is
(
5
));
assertThat
(
s
.
getBytes
(
"UTF-8"
).
length
,
is
(
10
));
assertThat
(
s
.
indexOf
(
"\u0041"
),
is
(
0
));
assertThat
(
s
.
indexOf
(
"\u00DF"
),
is
(
1
));
assertThat
(
s
.
indexOf
(
"\u6771"
),
is
(
2
));
assertThat
(
s
.
indexOf
(
"\uD801\uDC00"
),
is
(
3
));
assertThat
(
s
.
charAt
(
0
),
is
(
'\u0041'
));
assertThat
(
s
.
charAt
(
1
),
is
(
'\u00DF'
));