Database Reference
In-Depth Information
assertThat
(
s
.
charAt
(
2
),
is
(
'\u6771'
));
assertThat
(
s
.
charAt
(
3
),
is
(
'\uD801'
));
assertThat
(
s
.
charAt
(
4
),
is
(
'\uDC00'
));
assertThat
(
s
.
codePointAt
(
0
),
is
(
0x0041
));
assertThat
(
s
.
codePointAt
(
1
),
is
(
0x00DF
));
assertThat
(
s
.
codePointAt
(
2
),
is
(
0x6771
));
assertThat
(
s
.
codePointAt
(
3
),
is
(
0x10400
));
}
@Test
public
void
text
() {
Text t
=
new
Text
(
"\u0041\u00DF\u6771\uD801\uDC00"
);
assertThat
(
t
.
getLength
(),
is
(
10
));
assertThat
(
t
.
find
(
"\u0041"
),
is
(
0
));
assertThat
(
t
.
find
(
"\u00DF"
),
is
(
1
));
assertThat
(
t
.
find
(
"\u6771"
),
is
(
3
));
assertThat
(
t
.
find
(
"\uD801\uDC00"
),
is
(
6
));
assertThat
(
t
.
charAt
(
0
),
is
(
0x0041
));
assertThat
(
t
.
charAt
(
1
),
is
(
0x00DF
));
assertThat
(
t
.
charAt
(
3
),
is
(
0x6771
));
assertThat
(
t
.
charAt
(
6
),
is
(
0x10400
));
}
}
The test confirms that the length of a
String
is the number of
char
code units it con-
tains (five, made up of one from each of the first three characters in the string and a sur-
rogate pair from the last), whereas the length of a
Text
object is the number of bytes in
its UTF-8 encoding (10 = 1+2+3+4). Similarly, the
indexOf()
method in
String
re-
turns an index in
char
code units, and
find()
for
Text
returns a byte offset.
The
charAt()
method in
String
returns the
char
code unit for the given index,
which in the case of a surrogate pair will not represent a whole Unicode character. The
codePointAt()
method, indexed by
char
code unit, is needed to retrieve a single
Unicode character represented as an
int
. In fact, the
charAt()
method in
Text
is
more like the
codePointAt()
method than its namesake in
String
. The only differ-
ence is that it is indexed by byte offset.
Iteration
Iterating over the Unicode characters in
Text
is complicated by the use of byte offsets for
indexing, since you can't just increment the index. The idiom for iteration is a little ob-