Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

assertThat ( s . charAt ( 2 ), is ( '\u6771' ));

assertThat ( s . charAt ( 3 ), is ( '\uD801' ));

assertThat ( s . charAt ( 4 ), is ( '\uDC00' ));

assertThat ( s . codePointAt ( 0 ), is ( 0x0041 ));

assertThat ( s . codePointAt ( 1 ), is ( 0x00DF ));

assertThat ( s . codePointAt ( 2 ), is ( 0x6771 ));

assertThat ( s . codePointAt ( 3 ), is ( 0x10400 ));

}

@Test

public void text () {

Text t = new Text ( "\u0041\u00DF\u6771\uD801\uDC00" );

assertThat ( t . getLength (), is ( 10 ));

assertThat ( t . find ( "\u0041" ), is ( 0 ));

assertThat ( t . find ( "\u00DF" ), is ( 1 ));

assertThat ( t . find ( "\u6771" ), is ( 3 ));

assertThat ( t . find ( "\uD801\uDC00" ), is ( 6 ));

assertThat ( t . charAt ( 0 ), is ( 0x0041 ));

assertThat ( t . charAt ( 1 ), is ( 0x00DF ));

assertThat ( t . charAt ( 3 ), is ( 0x6771 ));

assertThat ( t . charAt ( 6 ), is ( 0x10400 ));

}

The test confirms that the length of a String is the number of char code units it con-

tains (five, made up of one from each of the first three characters in the string and a sur-

rogate pair from the last), whereas the length of a Text object is the number of bytes in

its UTF-8 encoding (10 = 1+2+3+4). Similarly, the indexOf() method in String re-

turns an index in char code units, and find() for Text returns a byte offset.

The charAt() method in String returns the char code unit for the given index,

which in the case of a surrogate pair will not represent a whole Unicode character. The

codePointAt() method, indexed by char code unit, is needed to retrieve a single

Unicode character represented as an int . In fact, the charAt() method in Text is

more like the codePointAt() method than its namesake in String . The only differ-

ence is that it is indexed by byte offset.

Iteration

Iterating over the Unicode characters in Text is complicated by the use of byte offsets for

indexing, since you can't just increment the index. The idiom for iteration is a little ob-

Search WWH ::

Custom Search

Home