Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

assertThat ( "Finds 'o' from position 4 or later" , t . find ( "o" , 4 ),

is ( 4 ));

assertThat ( "No match" , t . find ( "pig" ), is (- 1 ));

Unicode

When we start using characters that are encoded with more than a single byte, the differ-

ences between Text and String become clear. Consider the Unicode characters shown

in Table 5-8 . [ 45 ]

Table 5-8. Unicode characters

Unicode code

point

U+0041

U+00DF

U+6771

U+10400

Name

LATIN CAPITAL

LETTER A

LATIN SMALL

LETTER SHARP S

N/A (a unified

Han ideograph)

DESERET CAPITAL

LETTER LONG I

UTF-8 code

units

41

c3 9f

e6 9d b1

f0 90 90 80

Java repres-

entation

\u0041

\u00DF

\u6771

\uD801\uDC00

All but the last character in the table, U+10400, can be expressed using a single Java

char . U+10400 is a supplementary character and is represented by two Java char s,

known as a surrogate pair . The tests in Example 5-5 show the differences between

String and Text when processing a string of the four characters from Table 5-8 .

Example 5-5. Tests showing the differences between the String and Text classes

public class StringTextComparisonTest {

@Test

public void string () throws UnsupportedEncodingException {

String s = "\u0041\u00DF\u6771\uD801\uDC00" ;

assertThat ( s . length (), is ( 5 ));

assertThat ( s . getBytes ( "UTF-8" ). length , is ( 10 ));

assertThat ( s . indexOf ( "\u0041" ), is ( 0 ));

assertThat ( s . indexOf ( "\u00DF" ), is ( 1 ));

assertThat ( s . indexOf ( "\u6771" ), is ( 2 ));

assertThat ( s . indexOf ( "\uD801\uDC00" ), is ( 3 ));

assertThat ( s . charAt ( 0 ), is ( '\u0041' ));

assertThat ( s . charAt ( 1 ), is ( '\u00DF' ));

Hadoop: The Definitive Guide

Search WWH ::

Custom Search

Home