Database Reference
In-Depth Information
Figure 5-1. Writable class hierarchy
How do you choose between a fixed-length and a variable-length encoding? Fixed-length
encodings are good when the distribution of values is fairly uniform across the whole
value space, such as when using a (well-designed) hash function. Most numeric variables
tend to have nonuniform distributions, though, and on average, the variable-length encod-
ing will save space. Another advantage of variable-length encodings is that you can
switch from VIntWritable to VLongWritable , because their encodings are actu-
ally the same. So, by choosing a variable-length representation, you have room to grow
without committing to an 8-byte long representation from the beginning.
Text
Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent
of java.lang.String .
The Text class uses an int (with a variable-length encoding) to store the number of
bytes in the string encoding, so the maximum value is 2 GB. Furthermore, Text uses
standard UTF-8, which makes it potentially easier to interoperate with other tools that un-
derstand UTF-8.
Indexing
Because of its emphasis on using standard UTF-8, there are some differences between
Text and the Java String class. Indexing for the Text class is in terms of position in
the encoded byte sequence, not the Unicode character in the string or the Java char code
unit (as it is for String ). For ASCII strings, these three concepts of index position coin-
cide. Here is an example to demonstrate the use of the charAt() method:
Text t = new Text ( "hadoop" );
assertThat ( t . getLength (), is ( 6 ));
assertThat ( t . getBytes (). length , is ( 6 ));
assertThat ( t . charAt ( 2 ), is (( int ) 'd' ));
assertThat ( "Out of bounds" , t . charAt ( 100 ), is (- 1 ));
Notice that charAt() returns an int representing a Unicode code point, unlike the
String variant that returns a char . Text also has a find() method, which is analog-
ous to String 's indexOf() :
Text t = new Text ( "hadoop" );
assertThat ( "Find a substring" , t . find ( "do" ), is ( 2 ));
assertThat ( "Finds first 'o'" , t . find ( "o" ), is ( 3 ));
Search WWH ::




Custom Search