Database Reference
In-Depth Information
Figure 5-1. Writable class hierarchy
How do you choose between a fixed-length and a variable-length encoding? Fixed-length
encodings are good when the distribution of values is fairly uniform across the whole
value space, such as when using a (well-designed) hash function. Most numeric variables
tend to have nonuniform distributions, though, and on average, the variable-length encod-
ing will save space. Another advantage of variable-length encodings is that you can
switch from
VIntWritable
to
VLongWritable
, because their encodings are actu-
ally the same. So, by choosing a variable-length representation, you have room to grow
without committing to an 8-byte
long
representation from the beginning.
Text
Text
is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent
of
java.lang.String
.
The
Text
class uses an
int
(with a variable-length encoding) to store the number of
bytes in the string encoding, so the maximum value is 2 GB. Furthermore,
Text
uses
standard UTF-8, which makes it potentially easier to interoperate with other tools that un-
derstand UTF-8.
Indexing
Because of its emphasis on using standard UTF-8, there are some differences between
Text
and the Java
String
class. Indexing for the
Text
class is in terms of position in
the encoded byte sequence, not the Unicode character in the string or the Java
char
code
unit (as it is for
String
). For ASCII strings, these three concepts of index position coin-
cide. Here is an example to demonstrate the use of the
charAt()
method:
Text t
=
new
Text
(
"hadoop"
);
assertThat
(
t
.
getLength
(),
is
(
6
));
assertThat
(
t
.
getBytes
().
length
,
is
(
6
));
assertThat
(
t
.
charAt
(
2
),
is
((
int
)
'd'
));
assertThat
(
"Out of bounds"
,
t
.
charAt
(
100
),
is
(-
1
));
Notice that
charAt()
returns an
int
representing a Unicode code point, unlike the
String
variant that returns a
char
.
Text
also has a
find()
method, which is analog-
ous to
String
's
indexOf()
:
Text t
=
new
Text
(
"hadoop"
);
assertThat
(
"Find a substring"
,
t
.
find
(
"do"
),
is
(
2
));
assertThat
(
"Finds first 'o'"
,
t
.
find
(
"o"
),
is
(
3
));