Database Reference
In-Depth Information
a special entry to mark the sync point every few records as a sequence file is being writ-
ten. Such entries are small enough to incur only a modest storage overhead — less than
1%. Sync points always align with record boundaries.
Running the program in
Example 5-11
shows the sync points in the sequence file as aster-
isks. The first one occurs at position 2021 (the second one occurs at position 4075, but is
not shown in the output):
%
hadoop SequenceFileReadDemo numbers.seq
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
[590] 90 One, two, buckle my shoe
...
[1976] 60 One, two, buckle my shoe
[2021*] 59 Three, four, shut the door
[2088] 58 Five, six, pick up sticks
[2132] 57 Seven, eight, lay them straight
[2182] 56 Nine, ten, a big fat hen
...
[4557] 5 One, two, buckle my shoe
[4602] 4 Three, four, shut the door
[4649] 3 Five, six, pick up sticks
[4693] 2 Seven, eight, lay them straight
[4743] 1 Nine, ten, a big fat hen
There are two ways to seek to a given position in a sequence file. The first is the
seek()
method, which positions the reader at the given point in the file. For example, seeking to a
record boundary works as expected:
reader
.
seek
(
359
);
assertThat
(
reader
.
next
(
key
,
value
),
is
(
true
));
assertThat
(((
IntWritable
)
key
).
get
(),
is
(
95
));
But if the position in the file is not at a record boundary, the reader fails when the
next()
method is called:
reader
.
seek
(
360
);
reader
.
next
(
key
,
value
);
// fails with IOException