Database Reference
In-Depth Information
to control when writes occur (via flush or sync operations), so column-oriented formats
are not suited to streaming writes, as the current file cannot be recovered if the writer pro-
cess fails. On the other hand, row-oriented formats like sequence files and Avro datafiles
can be read up to the last sync point after a writer failure. It is for this reason that Flume
(see
Chapter 14
)
uses row-oriented formats.
The first column-oriented file format in Hadoop was Hive's
RCFile
, short for
Record
Columnar File
. It has since been superseded by Hive's
ORCFile
(
Optimized Record
Columnar File
), and
Parquet
(covered in
Chapter 13
)
. Parquet is a general-purpose
column-oriented file format based on Google's Dremel, and has wide support across Ha-
doop components. Avro also has a column-oriented format called
Trevni
.
[
44
]
For a comprehensive set of compression benchmarks,
jvm-compressor-benchmark
is a good reference
for JVM-compatible libraries (including some native libraries).
[
45
]
This example is based on one from Norbert Lindenberg and Masayoshi Okutsu's
“Supplementary Char-
acters in the Java Platform,”
May 2004.
[
46
]
Twitter's
Elephant Bird project
includes tools for working with Thrift and Protocol Buffers in Hadoop.
[
47
]
In a similar vein, the blog post
“A Million Little Files”
by Stuart Sierra includes code for converting a
tar file into a
SequenceFile
.
[
48
]
Full details of the format of these fields may be found in
SequenceFile
's
documentation
and source
code.