Database Reference
In-Depth Information
to control when writes occur (via flush or sync operations), so column-oriented formats
are not suited to streaming writes, as the current file cannot be recovered if the writer pro-
cess fails. On the other hand, row-oriented formats like sequence files and Avro datafiles
can be read up to the last sync point after a writer failure. It is for this reason that Flume
(see Chapter 14 ) uses row-oriented formats.
The first column-oriented file format in Hadoop was Hive's RCFile , short for Record
Columnar File . It has since been superseded by Hive's ORCFile ( Optimized Record
Columnar File ), and Parquet (covered in Chapter 13 ) . Parquet is a general-purpose
column-oriented file format based on Google's Dremel, and has wide support across Ha-
doop components. Avro also has a column-oriented format called Trevni .
[ 44 ] For a comprehensive set of compression benchmarks, jvm-compressor-benchmark is a good reference
for JVM-compatible libraries (including some native libraries).
[ 45 ] This example is based on one from Norbert Lindenberg and Masayoshi Okutsu's “Supplementary Char-
acters in the Java Platform,” May 2004.
[ 46 ] Twitter's Elephant Bird project includes tools for working with Thrift and Protocol Buffers in Hadoop.
[ 47 ] In a similar vein, the blog post “A Million Little Files” by Stuart Sierra includes code for converting a
tar file into a SequenceFile .
[ 48 ] Full details of the format of these fields may be found in SequenceFile 's documentation and source
code.
Search WWH ::




Custom Search