Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

to control when writes occur (via flush or sync operations), so column-oriented formats

are not suited to streaming writes, as the current file cannot be recovered if the writer pro-

cess fails. On the other hand, row-oriented formats like sequence files and Avro datafiles

can be read up to the last sync point after a writer failure. It is for this reason that Flume

(see Chapter 14 ) uses row-oriented formats.

The first column-oriented file format in Hadoop was Hive's RCFile , short for Record

Columnar File . It has since been superseded by Hive's ORCFile ( Optimized Record

Columnar File ), and Parquet (covered in Chapter 13 ) . Parquet is a general-purpose

column-oriented file format based on Google's Dremel, and has wide support across Ha-

doop components. Avro also has a column-oriented format called Trevni .

[ 44 ] For a comprehensive set of compression benchmarks, jvm-compressor-benchmark is a good reference

for JVM-compatible libraries (including some native libraries).

[ 45 ] This example is based on one from Norbert Lindenberg and Masayoshi Okutsu's “Supplementary Char-

acters in the Java Platform,” May 2004.

[ 46 ] Twitter's Elephant Bird project includes tools for working with Thrift and Protocol Buffers in Hadoop.

[ 47 ] In a similar vein, the blog post “A Million Little Files” by Stuart Sierra includes code for converting a

tar file into a SequenceFile .

[ 48 ] Full details of the format of these fields may be found in SequenceFile 's documentation and source

code.

Search WWH ::

Custom Search

Home