Database Reference
In-Depth Information
MapFile variants
Hadoop comes with a few variants on the general key-value MapFile interface:
SetFile is a specialization of MapFile for storing a set of Writable keys.
The keys must be added in sorted order.
ArrayFile is a MapFile where the key is an integer representing the index of
the element in the array and the value is a Writable value.
BloomMapFile is a MapFile that offers a fast version of the get() method,
especially for sparsely populated files. The implementation uses a dynamic
Bloom filter for testing whether a given key is in the map. The test is very fast be-
cause it is in-memory, and it has a nonzero probability of false positives. Only if
the test passes (the key is present) is the regular get() method called.
Other File Formats and Column-Oriented Formats
While sequence files and map files are the oldest binary file formats in Hadoop, they are
not the only ones, and in fact there are better alternatives that should be considered for
new projects.
Avro datafiles (covered in Avro Datafiles ) are like sequence files in that they are designed
for large-scale data processing — they are compact and splittable — but they are portable
across different programming languages. Objects stored in Avro datafiles are described by
a schema, rather than in the Java code of the implementation of a Writable object (as is
the case for sequence files), making them very Java-centric. Avro datafiles are widely sup-
ported across components in the Hadoop ecosystem, so they are a good default choice for
a binary format.
Sequence files, map files, and Avro datafiles are all row-oriented file formats, which
means that the values for each row are stored contiguously in the file. In a column-orien-
ted format, the rows in a file (or, equivalently, a table in Hive) are broken up into row
splits, then each split is stored in column-oriented fashion: the values for each row in the
first column are stored first, followed by the values for each row in the second column,
and so on. This is shown diagrammatically in Figure 5-4 .
A column-oriented layout permits columns that are not accessed in a query to be skipped.
Consider a query of the table in Figure 5-4 that processes only column 2. With row-orien-
ted storage, like a sequence file, the whole row (stored in a sequence file record) is loaded
into memory, even though only the second column is actually read. Lazy deserialization
saves some processing cycles by deserializing only the column fields that are accessed,
but it can't avoid the cost of reading each row's bytes from disk.
Search WWH ::




Custom Search