Database Reference
In-Depth Information
because of the default three-way block-level replication strategy of HDFS that provides
fault tolerance on commodity servers but does not provide any colocation guarantees.
Floratou et al. [54] tackle this challenge by implementing a modified HDFS block place-
ment policy that guarantees that the files corresponding to the different columns of a split
are always co-located across replicas. Hence, when reading a data set, the column input
format can actually assign one or more split directories to a single split and the column
files of a split directory are scanned sequentially and the records are reassembled using
values from corresponding positions in the files. A lazy record construction technique is
used to mitigate the deserialization overhead in Hadoop, as well as eliminate unneces-
sary disk I/O. The basic idea behind lazy record construction is to deserialize only those
columns of a record that are actually accessed in a map function. Each column of the
input data set can be compressed using one of the following compression schemes:
1. Compressed blocks : This scheme uses a standard compression algorithm to
compress a block of contiguous column values. Multiple compressed blocks
may fit into a single HDFS block. A header indicates the number of records
in a compressed block and the block's size. This allows the block to be
skipped if no values are accessed in it. However, when a value in the block
is accessed, the entire block needs to be decompressed.
2. Dictionary compressed skip list : This scheme is tailored for map-typed
columns. It takes advantage of the fact that the keys used in maps are often
strings that are drawn from a limited universe. Such strings are well suited
for dictionary compression. A dictionary is built of keys for each block of
map values and store the compressed keys in a map using a skip list format.
The main advantage of this scheme is that a value can be accessed without
having to decompress an entire block of values.
One advantage of this approach is that adding a column to a data set is not an
expensive operation. This can be done by simply placing an additional file for the
new column in each of the split directories. However, a potential disadvantage of
this approach is that the available parallelism may be limited for smaller data sets.
Maximum parallelism is achieved for a MapReduce job when the number of splits is
at least equal to the number of map tasks.
The Llama system [93] have introduced another approach of providing column
storage support for the MapReduce framework. In this approach, each imported
table is transformed into column groups where each group contains a set of files rep-
resenting one or more columns. Llama introduced a column-wise format for Hadoop,
called CFile , where each file can contain multiple data blocks, and each block of the
file contains a fixed number of records (Figure 2.7). However, the size of each logi-
cal block may vary since records can be variable-sized. Each file includes a block
index, which is stored after all data blocks, stores the offset of each block and is used
to locate a specific block. To achieve storage efficiency, Llama uses block-level com-
pression using any of the well-known compression schemes. To improve the query
processing and the performance of join operations, Llama columns are formed into
correlation groups to provide the basis for the vertical partitioning of tables. In par-
ticular, it creates multiple vertical groups where each group is defined by a collection
Search WWH ::




Custom Search