Database Reference
In-Depth Information
-
Seamless splitting : Data indexing adds an index overhead for each data
split. Therefore, the logical split includes the data as well as the index as it
automatically splits the indexed data at logical split boundaries.
-
Partial index : Trojan Index need not be built on the entire split. However, it
can be built on any contiguous subset of the split as well.
-
Multiple indexes : Several Trojan Indexes can be built on the same split. How-
ever, only one of them can be the primary index. During query processing,
an appropriate index can be chosen for data access based on the logical query
plan and the cost model.
￿
Trojan join : Similar to the idea of the trojan index, the Hadoop CC system
assumes that if we know the schema and the expected workload, then we can
co-partition the input data during the loading time. In particular, given any two
input relations, they apply the same partitioning function on the join attributes
of both the relations at data loading time and place the co-group pairs, having
the same join key from the two relations, on the same split and hence on the
same node. As a result, join operations can be then processed locally within each
node at query time. Implementing the trojan joins do not require any changes
to be made to the existing implementation of the Hadoop framework. The only
changes are made on the internal management of the data splitting process. In
addition, trojan indices can be freely combined with trojan joins.
The design and implementation of a column-oriented and binary backend storage
format for Hadoop has been presented in [ 132 ]. In general, a straightforward way to
implement a column-oriented storage format for Hadoop is to store each column of
the input dataset in a separate file. However, this raises two main challenges:
￿
It requires generating roughly equal sized splits so that a job can be effectively
parallelized over the cluster.
￿
It needs to ensure that the corresponding values from different columns in the
dataset are co-located on the same node running the map task.
The first challenge can be tackled by horizontally partitioning the dataset and
storing each partition in a separate subdirectory. The second challenge is harder to
tackle because of the default three-way block-level replication strategy of HDFS that
provides fault tolerance on commodity servers but does not provide any co-location
guarantees. Floratou et al. [ 132 ] tackle this challenge by implementing a modified
HDFS block placement policy which guarantees that the files corresponding to the
different columns of a split are always co-located across replicas. Hence, when
reading a dataset, the column input format can actually assign one or more split-
directories to a single split and the column files of a split-directory are scanned
sequentially and the records are reassembled using values from corresponding
positions in the files. A lazy record construction technique is used to mitigate the
deserialization overhead in Hadoop, as well as eliminate unnecessary disk I/O. The
basic idea behind lazy record construction is to deserialize only those columns of a
record that are actually accessed in a map function. Each column of the input dataset
can be compressed using one of the following compression schemes:
Search WWH ::




Custom Search