Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

-

Seamless splitting : Data indexing adds an index overhead for each data

split. Therefore, the logical split includes the data as well as the index as it

automatically splits the indexed data at logical split boundaries.

-

Partial index : Trojan Index need not be built on the entire split. However, it

can be built on any contiguous subset of the split as well.

-

Multiple indexes : Several Trojan Indexes can be built on the same split. How-

ever, only one of them can be the primary index. During query processing,

an appropriate index can be chosen for data access based on the logical query

plan and the cost model.

Trojan join : Similar to the idea of the trojan index, the Hadoop CC system

assumes that if we know the schema and the expected workload, then we can

co-partition the input data during the loading time. In particular, given any two

input relations, they apply the same partitioning function on the join attributes

of both the relations at data loading time and place the co-group pairs, having

the same join key from the two relations, on the same split and hence on the

same node. As a result, join operations can be then processed locally within each

node at query time. Implementing the trojan joins do not require any changes

to be made to the existing implementation of the Hadoop framework. The only

changes are made on the internal management of the data splitting process. In

addition, trojan indices can be freely combined with trojan joins.

The design and implementation of a column-oriented and binary backend storage

format for Hadoop has been presented in [ 132 ]. In general, a straightforward way to

implement a column-oriented storage format for Hadoop is to store each column of

the input dataset in a separate file. However, this raises two main challenges:

It requires generating roughly equal sized splits so that a job can be effectively

parallelized over the cluster.

It needs to ensure that the corresponding values from different columns in the

dataset are co-located on the same node running the map task.

The first challenge can be tackled by horizontally partitioning the dataset and

storing each partition in a separate subdirectory. The second challenge is harder to

tackle because of the default three-way block-level replication strategy of HDFS that

provides fault tolerance on commodity servers but does not provide any co-location

guarantees. Floratou et al. [ 132 ] tackle this challenge by implementing a modified

HDFS block placement policy which guarantees that the files corresponding to the

different columns of a split are always co-located across replicas. Hence, when

reading a dataset, the column input format can actually assign one or more split-

directories to a single split and the column files of a split-directory are scanned

sequentially and the records are reassembled using values from corresponding

positions in the files. A lazy record construction technique is used to mitigate the

deserialization overhead in Hadoop, as well as eliminate unnecessary disk I/O. The

basic idea behind lazy record construction is to deserialize only those columns of a

record that are actually accessed in a map function. Each column of the input dataset

can be compressed using one of the following compression schemes:

Cloud Data Management

Search WWH ::

Custom Search

Home