Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

index access path, which can be used for selective MapReduce jobs. The

scan access path can still be used for other MapReduce jobs. These indices

are created by injecting appropriate UDFs inside the Hadoop implementa-

tion. Specifically, the main features of Trojan indices can be summarized

as follows:

•

No external library or engine : Trojan indices integrate indexing capa-

bility natively into the Hadoop framework without imposing a distrib-

uted SQL query engine on top of it.

•

Noninvasive : They do not change the existing Hadoop framework. The

index structure is implemented by providing the right UDFs.

•

Optional access path : They provide an optional index access path that

can be used for selective MapReduce jobs. However, the scan access

path can still be used for other MapReduce jobs.

•

Seamless splitting : Data indexing adds an index overhead for each data

split. Therefore, the logical split includes the data as well as the index,

as it automatically splits the indexed data at logical split boundaries.

•

Partial index : Trojan index need not be built on the entire split. How-

ever, it can be built on any contiguous subset of the split as well.

•

Multiple indexes : Several Trojan indexes can be built on the same split.

However, only one of them can be the primary index. During query

processing, an appropriate index can be chosen for data access based on

the logical query plan and the cost model.

•

Trojan Join : Similar to the idea of the Trojan index, the Hadoop++ system

assumes that if we know the schema and the expected workload, then we can

co-partition the input data during the loading time. In particular, given any two

input relations, they apply the same partitioning function on the join attributes

of both the relations at data loading time and place the co-group pairs, having

the same join key from the two relations, on the same split and, hence, on the

same node. As a result, join operations can be then processed locally within

each node at query time. Implementing the Trojan joins do not require any

changes to be made to the existing implementation of the Hadoop framework.

The only changes are made on the internal management of the data splitting

process. In addition, Trojan indices can be freely combined with Trojan joins.

The design and implementation of a column-oriented and binary backend storage

format for Hadoop has been presented in [54]. In general, a straightforward way to

implement a column-oriented storage format for Hadoop is to store each column of

the input data set in a separate file. However, this raises two main challenges:

•

It requires generating roughly equal sized splits so that a job can be effec-

tively parallelized over the cluster.

•

It needs to ensure that the corresponding values from different columns in

the data set are co-located on the same node running the map task.

The first challenge can be tackled by horizontally partitioning the data set and stor-

ing each partition in a separate subdirectory. The second challenge is harder to tackle

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home