Database Reference
In-Depth Information
index access path, which can be used for selective MapReduce jobs. The
scan access path can still be used for other MapReduce jobs. These indices
are created by injecting appropriate UDFs inside the Hadoop implementa-
tion. Specifically, the main features of Trojan indices can be summarized
as follows:
No external library or engine : Trojan indices integrate indexing capa-
bility natively into the Hadoop framework without imposing a distrib-
uted SQL query engine on top of it.
Noninvasive : They do not change the existing Hadoop framework. The
index structure is implemented by providing the right UDFs.
Optional access path : They provide an optional index access path that
can be used for selective MapReduce jobs. However, the scan access
path can still be used for other MapReduce jobs.
Seamless splitting : Data indexing adds an index overhead for each data
split. Therefore, the logical split includes the data as well as the index,
as it automatically splits the indexed data at logical split boundaries.
Partial index : Trojan index need not be built on the entire split. How-
ever, it can be built on any contiguous subset of the split as well.
Multiple indexes : Several Trojan indexes can be built on the same split.
However, only one of them can be the primary index. During query
processing, an appropriate index can be chosen for data access based on
the logical query plan and the cost model.
Trojan Join : Similar to the idea of the Trojan index, the Hadoop++ system
assumes that if we know the schema and the expected workload, then we can
co-partition the input data during the loading time. In particular, given any two
input relations, they apply the same partitioning function on the join attributes
of both the relations at data loading time and place the co-group pairs, having
the same join key from the two relations, on the same split and, hence, on the
same node. As a result, join operations can be then processed locally within
each node at query time. Implementing the Trojan joins do not require any
changes to be made to the existing implementation of the Hadoop framework.
The only changes are made on the internal management of the data splitting
process. In addition, Trojan indices can be freely combined with Trojan joins.
The design and implementation of a column-oriented and binary backend storage
format for Hadoop has been presented in [54]. In general, a straightforward way to
implement a column-oriented storage format for Hadoop is to store each column of
the input data set in a separate file. However, this raises two main challenges:
It requires generating roughly equal sized splits so that a job can be effec-
tively parallelized over the cluster.
It needs to ensure that the corresponding values from different columns in
the data set are co-located on the same node running the map task.
The first challenge can be tackled by horizontally partitioning the data set and stor-
ing each partition in a separate subdirectory. The second challenge is harder to tackle
Search WWH ::




Custom Search