Database Reference
In-Depth Information
intermediate sorting step. Tenzing also implements a block-based shuffle
mechanism that combines many small rows into compressed blocks, which
is treated as one row to avoid reducer side sorting and avoid some of the
overhead associated with row serialization and deserialization in the under-
lying MapReduce framework code.
2.4.5 C heetah
The Cheetah system [35] has been introduced as a custom data warehouse solution
that has been built on top of the MapReduce framework. In particular, it defines a
virtual view on top of the common star or snowflake data warehouse schema and
applies a stack of optimization techniques on top of the MapReduce framework
including: data compression, optimized access methods, multiquery optimization,
and the exploiting materialized views. Cheetah provides an SQL-like and a non-SQL
interface for applications to directly access the raw data, which enables seamless
integration of MapReduce and Data Warehouse tools so that the developers can take
full advantage of the power of both worlds. For example, it has a JDBC interface
such that a user program can submit query and iterate through the output results. If
the query results are too big for a single program to consume, the user can write a
MapReduce job to analyze the query output files that are stored on HDFS.
Cheetah stores data in the compressed columnar format. The choice of compres-
sion type for each column set is dynamically determined based on the data in each
cell. During the ETL (extract-transfer-load) phase of a data warehousing project, the
statistics of each column is maintained and the best compression method is chosen.
During the query execution, Cheetah applies different optimization techniques. For
example, the map phase uses a shared scanner that shares the scan of the fact tables
and joins to the dimension tables where a selection pushup approach is applied to
share the joins among multiple queries. Each scanner attaches a query ID to each
output row, indicating which query this row qualifies. The reduce phase splits the
input rows based on their query IDs and then sends them to the corresponding query
operators. Cheetah also makes use of materialized view and applies a straightfor-
ward view-matching and query-rewriting process where the query must refer the
virtual view that corresponds to the same fact table upon which the materialized
view is defined. The nonaggregate columns referred in the SELECT and WHERE
clauses in the query must be a subset of the materialized view's group by columns.
2.4.6 ys mart
The YSmart system [88] has been presented as a correlation aware SQL-to-
MapReduce translator that attempts to optimize complex queries without modifi-
cation to the MapReduce framework and the underlying system. It applies a set of
rules to use the minimal number of MapReduce jobs to execute multiple correlated
operations in a complex query with the aim of reducing redundant computations,
I/O operations, and overhead of network transfers. The YSmart translator is used on
the Facebook production environment to achieve these goals. In particular, YSmart
batch-processes multiple correlated query operations within a query and applies a set
Search WWH ::




Custom Search