Database Reference
In-Depth Information
small rows into compressed blocks which is treated as one row in order to
avoid reducer side sorting and avoid some of the overhead associated with row
serialization and deserialization in the underlying MapReduce framework code.
Cheetah
The Cheetah system [ 101 ] has been introduced as a custom data warehouse solution
which has been built on top of the MapReduce framework. In particular, it defines
a virtual view on top of the common star or snowflake data warehouse schema and
applies a stack of optimization techniques on top of the MapReduce framework
including: data compression, optimized access methods, multi-query optimization
and the exploiting materialized views. Cheetah provides an SQL-like and a non-SQL
interface for applications to directly access the raw data which enables seamless
integration of MapReduce and Data Warehouse tools so that the developers can take
full advantage of the power of both worlds. For example, it has a JDBC interface
such that a user program can submit query and iterate through the output results. If
the query results are too big for a single program to consume, the user can write a
MapReduce job to analyze the query output files which are stored on HDFS.
Cheetah stores data in the compressed columnar format. The choice of compres-
sion type for each column set is dynamically determined based on the data in each
cell. During the ETL (extract-transfer-load) phase of a data warehousing project, the
statistics of each column is maintained and the best compression method is chosen.
During the query execution, Cheetah applies different optimization techniques. For
example, the map phase uses a shared scanner which shares the scan of the fact
tables and joins to the dimension tables where a selection pushup approach is
applied in order to share the joins among multiple queries. Each scanner attaches
a query ID to each output row, indicating which query this row qualifies. The
reduce phase splits the input rows based on their query IDs and then sends them
to the corresponding query operators. Cheetah also makes use of materialized view
and applies a straightforward view matching and query rewriting process where
the query must refer the virtual view that corresponds to the same fact table upon
which the materialized view is defined. The non-aggregate columns referred in the
SELECT and WHERE clauses in the query must be a subset of the materialized
view's group by columns.
SQL/MapReduce
In general, a user-defined function (UDF) is a powerful database feature that
allows users to customize database functionality. Friedman et al. [ 134 ] introduced
the SQL/MapReduce (SQL/MR) UDF framework which is designed to facilitate
parallel computation of procedural functions across hundreds of servers working
Search WWH ::




Custom Search