Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

small rows into compressed blocks which is treated as one row in order to

avoid reducer side sorting and avoid some of the overhead associated with row

serialization and deserialization in the underlying MapReduce framework code.

Cheetah

The Cheetah system [ 101 ] has been introduced as a custom data warehouse solution

which has been built on top of the MapReduce framework. In particular, it defines

a virtual view on top of the common star or snowflake data warehouse schema and

applies a stack of optimization techniques on top of the MapReduce framework

including: data compression, optimized access methods, multi-query optimization

and the exploiting materialized views. Cheetah provides an SQL-like and a non-SQL

interface for applications to directly access the raw data which enables seamless

integration of MapReduce and Data Warehouse tools so that the developers can take

full advantage of the power of both worlds. For example, it has a JDBC interface

such that a user program can submit query and iterate through the output results. If

the query results are too big for a single program to consume, the user can write a

MapReduce job to analyze the query output files which are stored on HDFS.

Cheetah stores data in the compressed columnar format. The choice of compres-

sion type for each column set is dynamically determined based on the data in each

cell. During the ETL (extract-transfer-load) phase of a data warehousing project, the

statistics of each column is maintained and the best compression method is chosen.

During the query execution, Cheetah applies different optimization techniques. For

example, the map phase uses a shared scanner which shares the scan of the fact

tables and joins to the dimension tables where a selection pushup approach is

applied in order to share the joins among multiple queries. Each scanner attaches

a query ID to each output row, indicating which query this row qualifies. The

reduce phase splits the input rows based on their query IDs and then sends them

to the corresponding query operators. Cheetah also makes use of materialized view

and applies a straightforward view matching and query rewriting process where

the query must refer the virtual view that corresponds to the same fact table upon

which the materialized view is defined. The non-aggregate columns referred in the

SELECT and WHERE clauses in the query must be a subset of the materialized

view's group by columns.

SQL/MapReduce

In general, a user-defined function (UDF) is a powerful database feature that

allows users to customize database functionality. Friedman et al. [ 134 ] introduced

the SQL/MapReduce (SQL/MR) UDF framework which is designed to facilitate

parallel computation of procedural functions across hundreds of servers working

Search WWH ::

Custom Search

Home