Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

intermediate sorting step. Tenzing also implements a block-based shuffle

mechanism that combines many small rows into compressed blocks, which

is treated as one row to avoid reducer side sorting and avoid some of the

overhead associated with row serialization and deserialization in the under-

lying MapReduce framework code.

2.4.5 C heetah

The Cheetah system [35] has been introduced as a custom data warehouse solution

that has been built on top of the MapReduce framework. In particular, it defines a

virtual view on top of the common star or snowflake data warehouse schema and

applies a stack of optimization techniques on top of the MapReduce framework

including: data compression, optimized access methods, multiquery optimization,

and the exploiting materialized views. Cheetah provides an SQL-like and a non-SQL

interface for applications to directly access the raw data, which enables seamless

integration of MapReduce and Data Warehouse tools so that the developers can take

full advantage of the power of both worlds. For example, it has a JDBC interface

such that a user program can submit query and iterate through the output results. If

the query results are too big for a single program to consume, the user can write a

MapReduce job to analyze the query output files that are stored on HDFS.

Cheetah stores data in the compressed columnar format. The choice of compres-

sion type for each column set is dynamically determined based on the data in each

cell. During the ETL (extract-transfer-load) phase of a data warehousing project, the

statistics of each column is maintained and the best compression method is chosen.

During the query execution, Cheetah applies different optimization techniques. For

example, the map phase uses a shared scanner that shares the scan of the fact tables

and joins to the dimension tables where a selection pushup approach is applied to

share the joins among multiple queries. Each scanner attaches a query ID to each

output row, indicating which query this row qualifies. The reduce phase splits the

input rows based on their query IDs and then sends them to the corresponding query

operators. Cheetah also makes use of materialized view and applies a straightfor-

ward view-matching and query-rewriting process where the query must refer the

virtual view that corresponds to the same fact table upon which the materialized

view is defined. The nonaggregate columns referred in the SELECT and WHERE

clauses in the query must be a subset of the materialized view's group by columns.

2.4.6 ys mart

The YSmart system [88] has been presented as a correlation aware SQL-to-

MapReduce translator that attempts to optimize complex queries without modifi-

cation to the MapReduce framework and the underlying system. It applies a set of

rules to use the minimal number of MapReduce jobs to execute multiple correlated

operations in a complex query with the aim of reducing redundant computations,

I/O operations, and overhead of network transfers. The YSmart translator is used on

the Facebook production environment to achieve these goals. In particular, YSmart

batch-processes multiple correlated query operations within a query and applies a set

Search WWH ::

Custom Search

Home