Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

stores (e.g. MySQL database), column stores, Bigtable (Google's built in key-value

store) [ 99 ], GFS (Google File System) [ 137 ], text and protocol buffers. In particular,

the Tenzing system has four major components:

The distributed worker pool : Represents the execution system which takes a

query execution plan and executes the MapReduce jobs. The pool consists of

master and worker nodes plus an overall gatekeeper called the master watcher.

The workers manipulate the data for all the tables defined in the metadata layer.

The query server : Serves as the gateway between the client and the pool. The

query server parses the query, applies different optimization mechanisms and

sends the plan to the master for execution. In principle, the Tenzing optimizer

applies some basic rule and cost-based optimizations to create an optimal

execution plan.

Client interfaces : Tenzing has several client interfaces including a command line

client (CLI) and a Web UI. The CLI is a more powerful interface that supports

complex scripting while the Web UI supports easier-to-use features such as query

and table browsers tools. There is also an API to directly execute queries on the

pool and a standalone binary which does not need any server side components

but rather can launch its own MapReduce jobs.

The metadata server : Provides an API to store and fetch metadata such as table

names and schemas and pointers to the underlying data.

A typical Tenzing query is submitted to the query server (through the Web

UI, CLI or API) which is responsible for parsing the query into an intermediate

parse tree and fetching the required metadata from the metadata server. The query

optimizer goes through the intermediate format, applies various optimizations and

generates a query execution plan that consists of one or more MapReduce jobs.

For each MapReduce, the query server finds an available master using the master

watcher and submits the query to it. At this stage, the execution is physically

partitioned into multiple units of work where idle workers poll the masters for

available work. The query server monitors the generated intermediate results,

gathers them as they arrive and streams the output back to the client. In order to

increase throughput, decrease latency and execute SQL operators more efficiently,

Tenzing has enhanced the MapReduce implementation with some main changes:

Streaming and in-memory chaining : The implementation of Tenzing does not

serialize the intermediate results of MapReduce jobs to GFS. Instead, it streams

the intermediate results between the Map and Reduce tasks using the network

and uses GFS only for backup purposes. In addition, it uses a memory chaining

mechanism where the reducer and the mapper of the same intermediate results

are co-located in the same process.

Sort avoidance : Certain operators such as hash join and hash aggregation require

shuffling but not sorting. The MapReduce API was enhanced to automatically

turn off sorting for these operations, when possible, so that the mapper feeds

data to the reducer which automatically bypasses the intermediate sorting step.

Tenzing also implements a block-based shuffle mechanism that combines many

Cloud Data Management

Search WWH ::

Custom Search

Home