Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

key-value store) [31,32], GFS (Google File System) [58], text, and protocol buffers. In

particular, the Tenzing system has four major components:

•

The distributed worker pool : represents the execution system that takes a

query execution plan and executes the MapReduce jobs. The pool consists

of master and worker nodes plus an overall gatekeeper called the master

watcher. The workers manipulate the data for all the tables defined in the

metadata layer.

•

The query server : serves as the gateway between the client and the pool.

The query server parses the query, applies different optimization mech-

anisms and sends the plan to the master for execution. In principle, the

Tenzing optimizer applies some basic rule and cost-based optimizations to

create an optimal execution plan.

•

Client interfaces : Tenzing has several client interfaces including a com-

mand line client (CLI) and a Web UI. The CLI is a more powerful interface

that supports complex scripting while the Web UI supports easier-to-use

features such as query and table browsers tools. There is also an API to

directly execute queries on the pool and a standalone binary, which does not

need any server side components but rather can launch its own MapReduce

jobs.

•

The metadata server : provides an API to store and fetch metadata such as

table names and schemas and pointers to the underlying data.

A typical Tenzing query is submitted to the query server (through the Web UI,

CLI, or API), which is responsible for parsing the query into an intermediate parse

tree and fetching the required metadata from the metadata server. The query opti-

mizer goes through the intermediate format, applies various optimizations and gen-

erates a query execution plan that consists of one or more MapReduce jobs. For each

MapReduce, the query server finds an available master using the master watcher

and submits the query to it. At this stage, the execution is physically partitioned into

multiple units of work where idle workers poll the masters for available work. The

query server monitors the generated intermediate results, gathers them as they arrive

and streams the output back to the client. To increase throughput, decrease latency

and execute SQL operators more efficiently, Tenzing has enhanced the MapReduce

implementation with some main changes:

•

Streaming and in-memory chaining : The implementation of Tenzing does

not serialize the intermediate results of MapReduce jobs to GFS. Instead,

it streams the intermediate results between the map and reduce tasks using

the network and uses GFS only for backup purposes. In addition, it uses

a memory chaining mechanism where the reducer and the mapper of the

same intermediate results are colocated in the same process.

•

Sort avoidance : Certain operators such as hash join and hash aggregation

require shuffling but not sorting. The MapReduce API was enhanced to

automatically turn off sorting for these operations, when possible, so that

the mapper feeds data to the reducer, which automatically bypasses the

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home