Database Reference
In-Depth Information
stores (e.g. MySQL database), column stores, Bigtable (Google's built in key-value
store) [ 99 ], GFS (Google File System) [ 137 ], text and protocol buffers. In particular,
the Tenzing system has four major components:
￿
The distributed worker pool : Represents the execution system which takes a
query execution plan and executes the MapReduce jobs. The pool consists of
master and worker nodes plus an overall gatekeeper called the master watcher.
The workers manipulate the data for all the tables defined in the metadata layer.
￿
The query server : Serves as the gateway between the client and the pool. The
query server parses the query, applies different optimization mechanisms and
sends the plan to the master for execution. In principle, the Tenzing optimizer
applies some basic rule and cost-based optimizations to create an optimal
execution plan.
￿
Client interfaces : Tenzing has several client interfaces including a command line
client (CLI) and a Web UI. The CLI is a more powerful interface that supports
complex scripting while the Web UI supports easier-to-use features such as query
and table browsers tools. There is also an API to directly execute queries on the
pool and a standalone binary which does not need any server side components
but rather can launch its own MapReduce jobs.
￿
The metadata server : Provides an API to store and fetch metadata such as table
names and schemas and pointers to the underlying data.
A typical Tenzing query is submitted to the query server (through the Web
UI, CLI or API) which is responsible for parsing the query into an intermediate
parse tree and fetching the required metadata from the metadata server. The query
optimizer goes through the intermediate format, applies various optimizations and
generates a query execution plan that consists of one or more MapReduce jobs.
For each MapReduce, the query server finds an available master using the master
watcher and submits the query to it. At this stage, the execution is physically
partitioned into multiple units of work where idle workers poll the masters for
available work. The query server monitors the generated intermediate results,
gathers them as they arrive and streams the output back to the client. In order to
increase throughput, decrease latency and execute SQL operators more efficiently,
Tenzing has enhanced the MapReduce implementation with some main changes:
￿
Streaming and in-memory chaining : The implementation of Tenzing does not
serialize the intermediate results of MapReduce jobs to GFS. Instead, it streams
the intermediate results between the Map and Reduce tasks using the network
and uses GFS only for backup purposes. In addition, it uses a memory chaining
mechanism where the reducer and the mapper of the same intermediate results
are co-located in the same process.
￿
Sort avoidance : Certain operators such as hash join and hash aggregation require
shuffling but not sorting. The MapReduce API was enhanced to automatically
turn off sorting for these operations, when possible, so that the mapper feeds
data to the reducer which automatically bypasses the intermediate sorting step.
Tenzing also implements a block-based shuffle mechanism that combines many
Search WWH ::




Custom Search