Database Reference
In-Depth Information
key-value store) [31,32], GFS (Google File System) [58], text, and protocol buffers. In
particular, the Tenzing system has four major components:
The distributed worker pool : represents the execution system that takes a
query execution plan and executes the MapReduce jobs. The pool consists
of master and worker nodes plus an overall gatekeeper called the master
watcher. The workers manipulate the data for all the tables defined in the
metadata layer.
The query server : serves as the gateway between the client and the pool.
The query server parses the query, applies different optimization mech-
anisms and sends the plan to the master for execution. In principle, the
Tenzing optimizer applies some basic rule and cost-based optimizations to
create an optimal execution plan.
Client interfaces : Tenzing has several client interfaces including a com-
mand line client (CLI) and a Web UI. The CLI is a more powerful interface
that supports complex scripting while the Web UI supports easier-to-use
features such as query and table browsers tools. There is also an API to
directly execute queries on the pool and a standalone binary, which does not
need any server side components but rather can launch its own MapReduce
jobs.
The metadata server : provides an API to store and fetch metadata such as
table names and schemas and pointers to the underlying data.
A typical Tenzing query is submitted to the query server (through the Web UI,
CLI, or API), which is responsible for parsing the query into an intermediate parse
tree and fetching the required metadata from the metadata server. The query opti-
mizer goes through the intermediate format, applies various optimizations and gen-
erates a query execution plan that consists of one or more MapReduce jobs. For each
MapReduce, the query server finds an available master using the master watcher
and submits the query to it. At this stage, the execution is physically partitioned into
multiple units of work where idle workers poll the masters for available work. The
query server monitors the generated intermediate results, gathers them as they arrive
and streams the output back to the client. To increase throughput, decrease latency
and execute SQL operators more efficiently, Tenzing has enhanced the MapReduce
implementation with some main changes:
Streaming and in-memory chaining : The implementation of Tenzing does
not serialize the intermediate results of MapReduce jobs to GFS. Instead,
it streams the intermediate results between the map and reduce tasks using
the network and uses GFS only for backup purposes. In addition, it uses
a memory chaining mechanism where the reducer and the mapper of the
same intermediate results are colocated in the same process.
Sort avoidance : Certain operators such as hash join and hash aggregation
require shuffling but not sorting. The MapReduce API was enhanced to
automatically turn off sorting for these operations, when possible, so that
the mapper feeds data to the reducer, which automatically bypasses the
Search WWH ::




Custom Search