Database Reference
In-Depth Information
2. Using dynamic instrumentation to collect run-time monitoring information from
unmodified MapReduce programs. The dynamic nature means that monitoring
can be turned on or off on demand.
The What-if Engine's accuracy come from how it uses a mix of simulation
and model-based estimation at the phase level of the MapReduce job execu-
tion [ 147 , 149 , 150 ]. For a given MapReduce program, the role of the cost-based
optimizer component is to enumerate and search efficiently through the high
dimensional space of configuration parameter settings, making appropriate calls to
the What-if Engine. In order for the program to find a good configuration setting, it
clusters parameters into lower-dimensional subspaces such that the globally-optimal
parameter setting in the high-dimensional space can be generated by composing
the optimal settings found for the subspaces. Stubby [ 176 ] has been presented as a
cost-based optimizer for MapReduce workflows that searches through the subspace
of the full plan space that can be enumerated correctly and costed based on the
information available in any given setting. Stubby enumerates the plan space based
on plan-to-plan transformations and an efficient search algorithm.
The Manimal system [ 92 , 153 ] is designed as a static analysis-style mechanism
for detecting opportunities for applying relational style optimizations in MapReduce
programs. Like most programming-language optimizers, it is a best-effort system
where it does not guarantee that it will find every possible optimization and it
only indicates an optimization when it is entirely safe to do so. In particular, the
analyzer component of the system is responsible for examining the MapReduce
program and sends the resulting optimization descriptor to the optimizer component.
In addition, the analyzer also emits an index generation program that can yield
a B+Tree of the input file. The optimizer uses the optimization descriptor, plus a
catalog of pre-computed indexes, to choose an optimized execution plan, called
an execution descriptor. This descriptor, plus a potentially-modified copy of the
user's original program, is then sent for execution on the Hadoop cluster. These
steps are performed transparently from the user where the submitted program does
not need to be modified by the programmer in any way. In particular, the main
task of the analyzer is to produce a set of optimization descriptors which enable
the system to carry out a phase roughly akin to logical rewriting of query plans in
a relational database. The descriptors characterize a set of potential modifications
that remain logically identical to the original plan. The catalog is a simple mapping
from a filename to zero or more (X; O) pairs where X is an index file and O is
an optimization descriptor. The optimizer examines the catalog to see if there is
any entry for input file. If not, then it simply indicates that Manimal should run
the unchanged user program without any optimization. If there is at least one entry
for the input file, and a catalog-associated optimization descriptor is compatible
with analyzer-output, then the optimizer can choose an execution plan that takes
advantage of the associated index file.
A key feature of MapReduce is that it automatically handles failures, hiding the
complexity of fault-tolerance from the programmer. In particular, if a node crashes,
MapReduce automatically restarts the execution of its tasks. In addition, if a node
Search WWH ::




Custom Search