Databases Reference
In-Depth Information
Perform secondary sorting as necessary.
Manage overrides specified by users for grouping and partitioning.
Reporter—is used to report progress, set application-level status messages, update any user set
counters, and indicate long running tasks or jobs are alive.
Combiner—an optional performance booster that can be specified to perform local aggregation
of the intermediate outputs to manage the amount of data transferred from the Mapper to the
Reducer.
Partitioner—controls the partitioning of the keys of the intermediate map outputs. The key (or
a subset of the key) is used to derive the partition and default partitions are created by a hash
function . The total number of partitions will be same as the number of reduce tasks for the job.
Output collector—collects the output of Mappers and Reducers.
Job configuration—is the primary user interface to manage MapReduce jobs.
It is typically used to specify the Mapper, Combiner, Partitioner, Reducer, InputFormat,
OutputFormat, and OutputCommitter for every job.
It also indicates the set of input files and where the output files should be written.
Optionally used to specify other advanced options for the job such as the comparator to be
used, files to be put in the DistributedCache, and compression on intermediate and/or final job
outputs.
It is used for debugging via user-provided scripts, whether job tasks can be executed in a
speculative manner, the maximum number of attempts per task for any possible failure, and
the percentage of task failures that can be tolerated by the job overall.
Output committer—is used to manage the commit for jobs and tasks in MapReduce. Key tasks
executed are:
Set up the job during initialization. For example, create the intermediate directory for the job
during the initialization of the job.
Clean up the job after the job completion. For example, remove the temporary output directory
after the job completion.
Set up any task temporary output.
Check whether a task needs a commit. This will avoid overheads on unnecessary commits.
Commit of the task output on completion.
On failure, discard the task commit and clean up all intermediate results, memory release, and
other user-specified tasks.
Job input:
Specifies the input format for a Map/Reduce job.
Validate the input specification of the job.
Split up the input file(s) into logical instances to be assigned to an individual Mapper.
Provide input records from the logical splits for processing by the Mapper.
Memory management, JVM reuse, and compression are managed with the job configuration set of
classes.
MapReduce program design
MapReduce programming is based on functional programming, where the dataflow from one module
to another is not based on a specific control but rather behaves like a directed acyclic graph (DAG),
where value changes in the currently executing step cause the successors to recalculate values. This
Search WWH ::




Custom Search