Introducing Big Data Technologies - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

●

Perform secondary sorting as necessary.

●

Manage overrides specified by users for grouping and partitioning.

●

Reporter—is used to report progress, set application-level status messages, update any user set

counters, and indicate long running tasks or jobs are alive.

●

Combiner—an optional performance booster that can be specified to perform local aggregation

of the intermediate outputs to manage the amount of data transferred from the Mapper to the

Reducer.

●

Partitioner—controls the partitioning of the keys of the intermediate map outputs. The key (or

a subset of the key) is used to derive the partition and default partitions are created by a hash

function . The total number of partitions will be same as the number of reduce tasks for the job.

●

Output collector—collects the output of Mappers and Reducers.

●

Job configuration—is the primary user interface to manage MapReduce jobs.

●

It is typically used to specify the Mapper, Combiner, Partitioner, Reducer, InputFormat,

OutputFormat, and OutputCommitter for every job.

●

It also indicates the set of input files and where the output files should be written.

●

Optionally used to specify other advanced options for the job such as the comparator to be

used, files to be put in the DistributedCache, and compression on intermediate and/or final job

outputs.

●

It is used for debugging via user-provided scripts, whether job tasks can be executed in a

speculative manner, the maximum number of attempts per task for any possible failure, and

the percentage of task failures that can be tolerated by the job overall.

●

Output committer—is used to manage the commit for jobs and tasks in MapReduce. Key tasks

executed are:

●

Set up the job during initialization. For example, create the intermediate directory for the job

during the initialization of the job.

●

Clean up the job after the job completion. For example, remove the temporary output directory

after the job completion.

●

Set up any task temporary output.

●

Check whether a task needs a commit. This will avoid overheads on unnecessary commits.

●

Commit of the task output on completion.

●

On failure, discard the task commit and clean up all intermediate results, memory release, and

other user-specified tasks.

●

Job input:

●

Specifies the input format for a Map/Reduce job.

●

Validate the input specification of the job.

●

Split up the input file(s) into logical instances to be assigned to an individual Mapper.

●

Provide input records from the logical splits for processing by the Mapper.

●

Memory management, JVM reuse, and compression are managed with the job configuration set of

classes.

MapReduce program design

MapReduce programming is based on functional programming, where the dataflow from one module

to another is not based on a specific control but rather behaves like a directed acyclic graph (DAG),

where value changes in the currently executing step cause the successors to recalculate values. This

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home