Databases Reference
In-Depth Information
The examples we have discussed above are complete map-reduce computations,
where we start from raw input data and create a final output. Many map-reduce functions
take a while to perform, even with clustered nodes. As new data keeps coming in, you will
have to re-run the map-reduce computations to stay up to date.
The map stages of the map-reduce are easy to handle incrementally; you run your
mapper function only if the input data has changed, since map functions are isolated
from each other, handling incremental updates are straightforward. The more complex
case is the reduce step, since it pulls together all the outputs from many maps and any
changes in the map outputs necessitates a re-run of the reduce function. This issue can
be resolved depending upon how parallel the reduce step is. If we are partitioning data
for reduction, then any partition that remains unchanged and does not necessitate the
reduce function to re-run on that partition.
Basic Map-Reduce Patterns
Counting and Summing
Problem Statement: There are a number of documents where each document is a set
of terms. It is required to calculate a total number of occurrences of each term in all
documents. Alternatively, it can be an arbitrary function of the terms. For instance,
there is a log file where each record contains a response time and it is required to
calculate an average response time.
Applications:
Log Analysis, Data Querying
Collating
Problem Statement: There is a set of items and some function of one item. It is required
to save all items that have the same value of function into one file or perform some other
computation that requires all such items to be processed as a group. The most typical
example is building of inverted indexes.
Solution: The solution is straightforward. Mapper computes a given function for
each item and emits value of the function as a key and item itself as a value. Reducer
obtains all items grouped by function value and process or save them. In case of inverted
indexes, items are terms (words) and function is a document ID where the term was
found.
Applications:
Inverted Indexes, ETL
Filtering (“Grepping”), Parsing, and Validation
Problem Statement: There is a set of records, and it is required to collect all records
that meet some condition or transform each record (independently from other records)
into another representation. The latter case includes such tasks as text parsing and value
extraction, conversion from one format to another.
 
Search WWH ::




Custom Search