Data Modeling Approaches for Big Data and Analytics Solutions - Big Data Imperatives

Databases Reference

In-Depth Information

The examples we have discussed above are complete map-reduce computations,

where we start from raw input data and create a final output. Many map-reduce functions

take a while to perform, even with clustered nodes. As new data keeps coming in, you will

have to re-run the map-reduce computations to stay up to date.

The map stages of the map-reduce are easy to handle incrementally; you run your

mapper function only if the input data has changed, since map functions are isolated

from each other, handling incremental updates are straightforward. The more complex

case is the reduce step, since it pulls together all the outputs from many maps and any

changes in the map outputs necessitates a re-run of the reduce function. This issue can

be resolved depending upon how parallel the reduce step is. If we are partitioning data

for reduction, then any partition that remains unchanged and does not necessitate the

reduce function to re-run on that partition.

Basic Map-Reduce Patterns

Counting and Summing

Problem Statement: There are a number of documents where each document is a set

of terms. It is required to calculate a total number of occurrences of each term in all

documents. Alternatively, it can be an arbitrary function of the terms. For instance,

there is a log file where each record contains a response time and it is required to

calculate an average response time.

Applications:

Log Analysis, Data Querying

Collating

Problem Statement: There is a set of items and some function of one item. It is required

to save all items that have the same value of function into one file or perform some other

computation that requires all such items to be processed as a group. The most typical

example is building of inverted indexes.

Solution: The solution is straightforward. Mapper computes a given function for

each item and emits value of the function as a key and item itself as a value. Reducer

obtains all items grouped by function value and process or save them. In case of inverted

indexes, items are terms (words) and function is a document ID where the term was

found.

Applications:

Inverted Indexes, ETL

Filtering (“Grepping”), Parsing, and Validation

Problem Statement: There is a set of records, and it is required to collect all records

that meet some condition or transform each record (independently from other records)

into another representation. The latter case includes such tasks as text parsing and value

extraction, conversion from one format to another.

Search WWH ::

Custom Search

Home