Databases Reference
In-Depth Information
Solution: Solution is absolutely straightforward: mapper takes records one by one
and emits accepted items or their transformed versions.
Applications:
Log Analysis, Data Querying, ETL, Data Validation
Distributed Task Execution
Problem Statement: There is a large computational problem that can be divided into
multiple parts and results from all parts can be combined together to obtain a final result.
Solution: Problem description is split into a set of specifications, and specifications
are stored as input data for mappers. Each mapper takes a specification, performs
corresponding computations, and emits results. Reducer combines all emitted parts into
the final result.
Applications:
Physical and Engineering Simulations, Numerical Analysis,
Performance Testing
Sorting
Problem Statement: There is a set of records, and it is required to sort these records by
some rule or process these records in a certain order.
Solution: Simple sorting is absolutely straightforward: mappers just emit all items
as values associated with the sorting keys that are assembled as function of items.
Nevertheless, in practice sorting is often used in a tricky way, that's why it is said to be
the heart of map-reduce (and Hadoop). In particular, it is very common to use composite
keys to achieve secondary sorting and grouping. Sorting in map-reduce is originally
intended for sorting of the emitted key-value pairs by key, but there exist techniques that
leverage Hadoop implementation specifics to achieve sorting by values.
It is worth noting that if map-reduce is used for sorting of the original (not
intermediate) data, it is often a good idea to continuously maintain data in sorted state
using BigTable concepts. In other words, it can be more efficient to sort data once during
insertion than sort them for each map-reduce query.
Applications:
ETL, Data Analysis
Advanced Map-Reduce Patterns
Iterative Message Passing (Graph Processing)
Problem Statement: There is a network of entities and relationships between them. It is
required to calculate a state of each entity on the basis of properties of the other entities in
its neighborhood. This state can represent a distance to other nodes, indication that there is
a neighbor with the certain properties, characteristic of neighborhood density and so on.
Solution: A network is stored as a set of nodes, and each node contains a list of
adjacent node IDs. Conceptually, map-reduce jobs are performed in iterative way, and
at each iteration each node sends messages to its neighbors. Each neighbor updates its
 
Search WWH ::




Custom Search