Database Reference
In-Depth Information
Processors
Local computation
Superstep
Communication
Barrier
synchronization
FIGURE 2.19
BSP programming model.
model is well suited for distributed implementations as it doesn't expose any mecha-
nism for detecting order of execution within a superstep, and all communication is
from superstep S to superstep S + 1. The ideas of Pregel have been cloned by many
open-source projects such as GoldenOrb ,* Apache Hama , and Apache Giraph .
Both of Hama and Giraph are implemented to be launched as a typical Hadoop
job that can leverage the Hadoop infrastructure. Other large-scale graph processing
systems that have been introduced that neither follow the MapReduce model nor
leverage the Hadoop infrastructure include GR ACE [130], GraphLab [96,97], and
Signal/Collect [122].
The Dedoop system ( De duplication with Ha doop ) [82,83] has been presented as
an entity resolution framework based on MapReduce. It supports the ability to define
complex entity resolution workflows that can include different matching steps and/
or apply machine learning mechanisms for the automatic generation of match classi-
fiers. The defined workflows are then automatically translated into MapReduce jobs
for parallel execution on Hadoop clusters. The MapDupReducer [129] is another
system that has been proposed as a MapReduce-based solution, which is developed
for supporting the problem of near duplicate detection over massive data sets using
the PPJoin ( P ositional and P reix filtering) algorithm [132].
An approach to efficiently perform set-similarity joins in parallel using the
MapReduce framework has been proposed by Vernica et al. [128]. In particular, they
propose a three-stage approach for end-to-end set-similarity joins. The approach
takes as input a set of records and outputs a set of joined records based on a set-
similarity condition. It partitions the data across nodes to balance the workload
and minimize the need for replication. J. Lin [92] has presented three MapReduce
algorithms for computing pairwise similarity on document collections. The first
* http://goldenorbos.org/.
http://hama.apache.org/.
http://giraph.apache.org/.
Search WWH ::




Custom Search