Database Reference
In-Depth Information
algorithm is based on brute force, the second algorithm treats the problem as a large-
scale ad hoc retrieval and the third algorithm is based on the Cartesian product of
postings lists.
V-SMART-Join
[101] is a MapReduce-based framework for discov-
ering all pairs of similar entities, which is applicable to sets, multisets, and vec-
tors. It presents a family of two-stage algorithms where the first stage computes and
joins the partial results, and the second stage computes the similarity exactly for
all candidate pairs. Afrati et al. [5] have provided a theoretical analysis of various
MapReduce-based similarity join algorithms in terms of various parameters includ-
ing map and reduce costs, number of reducers, and communication cost.
The
DisCo
(
Dis
tributed
Co
-clustering) framework [111] has been introduced as
an approach for distributed data preprocessing and co-clustering from the raw data to
the end clusters using the MapReduce framework. Cordeiro et al. [41] have presented
an approach for finding subspace clusters in very large moderate-to-high dimensional
data that is having typically more than 5 axes. Ene et al. [52] described the design
and the MapReduce-based implementations of the
k
-median and
k
-center clustering
algorithms.
PLANET
(
P
arallel
L
earner for
A
ssembling
N
umerous
E
nsemble
T
rees)
is a distributed framework for learning tree models over large data sets. It defines
tree learning as a series of distributed computations and implements, each one using
the MapReduce model [110]. The
SystemML
[60] provides a framework for express-
ing machine learning algorithms using a declarative higher-level language. The
algorithms expressed in SystemML are then automatically compiled and optimized
into a set of MapReduce jobs that can run on a cluster of machines.
NIMBLE
[59]
provides an infrastructure that has been specifically designed to enable the rapid
implementation of parallel machine learning and data mining algorithms. The infra-
structure allows its users to compose parallel machine learning algorithms using
reusable (serial and parallel) building blocks that can be efficiently executed using
the MapReduce framework.
Mahout
* is an Apache project with the aim of building
scalable machine learning libraries using the MapReduce framework.
Ricardo
[42]
is presented as a scalable platform for applying sophisticated statistical methods over
huge data repositories. It is designed to facilitate the
trading
between
R
(a famous
statistical software packages
†
) and Hadoop where each trading partner performs the
tasks that it does best. In particular, this trading is done in a way where
R
sends
aggregation-processing queries to Hadoop while Hadoop sends aggregated data to
R
for advanced statistical processing or visualization. Cary et al. [28] presented an
approach for applying the MapReduce model in the domain of spatial data manage-
ment. In particular, they focus on the bulk construction of R-Trees and aerial image
quality computation which involves vector and raster data. Morales et al. [102] have
presented two matching algorithms,
GreedyMR
and
StackMR
, which are geared for
the MapReduce paradigm with the aim of distributing content from information sup-
pliers to information consumers on social media applications. In particular, they
seek to maximize the overall relevance of the matched content from suppliers to
consumers while regulating the overall activity.
*
http://mahout.apache.org/.
†
http://www.r-project.org/.
Search WWH ::
Custom Search