Database Reference
In-Depth Information
algorithm is based on brute force, the second algorithm treats the problem as a large-
scale ad hoc retrieval and the third algorithm is based on the Cartesian product of
postings lists. V-SMART-Join [101] is a MapReduce-based framework for discov-
ering all pairs of similar entities, which is applicable to sets, multisets, and vec-
tors. It presents a family of two-stage algorithms where the first stage computes and
joins the partial results, and the second stage computes the similarity exactly for
all candidate pairs. Afrati et al. [5] have provided a theoretical analysis of various
MapReduce-based similarity join algorithms in terms of various parameters includ-
ing map and reduce costs, number of reducers, and communication cost.
The DisCo ( Dis tributed Co -clustering) framework [111] has been introduced as
an approach for distributed data preprocessing and co-clustering from the raw data to
the end clusters using the MapReduce framework. Cordeiro et al. [41] have presented
an approach for finding subspace clusters in very large moderate-to-high dimensional
data that is having typically more than 5 axes. Ene et al. [52] described the design
and the MapReduce-based implementations of the k -median and k -center clustering
algorithms. PLANET ( P arallel L earner for A ssembling N umerous E nsemble T rees)
is a distributed framework for learning tree models over large data sets. It defines
tree learning as a series of distributed computations and implements, each one using
the MapReduce model [110]. The SystemML [60] provides a framework for express-
ing machine learning algorithms using a declarative higher-level language. The
algorithms expressed in SystemML are then automatically compiled and optimized
into a set of MapReduce jobs that can run on a cluster of machines. NIMBLE [59]
provides an infrastructure that has been specifically designed to enable the rapid
implementation of parallel machine learning and data mining algorithms. The infra-
structure allows its users to compose parallel machine learning algorithms using
reusable (serial and parallel) building blocks that can be efficiently executed using
the MapReduce framework. Mahout * is an Apache project with the aim of building
scalable machine learning libraries using the MapReduce framework. Ricardo [42]
is presented as a scalable platform for applying sophisticated statistical methods over
huge data repositories. It is designed to facilitate the trading between R (a famous
statistical software packages ) and Hadoop where each trading partner performs the
tasks that it does best. In particular, this trading is done in a way where R sends
aggregation-processing queries to Hadoop while Hadoop sends aggregated data to
R for advanced statistical processing or visualization. Cary et al. [28] presented an
approach for applying the MapReduce model in the domain of spatial data manage-
ment. In particular, they focus on the bulk construction of R-Trees and aerial image
quality computation which involves vector and raster data. Morales et al. [102] have
presented two matching algorithms, GreedyMR and StackMR , which are geared for
the MapReduce paradigm with the aim of distributing content from information sup-
pliers to information consumers on social media applications. In particular, they
seek to maximize the overall relevance of the matched content from suppliers to
consumers while regulating the overall activity.
Search WWH ::

Custom Search