Database Reference
In-Depth Information
algorithm is based on brute force, the second algorithm treats the problem as a large-
scale ad hoc retrieval and the third algorithm is based on the Cartesian product of
postings lists. V-SMART-Join [101] is a MapReduce-based framework for discov-
ering all pairs of similar entities, which is applicable to sets, multisets, and vec-
tors. It presents a family of two-stage algorithms where the first stage computes and
joins the partial results, and the second stage computes the similarity exactly for
all candidate pairs. Afrati et al. [5] have provided a theoretical analysis of various
MapReduce-based similarity join algorithms in terms of various parameters includ-
ing map and reduce costs, number of reducers, and communication cost.
The DisCo ( Dis tributed Co -clustering) framework [111] has been introduced as
an approach for distributed data preprocessing and co-clustering from the raw data to
the end clusters using the MapReduce framework. Cordeiro et al. [41] have presented
an approach for finding subspace clusters in very large moderate-to-high dimensional
data that is having typically more than 5 axes. Ene et al. [52] described the design
and the MapReduce-based implementations of the k -median and k -center clustering
algorithms. PLANET ( P arallel L earner for A ssembling N umerous E nsemble T rees)
is a distributed framework for learning tree models over large data sets. It defines
tree learning as a series of distributed computations and implements, each one using
the MapReduce model [110]. The SystemML [60] provides a framework for express-
ing machine learning algorithms using a declarative higher-level language. The
algorithms expressed in SystemML are then automatically compiled and optimized
into a set of MapReduce jobs that can run on a cluster of machines. NIMBLE [59]
provides an infrastructure that has been specifically designed to enable the rapid
implementation of parallel machine learning and data mining algorithms. The infra-
structure allows its users to compose parallel machine learning algorithms using
reusable (serial and parallel) building blocks that can be efficiently executed using
the MapReduce framework. Mahout * is an Apache project with the aim of building
scalable machine learning libraries using the MapReduce framework. Ricardo [42]
is presented as a scalable platform for applying sophisticated statistical methods over
huge data repositories. It is designed to facilitate the trading between R (a famous
statistical software packages ) and Hadoop where each trading partner performs the
tasks that it does best. In particular, this trading is done in a way where R sends
aggregation-processing queries to Hadoop while Hadoop sends aggregated data to
R for advanced statistical processing or visualization. Cary et al. [28] presented an
approach for applying the MapReduce model in the domain of spatial data manage-
ment. In particular, they focus on the bulk construction of R-Trees and aerial image
quality computation which involves vector and raster data. Morales et al. [102] have
presented two matching algorithms, GreedyMR and StackMR , which are geared for
the MapReduce paradigm with the aim of distributing content from information sup-
pliers to information consumers on social media applications. In particular, they
seek to maximize the overall relevance of the matched content from suppliers to
consumers while regulating the overall activity.
* http://mahout.apache.org/.
http://www.r-project.org/.
Search WWH ::




Custom Search