Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

algorithm is based on brute force, the second algorithm treats the problem as a large-

scale ad hoc retrieval and the third algorithm is based on the Cartesian product of

postings lists. V-SMART-Join [101] is a MapReduce-based framework for discov-

ering all pairs of similar entities, which is applicable to sets, multisets, and vec-

tors. It presents a family of two-stage algorithms where the first stage computes and

joins the partial results, and the second stage computes the similarity exactly for

all candidate pairs. Afrati et al. [5] have provided a theoretical analysis of various

MapReduce-based similarity join algorithms in terms of various parameters includ-

ing map and reduce costs, number of reducers, and communication cost.

The DisCo ( Dis tributed Co -clustering) framework [111] has been introduced as

an approach for distributed data preprocessing and co-clustering from the raw data to

the end clusters using the MapReduce framework. Cordeiro et al. [41] have presented

an approach for finding subspace clusters in very large moderate-to-high dimensional

data that is having typically more than 5 axes. Ene et al. [52] described the design

and the MapReduce-based implementations of the k -median and k -center clustering

algorithms. PLANET ( P arallel L earner for A ssembling N umerous E nsemble T rees)

is a distributed framework for learning tree models over large data sets. It defines

tree learning as a series of distributed computations and implements, each one using

the MapReduce model [110]. The SystemML [60] provides a framework for express-

ing machine learning algorithms using a declarative higher-level language. The

algorithms expressed in SystemML are then automatically compiled and optimized

into a set of MapReduce jobs that can run on a cluster of machines. NIMBLE [59]

provides an infrastructure that has been specifically designed to enable the rapid

implementation of parallel machine learning and data mining algorithms. The infra-

structure allows its users to compose parallel machine learning algorithms using

reusable (serial and parallel) building blocks that can be efficiently executed using

the MapReduce framework. Mahout * is an Apache project with the aim of building

scalable machine learning libraries using the MapReduce framework. Ricardo [42]

is presented as a scalable platform for applying sophisticated statistical methods over

huge data repositories. It is designed to facilitate the trading between R (a famous

statistical software packages † ) and Hadoop where each trading partner performs the

tasks that it does best. In particular, this trading is done in a way where R sends

aggregation-processing queries to Hadoop while Hadoop sends aggregated data to

R for advanced statistical processing or visualization. Cary et al. [28] presented an

approach for applying the MapReduce model in the domain of spatial data manage-

ment. In particular, they focus on the bulk construction of R-Trees and aerial image

quality computation which involves vector and raster data. Morales et al. [102] have

presented two matching algorithms, GreedyMR and StackMR , which are geared for

the MapReduce paradigm with the aim of distributing content from information sup-

pliers to information consumers on social media applications. In particular, they

seek to maximize the overall relevance of the matched content from suppliers to

consumers while regulating the overall activity.

* http://mahout.apache.org/.

† http://www.r-project.org/.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home