iMapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

processing on a large cluster. It realizes the join of the static data and the state data by

explicitly specifying an additional MapReduce job, and relies on the task scheduler

and caching techniques to maintain local access to static data, while iMapReduce

relies on persistent tasks to manage static data and to avoid task initialization.

Some studies accelerate iterative algorithms by maintaining the iterated state data

in memory. Spark [18] is developed to optimize iterative and interactive computation.

It uses caching techniques to dramatically improve the performance for repeated

operations. The main idea in Spark is the construction of resilient distributed data

set (RDD), which is a read-only collection of objects maintained in memory across

iterations and supports fault recovery. Logothetis et al. [12] presents a general-

ized architecture for continuous bulk processing (CBP), which performs iterative

computations in an incremental fashion by unifying stateful programming with a

data-parallel operator. CIEL [15] supports data-dependent iterative or recursive algo-

rithms by building an abstract dynamic task graph. Piccolo [17] allows computation

running on different machines to share distributed, mutable state via a key-value

table interface. This enables one to implement iterative algorithms that access in-

memory distributed tables without worrying about the consistency of the data. Priter

[22] enables prioritized iteration, which exploits the dominant property of some por-

tion of the data and schedules them first for computation, rather than blindly per-

forms computations on all data. This is realized by maintaining a state table and

a priority queue in memory. Our iMapReduce framework is built on Hadoop, the

iterated state data as well as the static data are maintained in files but not in memory.

Therefore, it is more scalable and more resilient to failures.

Some other efforts focus on graph-based iterative algorithms, an important class of

iterative algorithms. PEGASUS [9] models those seemingly different graph iterative

algorithms as a generalization of matrix-vector multiplication (GIM-V). By exploring

matrix property, such as block multiplication, clustered edges and diagonal block itera-

tion, it can achieve 5 times faster performance over the regular job. Pregel [13] chooses

a pure message passing model to process graphs. In each iteration, a vertex can, inde-

pendently of other vertices, receive messages sent to it in the previous iteration, send

messages to other vertices, modify its own and its outgoing edges' states, and mutate

the graph's topology. Using this model, processing large graphs is expressive and easy to

program. iMapReduce exploits the property that the map and reduce functions operate

on the same type of keys, that is, node id, to accelerate graph-based iterative algorithms.

The most relevant work is that of Ekanayake et al., who proposed Twister [5,6],

which employs stream-based MapReduce implementation that supports iterative

applications. Twister employs novel ideas of loading the input data only once in the

initialization stage and performing iterative map-reduce processing by long run-

ning map/reduce daemons. iMapReduce differs from Twister mainly on that Twister

stores intermediate data in memory, while iMapReduce stores intermediate data in

files. Twister loads all data in distributed memory for fast data access and grounds on

the assumption that data sets and intermediate data can fit into the distributed mem-

ory of the computing infrastructure. iMapReduce aims at providing a MapReduce

based iterative computing framework running on a cluster of commodity machines

where each node has limited memory resources. In iMapReduce, the intermediate

data, including the intermediate results, the shuffled data between map and reduce,

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home