Database Reference
In-Depth Information
processing on a large cluster. It realizes the join of the static data and the state data by
explicitly specifying an additional MapReduce job, and relies on the task scheduler
and caching techniques to maintain local access to static data, while iMapReduce
relies on persistent tasks to manage static data and to avoid task initialization.
Some studies accelerate iterative algorithms by maintaining the iterated state data
in memory. Spark [18] is developed to optimize iterative and interactive computation.
It uses caching techniques to dramatically improve the performance for repeated
operations. The main idea in Spark is the construction of resilient distributed data
set (RDD), which is a read-only collection of objects maintained in memory across
iterations and supports fault recovery. Logothetis et al. [12] presents a general-
ized architecture for continuous bulk processing (CBP), which performs iterative
computations in an incremental fashion by unifying stateful programming with a
data-parallel operator. CIEL [15] supports data-dependent iterative or recursive algo-
rithms by building an abstract dynamic task graph. Piccolo [17] allows computation
running on different machines to share distributed, mutable state via a key-value
table interface. This enables one to implement iterative algorithms that access in-
memory distributed tables without worrying about the consistency of the data. Priter
[22] enables prioritized iteration, which exploits the dominant property of some por-
tion of the data and schedules them first for computation, rather than blindly per-
forms computations on all data. This is realized by maintaining a state table and
a priority queue in memory. Our iMapReduce framework is built on Hadoop, the
iterated state data as well as the static data are maintained in files but not in memory.
Therefore, it is more scalable and more resilient to failures.
Some other efforts focus on graph-based iterative algorithms, an important class of
iterative algorithms. PEGASUS [9] models those seemingly different graph iterative
algorithms as a generalization of matrix-vector multiplication (GIM-V). By exploring
matrix property, such as block multiplication, clustered edges and diagonal block itera-
tion, it can achieve 5 times faster performance over the regular job. Pregel [13] chooses
a pure message passing model to process graphs. In each iteration, a vertex can, inde-
pendently of other vertices, receive messages sent to it in the previous iteration, send
messages to other vertices, modify its own and its outgoing edges' states, and mutate
the graph's topology. Using this model, processing large graphs is expressive and easy to
program. iMapReduce exploits the property that the map and reduce functions operate
on the same type of keys, that is, node id, to accelerate graph-based iterative algorithms.
The most relevant work is that of Ekanayake et al., who proposed Twister [5,6],
which employs stream-based MapReduce implementation that supports iterative
applications. Twister employs novel ideas of loading the input data only once in the
initialization stage and performing iterative map-reduce processing by long run-
ning map/reduce daemons. iMapReduce differs from Twister mainly on that Twister
stores intermediate data in memory, while iMapReduce stores intermediate data in
files. Twister loads all data in distributed memory for fast data access and grounds on
the assumption that data sets and intermediate data can fit into the distributed mem-
ory of the computing infrastructure. iMapReduce aims at providing a MapReduce
based iterative computing framework running on a cluster of commodity machines
where each node has limited memory resources. In iMapReduce, the intermediate
data, including the intermediate results, the shuffled data between map and reduce,
Search WWH ::




Custom Search