Distributed Cache Strategies for Machine Learning Classification Tasks over Cluster Computing Resources - High Performance Computing

Information Technology Reference

In-Depth Information

BIGS endorses a data partition iterative computing model (such as described in

Section 2.1) through which workers can exploit locality aware computation if so

desired by the user. Through this model, distributed data analysis jobs are struc-

tured into a repeatable sequence of INIT-STATE, MAP and AGGREGATE-

STATE operations as shown in Figure 1.

Fig. 1. BIGS algorithm model

All operations receive a State Object as input and they may produce another

State Object as output. Any operation cannot start until the preceding ones have

finished according to the job dependency graph in Figure 1, which is stored in the

reference NoSQL database. Input datasets to be processed are split into a user

definable number of partitions, and there is one MAP operation per partition.

Each MAP operation can loop over the elements of the partition it is processing

and may produce elements for an output dataset. For instance, an image feature

extraction MAP operation would produce one or more feature vectors for each

input image. Workers take over operations by inspecting the job dependency

graph stored in the reference database. Developers program their algorithms by

providing implementations for the Java Process API methods. When a BIGS

worker takes over an operation, it creates the appropriate programming context

and makes the corresponding data available (through data caches) to the imple-

mentation being invoked. As it can be seen in Figure 2, AGGREGATE-STATE

operations use all output states of the preceding MAP operations to create the

resulting state of iteration or the whole process.

2.2 Data Access and Caching Strategies

BIGS takes advantage of data parallelism approach. Given a large amount of

data that can be processed independently, BIGS can distribute them in data

Search WWH ::

Custom Search

Home