Distributed Cache Strategies for Machine Learning Classification Tasks over Cluster Computing Resources - High Performance Computing

Information Technology Reference

In-Depth Information

iterative optimization of a loss function which is evaluated (several times) on a

training data set. For this, machine learning algorithms need to be redesigned

to eciently access and reuse data in a synchronized manner to preserve their

behavior (i.e. improving the prediction accuracy) whilst scaling horizontally to

analyze larger datasets.

Different complementary approaches exist today to deal with this problem,

such as using the map-reduce computing model [11], engineering stochastic or

online versions of existing algorithms [1,5], reformulating the algorithms in a

distributed manner [3,4], etc. Furthermore, several software frameworks have

emerged in the last years to support these processes in a scalable manner to

different extents [12,13,6]. Notably, Spark proposes a memory-only caching ar-

chitecture which is recently gaining popularity.

This work complements this architecture by combining disk and memory caches

and enabling a finer grained configuration for deciding what portions of data re-

side on the caches of each computing node. We focus on ML algorithms that fit

the Statistical Query Model [2] (see section 2.1) which represents a vast majority

of the algorithms. This way, we obtain parallel versions with relatively low effort

and combine them with distributed cache strategies [8] to understand their behav-

ior and settle for suitable strategies for scalable machine learning methods. Our

experiments are based on BIGS (Big Image Data Analysis Toolkit), a framework

which enables programming such algorithms in a distributed and opportunistic

manner; and were run over a virtualized computing cluster using the OpenNeb-

ula stack at the computing facilities in our universities. BIGS was extended by

implementing the cache strategies described and benchmarked in this work.

Our results strongly favour strategies where (1) datasets are partitioned and

pre loaded throughout the distributed memory of the different cluster nodes

and (2) algorithms use data locality information to maximize data reuse and

synchronize computations at each iteration through the data. This supports the

convergence towards models where “computing goes to data” as observed in other

Big Data related contexts.

The rest of this paper is structured as follows. Section 2 describes our im-

plementation of a parallel machine learning classification algorithms with dis-

tributed cache. Section 3 describes our experimental setup. Section 4 discuss the

experiments results and, finally, in Section 5 we expose our concluding remarks.

2 Distributed Machine Learning

A majority of ML algorithms adhere to the Statistical Query Model [2], through

which algorithms trying to learn a function f ( x, y ) of the data (such as to mea-

sure the prediction or classification errors) are required to return an estimate of

the expectation of f ( x, y ) using test and train data. Algorithms falling in this

class approach this problem through mathematical optimization (such as to min-

imize prediction error) and use gradients or sucient statistics for this. These

computations are typically expressible as a sum over data points and therefore

are partitionable and distributable. If we have m data points, a sum of a gradient

High Performance Computing

Search WWH ::

Custom Search

Home