Information Technology Reference
In-Depth Information
iterative optimization of a loss function which is evaluated (several times) on a
training data set. For this, machine learning algorithms need to be redesigned
to eciently access and reuse data in a synchronized manner to preserve their
behavior (i.e. improving the prediction accuracy) whilst scaling horizontally to
analyze larger datasets.
Different complementary approaches exist today to deal with this problem,
such as using the map-reduce computing model [11], engineering stochastic or
online versions of existing algorithms [1,5], reformulating the algorithms in a
distributed manner [3,4], etc. Furthermore, several software frameworks have
emerged in the last years to support these processes in a scalable manner to
different extents [12,13,6]. Notably, Spark proposes a memory-only caching ar-
chitecture which is recently gaining popularity.
This work complements this architecture by combining disk and memory caches
and enabling a finer grained configuration for deciding what portions of data re-
side on the caches of each computing node. We focus on ML algorithms that fit
the Statistical Query Model [2] (see section 2.1) which represents a vast majority
of the algorithms. This way, we obtain parallel versions with relatively low effort
and combine them with distributed cache strategies [8] to understand their behav-
ior and settle for suitable strategies for scalable machine learning methods. Our
experiments are based on BIGS (Big Image Data Analysis Toolkit), a framework
which enables programming such algorithms in a distributed and opportunistic
manner; and were run over a virtualized computing cluster using the OpenNeb-
ula stack at the computing facilities in our universities. BIGS was extended by
implementing the cache strategies described and benchmarked in this work.
Our results strongly favour strategies where (1) datasets are partitioned and
pre loaded throughout the distributed memory of the different cluster nodes
and (2) algorithms use data locality information to maximize data reuse and
synchronize computations at each iteration through the data. This supports the
convergence towards models where “computing goes to data” as observed in other
Big Data related contexts.
The rest of this paper is structured as follows. Section 2 describes our im-
plementation of a parallel machine learning classification algorithms with dis-
tributed cache. Section 3 describes our experimental setup. Section 4 discuss the
experiments results and, finally, in Section 5 we expose our concluding remarks.
2 Distributed Machine Learning
A majority of ML algorithms adhere to the Statistical Query Model [2], through
which algorithms trying to learn a function f ( x, y ) of the data (such as to mea-
sure the prediction or classification errors) are required to return an estimate of
the expectation of f ( x, y ) using test and train data. Algorithms falling in this
class approach this problem through mathematical optimization (such as to min-
imize prediction error) and use gradients or sucient statistics for this. These
computations are typically expressible as a sum over data points and therefore
are partitionable and distributable. If we have m data points, a sum of a gradient
 
Search WWH ::




Custom Search