Information Technology Reference
In-Depth Information
Distributed Cache Strategies
for Machine Learning Classification Tasks
over Cluster Computing Resources
John Edilson Arévalo Ovalle 1 , Raúl Ramos-Pollan 2 ,
and Fabio A. González 1
1 Universidad Nacional de Colombia
{jearevaloo,fagonzalezo}@unal.edu.co
2 Unidad de Supercómputo y Cálculo Científico,
Universidad Industrial de Santander Colombia
rramosp@uis.edu.co
Abstract. Scaling machine learning (ML) methods to learn from large
datasets requires devising distributed data architectures and algorithms
to support their iterative nature where the same data records are pro-
cessed several times. Data caching becomes key to minimize data trans-
mission through iterations at each node and, thus, contribute to the
overall scalability. In this work we propose a two level caching archi-
tecture (disk and memory) and benchmark different caching strategies
in a distributed machine learning setup over a cluster with no shared
RAM memory. Our results strongly favour strategies where (1) datasets
are partitioned and preloaded throughout the distributed memory of the
cluster nodes and (2) algorithms use data locality information to syn-
chronize computations at each iteration. This supports the convergence
towards models where “computing goes to data” as observed in other
Big Data contexts, and allows us to align strategies for parallelizing ML
algorithms and configure appropriately computing infrastructures.
1
Introduction
Data caching strategies have become a key issue in scaling machine learning
methods that typically iterate several times over a given dataset aiming at re-
ducing some error measure on their predictions. It is known that as dataset sizes
increase we need to adapt or even redesign the algorithms and devise the appro-
priate software and hardware architectures to support them. This is especially
true if we want to endow our systems with horizontal scalability, where increased
performance is to be achieved not by upgrading the existing computing resources
(faster machines with more memory) but by adding more computing resources
of a similar (commodity) kind.
In this sense, machine learning methods are particularly sensible to below-
optimal architectures as they typically need to iterate or process large amounts of
data which necessarily lives on distributed storage. Despite the fact there is a rich
variety of machine learning methods, many of them follow a common pattern:
Search WWH ::




Custom Search