Distributed Cache Strategies for Machine Learning Classification Tasks over Cluster Computing Resources - High Performance Computing

Information Technology Reference

In-Depth Information

Distributed Cache Strategies

for Machine Learning Classification Tasks

over Cluster Computing Resources

John Edilson Arévalo Ovalle 1 , Raúl Ramos-Pollan 2 ,

and Fabio A. González 1

1 Universidad Nacional de Colombia

{jearevaloo,fagonzalezo}@unal.edu.co

2 Unidad de Supercómputo y Cálculo Científico,

Universidad Industrial de Santander Colombia

rramosp@uis.edu.co

Abstract. Scaling machine learning (ML) methods to learn from large

datasets requires devising distributed data architectures and algorithms

to support their iterative nature where the same data records are pro-

cessed several times. Data caching becomes key to minimize data trans-

mission through iterations at each node and, thus, contribute to the

overall scalability. In this work we propose a two level caching archi-

tecture (disk and memory) and benchmark different caching strategies

in a distributed machine learning setup over a cluster with no shared

RAM memory. Our results strongly favour strategies where (1) datasets

are partitioned and preloaded throughout the distributed memory of the

cluster nodes and (2) algorithms use data locality information to syn-

chronize computations at each iteration. This supports the convergence

towards models where “computing goes to data” as observed in other

Big Data contexts, and allows us to align strategies for parallelizing ML

algorithms and configure appropriately computing infrastructures.

1

Introduction

Data caching strategies have become a key issue in scaling machine learning

methods that typically iterate several times over a given dataset aiming at re-

ducing some error measure on their predictions. It is known that as dataset sizes

increase we need to adapt or even redesign the algorithms and devise the appro-

priate software and hardware architectures to support them. This is especially

true if we want to endow our systems with horizontal scalability, where increased

performance is to be achieved not by upgrading the existing computing resources

(faster machines with more memory) but by adding more computing resources

of a similar (commodity) kind.

In this sense, machine learning methods are particularly sensible to below-

optimal architectures as they typically need to iterate or process large amounts of

data which necessarily lives on distributed storage. Despite the fact there is a rich

variety of machine learning methods, many of them follow a common pattern:

Search WWH ::

Custom Search

Home