Distributed Cache Strategies for Machine Learning Classification Tasks over Cluster Computing Resources - High Performance Computing

Information Technology Reference

In-Depth Information

W t +1 = W t

L ( W t ,X ) .

−

ʷ

∇

This update is performed iteratively until some convergence criterion is met.

An interesting characteristic of the gradient of the LR loss function is that the

evaluation of it over the whole dataset is equivalent to the sum of the individual

evaluations over each data samples, i.e.:

i =1 ∇

m

L ( W t ,X )=

L ( W t ,x i ) .

∇

Giving rise to a form in the sense described in Section 2 above for the gradient

function that can be calculated in a distributed fashion, by separating the train-

ing dataset in several groups, independently calculating the gradient for each

group, and then summing them up to find the overall gradient.

3 Experimental Setup

3.1 Dataset

The goal of our experiments was to measure the parallelization capabilities of a

gradient descent based method over a fixed number of computing resources as

the dataset size and number of iterations over the data increased using three dif-

ferent caching strategies (1) no cache, (2) default caching, (3) local only caching.

For this, we used the MNIST dataset [7] containing about 60,000 images with

handwritten digits (from 0 to 9), which is typically used as a benchmark for

computer vision algorithms. Each digit is contained within a 28x28 gray scale

image and represented by a vector of 784 components with the gray intensities

of each pixel. Given any digit image, the machine learning task is to classify each

image as the digit it represents. Figure 2 shows a sample of the dataset.

Fig. 2. MNIST dataset sample

3.2 Experimental Runs

We perform the evaluation on three dataset versions: 30000, 60000 and 120000.

To build the 120k dataset the original dataset was duplicated. Conversely, the

30k was built by subsampling at 50% the original dataset. From each dataset,

80% was taken as training set and the remainder 20% as the test set. Each dataset

was partitioned on 10 data partitions and loaded in the HBASE database to be

processed by the BIGS workers. The gradient descent algorithm hyperparameters

were fixed for all runs. Each run used 5 workers. We evaluated the experiment

for 25, 40 and 100 iterations. Each configuration was executed twice and their

average is reported. Table 1 shows the list of evaluated configurations.

Search WWH ::

Custom Search

Home