Information Technology Reference
In-Depth Information
W t +1 = W t
L ( W t ,X ) .
ʷ
This update is performed iteratively until some convergence criterion is met.
An interesting characteristic of the gradient of the LR loss function is that the
evaluation of it over the whole dataset is equivalent to the sum of the individual
evaluations over each data samples, i.e.:
i =1
m
L ( W t ,X )=
L ( W t ,x i ) .
Giving rise to a form in the sense described in Section 2 above for the gradient
function that can be calculated in a distributed fashion, by separating the train-
ing dataset in several groups, independently calculating the gradient for each
group, and then summing them up to find the overall gradient.
3 Experimental Setup
3.1 Dataset
The goal of our experiments was to measure the parallelization capabilities of a
gradient descent based method over a fixed number of computing resources as
the dataset size and number of iterations over the data increased using three dif-
ferent caching strategies (1) no cache, (2) default caching, (3) local only caching.
For this, we used the MNIST dataset [7] containing about 60,000 images with
handwritten digits (from 0 to 9), which is typically used as a benchmark for
computer vision algorithms. Each digit is contained within a 28x28 gray scale
image and represented by a vector of 784 components with the gray intensities
of each pixel. Given any digit image, the machine learning task is to classify each
image as the digit it represents. Figure 2 shows a sample of the dataset.
Fig. 2. MNIST dataset sample
3.2 Experimental Runs
We perform the evaluation on three dataset versions: 30000, 60000 and 120000.
To build the 120k dataset the original dataset was duplicated. Conversely, the
30k was built by subsampling at 50% the original dataset. From each dataset,
80% was taken as training set and the remainder 20% as the test set. Each dataset
was partitioned on 10 data partitions and loaded in the HBASE database to be
processed by the BIGS workers. The gradient descent algorithm hyperparameters
were fixed for all runs. Each run used 5 workers. We evaluated the experiment
for 25, 40 and 100 iterations. Each configuration was executed twice and their
average is reported. Table 1 shows the list of evaluated configurations.
 
Search WWH ::




Custom Search