Distributed Cache Strategies for Machine Learning Classification Tasks over Cluster Computing Resources - High Performance Computing

Information Technology Reference

In-Depth Information

4Rsus

Metrics collected in experiments focused on understanding (1) how different

cache configurations affected the total elapsed time of the experiments and (2)

how the processing time per data item (or 1 million data items) evolved through-

out the different run configurations. Elapsed time is understood as the wall time

of each experiment from beginning to end. As each experiment was run twice,

results show averages of the two runs. Deviations in times between the two runs

were insignificant.

Figure 4 (left) shows the experimental elapsed time as a function of the number

of data items processed. Here we averaged runs processing the same number of

data items, regardless the number of iterations and dataset size. For instance, a

dataset with 30K data items through 40 iterations processes a total of 1.2M data

items, whereas a dataset with 60K data items through 25 iterations processes a

total of 1.5M data items.

As it can be seen any cache strategy largely outperforms the lack of cache

and, furthermore, caching strategies using only local data tend to perform even

better that caching strategies that use no criteria to select the data to work on.

Figure 5 shows the average time to process 1 million data items as a function

of the number of iterations (left) and the dataset size (right). In all cases, as the

amount of data processed grows (either through iterations or through dataset

size) the time to processes 1 million data items tends to decrease, probably as

the first data items loads in other low level caches (processor, OS, Java virtual

machine, etc.). Anyhow, again we see that exploiting local data outperforms all

other strategies.

When coming to compare the two caching strategies used we can observe in

Figure 6, as expected, that using only local data results in a reduced number of

writes (PUTs) to the cache whereas the number of reads is similar. This signals

data reuse within each cache and we interpret this as the root cause of the

observed improvement of caching strategies using data locality information.

Figure 4 (right) shows the number of cache HITs per PUT with respect to

dataset size where one can observe some indication of the degree of data reuse in

each configuration. In general each data item is re-used about 5 times more often

Fig. 4. Total elapsed time of experiments (left) and data reuse (right)

Search WWH ::

Custom Search

Home