Distributed Cache Strategies for Machine Learning Classification Tasks over Cluster Computing Resources - High Performance Computing

Information Technology Reference

In-Depth Information

Fig. 5. Time to process 1 million data points vs the dataset size and the number of

iterations

Fig. 6. Number of cache PUTS (writes) and HITS (reads) per dataset size

using cache with local only partitions but, interestingly enough, as we increase

dataset size, this reuse drops, probably to due to RAM memory exhaustion on

each computing node. This behavior is a subject for further experimentation and

understanding.

Finally, it is worthwhile mentioning that our experiments produced an average

train accuracy of 86.77% of successful digit recognition (with 0.59 standard de-

viation), and an average 86.91% accuracy on the test data (with 0.76 standard

deviation). This figures fall within the expected accuracy of similar methods

reported for the MNIST dataset, including the method stability (very low stan-

dard deviation, under 1%) and generalization capabilities (very low difference

between train and test data), and guarantees the well behavior of the algorithms

throughout all experiments.

5 Conclusions

Caching strategies are key to enable scaling machine learning methods of iterative

nature. This work shows that even different strategies yield to different scalability

properties and, thus, caching arquitectures for distributed data must be taken into

account when devising scalable algorithms. Our results evidence that strategies

favoring cache reuse throughout the different iterations over the data outperform

simpler strategies, but this requires the algorithms (or the frameworks used) to

keep track and exploit data locality, combining different levels of caching (disk

and memory). This supports the convergence towards models where “computing

Search WWH ::

Custom Search

Home