Database Reference
In-Depth Information
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.9206
Accuracy 100%
Reliability 66.6667%
Reliability (standard deviation) 0.5774
Running a simple Bayesian classifier using the included binaries in the Apache
Mahout distribution is just the tip of the iceberg for the types of applications that are
capable of being created with this software. Indeed, it is possible to build far more
complicated machine learning projects using the underlying Java interface. Thanks to a
vibrant developer and user community, new features are being added to Mahout's core
libraries every day.
MLBase: Distributed Machine Learning
Framework
Mahout is not the only distributed machine learning system, but its integration with
Hadoop is a very compelling reason to consider it for building applications. One of the
criticisms of MapReduce-based approaches to data analysis is that performance is not
optimal. For some batch-processing jobs on data sizes that are much larger than avail-
able memory, MapReduce is still often the best way to solve a problem. Nevertheless,
MapReduce is heavily reliant on disk access.
The AMPLab, a group of researchers from UC Berkeley, has been approaching
data challenges by building new open-source software applications that have per-
formance in mind from the start. One of the core projects, Spark , is an in-memory
implementation for cluster computing. Spark aims to rethink distributed systems by
avoiding disk access as much as possible. Spark also is built around the idea of reusable
memory-based chunks of data that can be processed without having to resort to reads
from disk. For machine learning tasks, this can be very beneficial, as some predictive
or clustering models may change only incrementally as new data is added. In order
to take advantage of the Spark distributed environment, the AMPLab has sponsored
a project called MLbase. MLbase consists of several parts. The first component is a
general-purpose machine learning library, called MLlib, that is similar to Mahout in
many ways. It provides a low-level, Spark-compatible interface to machine learning
algorithms. The MLI is an API layer that sits on top of MLlib, providing a higher-level
interface to the underlying system. Perhaps the most exciting tool in this stack is the
MLOptimizer, with attempts to choose the correct algorithm based on the data and
task provided. The MLbase platform, although newer than the Mahout project, may
prove to be a viable option for working with large-scale machine learning tasks.
 
 
Search WWH ::




Custom Search