Building a Data Classification System with Mahout - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

=======================================================

Statistics

-------------------------------------------------------

Kappa 0.9206

Accuracy 100%

Reliability 66.6667%

Reliability (standard deviation) 0.5774

Running a simple Bayesian classifier using the included binaries in the Apache

Mahout distribution is just the tip of the iceberg for the types of applications that are

capable of being created with this software. Indeed, it is possible to build far more

complicated machine learning projects using the underlying Java interface. Thanks to a

vibrant developer and user community, new features are being added to Mahout's core

libraries every day.

Framework

Mahout is not the only distributed machine learning system, but its integration with

Hadoop is a very compelling reason to consider it for building applications. One of the

criticisms of MapReduce-based approaches to data analysis is that performance is not

optimal. For some batch-processing jobs on data sizes that are much larger than avail-

able memory, MapReduce is still often the best way to solve a problem. Nevertheless,

MapReduce is heavily reliant on disk access.

The AMPLab, a group of researchers from UC Berkeley, has been approaching

data challenges by building new open-source software applications that have per-

formance in mind from the start. One of the core projects, Spark , is an in-memory

implementation for cluster computing. Spark aims to rethink distributed systems by

avoiding disk access as much as possible. Spark also is built around the idea of reusable

memory-based chunks of data that can be processed without having to resort to reads

from disk. For machine learning tasks, this can be very beneficial, as some predictive

or clustering models may change only incrementally as new data is added. In order

to take advantage of the Spark distributed environment, the AMPLab has sponsored

a project called MLbase. MLbase consists of several parts. The first component is a

general-purpose machine learning library, called MLlib, that is similar to Mahout in

many ways. It provides a low-level, Spark-compatible interface to machine learning

algorithms. The MLI is an API layer that sits on top of MLlib, providing a higher-level

interface to the underlying system. Perhaps the most exciting tool in this stack is the

MLOptimizer, with attempts to choose the correct algorithm based on the data and

task provided. The MLbase platform, although newer than the Mahout project, may

prove to be a viable option for working with large-scale machine learning tasks.

Search WWH ::

Custom Search

Home