Building a Data Classification System with Mahout - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Summary

Although machines can't yet match the learning ability of humans, it is possible to cre-

ate systems that can modify their output based on new data. These systems are part of

the large and growing field of machine learning tools. Machine learning encompasses

a huge variety of use cases, from computer vision to text classification to biological

modeling. For projects that collect large amounts of Web-scale data, machine learn-

ing techniques are often the only viable way to provide predictive value to customers.

Common successful use cases for machine learning approaches include product-

recommendation systems, demographic grouping, and sorting out spam from inboxes.

There are many potential pitfalls to applying machine learning techniques to

large-scale data analysis problems. Sometimes there is little requirement to use entire

datasets, as models and analysis with samples of data can often be just as effective.

Overfitting a predictive model to existing data can produce false results. A similar

concern is that of ignoring the trade-off between sample bias and sample variance.

Using a more general, biased model when creating predictive models may provide

greater predictive use at the expense of highly variant results. Creating models that

tightly match existing data may prevent f lexibility when new data is collected. In sum-

mary, it is important to thoroughly understand the problem space and to consider the

trade-offs inherent in algorithmic approaches to solving machine learning problems.

Machine learning approaches have been effectively used for several classes of

Internet-scale data-collection problems. Classification systems use models built from

existing data to classify incoming data. A familiar implementation of a classification

can be found in the spam-detection systems common in Web mail systems. Cluster

analysis attempts to take unclassified data and form groups according to certain param-

eters. Cluster analysis can be used to create logical groupings of individuals, such as

customer demographics, that are based on a statistical metric. Recommendation sys-

tems are also a very popular use of machine learning algorithms. Many Web services,

from media to shopping to online-dating Web sites, use recommendation algorithms

to provide value to customers.

Machine learning algorithms have been developed for many systems, but as data

sizes grow, the need for heavy processing in a timely fashion can overwhelm a single

machine. Apache Mahout is an open-source set of libraries that help scale machine

learning tasks over clusters of machines, using distributed frameworks such as Apache

Hadoop. Mahout focuses on a collection of well-understood use cases, including clus-

tering, classification, and recommendation. Although Mahout is primarily a library

used to build other software packages, it also provides a great deal of useful binary

tools for running distributed machine learning tasks from the command line. Appli-

cations built using the Mahout library can be easily integrated with the rest of the

Hadoop ecosystem.

Mahout is just one of several tools that can be used for distributed machine learn-

ing. Integration with the Hadoop ecosystem can be a compelling reason to use

Search WWH ::

Custom Search

Home