Database Reference
In-Depth Information
Summary
Although machines can't yet match the learning ability of humans, it is possible to cre-
ate systems that can modify their output based on new data. These systems are part of
the large and growing field of machine learning tools. Machine learning encompasses
a huge variety of use cases, from computer vision to text classification to biological
modeling. For projects that collect large amounts of Web-scale data, machine learn-
ing techniques are often the only viable way to provide predictive value to customers.
Common successful use cases for machine learning approaches include product-
recommendation systems, demographic grouping, and sorting out spam from inboxes.
There are many potential pitfalls to applying machine learning techniques to
large-scale data analysis problems. Sometimes there is little requirement to use entire
datasets, as models and analysis with samples of data can often be just as effective.
Overfitting a predictive model to existing data can produce false results. A similar
concern is that of ignoring the trade-off between sample bias and sample variance.
Using a more general, biased model when creating predictive models may provide
greater predictive use at the expense of highly variant results. Creating models that
tightly match existing data may prevent f lexibility when new data is collected. In sum-
mary, it is important to thoroughly understand the problem space and to consider the
trade-offs inherent in algorithmic approaches to solving machine learning problems.
Machine learning approaches have been effectively used for several classes of
Internet-scale data-collection problems. Classification systems use models built from
existing data to classify incoming data. A familiar implementation of a classification
can be found in the spam-detection systems common in Web mail systems. Cluster
analysis attempts to take unclassified data and form groups according to certain param-
eters. Cluster analysis can be used to create logical groupings of individuals, such as
customer demographics, that are based on a statistical metric. Recommendation sys-
tems are also a very popular use of machine learning algorithms. Many Web services,
from media to shopping to online-dating Web sites, use recommendation algorithms
to provide value to customers.
Machine learning algorithms have been developed for many systems, but as data
sizes grow, the need for heavy processing in a timely fashion can overwhelm a single
machine. Apache Mahout is an open-source set of libraries that help scale machine
learning tasks over clusters of machines, using distributed frameworks such as Apache
Hadoop. Mahout focuses on a collection of well-understood use cases, including clus-
tering, classification, and recommendation. Although Mahout is primarily a library
used to build other software packages, it also provides a great deal of useful binary
tools for running distributed machine learning tasks from the command line. Appli-
cations built using the Mahout library can be easily integrated with the rest of the
Hadoop ecosystem.
Mahout is just one of several tools that can be used for distributed machine learn-
ing. Integration with the Hadoop ecosystem can be a compelling reason to use
 
 
Search WWH ::




Custom Search