Database Reference
In-Depth Information
researchers were able to identify the identities of individuals based on data from public
movie-ratings sites.
Of the many classes of problems that can be usefully solved with machine learning
tools, recommendation algorithms seem to be the most human. Every day, we depend
on our social circles for advice on what to buy, watch, and vote on. With the ubiquity
of online commerce, this problem space is becoming more valuable by the day.
There are many ways to build recommendation engines. One method is to take
existing customer choices and use these to try to predict future choices. Another
approach is to look at content itself. Are some movies inherently similar to others?
Action movies must all share similar features; they are noisy, fast paced, and colorful.
A computer may be able to identify traits in a particular type of media and build a
classification system accordingly.
Apache Mahout: Scalable Machine Learning
Many technologies introduced in this topic are related in some way to the Hadoop
MapReduce framework. MapReduce is an algorithmic approach to breaking up large
data problems—those that cannot be easily tackled by a single computer—into smaller
ones that can be distributed across a number of separate machines.
Implementing the algorithms used in many machine learning tasks can be challeng-
ing enough, and it is even more difficult to parallelize these across a number of dif-
ferent machines. In 2006, a group of computer scientists from Stanford (including the
founder of Coursera, Andrew Ng) published a paper called “Map-Reduce for Machine
Learning on Multicore.” This paper described how a MapReduce framework could be
applied to a wide variety of machine learning problems, including clustering, Bayesian
classification, and regression.
Meanwhile, open-source developers working on the Apache Lucene search index
project (which was started by Hadoop creator Doug Cutting) began to explore adding
machine learning features to the software. This work eventually grew into its own proj-
ect, which became Apache Mahout. As Mahout came into its own, it grew to express
MapReduce-related features similar to those explored in the paper by Andrew Ng and
others.
Apache Mahout has become a very popular project, not least of all because it is
designed to solve very practical machine learning problems. Mahout is essentially a
set of Java libraries designed to make machine learning applications easier to build.
Although one of the goals of the project is to be somewhat agnostic as to which plat-
form it is used with, it is well integrated with Apache Hadoop, the most commonly
used open-source MapReduce framework. Mahout also allows new users to get started
with common use cases quickly. Like Apache Hive (which provides an SQL-like inter-
face to querying data in Hadoop's distributed filesystem), Mahout translates machine
learning tasks expressed in Java into MapReduce jobs.
 
 
Search WWH ::




Custom Search