Building a Data Classification System with Mahout - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

researchers were able to identify the identities of individuals based on data from public

movie-ratings sites.

Of the many classes of problems that can be usefully solved with machine learning

tools, recommendation algorithms seem to be the most human. Every day, we depend

on our social circles for advice on what to buy, watch, and vote on. With the ubiquity

of online commerce, this problem space is becoming more valuable by the day.

There are many ways to build recommendation engines. One method is to take

existing customer choices and use these to try to predict future choices. Another

approach is to look at content itself. Are some movies inherently similar to others?

Action movies must all share similar features; they are noisy, fast paced, and colorful.

A computer may be able to identify traits in a particular type of media and build a

classification system accordingly.

Many technologies introduced in this topic are related in some way to the Hadoop

MapReduce framework. MapReduce is an algorithmic approach to breaking up large

data problems—those that cannot be easily tackled by a single computer—into smaller

ones that can be distributed across a number of separate machines.

Implementing the algorithms used in many machine learning tasks can be challeng-

ing enough, and it is even more difficult to parallelize these across a number of dif-

ferent machines. In 2006, a group of computer scientists from Stanford (including the

founder of Coursera, Andrew Ng) published a paper called “Map-Reduce for Machine

Learning on Multicore.” This paper described how a MapReduce framework could be

applied to a wide variety of machine learning problems, including clustering, Bayesian

classification, and regression.

Meanwhile, open-source developers working on the Apache Lucene search index

project (which was started by Hadoop creator Doug Cutting) began to explore adding

machine learning features to the software. This work eventually grew into its own proj-

ect, which became Apache Mahout. As Mahout came into its own, it grew to express

MapReduce-related features similar to those explored in the paper by Andrew Ng and

others.

Apache Mahout has become a very popular project, not least of all because it is

designed to solve very practical machine learning problems. Mahout is essentially a

set of Java libraries designed to make machine learning applications easier to build.

Although one of the goals of the project is to be somewhat agnostic as to which plat-

form it is used with, it is well integrated with Apache Hadoop, the most commonly

used open-source MapReduce framework. Mahout also allows new users to get started

with common use cases quickly. Like Apache Hive (which provides an SQL-like inter-

face to querying data in Hadoop's distributed filesystem), Mahout translates machine

learning tasks expressed in Java into MapReduce jobs.

Search WWH ::

Custom Search

Home