Analytic Helpers - Field Guide to Hadoop

Database Reference

In-Depth Information

You have a bunch of data in your Hadoop cluster. What are you going to do with it? You

might want to do some analytics, or data science, or machine learning. Much of this can be

done in some of the tools that come with the standard Apache distribution, such as Pig,

MapReduce, or Hive. But more sophisticated uses will involve algorithms that you will not

want to code yourself. So you turn to Mahout. What is Mahout? Mahout is a collection of

scalable machine-learning algorithms that run on Hadoop. Why is it called Mahout? Mahout

is the Hindi word for an elephant handler, as you can see from the logo. The list of al-

gorithms is constantly growing, but as of March 2014, it includes the ones listed in Table 5-1 .

Table 5-1. Mahout MapReduce algorithms

Mahout algorithm

Brief description

k -means/fuzzy k -means

clustering

Clustering is dividing a set of observation into groups where elements in the group are

similar and the groups are distinct

Latent Dirichlet allocation LDA is a modelling technique often used for classifying documents predicated on the

use of specific topic terms in the document

Singular value decomposi-

tion

SVD is difficult to explain succinctly without a lot of linear algebra and eigenvalue

background

Logistic-regression-based

classifier

Logistic regression is used to predict variables that have a zero-one value, such as

presence or absense of a disease, or membership in a group

Complementary naive

Bayes classifier

Another classification scheme making use of Bayes' theorem (which you may remem-

ber from Statistics 101)

Random forest decision

tree-based classifier

Yet another classifier based on decision trees

Collaborative filtering

Used in recommendation systems (if you like X, may we suggest Y)

A fuller discussion of all these is well beyond the scope of this topic. There are many good

introductions to machine learning available. Google is your friend here.

In April 2014, the Mahout community announced that it was moving away from MapReduce

to a domain-specific language (DSL) based on Scala to a Spark implementation (described

here ) . Current MapReduce algorithms would continue to be supported, but additions to the

code base could not be MapReduce based. In fact, in the latest release, the Mahout commu-

nity had dropped support for some infrequently used routines.

Search WWH ::

Custom Search

Home