Database Reference
In-Depth Information
You have a bunch of data in your Hadoop cluster. What are you going to do with it? You
might want to do some analytics, or data science, or machine learning. Much of this can be
done in some of the tools that come with the standard Apache distribution, such as Pig,
MapReduce, or Hive. But more sophisticated uses will involve algorithms that you will not
want to code yourself. So you turn to Mahout. What is Mahout? Mahout is a collection of
scalable machine-learning algorithms that run on Hadoop. Why is it called Mahout? Mahout
is the Hindi word for an elephant handler, as you can see from the logo. The list of al-
gorithms is constantly growing, but as of March 2014, it includes the ones listed in Table 5-1 .
Table 5-1. Mahout MapReduce algorithms
Mahout algorithm
Brief description
k -means/fuzzy k -means
clustering
Clustering is dividing a set of observation into groups where elements in the group are
similar and the groups are distinct
Latent Dirichlet allocation LDA is a modelling technique often used for classifying documents predicated on the
use of specific topic terms in the document
Singular value decomposi-
tion
SVD is difficult to explain succinctly without a lot of linear algebra and eigenvalue
background
Logistic-regression-based
classifier
Logistic regression is used to predict variables that have a zero-one value, such as
presence or absense of a disease, or membership in a group
Complementary naive
Bayes classifier
Another classification scheme making use of Bayes' theorem (which you may remem-
ber from Statistics 101)
Random forest decision
tree-based classifier
Yet another classifier based on decision trees
Collaborative filtering
Used in recommendation systems (if you like X, may we suggest Y)
A fuller discussion of all these is well beyond the scope of this topic. There are many good
introductions to machine learning available. Google is your friend here.
In April 2014, the Mahout community announced that it was moving away from MapReduce
to a domain-specific language (DSL) based on Scala to a Spark implementation (described
here ) . Current MapReduce algorithms would continue to be supported, but additions to the
code base could not be MapReduce based. In fact, in the latest release, the Mahout commu-
nity had dropped support for some infrequently used routines.
 
Search WWH ::




Custom Search