Database Reference
In-Depth Information
Introduction to Mahout
The preceding section introduced you at a high level to both data mining
and predictive analytics and how they apply to big data. If at this point you
are worried that you don't possess the skills or background to successfully
build and deliver this type of intelligence within your HDInsight platform,
fear not!
The remainder of this chapter introduces you to the Mahout machine
learning library and explains how you can use it to deliver meaningful big
data analytical solutions without a PhD in statistics or mathematics. So,
what is this Mahout thing?
Mahout is an open source, top-level Apache project that encapsulates
multiple machine learning algorithms into a single library. Like its Hadoop
counterpart, the Mahout community is a vibrant and active community that
has continually expanded and improved on Mahout.
For a historical perspective, the Mahout project grew out of two separate
projects: the Apache Lucene (an open source text indexing project) and
Taste (an open source Java library of machine learning algorithms).
Mahout supports two basic implementations. First, is a non-distributed
or real-time implementation that involves native non-Hadoop Java calls
directly totheMahoutlibrary. Thesecondscenario istheonewearefocused
on and is accomplished in a distributed or batch processing manner using
Hadoop. Both of these scenarios abstracts away the complexity of machine
learning algorithms.
The basis of Mahout within the context of big data are four primary use
cases:
• Collaborative filtering (recommendation mining based on user
behavior)
• Clustering (grouping similar documents)
• Classification (assigning uncategorized documents to predefined
categories)
• Frequent item set mining (market basket analysis)
To get started with the Apache Mahout library, you first need to download
the project distribution; it is not included by default with HDInsight on
Search WWH ::




Custom Search