Analytic Helpers - Field Guide to Hadoop

Database Reference

In-Depth Information

Table 5-2. MLLib algorithms

MLLib algorithm

Brief description

Linear SVM and logistic re-

gression

Prediction using continuous and binary variables

Classification and regression

tree

Methods to classify data based on binary decisions

k -means clustering

Clustering is dividing a set of observation into groups where elements in the group

are similar and the groups are distinct

Recommendation via alternat-

ing least squares

Used in recommendation systems (if you like X, you might like Y)

Multinomial naive Bayes

Classification based upon Bayes' Theorem

Basic statistics

Summary statistics, random data generation, correlations

Feature extraction and trans-

formation

A number of routines often used in text analytics

Dimensionality reduction

Reducing the number of variables in an analytic problem, often used when they

are highly correlated

Again, as MLLib lives on Spark, you would be wise to know Scala, Python, or Java to do

anything sophisticated with it.

You may wonder whether to choose MLLib or Mahout. In the short run, Mahout is more ma-

ture and has a larger set of routines, but the current version of Mahout uses MapReduce and

is slower in general (though likely more stable). If the algorithms you need only exist today

on Mahout, that solves your problem. Mahout currently has a much larger user community,

so if you're looking for online help with problems, you're more likely to find it for Mahout.

On the other hand, Mahout v2 will move to Spark and Scala, so in the long run, MLLib may

well replace Mahout or they may merge efforts.

Tutorial Links

“MLLib: Scalable Machine Learning on Spark” is a thorough but rather technical tutorial

that you may find useful.

Search WWH ::

Custom Search

Home