Database Reference
In-Depth Information
Table 5-2. MLLib algorithms
MLLib algorithm
Brief description
Linear SVM and logistic re-
gression
Prediction using continuous and binary variables
Classification and regression
tree
Methods to classify data based on binary decisions
k -means clustering
Clustering is dividing a set of observation into groups where elements in the group
are similar and the groups are distinct
Recommendation via alternat-
ing least squares
Used in recommendation systems (if you like X, you might like Y)
Multinomial naive Bayes
Classification based upon Bayes' Theorem
Basic statistics
Summary statistics, random data generation, correlations
Feature extraction and trans-
formation
A number of routines often used in text analytics
Dimensionality reduction
Reducing the number of variables in an analytic problem, often used when they
are highly correlated
Again, as MLLib lives on Spark, you would be wise to know Scala, Python, or Java to do
anything sophisticated with it.
You may wonder whether to choose MLLib or Mahout. In the short run, Mahout is more ma-
ture and has a larger set of routines, but the current version of Mahout uses MapReduce and
is slower in general (though likely more stable). If the algorithms you need only exist today
on Mahout, that solves your problem. Mahout currently has a much larger user community,
so if you're looking for online help with problems, you're more likely to find it for Mahout.
On the other hand, Mahout v2 will move to Spark and Scala, so in the long run, MLLib may
well replace Mahout or they may merge efforts.
Tutorial Links
“MLLib: Scalable Machine Learning on Spark” is a thorough but rather technical tutorial
that you may find useful.
 
Search WWH ::




Custom Search