Database Reference
In-Depth Information
K-means
MLlib includes the popular K-means algorithm for clustering, as well as a variant
called K-means|| that provides better initialization in parallel environments. 5 K-
means|| is similar to the K-means++ initialization procedure often used in single-
node settings.
The most important parameter in K-means is a target number of clusters to generate,
K. In practice, you rarely know the “true” number of clusters in advance, so the best
practice is to try several values of K, until the average intracluster distance stops
decreasing dramatically. However, the algorithm takes only one K at a time. Apart
from K, K-means in MLlib takes the following parameters:
initializationMode
The method to initialize cluster centers, which can be either “k-means||” or “ran‐
dom”; k-means|| (the default) generally leads to better results but is slightly
more expensive.
maxIterations
Maximum number of iterations to run (default: 100 ).
runs Number of concurrent runs of the algorithm to execute. MLlib's K-means sup‐
ports running from multiple starting positions concurrently and picking the best
result, which is a good way to get a better overall model (as K-means runs can
stop in local minima).
Like other algorithms, you invoke K-means by creating a mllib.clustering.KMeans
object (in Java/Scala) or calling KMeans.train (in Python). It takes an RDD of Vec
tor s. K-means returns a KMeansModel that lets you access the clusterCenters (as an
array of vectors) or call predict() on a new vector to return its cluster. Note that
predict() always returns the closest center to a point, even if the point is far from all
clusters.
Collaborative Filtering and Recommendation
Collaborative filtering is a technique for recommender systems wherein users' ratings
and interactions with various products are used to recommend new ones. Collabora‐
tive filtering is attractive because it only needs to take in a list of user/product interac‐
tions: either “explicit” interactions (i.e., ratings on a shopping site) or “implicit” ones
(e.g., a user browsed a product page but did not rate the product). Based solely on
these interactions, collaborative filtering algorithms learn which products are similar
5 K-means|| was introduced in Bahmani et al., “Scalable K-Means++,” VLDB 2008.
 
Search WWH ::




Custom Search