Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

K-means

MLlib includes the popular K-means algorithm for clustering, as well as a variant

called K-means|| that provides better initialization in parallel environments. 5 K-

means|| is similar to the K-means++ initialization procedure often used in single-

node settings.

The most important parameter in K-means is a target number of clusters to generate,

K. In practice, you rarely know the “true” number of clusters in advance, so the best

practice is to try several values of K, until the average intracluster distance stops

decreasing dramatically. However, the algorithm takes only one K at a time. Apart

from K, K-means in MLlib takes the following parameters:

initializationMode

The method to initialize cluster centers, which can be either “k-means||” or “ran‐

dom”; k-means|| (the default) generally leads to better results but is slightly

more expensive.

maxIterations

Maximum number of iterations to run (default: 100 ).

runs Number of concurrent runs of the algorithm to execute. MLlib's K-means sup‐

ports running from multiple starting positions concurrently and picking the best

result, which is a good way to get a better overall model (as K-means runs can

stop in local minima).

Like other algorithms, you invoke K-means by creating a mllib.clustering.KMeans

object (in Java/Scala) or calling KMeans.train (in Python). It takes an RDD of Vec

tor s. K-means returns a KMeansModel that lets you access the clusterCenters (as an

array of vectors) or call predict() on a new vector to return its cluster. Note that

predict() always returns the closest center to a point, even if the point is far from all

clusters.

Collaborative Filtering and Recommendation

Collaborative filtering is a technique for recommender systems wherein users' ratings

and interactions with various products are used to recommend new ones. Collabora‐

tive filtering is attractive because it only needs to take in a list of user/product interac‐

tions: either “explicit” interactions (i.e., ratings on a shopping site) or “implicit” ones

(e.g., a user browsed a product page but did not rate the product). Based solely on

these interactions, collaborative filtering algorithms learn which products are similar

5 K-means|| was introduced in Bahmani et al., “Scalable K-Means++,” VLDB 2008.

Search WWH ::

Custom Search

Home