Building a Clustering Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Extracting the right features from your

data

Like most of the machine learning models we have encountered so far, K-means clustering

requires numerical vectors as input. The same feature extraction and transformation ap-

proaches that we have seen for classification and regression are applicable for clustering.

As K-means, like least squares regression, uses a squared error function as the optimization

objective, it tends to be impacted by outliers and features with large variance.

As for regression and classification cases, input data can be normalized and standardized to

overcome this, which might improve accuracy. In some cases, however, it might be desir-

able not to standardize data, if, for example, the objective is to find segmentations accord-

ing to certain specific features.

Search WWH ::

Custom Search

Home