Database Reference
In-Depth Information
Extracting the right features from your
data
You might recall from Chapter 3 , Obtaining, Processing, and Preparing Data with Spark
that the majority of machine learning models operate on numerical data in the form of fea-
ture vectors. In addition, for supervised learning methods such as classification and regres-
sion, we need to provide the target variable (or variables in the case of multiclass situ-
ations) together with the feature vector.
Classification models in MLlib operate on instances of LabeledPoint , which is a wrap-
per around the target variable (called the label ) and the feature vector :
case class LabeledPoint(label: Double, features: Vector)
While in most examples of using classification, you will come across existing datasets that
are already in the vector format, in practice, you will usually start with raw data that needs
to be transformed into features. As we have already seen, this can involve preprocessing
and transformation, such as binning numerical features, scaling and normalizing features,
and using 1-of-k encodings for categorical features.
Search WWH ::




Custom Search