Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Extracting useful features from your data

Once we have completed the initial exploration, processing, and cleaning of our data, we

are ready to get down to the business of extracting actual features from the data, with

which our machine learning model can be trained.

Features refer to the variables that we use to train our model. Each row of data contains

various information that we would like to extract into a training example. Almost all ma-

chine learning models ultimately work on numerical representations in the form of a vec-

tor ; hence, we need to convert raw data into numbers.

Features broadly fall into a few categories, which are as follows:

• Numerical features : These features are typically real or integer numbers, for ex-

ample, the user age that we used in an example earlier.

• Categorical features : These features refer to variables that can take one of a set of

possible states at any given time. Examples from our dataset might include a user's

gender or occupation or movie categories.

• Text features : These are features derived from the text content in the data, for ex-

ample, movie titles, descriptions, or reviews.

• Other features : Most other types of features are ultimately represented numeric-

ally. For example, images, video, and audio can be represented as sets of numerical

data. Geographical locations can be represented as latitude and longitude or geo-

hash data.

Here we will cover numerical, categorical, and text features.

Search WWH ::

Custom Search

Home