Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Derived features

As we mentioned earlier, it is often useful to compute a derived feature from one or more

available variables. We hope that the derived feature can add more information than only

using the variable in its raw form.

For instance, we can compute the average rating given by each user to all the movies they

rated. This would be a feature that could provide a user-specific intercept in our model (in

fact, this is a commonly used approach in recommendation models). We have taken the raw

rating data and created a new feature that can allow us to learn a better model.

Examples of features derived from raw data include computing average values, median val-

ues, variances, sums, differences, maximums or minimums, and counts. We have already

seen a case of this when we created a new movie age feature from the year of release of

the movie and the current year. Often, the idea behind using these transformations is to

summarize the numerical data in some way that might make it easier for a model to learn.

It is also common to transform numerical features into categorical features, for example, by

binning features. Common examples of this include variables such as age, geolocation, and

time.

Transforming timestamps into categorical features

To illustrate how to derive categorical features from numerical data, we will use the times

of the ratings given by users to movies. These are in the form of Unix timestamps. We can

use Python's datetime module to extract the date and time from the timestamp and, in

turn, extract the hour of the day. This will result in an RDD of the hour of the day for each

rating.

We will need a function to extract a datetime representation of the rating timestamp (in

seconds); we will create this function now:

def extract_datetime(ts):

import datetime

return datetime.datetime.fromtimestamp(ts)

We will again use the rating_data RDD that we computed in the earlier examples as

our starting point.

Search WWH ::

Custom Search

Home