Database Reference
In-Depth Information
Derived features
As we mentioned earlier, it is often useful to compute a derived feature from one or more
available variables. We hope that the derived feature can add more information than only
using the variable in its raw form.
For instance, we can compute the average rating given by each user to all the movies they
rated. This would be a feature that could provide a user-specific intercept in our model (in
fact, this is a commonly used approach in recommendation models). We have taken the raw
rating data and created a new feature that can allow us to learn a better model.
Examples of features derived from raw data include computing average values, median val-
ues, variances, sums, differences, maximums or minimums, and counts. We have already
seen a case of this when we created a new movie age feature from the year of release of
the movie and the current year. Often, the idea behind using these transformations is to
summarize the numerical data in some way that might make it easier for a model to learn.
It is also common to transform numerical features into categorical features, for example, by
binning features. Common examples of this include variables such as age, geolocation, and
time.
Transforming timestamps into categorical features
To illustrate how to derive categorical features from numerical data, we will use the times
of the ratings given by users to movies. These are in the form of Unix timestamps. We can
use Python's datetime module to extract the date and time from the timestamp and, in
turn, extract the hour of the day. This will result in an RDD of the hour of the day for each
rating.
We will need a function to extract a datetime representation of the rating timestamp (in
seconds); we will create this function now:
def extract_datetime(ts):
import datetime
return datetime.datetime.fromtimestamp(ts)
We will again use the rating_data RDD that we computed in the earlier examples as
our starting point.
Search WWH ::




Custom Search