Database Reference
In-Depth Information
First, we will use a map transformation to extract the timestamp field, converting it to a
Python int datatype. We will then apply our extract_datetime function to each
timestamp and extract the hour from the resulting datetime object:
timestamps = rating_data.map(lambda fields: int(fields[3]))
hour_of_day = timestamps.map(lambda ts:
extract_datetime(ts). hour )
hour_of_day.take(5)
If we take the first five records of the resulting RDD, we will see the following output:
[17, 21, 9, 7, 7]
We have transformed the raw time data into a categorical feature that represents the hour
of the day in which the rating was given.
Now, say that we decide this is too coarse a representation. Perhaps we want to further re-
fine the transformation. We can assign each hour-of-the-day value into a defined bucket
that represents a time of day.
For example, we can say that morning is from 7 a.m. to 11 a.m., while lunch is from 11
a.m. to 1 a.m., and so on. Using these buckets, we can create a function to assign a time of
day, given the hour of the day as input:
def assign_tod(hr):
times_of_day = {
'morning' : range(7, 12),
'lunch' : range(12, 14),
'afternoon' : range(14, 18),
'evening' : range(18, 23),
'night' : range(23, 7)
}
for k, v in times_of_day.iteritems():
if hr in v:
return k
Now, we will apply the assign_tod function to the hour of each rating event contained
in the hour_of_day RDD:
Search WWH ::




Custom Search