Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

First, we will use a map transformation to extract the timestamp field, converting it to a

Python int datatype. We will then apply our extract_datetime function to each

timestamp and extract the hour from the resulting datetime object:

timestamps = rating_data.map(lambda fields: int(fields[3]))

hour_of_day = timestamps.map(lambda ts:

extract_datetime(ts). hour )

hour_of_day.take(5)

If we take the first five records of the resulting RDD, we will see the following output:

[17, 21, 9, 7, 7]

We have transformed the raw time data into a categorical feature that represents the hour

of the day in which the rating was given.

Now, say that we decide this is too coarse a representation. Perhaps we want to further re-

fine the transformation. We can assign each hour-of-the-day value into a defined bucket

that represents a time of day.

For example, we can say that morning is from 7 a.m. to 11 a.m., while lunch is from 11

a.m. to 1 a.m., and so on. Using these buckets, we can create a function to assign a time of

day, given the hour of the day as input:

def assign_tod(hr):

times_of_day = {

'morning' : range(7, 12),

'lunch' : range(12, 14),

'afternoon' : range(14, 18),

'evening' : range(18, 23),

'night' : range(23, 7)

}

for k, v in times_of_day.iteritems():

if hr in v:

return k

Now, we will apply the assign_tod function to the hour of each rating event contained

in the hour_of_day RDD:

Search WWH ::

Custom Search

Home