Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

In order to extract each categorical feature into a binary vector form, we will need to

know the feature mapping of each feature value to the index of the nonzero value in our

binary vector. Let's define a function that will extract this mapping from our dataset for a

given column:

def get_mapping(rdd, idx):

return rdd.map(lambda fields:

fields[idx]).distinct().zipWithIndex().collectAsMap()

Our function first maps the field to its unique values and then uses the zipWithIndex

transformation to zip the value up with a unique index such that a key-value RDD is

formed, where the key is the variable and the value is the index. This index will be the in-

dex of the nonzero entry in the binary vector representation of the feature. We will finally

collect this RDD back to the driver as a Python dictionary.

We can test our function on the third variable column (index 2):

print "Mapping of first categorical feasture column: %s" %

get_mapping(records, 2)

The preceding line of code will give us the following output:

Mapping of first categorical feasture column: {u'1': 0,

u'3': 2, u'2': 1, u'4': 3}

Now, we can apply this function to each categorical column (that is, for variable indices 2

to 9):

mappings = [get_mapping(records, i) for i in range(2,10)]

cat_len = sum(map(len, mappings))

num_len = len(records.first()[11:15])

total_len = num_len + cat_len

We now have the mappings for each variable, and we can see how many values in total we

need for our binary vector representation:

print "Feature vector length for categorical features: %d"

% cat_len

print "Feature vector length for numerical features: %d" %

Search WWH ::

Custom Search

Home