Database Reference
In-Depth Information
In order to extract each categorical feature into a binary vector form, we will need to
know the feature mapping of each feature value to the index of the nonzero value in our
binary vector. Let's define a function that will extract this mapping from our dataset for a
given column:
def get_mapping(rdd, idx):
return rdd.map(lambda fields:
fields[idx]).distinct().zipWithIndex().collectAsMap()
Our function first maps the field to its unique values and then uses the zipWithIndex
transformation to zip the value up with a unique index such that a key-value RDD is
formed, where the key is the variable and the value is the index. This index will be the in-
dex of the nonzero entry in the binary vector representation of the feature. We will finally
collect this RDD back to the driver as a Python dictionary.
We can test our function on the third variable column (index 2):
print "Mapping of first categorical feasture column: %s" %
get_mapping(records, 2)
The preceding line of code will give us the following output:
Mapping of first categorical feasture column: {u'1': 0,
u'3': 2, u'2': 1, u'4': 3}
Now, we can apply this function to each categorical column (that is, for variable indices 2
to 9):
mappings = [get_mapping(records, i) for i in range(2,10)]
cat_len = sum(map(len, mappings))
num_len = len(records.first()[11:15])
total_len = num_len + cat_len
We now have the mappings for each variable, and we can see how many values in total we
need for our binary vector representation:
print "Feature vector length for categorical features: %d"
% cat_len
print "Feature vector length for numerical features: %d" %
Search WWH ::




Custom Search