Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

num_len

print "Total feature vector length: %d" % total_len

The output of the preceding code is as follows:

Feature vector length for categorical features: 57

Feature vector length for numerical features: 4

Total feature vector length: 61

Creating feature vectors for the linear model

The next step is to use our extracted mappings to convert the categorical features to

binary-encoded features. Again, it will be helpful to create a function that we can apply to

each record in our dataset for this purpose. We will also create a function to extract the

target variable from each record. We will need to import numpy for linear algebra utilities

and MLlib's LabeledPoint class to wrap our feature vectors and target variables:

from pyspark.mllib.regression import LabeledPoint

import numpy as np

def extract_features(record):

cat_vec = np.zeros(cat_len)

i = 0

step = 0

for field in record[2:9]:

m = mappings[i]

idx = m[field]

cat_vec[idx + step] = 1

i = i + 1

step = step + len(m)

num_vec = np.array([float(field) for field in

record[10:14]])

return np.concatenate((cat_vec, num_vec))

def extract_label(record):

return float(record[-1])

In the preceding extract_features function, we ran through each column in the row

of data. We extracted the binary encoding for each variable in turn from the mappings we

created previously. The step variable ensures that the nonzero feature index in the full

Search WWH ::

Custom Search

Home