Database Reference
In-Depth Information
num_len
print "Total feature vector length: %d" % total_len
The output of the preceding code is as follows:
Feature vector length for categorical features: 57
Feature vector length for numerical features: 4
Total feature vector length: 61
Creating feature vectors for the linear model
The next step is to use our extracted mappings to convert the categorical features to
binary-encoded features. Again, it will be helpful to create a function that we can apply to
each record in our dataset for this purpose. We will also create a function to extract the
target variable from each record. We will need to import numpy for linear algebra utilities
and MLlib's LabeledPoint class to wrap our feature vectors and target variables:
from pyspark.mllib.regression import LabeledPoint
import numpy as np
def extract_features(record):
cat_vec = np.zeros(cat_len)
i = 0
step = 0
for field in record[2:9]:
m = mappings[i]
idx = m[field]
cat_vec[idx + step] = 1
i = i + 1
step = step + len(m)
num_vec = np.array([float(field) for field in
record[10:14]])
return np.concatenate((cat_vec, num_vec))
def extract_label(record):
return float(record[-1])
In the preceding extract_features function, we ran through each column in the row
of data. We extracted the binary encoding for each variable in turn from the mappings we
created previously. The step variable ensures that the nonzero feature index in the full
Search WWH ::




Custom Search