Database Reference
In-Depth Information
feature vector is correct (and is somewhat more efficient than, say, creating many smaller
binary vectors and concatenating them). The numeric vector is created directly by first
converting the data to floating point numbers and wrapping these in a numpy array. The
resulting two vectors are then concatenated. The extract_label function simply con-
verts the last column variable (the count) into a float.
With our utility functions defined, we can proceed with extracting feature vectors and la-
bels from our data records:
data = records.map(lambda r: LabeledPoint(extract_label(r),
extract_features(r)))
Let's inspect the first record in the extracted feature RDD:
first_point = data.first()
print "Raw data: " + str(first[2:])
print "Label: " + str(first_point.label)
print "Linear Model feature vector:\n" +
str(first_point.features)
print "Linear Model feature vector length: " +
str(len(first_point.features))
You should see output similar to the following:
Raw data: [u'1', u'0', u'1', u'0', u'0', u'6', u'0', u'1',
u'0.24', u'0.2879', u'0.81', u'0', u'3', u'13', u'16']
Label: 16.0
Linear Model feature vector:
[1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.24,0.2879,0.81,0.0]
Linear Model feature vector length: 61
As we can see, we converted the raw data into a feature vector made up of the binary cat-
egorical and real numeric features, and we indeed have a total vector length of 61 .
Creating feature vectors for the decision tree
As we have seen, decision tree models typically work on raw features (that is, it is not re-
quired to convert categorical features into a binary vector encoding; they can, instead, be
used directly). Therefore, we will create a separate function to extract the decision tree
Search WWH ::




Custom Search