Database Reference
In-Depth Information
print "Index of term 'Dead': %d" % all_terms_dict2['Dead']
print "Index of term 'Rooms': %d" % all_terms_dict2['Rooms']
The output is as follows:
Index of term 'Dead': 147
Index of term 'Rooms': 1963
The final step is to create a function that converts a set of terms into a sparse vector rep-
resentation. To do this, we will create an empty sparse matrix with one row and a number
of columns equal to the total number of terms in our dictionary. We will then step through
each term in the input list of terms and check whether this term is in our term dictionary.
If it is, we assign a value of 1 to the vector at the index that corresponds to the term in our
dictionary mapping:
# this function takes a list of terms and encodes it as a
scipy sparse vector using an approach
# similar to the 1-of-k encoding
def create_vector(terms, term_dict):
from scipy import sparse as sp
num_terms = len(term_dict)
x = sp.csc_matrix((1, num_terms))
for t in terms:
if t in term_dict:
idx = term_dict[t]
x[0, idx] = 1
return x
Once we have our function, we will apply it to each record in our RDD of extracted terms:
all_terms_bcast = sc. broadcast (all_terms_dict)
term_vectors = title_terms.map(lambda terms:
create_vector(terms, all_terms_bcast.value))
term_vectors.take(5)
We can then inspect the first few records of our new RDD of sparse vectors:
[<1x2645 sparse matrix of type '<type 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Column
format>,
Search WWH ::




Custom Search