Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

print "Index of term 'Dead': %d" % all_terms_dict2['Dead']

print "Index of term 'Rooms': %d" % all_terms_dict2['Rooms']

The output is as follows:

Index of term 'Dead': 147

Index of term 'Rooms': 1963

The final step is to create a function that converts a set of terms into a sparse vector rep-

resentation. To do this, we will create an empty sparse matrix with one row and a number

of columns equal to the total number of terms in our dictionary. We will then step through

each term in the input list of terms and check whether this term is in our term dictionary.

If it is, we assign a value of 1 to the vector at the index that corresponds to the term in our

dictionary mapping:

# this function takes a list of terms and encodes it as a

scipy sparse vector using an approach

# similar to the 1-of-k encoding

def create_vector(terms, term_dict):

from scipy import sparse as sp

num_terms = len(term_dict)

x = sp.csc_matrix((1, num_terms))

for t in terms:

if t in term_dict:

idx = term_dict[t]

x[0, idx] = 1

return x

Once we have our function, we will apply it to each record in our RDD of extracted terms:

all_terms_bcast = sc. broadcast (all_terms_dict)

term_vectors = title_terms.map(lambda terms:

create_vector(terms, all_terms_bcast.value))

term_vectors.take(5)

We can then inspect the first few records of our new RDD of sparse vectors:

[<1x2645 sparse matrix of type '<type 'numpy.float64'>'

with 2 stored elements in Compressed Sparse Column

format>,

Search WWH ::

Custom Search

Home