Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

First, we will use Spark's flatMap function (highlighted in the following code snippet)

to expand the list of strings in each record of the title_terms RDD into a new RDD

of strings where each record is a term called all_terms .

We can then collect all the unique terms and assign indexes in exactly the same way that

we did for the 1-of-k encoding of user occupations earlier:

# next we would like to collect all the possible terms, in

order to build out dictionary of term <-> index mappings

all_terms = title_terms. flatMap (lambda x:

x).distinct().collect()

# create a new dictionary to hold the terms, and assign the

"1-of-k" indexes

idx = 0

all_terms_dict = {}

for term in all_terms:

all_terms_dict[term] = idx

idx +=1

We can print out the total number of unique terms and test out our term mapping on a few

terms:

print "Total number of terms: %d" % len(all_terms_dict)

print "Index of term 'Dead': %d" % all_terms_dict['Dead']

print "Index of term 'Rooms': %d" % all_terms_dict['Rooms']

This will result in the following output:

Total number of terms: 2645

Index of term 'Dead': 147

Index of term 'Rooms': 1963

We can also achieve the same result more efficiently using Spark's zipWithIndex

function. This function takes an RDD of values and merges them together with an index to

create a new RDD of key-value pairs, where the key will be the term and the value will be

the index in the term dictionary. We will use collectAsMap to collect the key-value

RDD to the driver as a Python dict method:

all_terms_dict2 = title_terms.flatMap(lambda x:

x).distinct().zipWithIndex().collectAsMap()

Search WWH ::

Custom Search

Home