Database Reference
In-Depth Information
First, we will use Spark's flatMap function (highlighted in the following code snippet)
to expand the list of strings in each record of the title_terms RDD into a new RDD
of strings where each record is a term called all_terms .
We can then collect all the unique terms and assign indexes in exactly the same way that
we did for the 1-of-k encoding of user occupations earlier:
# next we would like to collect all the possible terms, in
order to build out dictionary of term <-> index mappings
all_terms = title_terms. flatMap (lambda x:
x).distinct().collect()
# create a new dictionary to hold the terms, and assign the
"1-of-k" indexes
idx = 0
all_terms_dict = {}
for term in all_terms:
all_terms_dict[term] = idx
idx +=1
We can print out the total number of unique terms and test out our term mapping on a few
terms:
print "Total number of terms: %d" % len(all_terms_dict)
print "Index of term 'Dead': %d" % all_terms_dict['Dead']
print "Index of term 'Rooms': %d" % all_terms_dict['Rooms']
This will result in the following output:
Total number of terms: 2645
Index of term 'Dead': 147
Index of term 'Rooms': 1963
We can also achieve the same result more efficiently using Spark's zipWithIndex
function. This function takes an RDD of values and merges them together with an index to
create a new RDD of key-value pairs, where the key will be the term and the value will be
the index in the term dictionary. We will use collectAsMap to collect the key-value
RDD to the driver as a Python dict method:
all_terms_dict2 = title_terms.flatMap(lambda x:
x).distinct().zipWithIndex().collectAsMap()
Search WWH ::




Custom Search