Database Reference
In-Depth Information
We can verify that our function works by inspecting the results, which should look like
this:
Toy Story
GoldenEye
Four Rooms
Get Shorty
Copycat
We would then like to apply our function to the raw titles and apply a tokenization scheme
to the extracted titles to convert them to terms. We will use the simple whitespace tokeniz-
ation we covered earlier:
movie_titles = raw_titles.map(lambda m: extract_title(m))
# next we tokenize the titles into terms. We'll use simple
whitespace tokenization
title_terms = movie_titles.map(lambda t: t.split(" "))
print title_terms.take(5)
Applying this simple tokenization gives the following result:
[[u'Toy', u'Story'], [u'GoldenEye'], [u'Four', u'Rooms'],
[u'Get', u'Shorty'], [u'Copycat']]
We can see that we have split each title on spaces so that each word becomes a token.
Tip
Here, we do not cover details such as converting text to lowercase, removing non-word or
non-numerical characters such as punctuation and special characters, removing stop
words, and stemming. These steps will be important in a real-world application. We will
cover many of these topics in Chapter 9 , Advanced Text Processing with Spark .
This additional processing can be done fairly simply using string functions, regular ex-
pressions, and the Spark API (apart from stemming). Perhaps you would like to give it a
try!
In order to assign each term to an index in our vector, we need to create the term diction-
ary, which maps each term to an integer index.
Search WWH ::




Custom Search