Database Reference
In-Depth Information
We can verify that our function works by inspecting the results, which should look like
Toy Story
Four Rooms
Get Shorty
We would then like to apply our function to the raw titles and apply a tokenization scheme
to the extracted titles to convert them to terms. We will use the simple whitespace tokeniz-
ation we covered earlier:
movie_titles = m: extract_title(m))
# next we tokenize the titles into terms. We'll use simple
whitespace tokenization
title_terms = t: t.split(" "))
print title_terms.take(5)
Applying this simple tokenization gives the following result:
[[u'Toy', u'Story'], [u'GoldenEye'], [u'Four', u'Rooms'],
[u'Get', u'Shorty'], [u'Copycat']]
We can see that we have split each title on spaces so that each word becomes a token.
Here, we do not cover details such as converting text to lowercase, removing non-word or
non-numerical characters such as punctuation and special characters, removing stop
words, and stemming. These steps will be important in a real-world application. We will
cover many of these topics in Chapter 9 , Advanced Text Processing with Spark .
This additional processing can be done fairly simply using string functions, regular ex-
pressions, and the Spark API (apart from stemming). Perhaps you would like to give it a
In order to assign each term to an index in our vector, we need to create the term diction-
ary, which maps each term to an integer index.
Search WWH ::

Custom Search