Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

We can verify that our function works by inspecting the results, which should look like

this:

Toy Story

GoldenEye

Four Rooms

Get Shorty

Copycat

We would then like to apply our function to the raw titles and apply a tokenization scheme

to the extracted titles to convert them to terms. We will use the simple whitespace tokeniz-

ation we covered earlier:

movie_titles = raw_titles.map(lambda m: extract_title(m))

# next we tokenize the titles into terms. We'll use simple

whitespace tokenization

title_terms = movie_titles.map(lambda t: t.split(" "))

print title_terms.take(5)

Applying this simple tokenization gives the following result:

[[u'Toy', u'Story'], [u'GoldenEye'], [u'Four', u'Rooms'],

[u'Get', u'Shorty'], [u'Copycat']]

We can see that we have split each title on spaces so that each word becomes a token.

Tip

Here, we do not cover details such as converting text to lowercase, removing non-word or

non-numerical characters such as punctuation and special characters, removing stop

words, and stemming. These steps will be important in a real-world application. We will

cover many of these topics in Chapter 9 , Advanced Text Processing with Spark .

This additional processing can be done fairly simply using string functions, regular ex-

pressions, and the Spark API (apart from stemming). Perhaps you would like to give it a

try!

In order to assign each term to an index in our vector, we need to create the term diction-

ary, which maps each term to an integer index.

Search WWH ::

Custom Search

Home