Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Note

In Chapter 9 , Advanced Text Processing with Spark , we will cover more complex text

processing and feature extraction, including methods to weight terms; these methods go

beyond the basic binary encoding we saw earlier.

Simple text feature extraction

To show an example of extracting textual features in the binary vector representation, we

can use the movie titles that we have available.

First, we will create a function to strip away the year of release for each movie, if the year

is present, leaving only the title of the movie.

We will use Python's regular expression module, re , to search for the year between paren-

theses in the movie titles. If we find a match with this regular expression, we will extract

only the title up to the index of the first match (that is, the index in the title string of the

opening parenthesis). This is done with the following raw[:grps.start()] code

snippet:

def extract_title(raw):

import re

# this regular expression finds the non-word (numbers)

betweenparentheses

grps = re.search("\((\w+)\)", raw)

if grps:

# we take only the title part, and strip the trailing

whitespace from the remaining text, below

return raw[:grps.start()].strip()

else:

return raw

Next, we will extract the raw movie titles from the movie_fields RDD:

raw_titles = movie_fields.map(lambda fields: fields[1])

We can test out our extract_title function on the first five raw titles as follows:

for raw_title in raw_titles.take(5):

print extract_title(raw_title)

Search WWH ::

Custom Search

Home