Database Reference
In-Depth Information
Note
In Chapter 9 , Advanced Text Processing with Spark , we will cover more complex text
processing and feature extraction, including methods to weight terms; these methods go
beyond the basic binary encoding we saw earlier.
Simple text feature extraction
To show an example of extracting textual features in the binary vector representation, we
can use the movie titles that we have available.
First, we will create a function to strip away the year of release for each movie, if the year
is present, leaving only the title of the movie.
We will use Python's regular expression module, re , to search for the year between paren-
theses in the movie titles. If we find a match with this regular expression, we will extract
only the title up to the index of the first match (that is, the index in the title string of the
opening parenthesis). This is done with the following raw[:grps.start()] code
snippet:
def extract_title(raw):
import re
# this regular expression finds the non-word (numbers)
betweenparentheses
grps = re.search("\((\w+)\)", raw)
if grps:
# we take only the title part, and strip the trailing
whitespace from the remaining text, below
return raw[:grps.start()].strip()
else:
return raw
Next, we will extract the raw movie titles from the movie_fields RDD:
raw_titles = movie_fields.map(lambda fields: fields[1])
We can test out our extract_title function on the first five raw titles as follows:
for raw_title in raw_titles.take(5):
print extract_title(raw_title)
Search WWH ::




Custom Search