Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

51801

As we can see, by applying all the filtering steps in our tokenization pipeline, we have re-

duced the feature dimension from 402,978 to 51,801.

We can now combine all our filtering logic into one function, which we can apply to each

document in our RDD:

def tokenize(line: String): Seq[String] = {

line.split("""\W+""")

.map(_.toLowerCase)

.filter(token => regex.pattern.matcher(token).matches)

.filterNot(token => stopwords.contains(token))

.filterNot(token => rareTokens.contains(token))

.filter(token => token.size >= 2)

.toSeq

}

We can check whether this function gives us the same result with the following code snip-

pet:

println(text.flatMap(doc => tokenize(doc)).distinct.count)

This will output 51801 , giving us the same unique token count as our step-by-step

pipeline.

We can tokenize each document in our RDD as follows:

val tokens = text.map(doc => tokenize(doc))

println(tokens.first.take(20))

You will see output similar to the following, showing the first part of the tokenized ver-

sion of our first document:

WrappedArray(mathew, mathew, mantis, co, uk, subject, alt,

atheism, faq, atheist, resources, summary, books,

addresses, music, anything, related, atheism, keywords, faq)

Search WWH ::

Custom Search

Home