Database Reference
In-Depth Information
51801
As we can see, by applying all the filtering steps in our tokenization pipeline, we have re-
duced the feature dimension from 402,978 to 51,801.
We can now combine all our filtering logic into one function, which we can apply to each
document in our RDD:
def tokenize(line: String): Seq[String] = {
line.split("""\W+""")
.map(_.toLowerCase)
.filter(token => regex.pattern.matcher(token).matches)
.filterNot(token => stopwords.contains(token))
.filterNot(token => rareTokens.contains(token))
.filter(token => token.size >= 2)
.toSeq
}
We can check whether this function gives us the same result with the following code snip-
pet:
println(text.flatMap(doc => tokenize(doc)).distinct.count)
This will output 51801 , giving us the same unique token count as our step-by-step
pipeline.
We can tokenize each document in our RDD as follows:
val tokens = text.map(doc => tokenize(doc))
println(tokens.first.take(20))
You will see output similar to the following, showing the first part of the tokenized ver-
sion of our first document:
WrappedArray(mathew, mathew, mantis, co, uk, subject, alt,
atheism, faq, atheist, resources, summary, books,
addresses, music, anything, related, atheism, keywords, faq)
Search WWH ::




Custom Search