Database Reference
In-Depth Information
(you,26682)
(from,22670)
(s,22337)
(edu,21321)
(on,20493)
(this,20121)
(be,19285)
(t,18728)
As we might expect, there are a lot of common words in this list that we could potentially
label as stop words. Let's create a set of stop words with some of these as well as other
common words. We will then look at the tokens after filtering out these stop words:
val stopwords = Set(
"the","a","an","of","or","in","for","by","on","but",
"is", "not", "with", "as", "was", "if",
"they", "are", "this", "and", "it", "have", "from", "at",
"my", "be", "that", "to"
)
val tokenCountsFilteredStopwords = tokenCounts.filter {
case (k, v) => !stopwords.contains(k) }
println(tokenCountsFilteredStopwords.top(20)(oreringDesc).mkString("\n"))
You will see the following output:
(ax,62406)
(i,53036)
(you,26682)
(s,22337)
(edu,21321)
(t,18728)
(m,12756)
(subject,12264)
(com,12133)
(lines,11835)
(can,11355)
(organization,11233)
(re,10534)
(what,9861)
Search WWH ::




Custom Search