Database Reference
In-Depth Information
(there,9689)
(x,9332)
(all,9310)
(will,9279)
(we,9227)
(one,9008)
You might notice that there are still quite a few common words in this top list. In practice,
we might have a much larger set of stop words. However, we will keep a few (partly to il-
lustrate the impact of common words when using TF-IDF weighting a little later).
One other filtering step that we will use is removing any tokens that are only one charac-
ter in length. The reasoning behind this is similar to removing stop words—these single-
character tokens are unlikely to be informative in our text model and can further reduce
the feature dimension and model size. We will do this with another filtering step:
val tokenCountsFilteredSize =
tokenCountsFilteredStopwords.filter { case (k, v) => k.size
>= 2 }
println(tokenCountsFilteredSize.top(20)(oreringDesc).mkString("\n"))
Again, we will examine the tokens remaining after this filtering step:
(ax,62406)
(you,26682)
(edu,21321)
(subject,12264)
(com,12133)
(lines,11835)
(can,11355)
(organization,11233)
(re,10534)
(what,9861)
(there,9689)
(all,9310)
(will,9279)
(we,9227)
(one,9008)
(would,8905)
Search WWH ::




Custom Search