Database Reference
In-Depth Information
As we can see, there are many terms that only occur once in the entire corpus. Since typic-
ally we want to use our extracted features for other tasks such as document similarity or
machine learning models, tokens that only occur once are not useful to learn from, as we
will not have enough training data relative to these tokens. We can apply another filter to
exclude these rare tokens:
val rareTokens = tokenCounts.filter{ case (k, v) => v < 2
}.map { case (k, v) => k }.collect.toSet
val tokenCountsFilteredAll = tokenCountsFilteredSize.filter
{ case (k, v) => !rareTokens.contains(k) }
println(tokenCountsFilteredAll.top(20)(oreringAsc).mkString("\n"))
We can see that we are left with tokens that occur at least twice in the corpus:
(sina,2)
(akachhy,2)
(mvd,2)
(hizbolah,2)
(wendel_clark,2)
(sarkis,2)
(purposeful,2)
(feagans,2)
(wout,2)
(uneven,2)
(senna,2)
(multimeters,2)
(bushy,2)
(subdivided,2)
(coretest,2)
(oww,2)
(historicity,2)
(mmg,2)
(margitan,2)
(defiance,2)
Now, let's count the number of unique tokens:
println(tokenCountsFilteredAll.count)
You will see the following output:
Search WWH ::




Custom Search