Database Reference
In-Depth Information
(do,8674)
(he,8441)
(about,8336)
(writes,7844)
Apart from some of the common words that we have not excluded, we see that a few po-
tentially more informative words are starting to appear.
Excluding terms based on frequency
It is also a common practice to exclude terms during tokenization when their overall oc-
currence in the corpus is very low. For example, let's examine the least occurring terms in
the corpus (notice the different ordering we use here to return the results sorted in ascend-
ing order):
val oreringAsc = Ordering.by[(String, Int), Int](-_._2)
println(tokenCountsFilteredSize.top(20)(oreringAsc).mkString("\n"))
You will get the following results:
(lennips,1)
(bluffing,1)
(preload,1)
(altina,1)
(dan_jacobson,1)
(vno,1)
(actu,1)
(donnalyn,1)
(ydag,1)
(mirosoft,1)
(xiconfiywindow,1)
(harger,1)
(feh,1)
(bankruptcies,1)
(uncompression,1)
(d_nibby,1)
(bunuel,1)
(odf,1)
(swith,1)
(lantastic,1)
Search WWH ::




Custom Search