Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

As we can see, there are many terms that only occur once in the entire corpus. Since typic-

ally we want to use our extracted features for other tasks such as document similarity or

machine learning models, tokens that only occur once are not useful to learn from, as we

will not have enough training data relative to these tokens. We can apply another filter to

exclude these rare tokens:

val rareTokens = tokenCounts.filter{ case (k, v) => v < 2

}.map { case (k, v) => k }.collect.toSet

val tokenCountsFilteredAll = tokenCountsFilteredSize.filter

{ case (k, v) => !rareTokens.contains(k) }

println(tokenCountsFilteredAll.top(20)(oreringAsc).mkString("\n"))

We can see that we are left with tokens that occur at least twice in the corpus:

(sina,2)

(akachhy,2)

(mvd,2)

(hizbolah,2)

(wendel_clark,2)

(sarkis,2)

(purposeful,2)

(feagans,2)

(wout,2)

(uneven,2)

(senna,2)

(multimeters,2)

(bushy,2)

(subdivided,2)

(coretest,2)

(oww,2)

(historicity,2)

(mmg,2)

(margitan,2)

(defiance,2)

Now, let's count the number of unique tokens:

println(tokenCountsFilteredAll.count)

You will see the following output:

Search WWH ::

Custom Search

Home