Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

(there,9689)

(x,9332)

(all,9310)

(will,9279)

(we,9227)

(one,9008)

You might notice that there are still quite a few common words in this top list. In practice,

we might have a much larger set of stop words. However, we will keep a few (partly to il-

lustrate the impact of common words when using TF-IDF weighting a little later).

One other filtering step that we will use is removing any tokens that are only one charac-

ter in length. The reasoning behind this is similar to removing stop words—these single-

character tokens are unlikely to be informative in our text model and can further reduce

the feature dimension and model size. We will do this with another filtering step:

val tokenCountsFilteredSize =

tokenCountsFilteredStopwords.filter { case (k, v) => k.size

>= 2 }

println(tokenCountsFilteredSize.top(20)(oreringDesc).mkString("\n"))

Again, we will examine the tokens remaining after this filtering step:

(ax,62406)

(you,26682)

(edu,21321)

(subject,12264)

(com,12133)

(lines,11835)

(can,11355)

(organization,11233)

(re,10534)

(what,9861)

(there,9689)

(all,9310)

(will,9279)

(we,9227)

(one,9008)

(would,8905)

Search WWH ::

Custom Search

Home