Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

but, the, of, and so on. It is a standard practice in text feature extraction to exclude stop

words from the extracted tokens.

When using TF-IDF weighting, the weighting scheme actually takes care of this for us. As

stop words have a very low IDF score, they will tend to have very low TF-IDF weightings

and thus less importance. In some cases, for information retrieval and search tasks, it

might be desirable to include stop words. However, it can still be beneficial to exclude

stop words during feature extraction, as it reduces the dimensionality of the final feature

vectors as well as the size of the training data.

We can take a look at some of the tokens in our corpus that have the highest occurrence

across all documents to get an idea about some other stop words to exclude:

val tokenCounts = filterNumbers.map(t => (t,

1)).reduceByKey(_ + _)

val oreringDesc = Ordering.by[(String, Int), Int](_._2)

println(tokenCounts.top(20)(oreringDesc).mkString("\n"))

In the preceding code, we took the tokens after filtering out numeric characters and gener-

ated a count of the occurrence of each token across the corpus. We can now use Spark's

top function to retrieve the top 20 tokens by count. Notice that we need to provide the

top function with an ordering that tells Spark how to order the elements of our RDD. In

this case, we want to order by the count, so we will specify the second element of our key-

value pair.

Running the preceding code snippet will result in the following top tokens:

(the,146532)

(to,75064)

(of,69034)

(a,64195)

(ax,62406)

(and,57957)

(i,53036)

(in,49402)

(is,43480)

(that,39264)

(it,33638)

(for,28600)

Search WWH ::

Custom Search

Home