Database Reference
In-Depth Information
but, the, of, and so on. It is a standard practice in text feature extraction to exclude stop
words from the extracted tokens.
When using TF-IDF weighting, the weighting scheme actually takes care of this for us. As
stop words have a very low IDF score, they will tend to have very low TF-IDF weightings
and thus less importance. In some cases, for information retrieval and search tasks, it
might be desirable to include stop words. However, it can still be beneficial to exclude
stop words during feature extraction, as it reduces the dimensionality of the final feature
vectors as well as the size of the training data.
We can take a look at some of the tokens in our corpus that have the highest occurrence
across all documents to get an idea about some other stop words to exclude:
val tokenCounts = filterNumbers.map(t => (t,
1)).reduceByKey(_ + _)
val oreringDesc = Ordering.by[(String, Int), Int](_._2)
println(tokenCounts.top(20)(oreringDesc).mkString("\n"))
In the preceding code, we took the tokens after filtering out numeric characters and gener-
ated a count of the occurrence of each token across the corpus. We can now use Spark's
top
function to retrieve the top 20 tokens by count. Notice that we need to provide the
top
function with an ordering that tells Spark how to order the elements of our RDD. In
this case, we want to order by the count, so we will specify the second element of our key-
value pair.
Running the preceding code snippet will result in the following top tokens:
(the,146532)
(to,75064)
(of,69034)
(a,64195)
(ax,62406)
(and,57957)
(i,53036)
(in,49402)
(is,43480)
(that,39264)
(it,33638)
(for,28600)