Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

}

val globalMinMax = minMaxVals.reduce { case ((min1, max1),

(min2, max2)) =>

(math.min(min1, min2), math.max(max1, max2))

}

println(globalMinMax)

As we can see, the minimum TF-IDF is zero, while the maximum is significantly larger:

(0.0,66155.39470409753)

We will now explore the TF-IDF weight attached to various terms. In the previous section

on stop words, we filtered out many common terms that occur frequently. Recall that we

did not remove all such potential stop words. Instead, we kept a few in the corpus so that

we could illustrate the impact of applying the TF-IDF weighting scheme on these terms.

TF-IDF weighting will tend to assign a lower weighting to common terms. To see this, we

can compute the TF-IDF representation for a few of the terms that appear in the list of top

occurrences that we previously computed, such as you , do , and we :

val common = sc.parallelize(Seq(Seq("you", "do", "we")))

val tfCommon = hashingTF.transform(common)

val tfidfCommon = idf.transform(tfCommon)

val commonVector = tfidfCommon.first.asInstanceOf[SV]

println(commonVector.values.toSeq)

If we form a TF-IDF vector representation of this document, we would see the following

values assigned to each term. Note that because of feature hashing, we are not sure exactly

which term represents what. However, the values illustrate that the weighting applied to

these terms is relatively low:

WrappedArray(0.9965359935704624, 1.3348773448236835,

0.5457486182039175)

Now, let's apply the same transformation to a few less common terms that we might intuit-

ively associate with being more linked to specific topics or concepts:

val uncommon = sc.parallelize(Seq(Seq("telescope",

"legislation", "investment")))

val tfUncommon = hashingTF.transform(uncommon)

Search WWH ::

Custom Search

Home