Database Reference
In-Depth Information
}
val globalMinMax = minMaxVals.reduce { case ((min1, max1),
(min2, max2)) =>
(math.min(min1, min2), math.max(max1, max2))
}
println(globalMinMax)
As we can see, the minimum TF-IDF is zero, while the maximum is significantly larger:
(0.0,66155.39470409753)
We will now explore the TF-IDF weight attached to various terms. In the previous section
on stop words, we filtered out many common terms that occur frequently. Recall that we
did not remove all such potential stop words. Instead, we kept a few in the corpus so that
we could illustrate the impact of applying the TF-IDF weighting scheme on these terms.
TF-IDF weighting will tend to assign a lower weighting to common terms. To see this, we
can compute the TF-IDF representation for a few of the terms that appear in the list of top
occurrences that we previously computed, such as you , do , and we :
val common = sc.parallelize(Seq(Seq("you", "do", "we")))
val tfCommon = hashingTF.transform(common)
val tfidfCommon = idf.transform(tfCommon)
val commonVector = tfidfCommon.first.asInstanceOf[SV]
println(commonVector.values.toSeq)
If we form a TF-IDF vector representation of this document, we would see the following
values assigned to each term. Note that because of feature hashing, we are not sure exactly
which term represents what. However, the values illustrate that the weighting applied to
these terms is relatively low:
WrappedArray(0.9965359935704624, 1.3348773448236835,
0.5457486182039175)
Now, let's apply the same transformation to a few less common terms that we might intuit-
ively associate with being more linked to specific topics or concepts:
val uncommon = sc.parallelize(Seq(Seq("telescope",
"legislation", "investment")))
val tfUncommon = hashingTF.transform(uncommon)
Search WWH ::




Custom Search