Database Reference
In-Depth Information
(2) Plagiarism : Let the items be documents and the baskets be sentences. An item/doc-
ument is “in” a basket/sentence if the sentence is in the document. This arrangement
appears backwards, but it is exactly what we need, and we should remember that the
relationship between items and baskets is an arbitrary many-many relationship. That
is, “in” need not have its conventional meaning: “part of.” In this application, we look
for pairs of items that appear together in several baskets. If we find such a pair, then
we have two documents that share several sentences in common. In practice, even one
or two sentences in common is a good indicator of plagiarism.
(3) Biomarkers : Let the items be of two types - biomarkers such as genes or blood pro-
teins, and diseases. Each basket is the set of data about a patient: their genome and
blood-chemistry analysis, as well as their medical history of disease. A frequent item-
set that consists of one disease and one or more biomarkers suggests a test for the dis-
ease.
6.1.3
Association Rules
While the subject of this chapter is extracting frequent sets of items from data, this inform-
ation is often presented as a collection of if-then rules, called association rules . The form
of an association rule is I j , where I is a set of items and j is an item. The implication of
this association rule is that if all of the items in I appear in some basket, then j is “likely” to
appear in that basket as well.
We formalize the notion of “likely” by defining the confidence of the rule I j to be the
ratio of the support for I { j } to the support for I . That is, the confidence of the rule is the
fraction of the baskets with all of I that also contain j .
EXAMPLE 6.2 Consider the baskets of Fig. 6.1 . The confidence of the rule { cat, dog } →
and is 3/5. The words “cat” and “dog” appear in five baskets: (1), (2), (3), (6), and (7). Of
these, “and” appears in (1), (2), and (7), or 3/5 of the baskets.
For another illustration, the confidence of { cat } → kitten is 1/6. The word “cat” appears
in six baskets, (1), (2), (3), (5), (6), and (7). Of these, only (5) has the word “kitten.”
Confidence alone can be useful, provided the support for the left side of the rule is fairly
large. For example, we don't need to know that people are unusually likely to buy mustard
when they buy hot dogs, as long as we know that many people buy hot dogs, and many
people buy both hot dogs and mustard. We can still use the sale-on-hot-dogs trick discussed
in Section 6.1.2 . However, there is often more value to an association rule if it reflects a
true relationship, where the item or items on the left somehow affect the item on the right.
Search WWH ::




Custom Search