Databases Reference
In-Depth Information
However, applications of frequent-itemset analysis is not limited to market
baskets. The same model can be used to mine many other kinds of data. Some
examples are:
1. Related concepts: Let items be words, and let baskets be documents
(e.g., Web pages, blogs, tweets). A basket/document contains those
items/words that are present in the document. If we look for sets of
words that appear together in many documents, the sets will be domi-
nated by the most common words (stop words), as we saw in Example 6.1.
There, even though the intent was to find snippets that talked about cats
and dogs, the stop words “and” and “a” were prominent among the fre-
quent itemsets. However, if we ignore all the most common words, then
we would hope to find among the frequent pairs some pairs of words
that represent a joint concept. For example, we would expect a pair like
{Brad, Angelina}to appear with surprising frequency.
2. Plagiarism: Let the items be documents and the baskets be sentences.
An item/document is “in” a basket/sentence if the sentence is in the
document. This arrangement appears backwards, but it is exactly what
we need, and we should remember that the relationship between items
and baskets is an arbitrary many-many relationship. That is, “in” need
not have its conventional meaning: “part of.” In this application, we
look for pairs of items that appear together in several baskets. If we find
such a pair, then we have two documents that share several sentences in
common. In practice, even one or two sentences in common is a good
indicator of plagiarism.
3. Biomarkers: Let the items be of two types - biomarkers such as genes
or blood proteins, and diseases. Each basket is the set of data about
a patient: their genome and blood-chemistry analysis, as well as their
medical history of disease. A frequent itemset that consists of one disease
and one or more biomarkers suggests a test for the disease.
6.1.3
Association Rules
While the subject of this chapter is extracting frequent sets of items from data,
this information is often presented as a collection of if-then rules, called associ-
ation rules. The form of an association rule is I→j, where I is a set of items
and j is an item. The implication of this association rule is that if all of the
items in I appear in some basket, then j is “likely” to appear in that basket as
well.
We formalize the notion of “likely” by defining the confidence of the rule
I→j to be the ratio of the support for I∪{j}to the support for I. That is,
the confidence of the rule is the fraction of the baskets with all of I that also
contain j.
Search WWH ::




Custom Search