Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

However, applications of frequent-itemset analysis is not limited to market

baskets. The same model can be used to mine many other kinds of data. Some

examples are:

1. Related concepts: Let items be words, and let baskets be documents

(e.g., Web pages, blogs, tweets). A basket/document contains those

items/words that are present in the document. If we look for sets of

words that appear together in many documents, the sets will be domi-

nated by the most common words (stop words), as we saw in Example 6.1.

There, even though the intent was to find snippets that talked about cats

and dogs, the stop words “and” and “a” were prominent among the fre-

quent itemsets. However, if we ignore all the most common words, then

we would hope to find among the frequent pairs some pairs of words

that represent a joint concept. For example, we would expect a pair like

{Brad, Angelina}to appear with surprising frequency.

2. Plagiarism: Let the items be documents and the baskets be sentences.

An item/document is “in” a basket/sentence if the sentence is in the

document. This arrangement appears backwards, but it is exactly what

we need, and we should remember that the relationship between items

and baskets is an arbitrary many-many relationship. That is, “in” need

not have its conventional meaning: “part of.” In this application, we

look for pairs of items that appear together in several baskets. If we find

such a pair, then we have two documents that share several sentences in

common. In practice, even one or two sentences in common is a good

indicator of plagiarism.

3. Biomarkers: Let the items be of two types - biomarkers such as genes

or blood proteins, and diseases. Each basket is the set of data about

a patient: their genome and blood-chemistry analysis, as well as their

medical history of disease. A frequent itemset that consists of one disease

and one or more biomarkers suggests a test for the disease.

6.1.3

Association Rules

While the subject of this chapter is extracting frequent sets of items from data,

this information is often presented as a collection of if-then rules, called associ-

ation rules. The form of an association rule is I→j, where I is a set of items

and j is an item. The implication of this association rule is that if all of the

items in I appear in some basket, then j is “likely” to appear in that basket as

well.

We formalize the notion of “likely” by defining the confidence of the rule

I→j to be the ratio of the support for I∪{j}to the support for I. That is,

the confidence of the rule is the fraction of the baskets with all of I that also

contain j.

Search WWH ::

Custom Search

Home