Data Mining - Mining of Massive Datasets

Database Reference

In-Depth Information

1.1.5

Feature Extraction

The typical feature-based model looks for the most extreme examples of a phenomenon

and represents the data by these examples. If you are familiar with Bayes nets, a branch of

machine learning and a topic we do not cover in this topic, you know how a complex re-

lationship between objects is represented by finding the strongest statistical dependencies

among these objects and using only those in representing all statistical connections. Some

of the important kinds of feature extraction from large-scale data that we shall study are:

(1) Frequent Itemsets . This model makes sense for data that consists of “baskets” of small

sets of items, as in the market-basket problem that we shall discuss in Chapter 6 . We

look for small sets of items that appear together in many baskets, and these “frequent

itemsets” are the characterization of the data that we seek. The original application of

this sort of mining was true market baskets: the sets of items, such as hamburger and

ketchup, that people tend to buy together when checking out at the cash register of a

store or super market.

(2) Similar Items . Often, your data looks like a collection of sets, and the objective is to

find pairs of sets that have a relatively large fraction of their elements in common.

An example is treating customers at an on-line store like Amazon as the set of items

they have bought. In order for Amazon to recommend something else they might like,

Amazon can look for “similar” customers and recommend something many of these

customers have bought. This process is called “collaborative filtering.” If customers

were single-minded - that is, they bought only one kind of thing - then clustering cus-

tomers might work. However, since customers tend to have interests in many different

things, it is more useful to find, for each customer, a small number of other customers

who are similar in their tastes, and represent the data by these connections. We discuss

similarity in Chapter 3 .

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events hidden within

massive amounts of data. This section is a discussion of the problem, including “Bonfer-

roni's Principle,” a warning against overzealous use of data mining.

Search WWH ::

Custom Search

Home