Database Reference
In-Depth Information
1.1.5
Feature Extraction
The typical feature-based model looks for the most extreme examples of a phenomenon
and represents the data by these examples. If you are familiar with Bayes nets, a branch of
machine learning and a topic we do not cover in this topic, you know how a complex re-
lationship between objects is represented by finding the strongest statistical dependencies
among these objects and using only those in representing all statistical connections. Some
of the important kinds of feature extraction from large-scale data that we shall study are:
(1) Frequent Itemsets . This model makes sense for data that consists of “baskets” of small
sets of items, as in the market-basket problem that we shall discuss in Chapter 6 . We
look for small sets of items that appear together in many baskets, and these “frequent
itemsets” are the characterization of the data that we seek. The original application of
this sort of mining was true market baskets: the sets of items, such as hamburger and
ketchup, that people tend to buy together when checking out at the cash register of a
store or super market.
(2) Similar Items . Often, your data looks like a collection of sets, and the objective is to
find pairs of sets that have a relatively large fraction of their elements in common.
An example is treating customers at an on-line store like Amazon as the set of items
they have bought. In order for Amazon to recommend something else they might like,
Amazon can look for “similar” customers and recommend something many of these
customers have bought. This process is called “collaborative filtering.” If customers
were single-minded - that is, they bought only one kind of thing - then clustering cus-
tomers might work. However, since customers tend to have interests in many different
things, it is more useful to find, for each customer, a small number of other customers
who are similar in their tastes, and represent the data by these connections. We discuss
similarity in Chapter 3 .
1.2 Statistical Limits on Data Mining
A common sort of data-mining problem involves discovering unusual events hidden within
massive amounts of data. This section is a discussion of the problem, including “Bonfer-
roni's Principle,” a warning against overzealous use of data mining.
Search WWH ::




Custom Search