Data Mining - Mining of Massive Datasets

Database Reference

In-Depth Information

ments in common. The key techniques of minhashing and locality-sensitive hashing are ex-

plained. These techniques have numerous applications and often give surprisingly efficient

solutions to problems that appear impossible for massive data sets.

In Chapter 4 , we consider data in the form of a stream. The difference between a stream

and a database is that the data in a stream is lost if you do not do something about it imme-

diately. Important examples of streams are the streams of search queries at a search engine

or clicks at a popular Web site. In this chapter, we see several of the surprising applications

of hashing that make management of stream data feasible.

Chapter 5 is devoted to a single application: the computation of PageRank. This com-

putation is the idea that made Google stand out from other search engines, and it is still

an essential part of how search engines know what pages the user is likely to want to see.

Extensions of PageRank are also essential in the fight against spam (euphemistically called

“search engine optimization”), and we shall examine the latest extensions of the idea for

the purpose of combating spam.

Then, Chapter 6 introduces the market-basket model of data, and its canonical problems

of association rules and finding frequent itemsets. In the market-basket model, data consists

of a large collection of baskets, each of which contains a small set of items. We give a se-

quence of algorithms capable of finding all frequent pairs of items, that is pairs of items

that appear together in many baskets. Another sequence of algorithms are useful for finding

most of the frequent itemsets larger than pairs, with high efficiency.

Chapter 7 examines the problem of clustering. We assume a set of items with a distance

measure defining how close or far one item is from another. The goal is to examine a large

amount of data and partition it into subsets (clusters), each cluster consisting of items that

are all close to one another, yet far from items in the other clusters.

Chapter 8 is devoted to on-line advertising and the computational problems it engenders.

We introduce the notion of an on-line algorithm - one where a good response must be given

immediately, rather than waiting until we have seen the entire dataset. The idea of compet-

itive ratio is another important concept covered in this chapter; it is the ratio of the guar-

anteed performance of an online algorithm compared with the performance of the optimal

algorithm that is allowed to see all the data before making any decisions. These ideas are

used to give good algorithms that match bids by advertisers for the right to display their ad

in response to a query against the search queries arriving at a search engine.

Finally, Chapter 9 is devoted to recommendation systems. Many Web applications in-

volve advising users on what they might like. The Netflix challenge is one example, where

it is desired to predict what movies a user would like, or Amazon's problem of pitching a

product to a customer based on information about what they might be interested in buying.

There are two basic approaches to recommendation. We can characterize items by features,

e.g., the stars of a movie, and recommend items with the same features as those the user is

Search WWH ::

Custom Search

Home