Database Reference
In-Depth Information
ments in common. The key techniques of minhashing and locality-sensitive hashing are ex-
plained. These techniques have numerous applications and often give surprisingly efficient
solutions to problems that appear impossible for massive data sets.
In Chapter 4 , we consider data in the form of a stream. The difference between a stream
and a database is that the data in a stream is lost if you do not do something about it imme-
diately. Important examples of streams are the streams of search queries at a search engine
or clicks at a popular Web site. In this chapter, we see several of the surprising applications
of hashing that make management of stream data feasible.
Chapter 5 is devoted to a single application: the computation of PageRank. This com-
putation is the idea that made Google stand out from other search engines, and it is still
an essential part of how search engines know what pages the user is likely to want to see.
Extensions of PageRank are also essential in the fight against spam (euphemistically called
“search engine optimization”), and we shall examine the latest extensions of the idea for
the purpose of combating spam.
Then, Chapter 6 introduces the market-basket model of data, and its canonical problems
of association rules and finding frequent itemsets. In the market-basket model, data consists
of a large collection of baskets, each of which contains a small set of items. We give a se-
quence of algorithms capable of finding all frequent pairs of items, that is pairs of items
that appear together in many baskets. Another sequence of algorithms are useful for finding
most of the frequent itemsets larger than pairs, with high efficiency.
Chapter 7 examines the problem of clustering. We assume a set of items with a distance
measure defining how close or far one item is from another. The goal is to examine a large
amount of data and partition it into subsets (clusters), each cluster consisting of items that
are all close to one another, yet far from items in the other clusters.
Chapter 8 is devoted to on-line advertising and the computational problems it engenders.
We introduce the notion of an on-line algorithm - one where a good response must be given
immediately, rather than waiting until we have seen the entire dataset. The idea of compet-
itive ratio is another important concept covered in this chapter; it is the ratio of the guar-
anteed performance of an online algorithm compared with the performance of the optimal
algorithm that is allowed to see all the data before making any decisions. These ideas are
used to give good algorithms that match bids by advertisers for the right to display their ad
in response to a query against the search queries arriving at a search engine.
Finally, Chapter 9 is devoted to recommendation systems. Many Web applications in-
volve advising users on what they might like. The Netflix challenge is one example, where
it is desired to predict what movies a user would like, or Amazon's problem of pitching a
product to a customer based on information about what they might be interested in buying.
There are two basic approaches to recommendation. We can characterize items by features,
e.g., the stars of a movie, and recommend items with the same features as those the user is
Search WWH ::




Custom Search