Databases Reference
In-Depth Information
4. Zipf 's Law : This power law originally referred to the frequency of words
in a collection of documents. If you order words by frequency, and let y
be the number of times the xth word in the order appears, then you get
a power law, although with a much shallower slope than that of Fig. 1.3.
Zipf's observation was that y = cx −1/2 . Interestingly, a number of other
kinds of data follow this particular power law. For example, if we order
states in the US by population and let y be the population of the xth
most populous state, then x and y obey Zipf's law approximately.
1.3.7
Exercises for Section 1.3
Exercise 1.3.1 : Suppose there is a repository of ten million documents. What
(to the nearest integer) is the IDF for a word that appears in (a) 40 documents
(b) 10,000 documents?
Exercise 1.3.2 : Suppose there is a repository of ten million documents, and
word w appears in 320 of them. In a particular document d, the maximum
number of occurrences of a word is 15. Approximately what is the TF.IDF
score for w if that word appears (a) once (b) five times?
! Exercise 1.3.3 : Suppose hash-keys are drawn from the population of all non-
negative integers that are multiples of some constant c, and hash function h(x)
is x mod 15. For what values of c will h be a suitable hash function, i.e., a
large random choice of hash-keys will be divided roughly equally into buckets?
Exercise 1.3.4 : In terms of e, give approximations to
(a) (1.01) 500
(b) (1.05) 1000
(c) (0.9) 40
Exercise 1.3.5 : Use the Taylor expansion of e x
to compute, to three decimal
places: (a) e 1/10
(b) e −1/10
(c) e 2 .
1.4
Outline of the Topic
This section gives brief summaries of the remaining chapters of the topic.
Chapter 2 is not about data mining per se. Rather, it introduces us to the
map-reduce methodology for exploiting parallelism in computing clouds (racks
of interconnected processors). There is reason to believe that cloud computing,
and map-reduce in particular, will become the normal way to compute when
analysis of very large amounts of data is involved. A pervasive issue in later
chapters will be the exploitation of the map-reduce methodology to implement
the algorithms we cover.
Chapter 3 is about finding similar items. Our starting point is that items
can be represented by sets of elements, and similar sets are those that have a
large fraction of their elements in common. The key techniques of minhashing
and locality-sensitive hashing are explained. These techniques have numerous
Search WWH ::




Custom Search