Data Mining - Mining of Massive Datasets

Databases Reference

In-Depth Information

4. Zipf 's Law : This power law originally referred to the frequency of words

in a collection of documents. If you order words by frequency, and let y

be the number of times the xth word in the order appears, then you get

a power law, although with a much shallower slope than that of Fig. 1.3.

Zipf's observation was that y = cx −1/2 . Interestingly, a number of other

kinds of data follow this particular power law. For example, if we order

states in the US by population and let y be the population of the xth

most populous state, then x and y obey Zipf's law approximately.

1.3.7

Exercises for Section 1.3

Exercise 1.3.1 : Suppose there is a repository of ten million documents. What

(to the nearest integer) is the IDF for a word that appears in (a) 40 documents

(b) 10,000 documents?

Exercise 1.3.2 : Suppose there is a repository of ten million documents, and

word w appears in 320 of them. In a particular document d, the maximum

number of occurrences of a word is 15. Approximately what is the TF.IDF

score for w if that word appears (a) once (b) five times?

! Exercise 1.3.3 : Suppose hash-keys are drawn from the population of all non-

negative integers that are multiples of some constant c, and hash function h(x)

is x mod 15. For what values of c will h be a suitable hash function, i.e., a

large random choice of hash-keys will be divided roughly equally into buckets?

Exercise 1.3.4 : In terms of e, give approximations to

(a) (1.01) 500

(b) (1.05) 1000

(c) (0.9) 40

Exercise 1.3.5 : Use the Taylor expansion of e x

to compute, to three decimal

places: (a) e 1/10

(b) e −1/10

(c) e 2 .

1.4

Outline of the Topic

This section gives brief summaries of the remaining chapters of the topic.

Chapter 2 is not about data mining per se. Rather, it introduces us to the

map-reduce methodology for exploiting parallelism in computing clouds (racks

of interconnected processors). There is reason to believe that cloud computing,

and map-reduce in particular, will become the normal way to compute when

analysis of very large amounts of data is involved. A pervasive issue in later

chapters will be the exploitation of the map-reduce methodology to implement

the algorithms we cover.

Chapter 3 is about finding similar items. Our starting point is that items

can be represented by sets of elements, and similar sets are those that have a

large fraction of their elements in common. The key techniques of minhashing

and locality-sensitive hashing are explained. These techniques have numerous

Mining of Massive Datasets

Search WWH ::

Custom Search

Home