Database Reference
In-Depth Information
well. As another example, if a topic is selling well on Amazon, then it is likely to be advertised when customers go to
the Amazon site. Some of these people will choose to buy the topic as well, thus increasing the sales of this topic.
1.3.7
Exercises for Section 1.3
EXERCISE 1.3.1 Suppose there is a repository of ten million documents. What (to the
nearest integer) is the IDF for a word that appears in (a) 40 documents (b) 10,000 docu-
ments?
EXERCISE 1.3.2 Suppose there is a repository of ten million documents, and word w ap-
pears in 320 of them. In a particular document d , the maximum number of occurrences of
a word is 15. Approximately what is the TF.IDF score for w if that word appears (a) once
(b) five times?
! EXERCISE 1.3.3 Suppose hash-keys are drawn from the population of all non-negative in-
tegers that are multiples of some constant c , and hash function h ( x ) is x mod 15. For what
values of c will h be a suitable hash function, i.e., a large random choice of hash-keys will
be divided roughly equally into buckets?
EXERCISE 1.3.4 In terms of e , give approximations to
(a) (1 . 01) 500 (b) (1 . 05) 1000 (c) (0 . 9) 40
EXERCISE 1.3.5 Use the Taylor expansion of e x to compute, to three decimal places: (a)
e 1/10 (b) e 1/10 (c) e 2 .
1.4 Outline of the Topic
This section gives brief summaries of the remaining chapters of the topic.
Chapter 2 is not about data mining per se. Rather, it introduces us to the MapReduce
methodology for exploiting parallelism in computing clouds (racks of interconnected pro-
cessors). There is reason to believe that cloud computing, and MapReduce in particular,
will become the normal way to compute when analysis of very large amounts of data is
involved. A pervasive issue in later chapters will be the exploitation of the MapReduce
methodology to implement the algorithms we cover.
Chapter 3 is about finding similar items. Our starting point is that items can be repres-
ented by sets of elements, and similar sets are those that have a large fraction of their ele-
Search WWH ::




Custom Search