Data Mining - Mining of Massive Datasets

Databases Reference

In-Depth Information

to that of the user in question, and see what they liked (a technique known as

collaborative filtering).

1.5

Summary of Chapter 1

F

Data Mining: This term refers to the process of extracting useful models

of data. Sometimes, a model can be a summary of the data, or it can be

the set of most extreme features of the data.

F

Bonferroni's Principle: If we are willing to view as an interesting feature

of data something of which many can be expected to exist in random data,

then we cannot rely on such features being significant. This observation

limits our ability to mine data for features that are not su ciently rare

in practice.

F

TF.IDF : The measure called TF.IDF lets us identify words in a collection

of documents that are useful for determining the topic of each document.

A word has high TF.IDF score in a document if it appears in relatively few

documents, but appears in this one, and when it appears in a document

it tends to appear many times.

F

Hash Functions: A hash function maps hash-keys of some data type to

integer bucket numbers. A good hash function distributes the possible

hash-key values approximately evenly among buckets. Any data type can

be the domain of a hash function.

F

Indexes: An index is a data structure that allows us to store and retrieve

data records e ciently, given the value in one or more of the fields of the

record. Hashing is one way to build an index.

F

Storage on Disk : When data must be stored on disk (secondary memory),

it takes very much more time to access a desired data item than if the same

data were stored in main memory. When data is large, it is important

that algorithms strive to keep needed data in main memory.

F

Power Laws: Many phenomena obey a law that can be expressed as

y = cx a for some power a, often around−2. Such phenomena include the

sales of the xth most popular topic, or the number of in-links to the xth