Databases Reference
In-Depth Information
to that of the user in question, and see what they liked (a technique known as
collaborative filtering).
1.5
Summary of Chapter 1
F
Data Mining: This term refers to the process of extracting useful models
of data. Sometimes, a model can be a summary of the data, or it can be
the set of most extreme features of the data.
F
Bonferroni's Principle: If we are willing to view as an interesting feature
of data something of which many can be expected to exist in random data,
then we cannot rely on such features being significant. This observation
limits our ability to mine data for features that are not su ciently rare
in practice.
F
TF.IDF : The measure called TF.IDF lets us identify words in a collection
of documents that are useful for determining the topic of each document.
A word has high TF.IDF score in a document if it appears in relatively few
documents, but appears in this one, and when it appears in a document
it tends to appear many times.
F
Hash Functions: A hash function maps hash-keys of some data type to
integer bucket numbers. A good hash function distributes the possible
hash-key values approximately evenly among buckets. Any data type can
be the domain of a hash function.
F
Indexes: An index is a data structure that allows us to store and retrieve
data records e ciently, given the value in one or more of the fields of the
record. Hashing is one way to build an index.
F
Storage on Disk : When data must be stored on disk (secondary memory),
it takes very much more time to access a desired data item than if the same
data were stored in main memory. When data is large, it is important
that algorithms strive to keep needed data in main memory.
F
Power Laws: Many phenomena obey a law that can be expressed as
y = cx a for some power a, often around−2. Such phenomena include the
sales of the xth most popular topic, or the number of in-links to the xth
most popular page.
1.6
References for Chapter 1
[7] is a clear introduction to the basics of data mining. [2] covers data mining
principally from the point of view of machine learning and statistics.
For construction of hash functions and hash tables, see [4]. Details of the
TF.IDF measure and other matters regarding document processing can be
Search WWH ::




Custom Search