Data Mining - Mining of Massive Datasets

Database Reference

In-Depth Information

known to like. Or, we can look at other users with preferences similar to that of the user in

question, and see what they liked (a technique known as collaborative filtering).

In Chapter 10 , we study social networks and algorithms for their analysis. The canonical

example of a social network is the graph of Facebook friends, where the nodes are people,

and edges connect two people if they are friends. Directed graphs, such as followers on

Twitter, can also be viewed as social networks. A common example of a problem to be ad-

dressed is identifying “communities;” that is, small sets of nodes with an unusually large

number of edges among them. Other questions about social networks are general questions

about graphs, such as computing the transitive closure or diameter of a graph, but are made

more difficult by the size of typical networks.

Chapter 11 looks at dimensionality reduction.We are given a very largematrix, typically

sparse. Think of the matrix as representing a relationship between two kinds of entities,

e.g., ratings of movies by viewers. Intuitively, there are a small number of concepts, many

fewer concepts than there are movies or viewers, that explain why certain viewers like cer-

tain movies. We offer several algorithms that simplify matrices by decomposing them in-

to a product of matrices that are much smaller in one of the two dimensions. One matrix

relates entities of one kind to the small number of concepts, and another relates the con-

cepts to the other kind of entity. If done correctly, the product of the smaller matrices will

be very close to the original matrix.

Finally, Chapter 12 discusses algorithms for machine learning from very large datasets.

Techniques covered include perceptrons, support-vector machines, finding models by

gradient descent, nearest-neighbor models, and decision trees.

1.5 Summary of Chapter 1

✦ Data Mining : This term refers to the process of extracting useful models of data. Sometimes, a model can be a sum-

mary of the data, or it can be the set of most extreme features of the data.

✦ Bonferroni's Principle : If we are willing to view as an interesting feature of data something of which many instances

can be expected to exist in random data, then we cannot rely on such features being significant. This observation

limits our ability to mine data for features that are not sufficiently rare in practice.

✦ TF.IDF : The measure called TF.IDF lets us identify words in a collection of documents that are useful for determ-

ining the topic of each document. A word has high TF.IDF score in a document if it appears in relatively few docu-

ments, but appears in this one, and when it appears in a document it tends to appear many times.

✦ Hash Functions : A hash function maps hash-keys of some data type to integer bucket numbers. A good hash function

distributes the possible hash-key values approximately evenly among buckets. Any data type can be the domain of a

hash function.

✦ Indexes : An index is a data structure that allows us to store and retrieve data records efficiently, given the value in

one or more of the fields of the record. Hashing is one way to build an index.

✦ Storage on Disk : When data must be stored on disk (secondary memory), it takes very much more time to access

a desired data item than if the same data were stored in main memory. When data is large, it is important that al-

gorithms strive to keep needed data in main memory.

Search WWH ::

Custom Search

Home