Data Mining - Mining of Massive Datasets

Database Reference

In-Depth Information

parameters of this Gaussian. The mean and standard deviation of this Gaussian distribution

completely characterize the distribution and would become the model of the data.

□

1.1.2

Machine Learning

There are some who regard data mining as synonymous with machine learning. There is

no question that some data mining appropriately uses algorithms from machine learning.

Machine-learning practitioners use the data as a training set, to train an algorithm of one of

the many types used by machine-learning practitioners, such as Bayes nets, support-vector

machines, decision trees, hidden Markov models, and many others.

There are situations where using data in this way makes sense. The typical case where

machine learning is a good approach is when we have little idea of what we are looking

for in the data. For example, it is rather unclear what it is about movies that makes certain

movie-goers like or dislike it. Thus, in answering the “Netflix challenge” to devise an al-

gorithm that predicts the ratings of movies by users, based on a sample of their responses,

machine-learning algorithms have proved quite successful. We shall discuss a simple form

of this type of algorithm in Section 9.4 .

On the other hand, machine learning has not proved successful in situations where we

can describe the goals of the mining more directly. An interesting case in point is the at-

tempt by WhizBang! Labs 1 to use machine learning to locate people's resumes on the Web.

It was not able to do better than algorithms designed by hand to look for some of the obvi-

ous words and phrases that appear in the typical resume. Since everyone who has looked at

or written a resume has a pretty good idea of what resumes contain, there was no mystery

about what makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3

Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic problem.

In this case, the model of the data is simply the answer to a complex query about it. For

instance, given the set of numbers of Example 1.1 , we might compute their average and

standard deviation. Note that these values might not be the parameters of the Gaussian that

best fits the data, although they will almost certainly be very close if the size of the data is

large.

There are many different approaches to modeling data. We have already mentioned the

possibility of constructing a statistical process whereby the data could have been generated.

Most other approaches to modeling can be described as either

Search WWH ::

Custom Search

Home