Introduction - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

models of target classes can be built. In other words, such statistical models can be the

outcome of a data mining task. Alternatively, data mining tasks can be built on top of

statistical models. For example, we can use statistics to model noise and missing data

values. Then, when mining patterns in a large data set, the data mining process can use

the model to help identify and handle noisy or missing values in the data.

Statistics research develops tools for prediction and forecasting using data and sta-

tistical models. Statistical methods can be used to summarize or describe a collection

of data. Basic statistical descriptions of data are introduced in Chapter 2. Statistics is

useful for mining various patterns from data as well as for understanding the underlying

mechanisms generating and affecting the patterns. Inferential statistics (or predictive

statistics ) models data in a way that accounts for randomness and uncertainty in the

observations and is used to draw inferences about the process or population under

investigation.

Statistical methods can also be used to verify data mining results. For example, after

a classification or prediction model is mined, the model should be verified by statisti-

cal hypothesis testing. A statistical hypothesis test (sometimes called confirmatory data

analysis ) makes statistical decisions using experimental data. A result is called statistically

significant if it is unlikely to have occurred by chance. If the classification or prediction

model holds true, then the descriptive statistics of the model increases the soundness of

the model.

Applying statistical methods in data mining is far from trivial. Often, a serious chal-

lenge is how to scale up a statistical method over a large data set. Many statistical

methods have high complexity in computation. When such methods are applied on

large data sets that are also distributed on multiple logical or physical sites, algorithms

should be carefully designed and tuned to reduce the computational cost. This challenge

becomes even tougher for online applications, such as online query suggestions in

search engines, where data mining is required to continuously handle fast, real-time

data streams.

1.5.2 Machine Learning

Machine learning investigates how computers can learn (or improve their performance)

based on data. A main research area is for computer programs to automatically learn to

recognize complex patterns and make intelligent decisions based on data. For example, a

typical machine learning problem is to program a computer so that it can automatically

recognize handwritten postal codes on mail after learning from a set of examples.

Machine learning is a fast-growing discipline. Here, we illustrate classic problems in

machine learning that are highly related to data mining.

Supervised learning is basically a synonym for classification. The supervision in the

learning comes from the labeled examples in the training data set. For example, in

the postal code recognition problem, a set of handwritten postal code images and

their corresponding machine-readable translations are used as the training examples,

which supervise the learning of the classification model.

Search WWH ::

Custom Search

Home