Oracle Analytics: Business Intelligence and Analytic Applications - The CIO's Guide to Oracle Products and Solutions

Database Reference

In-Depth Information

domains as well. The pattern-recognition step is usually independent of

the domain or application.

Data mining starts with the raw data, which usually takes the form of

simulation data, observed signals, or images. These data are preprocessed

using various techniques such as sampling, multiresolution analysis,

denoising, feature extraction, and normalization.

Sampling is a widely accepted technique to reduce the size of the data

set and make it easier to handle. However, in some cases, such as when

looking for something that appears infrequently in the set, sampling may

not be viable. Multiresolution analysis is another technique to reduce the

size of the data set. With multiresolution analysis, data at a fine resolution

can be coarsened, which shrinks the data set by removing some of the

detail and extracts relevant features from the raw data set. In credit card

fraud, for instance, an important feature might be the location where a

card is used: If a credit card is suddenly used in a country where it's never

been used before, fraudulent use seems likely. Thus the key to effective

data mining is reducing the number of features used to mine data, retain-

ing only those features that provide the best discrimination among the

relevant data items.

Once the data is preprocessed or transformed, pattern-recognition soft-

ware is used to look for patterns. Patterns are defined as an ordering that con-

tains some underlying structure. The results are processed back into a format

familiar to the experts, who then can examine and interpret the results.

To be truly useful, data-mining techniques must be scalable. In other

words, when the problem increases in size, we don't want the mining time

to increase proportionally. Making the end-to-end process scalable can be

very challenging, because it's not just a matter of scaling each step, but of

scaling the process as a whole.

Large-scale data mining is a field very much in its infancy, making it a

source of several open research problems. In order to extend data-mining

techniques to large-scale data, several barriers must be overcome. The

extraction of key features from large, multidimensional, complex data is a

critical issue that must be addressed first, prior to the application of the

pattern-recognition algorithms. The features extracted must be relevant to

the problem, insensitive to small changes in the data, and invariant to scal-

ing, rotation, and translation. In addition, we need to select discriminating

features through appropriate dimension-reduction techniques. The pattern-

recognition step poses several challenges as well. For example, is it possible

to modify existing algorithms, or design new ones, that are scalable, robust,

Search WWH ::

Custom Search

Home