Database Reference
In-Depth Information
domains as well. The pattern-recognition step is usually independent of
the domain or application.
Data mining starts with the raw data, which usually takes the form of
simulation data, observed signals, or images. These data are preprocessed
using various techniques such as sampling, multiresolution analysis,
denoising, feature extraction, and normalization.
Sampling is a widely accepted technique to reduce the size of the data
set and make it easier to handle. However, in some cases, such as when
looking for something that appears infrequently in the set, sampling may
not be viable. Multiresolution analysis is another technique to reduce the
size of the data set. With multiresolution analysis, data at a fine resolution
can be coarsened, which shrinks the data set by removing some of the
detail and extracts relevant features from the raw data set. In credit card
fraud, for instance, an important feature might be the location where a
card is used: If a credit card is suddenly used in a country where it's never
been used before, fraudulent use seems likely. Thus the key to effective
data mining is reducing the number of features used to mine data, retain-
ing only those features that provide the best discrimination among the
relevant data items.
Once the data is preprocessed or transformed, pattern-recognition soft-
ware is used to look for patterns. Patterns are defined as an ordering that con-
tains some underlying structure. The results are processed back into a format
familiar to the experts, who then can examine and interpret the results.
To be truly useful, data-mining techniques must be scalable. In other
words, when the problem increases in size, we don't want the mining time
to increase proportionally. Making the end-to-end process scalable can be
very challenging, because it's not just a matter of scaling each step, but of
scaling the process as a whole.
Large-scale data mining is a field very much in its infancy, making it a
source of several open research problems. In order to extend data-mining
techniques to large-scale data, several barriers must be overcome. The
extraction of key features from large, multidimensional, complex data is a
critical issue that must be addressed first, prior to the application of the
pattern-recognition algorithms. The features extracted must be relevant to
the problem, insensitive to small changes in the data, and invariant to scal-
ing, rotation, and translation. In addition, we need to select discriminating
features through appropriate dimension-reduction techniques. The pattern-
recognition step poses several challenges as well. For example, is it possible
to modify existing algorithms, or design new ones, that are scalable, robust,
Search WWH ::




Custom Search