Database Reference
In-Depth Information
Most machine learning algorithms, among them those underlying the
data mining process assume that the data, which needs to be learned
(training data), serves as a random sample drawn from a stationary
distribution. The assumption is unfortunately violated by the majority of
databases and data streams available for mining today. These databases
accumulate over large periods of time, with the underlying processes
generating them changing respectively and at times quite drastically.
This occurrence is known as concept drift. According to Hulten et al .
(2001), “in many cases . . . it is more accurate to assume that data
was generated by . . . a concept function with time-varying parameters.”
Incorrect models are learned by the traditional data mining algorithms
when these mistakenly assume that the underlying concept is stationary,
when it is, in fact, drifting. This may serve to degrade the predictive
performance of the models.
A prime example of systems which may have to deal with the
aforementioned problems are on-line learning systems, which use continuous
incoming batches of training examples to induce rules for a classification
task. Two instances in which these systems are currently utilized are
credit card fraud detection and real-time monitoring of manufacturing
processes.
As a result of the above-mentioned problems it is now common practice
tomineasub-sampleoftheava labledataortomineforamodel
drastically simpler than the data could support. Ideally, the KDD systems
will function continuously, constantly processing data received so that
potentially valuable information is never lost. In order to achieve this goal,
many methods have been developed. Termed incremental (online) learning
methods, these methods aim to extract patterns from changing streams
of data.
11.7.2
The Ine ciency Challenge
AccordingtoHulten et al . (2001), incremental learning algorithms suffer
from numerous inadequacies from the KDD point of view. Whereas some
of these algorithms are relatively ecient, they do not guarantee that the
model that emerges will be similar to the one obtained by learning on
the same data in the non-incremental (batch) methods. Other incremental
learning algorithms produce the same model as the batch version, but at a
higher cost in eciency, which may mean longer training times.
Search WWH ::




Custom Search