Segmentation of Continuous Data Streams Based on a Change Detection Methodology - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

commonly performed by estimating the validation error rate of the examined

model. This is calculated by the following equation:

¦

H

(

G

)

is

wrong

T

(

predicted

)

z

T

(

actual

)

X

val

(4.1)

X

val

N

(

nember

of

records

in the

validatio

n

set)

X

val

When the database D is not fixed but is accumulated over time, the data-

mining task should be altered: In every period K , a new set of records X K is added

to the database; d K is the set of records X K that was added in the start of period K ,

and D K is the accumulated database K

1

D

d

. Therefore, Given a database

K

k

D K containing a complete set of records X K , generate the best set of hypotheses

H K to describe the accumulated model M. At the end of time period K+ 1, a new

question is encountered: Is M K , G = M K +1, G , for every K ?

As noted before, several methods dealing with this problem have already been

proposed by researchers. Most methods have dealt with “how the model M can be

updated efficiently when a new period K is encountered” or “how we can adapt to

the time factor,” rather than asking the following questions:

1.

Was the model significantly changed during the period K ?

2.

What was the nature of the change?

3.

Should we consider several of the past periods as redundant or not required

for an algorithm G to generate a better model M ?

Hence, the objective is to define and evaluate a change-detection methodology

for identifying a significant change that happened during period K in a data-

mining model, which was incrementally built in periods 1 to

K

1

, based on the

data that was accumulated during period K .

4.2.2 Variety of Changes

There are various significant changes that might occur when building the model M

based on the algorithm G. There are several possible causes for significant

changes in a data-mining classification model:

1. A change in the distribution of one or more of the candidate attributes (A). For

example, if a database in periods 1 to K -1, consists of a 45% male and 55%

female examples and in period K all records of male samples.

2. A change in the distribution of the target variable (T). For example, consider

the case of examining the rate of failures in a seminar examination based on the

characteristics of the students in the course in past consecutive years. If in 1999

the average was 20% and in 2000 it was 40%, then a change in the target

distribution has occurred.

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home