Segmentation of Continuous Data Streams Based on a Change Detection Methodology - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

validation records and is not an equation), then it will be revealed as a significant

change.

The procedure has three major stages. The first is designed to perform an

initiation of procedure. The second stage is designated to detect a significant

change in the “patterns” (rules) of the prebuilt data-mining model, as described in

the previous section. The third stage is designated to evaluate whether one or some

variable(s) in the group of candidate attributes or target variable(s) (A and T) show

a significant change between periods.

The basic assumption for using the procedure is the availability of sufficient

data for each run of the algorithm on every period. If this assumption is not valid,

it is necessary to merge two or more periods to obtain statistically significant

outcomes.

4.3 Application Evaluation

4.3.1 Data Set Description

The method was proven useful when run on artificially generated data sets. The

method for change detection was also evaluated on several benchmark data sets

(Zeira et al. [39]).

An example of the implementation of the change-detection methodology is

illustrated in the first set of experiments, which were performed on a database

obtained from a network of colleges in Israel. This data set describes yearly (e.g.,

the time periods) dropouts of students from technicians and technical engineering

colleges (we refer to this data set as “Dropout”). The candidate attributes are:

regional area of the colleges (REGION), a discrete categorical variable; number of

divisions of studies in the institute (DIVISIONS), a discrete variable; number of

students in the institute (SUMP), a discretized variable where each value X

describes the interval

X ; average number of students in class

(AVEP), discretized to two intervals (low and high); percent of technological

reserve students in the institute (TR_PER), discretized to two intervals (high and

low); and class of students (CLASS), a discrete categorical variable (technicians

studies and technical engineers studies). The target factor (DROPOUT) describes

dropout percentage in the institute (high, low, negative). Dropout represents

students who have not finished their studies according to the pre-defined

curriculum of their class.

The “Dropout” database represents data for a five-year period. It is common

that due to organizational and social trends in the society, some changes in the

data-mining model are expected after the model becomes stable. Therefore, the

base assumption for this data set is that significant changes would be observed

over time.

The second set of experiments has been performed on a stock market data set,

initially used in Last et al. [24] for evaluation of the IFN algorithm. The raw data

represent the daily stock prices of 373 companies from the Standard & Poor's 500

[

40

(

1

),

40

X

]

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home