Information Technology Reference
In-Depth Information
a hypothesis is evaluated on a sample data. One can distinguish between two
classes of hypothesis tests (i) tests on a single population (single-sample tests)
and (ii) tests on two populations (two-sample tests). Another classification of
hypothesis tests is concerned with the dimensionality of each data element in
a sample. Tests dealing with scalar data elements are called as univariate tests
while those dealing with vector data elements are called as multi-variate tests.
For our problem, two-sample univariate and multi-variate tests are appropriate .
The dataset
of feature values can be considered as a time series as depicted
in Fig. 3. Each d i ∈D
D
corresponds to a feature value for a trace (or sub-log)
andcanbeascalaroravector. The basic idea is to consider a series of suc-
cessive populations of values (of size w ) and investigate if there is a significant
difference between the two populations . The premise is that differences are ex-
pected to be perceived at change points provided appropriate characteristics
of the change are captured as features. A moving window of size w is used
to generate the populations. Fig. 3 depicts a scenario where two populations
P 1 =
of size w are considered.
In the next iteration, the populations correspond to P 1 =
d 1 , d 2 ,..., d w
and P 2 =
d w +1 , d w +2 ,..., d 2 w
d 2 , d 3 ,..., d w +1
and P 2 =
d w +2 , d w +3 ,..., d 2 w +1
. Given a dataset of m values, the number of
population pairs will be m
2 w +1.
Iteration 2
Iteration 1
P 1
P 2
d 1 d 2 ...
... d w d w +1 d w +2 ... d 2 w d 2 w +1
...
... d m
Fig. 3. Dataset of feature values considered as a time series for hypothesis tests. P 1
and P 2 are two populations of size w
We will use the univariate two sample Kolmogorov-Smirnov test ( KS test) and
Mann-Whitney U test ( MW test) as hypothesis tests for univariate data, and
the two sample Hotelling T 2 test for multivariate data. The KS test evaluates
the hypothesis “Do the two independent samples (populations P 1 and P 2 )rep-
resent two different cumulative frequency distributions?” while the MW test
evaluates the hypothesis “Do the two independent samples have different dis-
tributions with respect to the rank-ordering of the values?”. The multi-variate
Hotelling T 2 test is a generalization of the t -test and evaluates the hypothesis
“Do the two samples have the same mean pattern?”. All of these tests yield a
significance probability assessing the validity of the hypothesis on the samples.
We refer the reader to [13] for a classic introduction to various hypothesis tests.
5 Case Study and Discussion
We illustrate the concepts presented in this paper with an example process. The
process corresponds to the handling of health insurance claims in a travel agency.
Upon registration of a claim, a general questionnaire is sent to the claimant. In
parallel, a registered claim is classified into a high or low claim. For low claims,
 
Search WWH ::




Custom Search