Information Technology Reference
In-Depth Information
testing period size of 200 emails. We thus collected a total of 7,947 p-values, and
their histogram is shown in Fig. 1.
p-values histogram
0
0
10
20
30
40
50
60
70
80
90
Fig. 1. P value plot
Under the hypothesis that the frequencies are constant, the histogram is expected to
be a flat line. On the contrary, this histogram is characterized by a very large concen-
tration of p-values between 0 and 5%, and a large (but less large) concentration be-
tween 95 and 100%, while p-values in the range of 5 to 95% are under-represented.
Our intuitive explanation of this histogram (also based on our domain knowledge) is
the following:
Most of the time, frequencies change significantly (in a statistical sense) between
two consecutive time frames; this is why 60% of the p-values are below 5% (as a low
p-value indicates a very high chance that the frequencies have changed between two
time frames). Emails users tend to modify their recipient frequencies quite often. On
the other side, there are non-negligible times when those frequencies stay very stable
(as 13% of the p-values are above 95%, indicating strong stability). As the frequen-
cies have been found to be so variable under normal circumstances, the Chi Square
itself could not be used to detect an intrusion. Instead we explore a related metric,
which will be more useful for that purpose.
2.4.3 Hellinger Distance
Our first tests using the Chi-square statistic revealed that the frequencies cannot be
assumed constant between two consecutive time frames for a given user. What is
specific to every user though is, how variable frequencies are over time. We try to
assess this by calculating a measure between the two frequency tables.
We are using the Hellinger distance for this purpose. Its formula is:
n
i
=
1
2
HD f
(
[],
f
[])
=
(
f
[ ]
i
f
[ ])
i
(10)
1
2
1
2
0
Where f 1 [] is the array of frequencies for the training set, f 2 [] for the testing set, n
the total number of distinct recipients during both periods. “Hellinger distance size” is
a text field set at 100 by default. It represents the length of the testing window and can
Search WWH ::




Custom Search