Information Technology Reference
In-Depth Information
frames. Obviously, recipient frequencies are not constant over a long time horizon, as
users will add new recipients and drop old ones. It can be informative for behavioral
modeling though, to analyze the variability of frequencies over two near time frames.
The window is composed of two histograms and two tables. They are constructed
in the same manner as what is done in the Profile window, but here two time periods
of activity for the same user are compared. The idea is to treat the first period, the
“Training range” at the top, as the true distribution corresponding to the user under
normal behavior, while the second time period, the “Testing range” at the bottom, is
used to evaluate if frequencies have changed, and if any malicious activity is taking
place.
By default, the 200 past emails are selected as the Testing range, while the previ-
ous 800 are the Training range, thus operating under the usual 1/5 - 4/5 ratio between
testing and training sets, using the past 1000 messages as total set. These ranges are
modifiable; a live version would use a much shorter testing period in order to provide
fast alerts.
The scales on the x-axis for both Recipient Frequency histograms are the same, and
are based on the sorted frequencies from the Training range. It implies that the ad-
dresses appearing only in the testing range (but not in the training range) are neverthe-
less present in both histograms, but with frequency zero in the training range histo-
gram. Conversely, they have a non-zero frequency in the lower histogram and are
located on the extreme right side. (One can see a jump in frequency on this side, as
the sorting is based on the top histogram, where they had zero frequency.) As each
recipient address is at the same location on the x axis on both histograms, the lower
one does not appear to be sorted, as the order has changed between training and test-
ing ranges. This shows how some frequencies drop while others spike between the
two periods.
Finally, at the bottom of the window, a Chi Square statistic, with its p-value and
degrees of freedom are displayed in blue. The Chi Square is a statistic that can be
used, among other things, to compare two frequency tables. Assuming that the ob-
served frequencies corresponding to the first, longer time frame window are the true
underlying frequencies, the Chi Square statistic enables us to evaluate how likely the
observed frequencies from the second time frame are to be coming from that same
distribution [18]. The Chi Square formula is:
=
k
i
Q
=
(
X
(
i
)
np
(
i
))
/
np
(
i
)
(9)
1
X
( i
)
Where
is the number of observations for recipient (i) in the testing range,
( p is the true frequency calculated from the training range, n is the number of ob-
servations in the testing range, and k is the number of recipients. There are (k-1) de-
grees of freedom.
The p-value represents the probability that the frequencies in both time frames
come from the same multinomial distribution. In order to get an idea of the variability
of the frequencies under real conditions, we used a sample of 37,556 emails from 8
users. We run two batches of calculations. First, we used a training period size of 400
emails and a testing period size of 100 emails; for each user, we started at the first
record, calculated the p-value, then translated the two windows by steps of 10 records
until the end of the log was reached, each time calculating the p-value. Secondly, we
reproduced the same experiment, but with a training period size of 800 emails, and a
)
Search WWH ::




Custom Search