A Behavior-Based Approach to Securing Email Systems - Computer Network Security

Information Technology Reference

In-Depth Information

frames. Obviously, recipient frequencies are not constant over a long time horizon, as

users will add new recipients and drop old ones. It can be informative for behavioral

modeling though, to analyze the variability of frequencies over two near time frames.

The window is composed of two histograms and two tables. They are constructed

in the same manner as what is done in the Profile window, but here two time periods

of activity for the same user are compared. The idea is to treat the first period, the

“Training range” at the top, as the true distribution corresponding to the user under

normal behavior, while the second time period, the “Testing range” at the bottom, is

used to evaluate if frequencies have changed, and if any malicious activity is taking

place.

By default, the 200 past emails are selected as the Testing range, while the previ-

ous 800 are the Training range, thus operating under the usual 1/5 - 4/5 ratio between

testing and training sets, using the past 1000 messages as total set. These ranges are

modifiable; a live version would use a much shorter testing period in order to provide

fast alerts.

The scales on the x-axis for both Recipient Frequency histograms are the same, and

are based on the sorted frequencies from the Training range. It implies that the ad-

dresses appearing only in the testing range (but not in the training range) are neverthe-

less present in both histograms, but with frequency zero in the training range histo-

gram. Conversely, they have a non-zero frequency in the lower histogram and are

located on the extreme right side. (One can see a jump in frequency on this side, as

the sorting is based on the top histogram, where they had zero frequency.) As each

recipient address is at the same location on the x axis on both histograms, the lower

one does not appear to be sorted, as the order has changed between training and test-

ing ranges. This shows how some frequencies drop while others spike between the

two periods.

Finally, at the bottom of the window, a Chi Square statistic, with its p-value and

degrees of freedom are displayed in blue. The Chi Square is a statistic that can be

used, among other things, to compare two frequency tables. Assuming that the ob-

served frequencies corresponding to the first, longer time frame window are the true

underlying frequencies, the Chi Square statistic enables us to evaluate how likely the

observed frequencies from the second time frame are to be coming from that same

distribution [18]. The Chi Square formula is:

∑ =

k

i

Q

=

(

X

(

i

)

−

np

(

i

))

/

np

(

i

)

(9)

1

X

( i

)

Where

is the number of observations for recipient (i) in the testing range,

( p is the true frequency calculated from the training range, n is the number of ob-

servations in the testing range, and k is the number of recipients. There are (k-1) de-

grees of freedom.

The p-value represents the probability that the frequencies in both time frames

come from the same multinomial distribution. In order to get an idea of the variability

of the frequencies under real conditions, we used a sample of 37,556 emails from 8

users. We run two batches of calculations. First, we used a training period size of 400

emails and a testing period size of 100 emails; for each user, we started at the first

record, calculated the p-value, then translated the two windows by steps of 10 records

until the end of the log was reached, each time calculating the p-value. Secondly, we

reproduced the same experiment, but with a training period size of 800 emails, and a

)

Computer Network Security

Search WWH ::

Custom Search

Home