A Behavior-Based Approach to Securing Email Systems - Computer Network Security

Information Technology Reference

In-Depth Information

second degree function to the first degree function; then we assign a weight to each

bin so that the larger bins will contribute more to the final distance computation:

∑

−

(

)

(

[

]

−

[

])

(7)

∑

−

[

]

[

]

Weight

(8)

When the distance between the histogram of the selected recent period and that of

the longer term profile is larger than a threshold, an alert will be generated to warn the

analyst that the behavior “might be abnormal” or is deemed “abnormal”. The alert is

also put into the alert log of EMT.

2.3.3 Similar Users

User accounts that may behave similarly may be identified by computing the pair-

wise distances of their histograms (eg., a set of SPAM accounts may be inferred given

a known or suspect SPAM account as a model). Intuitively, most users will have a

pattern of use over time, which spamming accounts will likely not follow. (SPAMbots

don't sleep or eat and hence may operate at times that are highly unusual.)

The histogram distance functions were modified for this detection task. First, we

balance and weigh the information in the histogram representing hourly behavior with

the information provided by the histogram representing behavior over different ag-

gregate periods of a day. This is done since measures of hourly behavior may be too

low a level of resolution to find proper groupings of similar accounts. For example, an

account that sends most of its email between 9am and 10am should be considered

similar to another that sends emails between 10am and 11am, but perhaps not to an

account that emails at 5pm. Given two histograms representing a heavy 9am user, and

another for a heavy 10am user, a straightforward application of any of the histogram

distance functions will produce erroneous results.

Thus, we divide a day into four periods: morning (7am-1pm), afternoon (1pm-

7pm), night (7pm-1am), and late night (1am-7am). The final distance computed is the

average of the distance of the 24-hour histogram and that of the 4-bin histogram,

which is obtained by regrouping the bins in the 24-hour histogram.

Second, because some of the distance functions require normalizing the histograms

before computing the distance function, we also take into account the volume of

emails. Even with the exact distribution after normalization, a bin representing 20

emails per day should be considered quite different from an account exhibiting the

emission of 200 emails per day.

In addition to find similar users to one specific user, EMT computes distances pair-

wise over all user account profiles, and clusters sets of accounts according to the simi-

larity of their behavior profile. To reduce the complexity of this analysis, we use an

approximation by randomly choosing some user account profile as a “centroid” base

model, and then compare all others to this account. Those account profiles that are

deemed within a small neighborhood from each other (using their distance to the

centroid as the metric) are treated as one clustered group. The cluster so produced and

its centroid are then stored and removed, and the process is repeated until all profiles

have been assigned to a particular cluster.

The histograms described here are stationary models; they represent statistics time

frames. Other non-stationary account profiles are provided by EMT, where behavior

Computer Network Security

Search WWH ::

Custom Search

Home