Information Technology Reference
In-Depth Information
second degree function to the first degree function; then we assign a weight to each
bin so that the larger bins will contribute more to the final distance computation:
n
i
=
1
0
D
(
h
,
h
)
=
w
(
h
[
i
]
h
[
i
])
/
σ
(7)
4
1
i
1
i
n
j
=
1
0
w
=
h
[
i
]
/
h
[
j
]
Weight
(8)
i
1
1
When the distance between the histogram of the selected recent period and that of
the longer term profile is larger than a threshold, an alert will be generated to warn the
analyst that the behavior “might be abnormal” or is deemed “abnormal”. The alert is
also put into the alert log of EMT.
2.3.3 Similar Users
User accounts that may behave similarly may be identified by computing the pair-
wise distances of their histograms (eg., a set of SPAM accounts may be inferred given
a known or suspect SPAM account as a model). Intuitively, most users will have a
pattern of use over time, which spamming accounts will likely not follow. (SPAMbots
don't sleep or eat and hence may operate at times that are highly unusual.)
The histogram distance functions were modified for this detection task. First, we
balance and weigh the information in the histogram representing hourly behavior with
the information provided by the histogram representing behavior over different ag-
gregate periods of a day. This is done since measures of hourly behavior may be too
low a level of resolution to find proper groupings of similar accounts. For example, an
account that sends most of its email between 9am and 10am should be considered
similar to another that sends emails between 10am and 11am, but perhaps not to an
account that emails at 5pm. Given two histograms representing a heavy 9am user, and
another for a heavy 10am user, a straightforward application of any of the histogram
distance functions will produce erroneous results.
Thus, we divide a day into four periods: morning (7am-1pm), afternoon (1pm-
7pm), night (7pm-1am), and late night (1am-7am). The final distance computed is the
average of the distance of the 24-hour histogram and that of the 4-bin histogram,
which is obtained by regrouping the bins in the 24-hour histogram.
Second, because some of the distance functions require normalizing the histograms
before computing the distance function, we also take into account the volume of
emails. Even with the exact distribution after normalization, a bin representing 20
emails per day should be considered quite different from an account exhibiting the
emission of 200 emails per day.
In addition to find similar users to one specific user, EMT computes distances pair-
wise over all user account profiles, and clusters sets of accounts according to the simi-
larity of their behavior profile. To reduce the complexity of this analysis, we use an
approximation by randomly choosing some user account profile as a “centroid” base
model, and then compare all others to this account. Those account profiles that are
deemed within a small neighborhood from each other (using their distance to the
centroid as the metric) are treated as one clustered group. The cluster so produced and
its centroid are then stored and removed, and the process is repeated until all profiles
have been assigned to a particular cluster.
The histograms described here are stationary models; they represent statistics time
frames. Other non-stationary account profiles are provided by EMT, where behavior
Search WWH ::




Custom Search