Information Technology Reference
In-Depth Information
is modeled over sequences of emails irrespective of time. These models are described
next.
2.4
Non-stationary User Profiles
Another type of modeling considers the changing conditions of an email account over
sequences of email transmissions. Most email accounts follow certain trends, which
can be modeled by some underlying distribution. As an example of what this means,
many people will typically email a few addresses very frequently, while emailing
many others infrequently. Day to day interaction with a limited number of peers usu-
ally results in some predefined groups of emails being sent. Other contacts communi-
cated to on less than a daily basis have a more infrequent email exchange behavior.
These patterns can be learnt through the analysis of a user's email archive over a bulk
set of sequential emails. For some users, 500 emails may occur over months, for oth-
ers over days.
Every user of an email system develops a unique pattern of email emission to a
specific list of recipients, each having their own frequency. Modeling every user's
idiosyncrasies enables the EMT system to detect malicious or anomalous activity in
the account. This is similar to what happens in credit card fraud detection, where
current behavior violates some past behavior patterns.
Fig. 5 and 6 are screenshots of the non-stationary model features in EMT. We will
illustrate the ideas of this model referencing specific details in the screenshots.
2.4.1 Profile of a User
The Profile tab in Fig. 11 provides a snapshot of the account's activity in term of re-
cipient frequency. It contains three charts and one table.
The “Recipient Frequency Histogram” chart is the bar chart in the upper left cor-
ner. It displays the frequency at which the user sends emails to all the recipients
communicated to in the past. Each point on the x-axis represents one recipient, the
corresponding height of the bar located at any point on the x-axis measures the fre-
quency of emails sent to this recipient, as a percentage.
This bar chart is sorted in decreasing order, and usually appears as a nice convex
curve with a strong skewedness; a long low tail on the right side, and a very thin spike
at the start on the left side. This frequency bar chart can be modeled with either a Zipf
function, or a DGX function (Discrete Gaussian Exponential function), which is a
generalized version of the Zipf distribution. This distribution characterizes some spe-
cific human behavioral patterns, such as word frequencies in written texts, or URL
frequencies in Internet browsing [2]. In brief, its main trait is that few objects receive
a large part of the flow, while many objects receive a very small part of the flow.
The rank-frequency version of Zipf's law states that
( f is
the occurrence frequency versus the rank r, in logarithmic-logarithmic scales. The
generalized Zipf distribution is defined as
f
( r
)
1
/
r
)
, where
)
( r , where the log-log plot
can be linear with any slope. Our tests indicate that the log-log plots are concave, and
thus require the usage of the DGX distribution for a better fit [2].
The “Recipient List” is the table in the upper right corner in Fig. 5 . This table is di-
rectly related to the previous histogram. It lists in decreasing order every recipient's
f
( r
)
/
Search WWH ::




Custom Search