Information Technology Reference
In-Depth Information
For any connection between the selected user and a given recipient belonging to a
clique, the algorithm implemented in EMT allocates 60% of the email flow from user
to recipient to the given cliques, and the rest to the recipient. This number was chosen
to reflect the fact that if a user and a recipient belong to the same clique, most of the
email flow between the two is assumed to belong to the clique.
In some cases, two or more cliques may share the same connection between a user
and a recipient. For example if A, B and C belong to clique 1 , and A, B and D belong
to clique 2 , the connection between A and B is shared among the two cliques. In that
case, half of the 60% allocated to cliques will be split between clique 1 and clique 2 in
order to calculate the frequency, from A to B, say. If 100 messages were sent from A
to B, 40 are assigned to B, 30 to clique 1 and 30 to clique 2 .
Cliques tend to have high ranks in the frequency table, as the number of emails
corresponding to cliques is the aggregate total for a few recipients. Let's for example
assume that clique 1 = {A, B, C, D}, and that clique 1 shares no connection with other
cliques. If A sent 200 messages to B, 100 to C and 100 to D, the number of messages
allocated, respectively, to B is 80, to C is 40, to D is 40, and to is 240. Thus, the
clique will get a large share of the flow, and this is expected, as they model small
groups of tightly connected users with heavy email traffic.
2.5.2 Enclave Cliques vs. User Cliques
Conceptually, two types of cliques can be formulated. The one described in the pre-
vious section can be called enclave cliques because these cliques are inferred by look-
ing at email exchange patterns of an enclave of accounts. In this regard, no account is
treated special and we are interested in email flow pattern on the enclave-level. Any
flow violation or a new flow pattern pertains to the entire enclave. On the other hand,
it is possible to look at email traffic patterns from a different viewpoint altogether.
Consider we are focusing on a specific account and we have access to its outbound
traffic log. As an email can have multiple recipients, these recipients can be viewed
as a clique associated with this account. Since another clique could subsume a clique,
we defined a user clique as one that is not a subset of any other cliques. In other
words, user cliques of an account are its recipient lists that are not subsets of other
recipient lists.
To illustrate the idea of both types of cliques and show how they might be used in
a spam detection task, two simulations are run. In both cases, various attack strategies
are simulated. Detection is attempted based on examining a single attack email.
Final results are based on how well such detection performs statistically.
In the case of enclave cliques, the following simulation is performed. An enclave
of 10 accounts is created, with each account sending 500 emails. Each email has a
recipient list that is no larger than 5 and whose actual size follows Zipf distribution,
where the rank of the size of recipient lists is in decreasing order of the size; i.e. sin-
gle-recipient emails have a rank of 1 and 5-recipient emails have a rank of 5. Fur-
thermore, for each account, a random rank is assigned to its potential recipients and
this rank is constant across all emails sent. Once the recipient list size an email is
determined, the actual recipients of that email is generated based on generalized Zipf
distribution, with theta = 2. Finally, a threshold of 50 is used to qualify any pair of
accounts to be in a same clique.
In terms of attack strategies used, 5 different ones are tested. The first is to send to
all potential recipient addresses, one at a time. The second, third and fourth attack
Search WWH ::




Custom Search