Information Technology Reference
In-Depth Information
We seek to identify clusters or groups of related email accounts that frequently
communicate with each other, and then use this information to identify unusual email
behavior that violates typical group behavior. For example, intuitively it is doubtful
that a user will send the same email message to his spouse, his boss, his “drinking
buddies” and his church elders all appearing together as recipients of the same mes-
sage. A virus attacking his address book would surely not know the social relation-
ships and the typical communication pattern of the victim, and hence would violate
the user's group behavior profile if it propagated itself in violation of the user's “so-
cial cliques”.
Clique violations may also indicate internal email security policy violations. For
example, members of the legal department of a company might be expected to ex-
change many Word attachments containing patent applications. It would be highly
unusual if members of the marketing department, and HR services would likewise
receive these attachments. EMT can infer the composition of related groups by ana-
lyzing normal email flows and computing cliques (see Fig. 5), and use the learned
cliques to alert when emails violate clique behavior (see Fig. 7).
EMT provides the clique finding algorithm using the branch and bound algorithm
described in [0]. We treat an email account as a node, and establish an edge between
two nodes if the number of emails exchanged between them is greater than a user
defined threshold, which is taken as a parameter (Fig. 7 is displayed with a setting of
100). The cliques found are the fully connected sub-graphs. For every clique, EMT
computes the most frequently occurring words appearing in the subject of the emails
in question which often reveals the clique's typical subject matter under discussion.
(The reader is cautioned not to confuse the computation of cliques, with the maxi-
mal Clique finding problem, that is NP-complete. Here we are computing the set of
all cliques in an email archive which has near linear time complexity.)
2.5.1 Chi Square + Cliques
The Chi Square + cliques (CS + cliques) feature in EMT is the same as the Chi
Square window described in section 3.4.2, with the addition of the calculation of
clique frequencies.
In summary, the clique algorithm is based on graph theory. It finds the largest
cliques (group of users), which are fully connected with a minimum number of emails
per connection at least equal to the threshold (set at 50 by default). For example if
clique 1 is a clique of three users A, B and C, meaning that A and B have exchanged at
least 50 emails; similarly B and C, and A and C, have exchanged at least 50 emails.
The Clique Threshold field can be changed from this window, which will recalculate
the list of all cliques for the entire database, and concurrently the metrics in the win-
dow are automatically readjusted accordingly.
In this window, each clique is treated as if it were a single recipient, so that each
clique has a frequency associated with it. Only the cliques to which the selected user
belongs will be displayed. Some users don't belong to any clique, and for those, this
window is identical to the normal Chi Square window.
If the selected user belongs to one or more cliques, each clique appears under the
name clique i , i:=1,2,..., and is displayed in a cell with a green color in order to be
distinguishable from individual email account recipients. (One can double click on
each clique's green cell, and a window pops-up with the list of the members of the
clique.)
Search WWH ::




Custom Search