Database Reference
In-Depth Information
69 , 157 terms were parsed from the 53 , 733 messages using a master dictio-
nary of 121 , 393 terms created by the General Text Parser (GTP) software
environment (in C++) maintained at the University of Tennessee (17). This
larger set of terms was previously obtained when GTP was used to parse
289 , 695 of the 517 , 431 emails defining the Cohen distribution at CMU (see
Section 7.1 ). To be accepted into the dictionary, a term had to occur in more
than one email and more than 10 times among the 289 , 695 emails.
The four-way data correspond to a sparse array
Y
of size 39573
×
197
×
197
357 with 639 , 179 nonzeros. The 39 , 573 terms were parsed from the
email messages in the same manner as for the three-way data. There are fewer
terms because we are restricting the set of messages to be only those between
the same 197 individuals. In the three-way set, there are more messages
because many are sent to individuals outside of the set of 197.
We scaled the nonzero entries of
×
X
and
Y
according to a weighted frequency:
x ijk = w ijk g i a j ,
y ijkl = w ijkl g i a j r k ,
where w ijkl is the local weight for term i sent to recipient k by author j in day
l , g i is the global weight for term i , a j is an author normalization factor, and
r k is a recipient normalization factor. While some scaling and normalization
are necessary to properly balance the arrays, many schemes are possible.
For the three-way data, we used the scaling from a previous study in (5)
for consistency. Let f ijk be the number of times term i is written by author j
in day k , and define h ij = P k f ijk
P jk f ijk . The specific components of each nonzero
are listed below:
w ijk =log(1+ f ijk )
Log local weight
n
h ij log h ij
log n
Entropy global weight
g i =1+
j =1
1
Author normalization
a j =
t
( w ijk g i )
i,k
For the four-way data, we followed a different scheme. Let f ijkl be the
number of times term i is sent to recipient k by author j in day l . Define the
entropy of term i by
e i =
f ijkl log f ijkl .
j,k,l
The specific components of each nonzero are listed below:
Search WWH ::




Custom Search