Database Reference
In-Depth Information
69
,
157 terms were parsed from the 53
,
733 messages using a master dictio-
nary of 121
,
393 terms created by the General Text Parser (GTP) software
environment (in C++) maintained at the University of Tennessee (17). This
larger set of terms was previously obtained when GTP was used to parse
289
,
695 of the 517
,
431 emails defining the Cohen distribution at CMU (see
Section 7.1
). To be accepted into the dictionary, a term had to occur in more
than one email and more than 10 times among the 289
,
695 emails.
The four-way data correspond to a sparse array
Y
of size 39573
×
197
×
197
357 with 639
,
179 nonzeros. The 39
,
573 terms were parsed from the
email messages in the same manner as for the three-way data. There are fewer
terms because we are restricting the set of messages to be only those between
the same 197 individuals. In the three-way set, there are more messages
because many are sent to individuals outside of the set of 197.
We scaled the nonzero entries of
×
X
and
Y
according to a weighted frequency:
x
ijk
=
w
ijk
g
i
a
j
,
y
ijkl
=
w
ijkl
g
i
a
j
r
k
,
where
w
ijkl
is the local weight for term
i
sent to recipient
k
by author
j
in day
l
,
g
i
is the global weight for term
i
,
a
j
is an author normalization factor, and
r
k
is a recipient normalization factor. While some scaling and normalization
are necessary to properly balance the arrays, many schemes are possible.
For the three-way data, we used the scaling from a previous study in (5)
for consistency. Let
f
ijk
be the number of times term
i
is written by author
j
in day
k
, and define
h
ij
=
P
k
f
ijk
P
jk
f
ijk
. The specific components of each nonzero
are listed below:
w
ijk
=log(1+
f
ijk
)
Log local weight
n
h
ij
log
h
ij
log
n
Entropy global weight
g
i
=1+
j
=1
1
Author normalization
a
j
=
t
(
w
ijk
g
i
)
i,k
For the four-way data, we followed a different scheme. Let
f
ijkl
be the
number of times term
i
is sent to recipient
k
by author
j
in day
l
. Define the
entropy of term
i
by
e
i
=
−
f
ijkl
log
f
ijkl
.
j,k,l
The specific components of each nonzero are listed below:
Search WWH ::
Custom Search