Biology Reference
In-Depth Information
In many practical scenarios, sparse event data can be represented as a net-
work of entities and links. Evidence pertaining to the individual entities can
then be shared among their neighbors in the graph. It is shown in Section
5 that such an approach can boost predictive power of models built from
sparse events data in practical applications (Sarkar et al. 2008, Dubrawski et
al. 2009b, Dubrawski et al. 2007b).
9.2 Scalable Aggregation of Evidence in
Multidimensional Event Data
Many data sets encountered in the practice of biosurveillance have the
form of a record of transactions. Each entry in such data typically includes
date/time of an event (such as a pharmacy sales transaction or an admis-
sion of a patient to a hospital) and a number of descriptors characterizing
it (e.g., the brand, name, dose, and quantity of the pharmaceutical sold,
or the age, gender, symptoms, and test results of the admitted patient).
In order to understand and to monitor processes taking place in environ-
ments producing such data, one needs to track frequencies of events of
various categories over time. This requires computing numerous different
aggregations of data. For instance, when monitoring hospital records for
indications of a possible local outbreak of a gastrointestinal ailment, the
public health analysts may want to check daily counts of children report-
ing recently with bloody stools to hospitals in Pittsburgh, and compare
these counts against the expectation derived from the analogical numbers
observed over, for instance, the past 12 months. The number of possible
count queries involving multiple multi-valued attributes that can be asked
of such data can be actually very high. Given the large number of possibil-
ities, answering all possible queries up to a certain level of specificity, as in
exhaustive screening scenarios, poses a serious computational challenge.
One way of addressing that challenge is to precompute all counts of
interest ahead of the time of analysis, for instance, upon loading the data.
Database administrators do similar things routinely. They monitor frequen-
cies of queries issued by the database users, and they cache the responses
to the most popular ones in the server's operating memory, so that they
are handy when the next predictable request comes along. This approach
lessens the burden on the database system and often significantly boosts
its throughput. Similarly, statisticians are used to construct contingency
tables to represent distributions of multivariate discrete data. The cells in
these tables store counts of events corresponding to all unique combina-
tions of values of all involved dimensions. These counts, as soon as they
are extracted from raw data and stored in the table, are readily available
 
Search WWH ::




Custom Search