The Role of Data Aggregation in Public Health and Food Safety Surveillance - Biosurveillance: Methods and Case Studies

Biology Reference

In-Depth Information

In many practical scenarios, sparse event data can be represented as a net-

work of entities and links. Evidence pertaining to the individual entities can

then be shared among their neighbors in the graph. It is shown in Section

5 that such an approach can boost predictive power of models built from

sparse events data in practical applications (Sarkar et al. 2008, Dubrawski et

al. 2009b, Dubrawski et al. 2007b).

9.2 Scalable Aggregation of Evidence in

Multidimensional Event Data

Many data sets encountered in the practice of biosurveillance have the

form of a record of transactions. Each entry in such data typically includes

date/time of an event (such as a pharmacy sales transaction or an admis-

sion of a patient to a hospital) and a number of descriptors characterizing

it (e.g., the brand, name, dose, and quantity of the pharmaceutical sold,

or the age, gender, symptoms, and test results of the admitted patient).

In order to understand and to monitor processes taking place in environ-

ments producing such data, one needs to track frequencies of events of

various categories over time. This requires computing numerous different

aggregations of data. For instance, when monitoring hospital records for

indications of a possible local outbreak of a gastrointestinal ailment, the

public health analysts may want to check daily counts of children report-

ing recently with bloody stools to hospitals in Pittsburgh, and compare

these counts against the expectation derived from the analogical numbers

observed over, for instance, the past 12 months. The number of possible

count queries involving multiple multi-valued attributes that can be asked

of such data can be actually very high. Given the large number of possibil-

ities, answering all possible queries up to a certain level of specificity, as in

exhaustive screening scenarios, poses a serious computational challenge.

One way of addressing that challenge is to precompute all counts of

interest ahead of the time of analysis, for instance, upon loading the data.

Database administrators do similar things routinely. They monitor frequen-

cies of queries issued by the database users, and they cache the responses

to the most popular ones in the server's operating memory, so that they

are handy when the next predictable request comes along. This approach

lessens the burden on the database system and often significantly boosts

its throughput. Similarly, statisticians are used to construct contingency

tables to represent distributions of multivariate discrete data. The cells in

these tables store counts of events corresponding to all unique combina-

tions of values of all involved dimensions. These counts, as soon as they

are extracted from raw data and stored in the table, are readily available

Search WWH ::

Custom Search

Home