The Role of Data Aggregation in Public Health and Food Safety Surveillance - Biosurveillance: Methods and Case Studies

Biology Reference

In-Depth Information

for very fast computations of estimates of all conceivable probabilities that

can be derived from the data at hand. For instance, to compute today's

estimate of the conditional probability of seeing a report of bloody stools

given the young age of the triaged patient, and the recent admissions data,

one needs to aggregate current counts from all cells at the intersection of

symptom  = ”bloody stools” and patients_age  = ”child” and divide the result

by the sum of the recent counts of patient visits retrieved from all cells

matching patients_age  = ”child.” The contingency table approach is very

useful in facilitating data-intensive statistical mining. In its nutshell, it is

more comprehensive than the standard database query caching strategy

mentioned above in that it pays an equal attention to precomputing num-

bers needed to derive answers to all possible queries. Unfortunately, con-

tingency tables may consume large amounts of memory, and they become

infeasible to use in practice when the number of dimensions of data is

not trivially small, and when the number of unique combinations of their

values becomes very large.

T-Cube is an alternative data cache structure that addresses that challenge

by memorizing a limited and controllable amount of information about data,

which is sufficient for either direct retrieval or for rapid reconstruction of

all conceivable aggregations (Sabhnani et al. 2007). It extends the idea of

AD-tree (Moore and Lee 1998) to represent time series of counts. AD-tree is

an in-memory data cache that leverages redundancies in the raw event data

to represent it in a compact form. It enables rapid responses to all conceiv-

able queries for counts of occurrences of events, and in that sense, it mimics

the functionality of contingency tables. Its query-response time is indepen-

dent of the number of records in the raw data, and it is typically orders of

magnitude shorter than attainable with the state-of-the art database systems.

AD-tree achieves that at substantially lower memory requirements than typi-

cally seen from contingency tables in multidimensional multi-valued data

scenarios.

Figure 9.1 conveys the basic idea of T-Cube data representation and query

retrieval applied to a simple public health data set. The top node of the tree

represents the most general query, and it stores the cumulative time series of

counts of all categories of events characterized by gender and two types of

symptoms reported by patients (in this example, we focus on gastrointesti-

nal and respiratory symptoms). Nodes at deeper levels of the tree store time

series corresponding to increasingly more specific queries. Once the T-Cube

is built, time series for any query can be retrieved in time independent of the

number of records in the raw dataset. One example of such a query is “get me

the time series by day for all males reporting with gastrointestinal but with

no respiratory symptoms.” The reply can be produced easily by navigating

to the nodes of the tree, which represent exactly the conjunctive components

of each query and by aggregating and/or subtracting time series stored in

them as needed.

Biosurveillance: Methods and Case Studies

Search WWH ::

Custom Search

Home