Biology Reference
In-Depth Information
for very fast computations of estimates of all conceivable probabilities that
can be derived from the data at hand. For instance, to compute today's
estimate of the conditional probability of seeing a report of bloody stools
given the young age of the triaged patient, and the recent admissions data,
one needs to aggregate current counts from all cells at the intersection of
symptom  = ”bloody stools” and patients_age  = ”child” and divide the result
by the sum of the recent counts of patient visits retrieved from all cells
matching patients_age  = ”child.” The contingency table approach is very
useful in facilitating data-intensive statistical mining. In its nutshell, it is
more comprehensive than the standard database query caching strategy
mentioned above in that it pays an equal attention to precomputing num-
bers needed to derive answers to all possible queries. Unfortunately, con-
tingency tables may consume large amounts of memory, and they become
infeasible to use in practice when the number of dimensions of data is
not trivially small, and when the number of unique combinations of their
values becomes very large.
T-Cube is an alternative data cache structure that addresses that challenge
by memorizing a limited and controllable amount of information about data,
which is sufficient for either direct retrieval or for rapid reconstruction of
all conceivable aggregations (Sabhnani et al. 2007). It extends the idea of
AD-tree (Moore and Lee 1998) to represent time series of counts. AD-tree is
an in-memory data cache that leverages redundancies in the raw event data
to represent it in a compact form. It enables rapid responses to all conceiv-
able queries for counts of occurrences of events, and in that sense, it mimics
the functionality of contingency tables. Its query-response time is indepen-
dent of the number of records in the raw data, and it is typically orders of
magnitude shorter than attainable with the state-of-the art database systems.
AD-tree achieves that at substantially lower memory requirements than typi-
cally seen from contingency tables in multidimensional multi-valued data
scenarios.
Figure 9.1 conveys the basic idea of T-Cube data representation and query
retrieval applied to a simple public health data set. The top node of the tree
represents the most general query, and it stores the cumulative time series of
counts of all categories of events characterized by gender and two types of
symptoms reported by patients (in this example, we focus on gastrointesti-
nal and respiratory symptoms). Nodes at deeper levels of the tree store time
series corresponding to increasingly more specific queries. Once the T-Cube
is built, time series for any query can be retrieved in time independent of the
number of records in the raw dataset. One example of such a query is “get me
the time series by day for all males reporting with gastrointestinal but with
no respiratory symptoms.” The reply can be produced easily by navigating
to the nodes of the tree, which represent exactly the conjunctive components
of each query and by aggregating and/or subtracting time series stored in
them as needed.
Search WWH ::




Custom Search