Biology Reference
In-Depth Information
Figures 2.7, 2.8, and 2.9 show the pairs of authentic (black) and mimicked
(gray) series on different temporal scales: daily, weekly, and monthly.
Although the authentic dataset and mimicked dataset appear very simi-
lar, they are far from identical. By generating the mimicked dataset sto-
chastically, we obtain a different realization from the same process. To see
how the daily counts differ between the authentic and mimicked data, see
Figure 2.10, which displays the differences between the daily counts of each
authentic series and its mimicked counterpart. We see that the differences
are in the order of magnitude of tens of counts. There are also several days
with extreme deviations between the authentic and mimicked series. These
are mostly on days that are either non-federal holidays (e.g., Christmas Eve
and New Year's Eve) or federal holidays on which “business is as usual” in
many areas (e.g., Columbus Day). This emphasizes the importance of speci-
fying all relevant holidays in the particular area where the data are collected
or simulated.
2.6.2 Distribution Testing
We now consider the tests of distributional equivalence. The multivariate
nearest-neighbor test gives a raw statistic of 0.536, which after standardiza-
tion provides a Z -score of 3.62, with a p -value of 0.000293. These p -values
should be viewed cautiously, because due to the sample size of n = 1400,
it will be very sensitive to any differences in distribution. Comparing it to
another earlier simulation method using different DOW variances shows
improvement, compared to the alternative method's standardized Z-score
of 33.3. Still, the value is quite low, leading us to consider the univariate
χ 2 tests.
When the individual DOW scores are considered for each series, we find
significant deviations in four categories: giMilVisit on Sun ( p -val = 0.000915);
giMilVisit on Sat ( p -val = 0.000225); giPrescrip on Sun ( p -val = 0.000045); and
giCivVisit on Sun ( p -val = 0.000060).
Examining individual bin comparisons, we see that the mimics have less
variance on weekends than the original, suggesting that a negative binomial
with increased variance might improve the simulation method. Figure 2.11
shows differences in Sundays for GI Civilian visits.
Outbreaks can then be inserted into this simulated dataset, to provide
labeled semiauthentic health data. Such data are necessary in order to apply
many detection or classification algorithms.
2.6.3 Outbreak insertion
In the next step, we simulate an outbreak signature and then insert it into the
mimicked data. For illustration, we simulated a lognormal outbreak signa-
ture with parameters μ = 0, σ = 1, and noutbreak ~ N (2σ, 2). Figure 2.12 displays
Search WWH ::




Custom Search