Simulating and Evaluating Biosurveillance Datasets - Biosurveillance: Methods and Case Studies

Biology Reference

In-Depth Information

to mimic a multivariate authentic dataset, thereby producing a new semiau-

thentic dataset. The authentic dataset and its semiauthentic mimic have the

same statistical structure, yet they differ in their actual daily counts. This

combination means that the resulting semiauthentic datasets are useful for

purposes of research (e.g., algorithm development and evaluation), yet avoid

data disclosure concerns such as privacy and confidentiality.

Another important use of simulated datasets that are “copies” of the same

authentic dataset is for purposes of randomization and Monte Carlo test-

ing. The ability to test an algorithm on multiple versions of the same data

structure helps avoid overfitting and gives more accurate estimates of model

performance.

In the following, we describe how the different statistical components of

the authentic data are estimated. These are then used to create the mim-

icked dataset.

Estimating DOW patterns. Given a dataset, the method of ratio to mov-

ing averages (RMA) is used to estimate DOW indices. A vector of

seven indices is created separately for each series, in order to capture

the weekly pattern for that series.

Estimating seasonality. The data are smoothed using a 7-day moving

average. A smoothing spline is fit to the smoothed data, and this

spline is then evaluated at each daily point. These daily points are

then used as the seasonality components. For more details on the

smoothing spline, see Chambers and Hastie (1992).

Estimating holiday effects. Holiday dates are copied from the original

dataset, if any are present. This vector is used to identify that days

should have holiday effects in the mimicked dataset.

Estimating series means. The mean for each simulated series is deter-

mined from the mean of the corresponding original series, exclud-

ing holidays (if any).

Estimating series variances and autocorrelations. To determine the vari-

ance and autocorrelation of the authentic series devoid of seasonal

patterns, we use a Holt-Winters exponential smoother on each series

separately and obtain a series of residuals (actual daily counts minus

predicted daily counts). This residual series should not contain

trends, DOW, or seasonal effects. We then compute the autocorrela-

tion, variance, and correlation matrix of the residual series, which is

later used as input for the simulator.

Some or all of the previous estimated parameters are then fed into the data

simulator, thereby yielding a simulated mimicked version of the authentic

multivariate data. The original dataset and its mimic contain the same sta-

tistical characteristics but are different stochastic realizations (i.e., the counts

are not identical).

Search WWH ::

Custom Search

Home