Biology Reference
In-Depth Information
to mimic a multivariate authentic dataset, thereby producing a new semiau-
thentic dataset. The authentic dataset and its semiauthentic mimic have the
same statistical structure, yet they differ in their actual daily counts. This
combination means that the resulting semiauthentic datasets are useful for
purposes of research (e.g., algorithm development and evaluation), yet avoid
data disclosure concerns such as privacy and confidentiality.
Another important use of simulated datasets that are “copies” of the same
authentic dataset is for purposes of randomization and Monte Carlo test-
ing. The ability to test an algorithm on multiple versions of the same data
structure helps avoid overfitting and gives more accurate estimates of model
In the following, we describe how the different statistical components of
the authentic data are estimated. These are then used to create the mim-
icked dataset.
Estimating DOW patterns. Given a dataset, the method of ratio to mov-
ing averages (RMA) is used to estimate DOW indices. A vector of
seven indices is created separately for each series, in order to capture
the weekly pattern for that series.
Estimating seasonality. The data are smoothed using a 7-day moving
average. A smoothing spline is fit to the smoothed data, and this
spline is then evaluated at each daily point. These daily points are
then used as the seasonality components. For more details on the
smoothing spline, see Chambers and Hastie (1992).
Estimating holiday effects. Holiday dates are copied from the original
dataset, if any are present. This vector is used to identify that days
should have holiday effects in the mimicked dataset.
Estimating series means. The mean for each simulated series is deter-
mined from the mean of the corresponding original series, exclud-
ing holidays (if any).
Estimating series variances and autocorrelations. To determine the vari-
ance and autocorrelation of the authentic series devoid of seasonal
patterns, we use a Holt-Winters exponential smoother on each series
separately and obtain a series of residuals (actual daily counts minus
predicted daily counts). This residual series should not contain
trends, DOW, or seasonal effects. We then compute the autocorrela-
tion, variance, and correlation matrix of the residual series, which is
later used as input for the simulator.
Some or all of the previous estimated parameters are then fed into the data
simulator, thereby yielding a simulated mimicked version of the authentic
multivariate data. The original dataset and its mimic contain the same sta-
tistical characteristics but are different stochastic realizations (i.e., the counts
are not identical).
Search WWH ::

Custom Search