Biomedical Engineering Reference
In-Depth Information
of how many weeks of OSN and ILI data should be used to predict the ILI activity in
the current week. Within the ranges examined, m =0or n = 0 represent models where
there are no CDC data, y, or OSN data, u, terms present. Also, if m =0and n =1,we
have a linear regression between OSN data and CDC data. If n = 0, we have standard
auto-regressive (AR) models. Since the AR models utilize past CDC data, they serve
as baselines to validate whether OSN data provides additional predictive power beyond
historical CDC data.
Prediction with Logistic ARX Model. To predict the flu cases in week t using the
Logistic ARX model in Eq. (1) based on the CDC data with 2 weeks of delay and/or
the up-to-date OSN data, we apply the following relationship:
log y ( t )
1
= a i log y ( t
+
a i log y ( t
m
1)
i )
y ( t )
1
y ( t
1)
1
y ( t
i )
i =2
n− 1
+
b j log( u ( t
j ))
(2)
j =0
log y ( t
=
a i log y ( t
+
m
n− 1
1)
i
1)
b j log( u ( t
j
1))
1
y ( t
1)
1
y ( t
i
1)
i =1
j =0
(3)
where y ( t ) represents predicted CDC data in week t . It can be verified from the above
equations that to predict the CDC data in week t , the most recent CDC data is from
week t
2 . If the CDC data lag is more or less than two weeks, the above equations
can be easily adjusted accordingly.
5.2
Cross Validation Test Description
Based on ARX model structure in Eq. (1), we conducted tests using different combi-
nations of m and n values. We currently have 33 weeks with both Twitter activity and
CDC data available (10/3/2010-05/15/2011). Due to limited data samples, we adopted
the K -fold cross validation approach to test the prediction performance of the models.
In a typical K -fold cross validation scheme, the dataset is divided into K (approxi-
mately) equally sized subsets. At each step in the scheme, one such subset is used as the
test set while all other subsets are used as training samples in order to estimate the model
coefficients. Therefore, in a simple case of a 30-sample dataset, 10-fold cross-validation
would involve testing 3-samples in each step, while using the other 27 samples to esti-
mate the model parameters.
In our case, the cross-validation scheme is somewhat complicated by the dependency
of the sample y ( t ) on the previous samples, y ( t
1) ,... , y ( t
m ) and u ( t ) ,... ,
u ( t
n +1) (see Eq. (1) ). Therefore, the first sample that can be predicted is y (max( m +
1 ,n )) not y (1) . In fact, since we are predicting “two weeks ahead” of the available
CDC data, the first sample that can be estimated is actually y (max( m +2 ,n +1)) .
Since, prediction equations cannot be formed for y (1) ,... , y (max( m +2 ,n +1)
1) ,
those samples were not considered in any of the K subsets during our experiment to be
 
Search WWH ::




Custom Search