Biomedical Engineering Reference
In-Depth Information
evaluated for prediction performance. However, they were still used in the training set
to estimate the values of the coefficients a i and b j in Eq. (1).
Considering the above constraints, our K-fold validation testing procedure is as fol-
lows:
1. For each ( m, n ) pair from m =0 , 1 , 2 and n =0 , 1 , 2 , 3 , repeat the following:
(a) Identify F , the index of first data sample that can actually be predicted. F =
max ( m +1 ,n )
(b) Represent the available data indices as t =1 ,...,T . Then divide the dataset
into K approximately equally sized subsets
{
S 1 ,S 2 ,...,S K }
, with each sub-
set comprising members that have an approximately equal time interval be-
tween them. For example, the first set would be S 1 =
{
y ( F ) ,y ( F + K ) ,y ( F +
2 K ) ,...
}
, the second would be S 2 =
{
y ( F +1) ,y ( F + K +1) ,y ( F +2 K +
and so on.
(c) For each S k , k =1 ,...,K , obtain the values of the model parameters a i and b j
using all the other subsets with the least squares estimation technique. Based on
the estimated model parameter values and the associated prediction equations
in Eq. (2), predict the value of each member of S k .
2. For each ( m, n ) pair, we have obtained a prediction of the CDC time-series, y ( t )
for t = F mn ,...,T . Note that F still represents the first time index that can be
predicted. However, we use the subscript mn to emphasize the fact that F varies
depending on the values of m and n . By comparing the prediction with the true
CDC data, we calculate the root mean-squared error (RMSE) as follows:
=
1) ,...
}
F max +1
t
1
( y ( t )
y ( t )) 2
(4)
T
The RMSE is computed over t = F max ,...,T , regardless of techniques and model
orders to ensure fairness in comparison.
5.3
Cross Validation Results
We fit our model with Twitter data, Facebook data, and the combination of Twitter
and Facebook data. According to the cross validation results in Table 3 1 , the models
corresponding to m =2 and n =0 have the lowest RMSE for both Twitter and Face-
book. This indicates that two most recent data points are required to perform accurate
prediction of influenza rates using Twitter or Facebook data. However the model cor-
responding to m =1 and n =2 for the combination of Twitter and Facebook data
has the lowest RMSE among all models. Thus the model corresponding to m =1 and
n =2 is used for accurate prediction of influenza rates and it uses most recent CDC
ILI data, in addition to the two most recent OSN data points. In general, the addition
of OSN data improves the prediction with past CDC data alone. For the 10-fold cross
validation results presented in Table 3, for example, the AR model ( m =1 ,n =0)
1
Cross Validation Results presented for Twitter dataset differs from our previous work [2] as
we disregard the scaling effect caused by creation of new Twitter accounts over time.
 
Search WWH ::




Custom Search