Online Social Networks Flu Trend Tracker: A Novel Sensory Approach to Predict Flu Trends - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

of how many weeks of OSN and ILI data should be used to predict the ILI activity in

the current week. Within the ranges examined, m =0or n = 0 represent models where

there are no CDC data, y, or OSN data, u, terms present. Also, if m =0and n =1,we

have a linear regression between OSN data and CDC data. If n = 0, we have standard

auto-regressive (AR) models. Since the AR models utilize past CDC data, they serve

as baselines to validate whether OSN data provides additional predictive power beyond

historical CDC data.

Prediction with Logistic ARX Model. To predict the flu cases in week t using the

Logistic ARX model in Eq. (1) based on the CDC data with 2 weeks of delay and/or

the up-to-date OSN data, we apply the following relationship:

log y ( t )

= a i log y ( t

a i log y ( t

−

i )

−

y ( t )

−

y ( t

−

y ( t

−

i )

i =2

n− 1

b j log( u ( t

−

j ))

(2)

j =0

log y ( t

a i log y ( t

n− 1

−

b j log( u ( t

−

1))

−

y ( t

−

y ( t

−

i =1

j =0

(3)

where y ( t ) represents predicted CDC data in week t . It can be verified from the above

equations that to predict the CDC data in week t , the most recent CDC data is from

week t

2 . If the CDC data lag is more or less than two weeks, the above equations

can be easily adjusted accordingly.

−

5.2

Cross Validation Test Description

Based on ARX model structure in Eq. (1), we conducted tests using different combi-

nations of m and n values. We currently have 33 weeks with both Twitter activity and

CDC data available (10/3/2010-05/15/2011). Due to limited data samples, we adopted

the K -fold cross validation approach to test the prediction performance of the models.

In a typical K -fold cross validation scheme, the dataset is divided into K (approxi-

mately) equally sized subsets. At each step in the scheme, one such subset is used as the

test set while all other subsets are used as training samples in order to estimate the model

coefficients. Therefore, in a simple case of a 30-sample dataset, 10-fold cross-validation

would involve testing 3-samples in each step, while using the other 27 samples to esti-

mate the model parameters.

In our case, the cross-validation scheme is somewhat complicated by the dependency

of the sample y ( t ) on the previous samples, y ( t

−

1) ,... , y ( t

−

m ) and u ( t ) ,... ,

u ( t

n +1) (see Eq. (1) ). Therefore, the first sample that can be predicted is y (max( m +

1 ,n )) not y (1) . In fact, since we are predicting “two weeks ahead” of the available

CDC data, the first sample that can be estimated is actually y (max( m +2 ,n +1)) .

Since, prediction equations cannot be formed for y (1) ,... , y (max( m +2 ,n +1)

−

1) ,

those samples were not considered in any of the K subsets during our experiment to be

−

Biomedical Engineering Systems and Technologies

Search WWH ::

Custom Search

Home