Biomedical Engineering Reference
In-Depth Information
using current and past OSN activity, and CDC data from previous weeks. The prediction
of current ILI activity using ILI activity from previous weeks forms the autoregressive
component of the model, while the OSN data from previous weeks serve as exogenous
inputs. By CDC data, we refer to the percentage of visits to a physician for Influenza-
Like Illness (also called ILI rate).
5.1
Influenza Model Structure
Although the percentage of physician visits is between 0% and 100%, the number of
OSN users is bounded below by 0. Simple Linear ARX neglects this fact in the model
structure. Therefore, we introduce a logit link function for CDC data and a logarithmic
transformation of the OSN data as follows:
Logistic ARX Model
log
y
(
t
)
1
=
a
i
log
y
(
t
+
m
n−
1
−
i
)
b
j
log(
u
(
t
−
j
)) +
c
+
e
(
t
)
(1)
−
y
(
t
)
1
−
y
(
t
−
i
)
i
=1
j
=0
where
t
indexes weeks,
y
(
t
)
denotes the percentage of physician visits due to ILI in
week
t
,
u
(
t
)
represents the number of unique Twitter/Facebook users with flu related
tweets in week
t
,and
e
(
t
)
is a sequence of independent random variables.
c
is a constant
term to account for offset. In our tests, the number of unique OSN users
u
(
t
)
is defined
as Twitter users without retweets and having no tweets from the same user within syn-
drome elapsed time of 0 week or Facebook users having no posts from the same user
within syndrome elapsed time of 0 week. The flu related messages are defined as posts
with keywords “flu”, “H1N1” and “swine flu”. The rationale for the model structure in
Eq. (1) is that OSN data provides real-time assessment of the flu epidemic. However,
the OSN data may be disturbed at times by events related to flu, such as news reports of
flu in other parts of the world, but not necessarily to local people actually getting sick
due to ILI. On the other hand, the CDC data provides a true, albeit delayed, assessment
of a flu epidemic. Hence, by using the CDC data along with the OSN data, we may be
able to take advantage of the timeliness of the OSN data while overcoming the distur-
bance that may be present in the OSN data.
The objective of the model is to provide timely updates of the percentage of physi-
cian visits. To predict such percentage in week t, we assume that only the CDC data
with at least 2 weeks of lag is available for the prediction, if past CDC data is present
in a model. The 2-week lag is to simulate the typical delay in CDC data reporting and
aggregation. For the OSN data, we assume that the most recent data is always available,
if a model includes the OSN data terms. In other words, the most current CDC or OSN
data that can be used to predict the percentage of physician visits in week t is week t-2
for the CDC data and week t for the OSN data.
In order to predict ILI rates in a particular week given current OSN data and the most
recent ILI data from the CDC we must estimates the coefficients, a
i
,b
j
and c in Eq.
(1). Also, in practice, the model orders
m
and
n
are unknown and must be estimated. In
our experiment, we vary
m
from 0 to 2 and
n
from0to3inEq.(1)inordertoobtain
the best values of m and n to use for prediction. Intuitively, this answers the question
Search WWH ::
Custom Search