Biomedical Engineering Reference
In-Depth Information
the correlation that existed then. To mitigate this problem, we removed the spurious
tweets using a filtering technique that trains a document classifier to label whether a
message is indicative of a flu event or not.
4.1
Text Classification
In an information retrieval scenario, text mining seeks to extract useful information
from unstructured textual data. Using a simple “bag-of-words” text representations
technique based on a vector space, our algorithm classifies messages wherein user men-
tions having contracted the flu himself or has observed the flu among his friends, family,
relatives, etc. Accuracy of such a model is highly dependent on how well trained our
model is, in terms of precision, recall and F-measure.
The set of possible labels for a given instance can be divided into two subsets, one
of which is considered “relevant”. To create such an annotated dataset which demands
human intelligence, we use Amazon Mechanical Turks to manually classify a sample of
25,000 tweets and 10,000 status updates. Every message is classified by exactly three
Turks and the majority classified result is attached as the final class for that message.
Ta b l e 1 . Twitter Text Classification 10 fold cross validation results (left) followed by Facebook's
10 fold cross validation results (right)
Twitter
Facebook
Classifier
Class Precision Recall F-value Precision Recall F-value
Ye s
0.801
0.791
0.796
0.684
0.785
0.731
J48
No
0.813
0.704
0.755
0.629
0.501
0.557
Ye s
0.725
0.829
0.773
0.688
0.847
0.759
Naive Bayesian
No
0.813
0.704
0.755
0.69
0.47
0.559
Ye s
0.807
0.822
0.814
0.696
0.857
0.768
SVM
No
0.829
0.814
0.822
0.71
0.485
0.576
The training dataset is fed as an input to different classifiers namely decision tree
(J48), Support Vector Machines (SVM) and Naive Bayesian. For efficient learning,
some configurations that we incorporated within our text classification algorithm in-
clude setting term frequency and inverse document frequency (tf-idf) weighting, stem-
ming, using a stopwords list, limiting the number of words to keep (feature vector set)
and reordering class. Based on the results shown in Table 1, we conclude that SVM
classifier with highest precision and recall rate outperforms other classifiers when it
comes to text classification for our data set. Application of SVM on unclassified data
originating from within the United States resulted in a Twitter dataset with 280K pos-
itively classified tweets from 187K unique twitter users and 185K positively classified
facebook posts from 164K unique Facebook users. In order to gauge if the number of
unique twitter users mentioning the flu per week is a good measure of the CDC's ILI
reported data, we plot (in Figure 2) the number of Twitter users/week against the per-
centage of weighted ILI visits, which yields a high Pearson correlation coefficient of
0.8907. A similar plot was generated for the number of unique Facebook users men-
tioning about flu per week against the percentage of weighted ILI visits resulting in
Pearson correlation coefficient of 0.8728.
 
Search WWH ::




Custom Search