Navigating the Information Storm: Web-Based Global Health Surveillance in BioCaster - Biosurveillance: Methods and Case Studies

Biology Reference

In-Depth Information

of reports. Integration of manual and automated analyses is also given

consideration. Given time and space limitations, the survey was intended

only to be indicative of current technology trends without claiming to be

comprehensive. A number of other systems that perform similar func-

tions, as well as sites that specialize in reporting natural disaster infor-

mation—such as the United Nation's Global Disaster Alert Coordinating

System 1 (GDACS)—also exist. Of particular note are human network sys-

tems. ProMED-mail is an outstanding example that is used as a source

by several of the systems we survey here (Madoff 2004). To provide a

technological context to the following discussion, we briefly detail the

basic methodological processes that a prototypical system will need to

employ below:

1. Data ingestion is the first stage of processing, with sources that orig-

inate from a variety of document types such as e-mails, newswire

reports, business reports and blogs (Web logs). Contents can be for-

matted in standard syntaxes, including: HTML (HyperText Markup

Language), RSS (Really Simple Syndication) feeds, and PDF (Portable

Document Format) documents.

2. Data cleansing is a technologically mundane process; however, it is

vital in practice to both remove unwanted noise from the text (e.g.,

advertisements or links to unrelated news stories) as well as join

together broken sentences.

3. Data triage is applied after the first two stages, and is the stage dur-

ing which the more-or-less clean text is grouped into topic catego-

ries for either trashing or subsequent processing using detailed fact

extraction. Trashing is necessary in the case of documents found that

are clearly outside the task definition. At this stage, redundant infor-

mation (e.g., multiple reports of the same event) is usually detected

through document clustering.

4. Machine translation of the source text may be required during the

data triage stage if the system does not have a native fact extraction

capability in the source language.

5. Fact extraction is used to obtain structured information about an

event—such as the name of the condition, the type of agent, the num-

ber of victims, and the time and the location where an event happened.

In other words, this is the who , what, where, when, and how of an event.

6. Significance scores are calculated using results from available data.

This could be from the data triage stage alone, or may result in con-

junction with fact extraction. High-end systems will use sophisticated

statistical analysis to assign an alerting level to each detected event.

7. Human judgment is key throughout these processes. It is almost always

needed to understand what is abnormal, to discover rare events that

Search WWH ::

Custom Search

Home