Biology Reference
In-Depth Information
15.3.1 Data Sources
In order to simplify news collection, BioCaster ingests data in RSS format.
This provides low maintenance access to frequently updated Web feeds,
such as news or blogs. BioCaster ingests approximately 1700 feeds from a
large variety of sites, including Google News, ProMED-mail, the European
Media Monitor, WHO outbreak reports, and news reported by international,
national, and local providers. From 2009, the BioCaster group has outsourced
part of its news collection to a private media monitoring company, allowing
access to more than 90,000 news providers in approximately 110 countries.
However, because of copyright restrictions, links to these reports will be
made available to login users only. The BioCaster group expects the number
of detected reports to increase as the system extends operational coverage of
languages to Chinese, French, Russian, Spanish, and Portuguese.
15.3.2 Selection and encoding of Diseases
The central knowledge resource within BioCaster is a multilingual applica-
tion ontology (BCO) that was developed by engaging an interdisciplinary
team of experts with skills ranging from the areas of computational linguis-
tics, national public health, genetics, and anthropology (Collier et al. 2007).
Formal concept analysis was used to organize the BCO around a backbone of
Suggested Upper Merged Ontology (SUMO) upper-level taxonomy (Kawazoe
et al. 2006; Kawazoe et al. 2008). Domain entity classes such as “Disease,”
“Country,” “Province,” “Symptom,” and “Chemical” were carefully grafted
onto this taxonomy (Niles and Pease, 2001). Root terms (the key concepts that
play roles in events) appear as instances of the domain entity classes. The
selection of terms was centered on diseases selected from various country's
notifiable disease lists and ranked for public health impact. The resulting
ontology is made available for browsing on the BioCaster portal site, 2 and
also as a free and downloadable OWL (Web Ontology Language) file. The
third version of the ontology, released in 2009, encodes multilingual equiva-
lences between eleven languages: Chinese, English, French, Indonesian,
Japanese, Korean, Malay, Spanish, Russian, Thai, and Vietnamese. Cross-
language term equivalents are handled as multilingual synonym sets in a
manner similar to that used in EuroWordNet (Vossen 1998). The new ver-
sion of the system will contain more than 300 human and animal diseases,
representing an increase of nearly 200 from the second version released in
April 2008.
15.3.3 automated analysis
As shown in Figure 15.3, the initial stages of automatic analysis begin
with data ingestion and cleansing, followed by machine translation
(MT). Although BioCaster has native-named entity and event extraction
Search WWH ::




Custom Search