Biology Reference
In-Depth Information
and two-thirds on nonrelevant topics, such as business, politics, technology,
sport, and entertainment.
An important feature of the topic classification task is the need for high
recall so as not to reject any true positives. Posting side-by-side compari-
sons with other learning algorithms, the BioCaster group selected a support
vector machine model able to help achieve a classification accuracy over
93.5% (F-score = 91.2). However, a small proportion of false positives still
remained—for example, when the condition is unclassified or vague, when
the outbreak is hypothetical, or when the outbreak is historical or negative.
More subtle borderline cases frequently occur in stories that discuss disease
outbreak reports, but not as their main topical contribution, such as in pre-
vention and control campaigns. Later stages of processing are designed to
detect and, if necessary, reject such reports.
For named entity analysis and event extraction, the BioCaster group devel-
oped Simple Rule Language, 3 a freely available, regular expression pattern-
matching language. SRL is designed to allow users without a background in
computer science to quickly build up rule topics. Although this is laborious in
general, the BioCaster group has tried to make the task easier by developing
a freely available graphical user interface (McCrae et al. 2009). SRL has influ-
ences from earlier pattern-based languages, such as Declarative Information
Analysis Language (DIAL), and incorporates a capability to match string
literals, named entity classes, skipwords, and word lists (Feldman 2003).
The general SRL syntax is a label followed by a head expression and a body
expression. The head expression is output if the regular expression in the
body matches to the text. Examples of these follow:
Ex 1. “D1: name(disease) { list(%disease) }” matches to any phrase in
the list “disease” and outputs a named entity of type “disease.”
Ex 2. “L2: name(location) { list(@cardinal_directions) list(@country)
}” matches to any country name preceded by a cardinal direction
such as “northeastern” and outputs the whole phrase as a named
entity of type “location.”
Ex 3. “IT1:international_travel(“true”) :- “recently” “traveled” “to”
words(,2) name(location,L) { list(@country) }” matches to the phrase
“recently traveled to” followed by up to two words and a location
name and outputs the fact that “international travel” is true.
As shown in the examples above, lists and named entities are used to encode
semantically related terms such as disease names, country names, victim
expressions, verbs of infection, and so on with several of the lists coming
from the BCO. The advantage of this approach is that it can be easily used by
nonexperts in software engineering, can be quickly changed to accommodate
new terms or events, and is not limited to any particular language. SRL is
therefore well suited to resource-poor languages, but can easily be extended
Search WWH ::




Custom Search