Biomedical Engineering Reference
In-Depth Information
In extraction of biomedical entities and biomedical relations (discussed in the
next section), the performance on methods and system are generally evaluated using
two scores: precision and recall . They are defined as following:
True
Precision
=
matches
True
+
False
matches
matches
True
matches
Recall
=
True
+
Missed
matches
matches
Precision and recall can be combined to calculate the F -measure, which is the
weighted harmonic mean of precision and recall:
2
×
Precision
×
Recall
F
Measure
=
Precision
+
Recall
There are several approaches to biomedical NER: rule-based [48, 49], dictio-
nary-based [50-52], and model-based [53, 54] approaches. In rule-based
approaches, predefined rules are used to describe the composition of the named
entities and their context. Surface clues, such as capital letters, symbols, and digits,
might be used to extract candidates for gene and protein names. These candidates
serve as the core terms and can be further expanded using syntactic rules to include
confirming words, such as GENE, PROTEIN, and RECEPTOR, for term refine-
ment. Part-of-speech (POS) tags can also be used to further improve the rules.
Rule-based NER usually does not perform well on unseen names, and the construc-
tion and tuning of rules can be time consuming.
Dictionary-based NER uses collections of names to identify entities in different
categories. Named entities are located in free text by exact matching to names in
dictionaries. Dictionary-based NER achieves high precision but depending on the
completeness of the dictionaries, the recall can be low. Recall can be improved with
fuzzy matching. In biomedical textual analysis, dictionaries of gene, protein, and
some other biological names as well as their synonyms can be built in using
well-annotated databases.
Model-based or classification-based NER uses machine learning approaches.
Commonly used techniques include naïve Bayes, SVM, decision trees, and the
Markov model. Using selected features such as word/phrase occurrence, word
sequence tag, POS tag, and dictionary match, classifiers are trained on a previously
annotated corpus. Model-based NER is sensitive to selection of features used for
training and classification and to the quality of the corpus text used in training.
9.5.3 Mining Relations Between Named Entities
Identification of biological entities is the first and crucial step, while the extraction
of associations between proteins and their functional features is the goal of the anal-
ysis of biological text data. There are two levels of entity relation mining for bio-
medical literature. At the low level, a relationship between named entities is implied
 
Search WWH ::




Custom Search