Data Analysis - Biomedical Informatics in Translational Research

Biomedical Engineering Reference

In-Depth Information

In extraction of biomedical entities and biomedical relations (discussed in the

next section), the performance on methods and system are generally evaluated using

two scores: precision and recall . They are defined as following:

True

Precision

=

matches

True

+

False

matches

True

matches

Recall

=

True

+

Missed

matches

Precision and recall can be combined to calculate the F -measure, which is the

weighted harmonic mean of precision and recall:

2

×

Precision

×

Recall

F

−

Measure

=

Precision

+

Recall

There are several approaches to biomedical NER: rule-based [48, 49], dictio-

nary-based [50-52], and model-based [53, 54] approaches. In rule-based

approaches, predefined rules are used to describe the composition of the named

entities and their context. Surface clues, such as capital letters, symbols, and digits,

might be used to extract candidates for gene and protein names. These candidates

serve as the core terms and can be further expanded using syntactic rules to include

confirming words, such as GENE, PROTEIN, and RECEPTOR, for term refine-

ment. Part-of-speech (POS) tags can also be used to further improve the rules.

Rule-based NER usually does not perform well on unseen names, and the construc-

tion and tuning of rules can be time consuming.

Dictionary-based NER uses collections of names to identify entities in different

categories. Named entities are located in free text by exact matching to names in

dictionaries. Dictionary-based NER achieves high precision but depending on the

completeness of the dictionaries, the recall can be low. Recall can be improved with

fuzzy matching. In biomedical textual analysis, dictionaries of gene, protein, and

some other biological names as well as their synonyms can be built in using

well-annotated databases.

Model-based or classification-based NER uses machine learning approaches.

Commonly used techniques include naïve Bayes, SVM, decision trees, and the

Markov model. Using selected features such as word/phrase occurrence, word

sequence tag, POS tag, and dictionary match, classifiers are trained on a previously

annotated corpus. Model-based NER is sensitive to selection of features used for

training and classification and to the quality of the corpus text used in training.

9.5.3 Mining Relations Between Named Entities

Identification of biological entities is the first and crucial step, while the extraction

of associations between proteins and their functional features is the goal of the anal-

ysis of biological text data. There are two levels of entity relation mining for bio-

medical literature. At the low level, a relationship between named entities is implied

Biomedical Informatics in Translational Research

Search WWH ::

Custom Search

Home