Information Technology Reference
In-Depth Information
(EMR) systems [3]. Researchers often spend lot of time and resources in extracting
patient information from unstructured clinical notes. Specifically, it is more challeng-
ing and tedious to extract HTN information manually as the HTN information is
usually mentioned in multiple records for a single patient. At the same time, coding
this HTN information to standard ontologies like SNOMED-CT adds another burden
to the manual extraction.
Simple clinical text mining techniques can be employed to extract HTN informa-
tion very easily from unstructured clinical notes. There are various tools to extract
HTN information from unstructured clinical notes or biomedical text. However, these
tools have limited capabilities in extracting HTN information. For example, MetaMap
[4] is a popular biomedical text information extraction system which is capable of
identifying HTN mentions but can't infer HTN information based on medications or
lab values. On the other hand, there are rule based tools that can recognize blood pres-
sure (BP) values or medications but not capable of inferring whether the values or
medications are relevant to HTN [5-7]. In other words, these systems can not differen-
tiate between high BP and low BP. In addition, differences in what range of BP values
are considered as HTN vary from country to country. In this study, we present a sim-
ple HTN information extraction system called HTNSystem which is capable of ex-
tracting mentions of hypertension and inferring HTN information from BP lab values
from unstructured clinical notes. HTNSystem is a rule-based information system
which implements MetaMap as a core component together with a custom built BP
value extractor and rule-based post processing components. The BP value extractor
component was originally built as part of TMUNSW system developed for 2014
i2b2/UTHealth Shared-Task 2 and 4 [8, 9]. As part of HTNSystem the old BP value
extractor is significantly improved to increase performance (more details in results
section). Overall, HTNSystem is generic and highly configurable allowing end users
and developers to customize HTNSystem according to their preferences or suggested
clinical guidelines.
2
Materials and Methods
2.1
2014 i2b2/UTHealth Shared-Task 2 Corpus
The 2014 i2b2/UTHealth Shared-Task 2 1 corpus is a clinical data set distributed by
organizers [10]. The corpus represents longitudinal data of diabetic patients collected for
the purpose of identifying CVD risk factors. It was distributed as a part of shared Task
in three sets. Table 1 presents a summary level statistics of the corpus. Two training sets
consist of 521 and 269 unstructured clinical notes (from here on referred as records)
respectively and a test set with 514 records. The records in the training data set
were distributed in XML (Extensible Markup Language) format and included annota-
tions on CVD risk factors. Each record in the corpus was manually annotated by three
different annotators. The risk factors identified in the corpus were Hypertension,
Diabetes, Obesity, Medication, Coronary artery disease and Smoking history. Three
1 https://www.i2b2.org/NLP/HeartDisease/
Search WWH ::




Custom Search