Information Technology Reference
In-Depth Information
suffixes can provide good clues for classifying section headings. For example, words
which end in "Hx" are related to medical historical information, such as PSurHx re-
fers to the “past surgical history” section. This work used the length of 2 characters
for prefixes and suffixes.
Orthographic Features
The surface strings of the section headings in an EHR may vary, but still follow cer-
tain rules established by usage. The orthographic features were developed to capture
the subtle writing style. Each orthographic feature is implemented by using regular
expressions to capture writing rules of section headings in terms of spelling, hyphena-
tion, and capitalization. If the current word matches the defined orthographic feature,
its feature value is 1; otherwise, the value is 0.
Layout Features
Given the variety of the layouts of EHRs, the original line breaks of the raw text can
guide the machine learning model to determine the section headings that lead section
blocks. This work developed layout features to capture the line break information. In
our implementation, for a given split sentence, if its previous line in the original raw
text was an empty line, the value of the layout feature is 1, otherwise it is 0. Take the
third line of Figure 1 as an example. The value of the layout feature with block size 1
would be 1, but the value of the fourth line is 0. The block size for the layout features
was set to six, meaning that for a given sentence, the preceding and the following
three lines were considered.
3
Experiment
3.1
Dataset
The Track 2 dataset released by the i2b2 2014 shared-task was used in this work. The
dataset was preliminarily divided into three subsets: set1 (521 records), set2 (269
records) and testing set (514 records). After the manual annotation of section headings
was completed, the dataset contains a total of 1304 medical records annotated with
13,962 section headings.
This work analyzed the compiled corpus and generated the following statistics of
existing section headings. Among all annotated sections, 803 (5.7%) are the “chief
complaints” section, which was found to be presented in several alternative spellings
such as CC, chief, and reason. 843 (6.0%) are the “present illness" section. 2,701
(19%) are “personal histories”, which may include subsections like social history,
medication, allergy, substance, marital status, activity and general health status. 486
(3.4%) are “family histories”,1,104 (7.9%) are “physical examinations”, 401 (2.8%)
are “laboratory examinations”, and 87 (< 1.0%) are “radiology reports”. 103 (< 1.0%)
are “data” sections, which include laboratory and radiology results. 884 (6.3%) are
“diagnosis” or “impressions”, 468 (3.3%) are "plans” or “recommendations”, and the
remaining 6,081 (43.6%) are other section names, including patient name, physician
Search WWH ::




Custom Search