Information Technology Reference
In-Depth Information
4.4 Learning to Annotate Cases with Knowledge Roles
To perform the task of learning to annotate cases with knowledge roles, we im-
plemented a software framework, as shown in Figure 4.5. Only the preparation of
documents (described in Section 4.4.1) is performed outside of this framework. In
the remainder of the section, every component of the framework is presented in
detail.
Corpus
1
2
Tagging
Parsing
3
Tree Representation
5
Corpus
Statistics and
Clustering
4
Feature
Creature
6
Selection &
Annotation
7
Bootstrap
Initialization
8
Learning
Algorithm
Active Learning
Fig. 4.5. The Learning Framework Architecture.
4.4.1 Document Preparation
In Section 4.2.1 it was mentioned that our documents are o cial diagnostic re-
ports hierarchically structured in several sections and subsections, written by using
MS R Word. Actually, extracting text from such documents, while preserving the
content structure, is a di cult task. In completing it we were fortunate twice. First,
with MS R O ce 2003 the XML based format WordML was introduced that permits
storing MS R Word documents directly in XML. Second, the documents were origi-
nally created using a MS R Word document template, so that the majority of them
had the same structure. Still, many problems needed to be handled. MS R Word
mixes formatting instructions with content very heavily and this is reflected also in
its XML format. In addition, information about spelling, versioning, hidden template
elements, and so on are also stored. Thus, one needs to explore the XML output of
the documents to find out how to distinguish text and content structure from unim-
portant information. Such a process will always be a heuristic one, depending on the
nature of the documents. We wrote a program that reads the XML document tree,
 
Search WWH ::




Custom Search