Information Technology Reference
In-Depth Information
and for each section with a specified label (from the document template) it extracts
the pure text and stores it in a new XML document, as the excerpt in Figure 4.6
shows.
<section title="Measurements">
<subsection title="Stator_Winding">
<measurement title="Visual_Control">
<submeasurement title="Overhang_Support">
<evaluation>
Die Wickelkopfabsttzung AS und NS befand sich in einem ...
</evaluation>
<action>Keine</action>
</submeasurement>
...
Fig. 4.6. Excerpt of the XML representation of the documents.
Based on such an XML representation, we create subcorpora of text containing
measurement evaluations of the same type, stored as paragraphs of one to many
sentences.
4.4.2 Tagging
The part-of-speech (POS) tagger (TreeTagger 4 ) that we used [26] is a probabilistic
tagger with parameter files for tagging several languages: German, English, French,
or Italian. For some small problems we encountered, the author of the tool was very
cooperative in providing fixes. Nevertheless, our primary interest in using the tagger
was not the POS tagging itself (the parser, as is it shown in Section 4.4.3, performs
tagging and parsing), but getting stem information (since the German language has
a very rich morphology) and dividing the paragraphs in sentences (since the sentence
is the unit of operation for the next processing steps).
The tag set used for tagging German is slightly different from that of English. 5
Figure 4.7 shows the output of the tagger for a short sentence. 6
As indicated in Figure 4.7, to create sentences it su ces to find the lines con-
taining: ". \$. ." (one sentence contains all the words between two such
lines). In general, this is a very good heuristic, but its accuracy depends on the
nature of the text. For example, while the tagger correctly tagged abbreviations
found in its list of abbreviations (and the list of abbreviations can be customized by
adding abbreviations common to the domain of the text), it got confused when the
same abbreviations were found inside parentheses, as the examples in Figure 4.8 for
the word 'ca.' (circa) show.
If such phenomena occur often, they become a problem for the further correct
processing of sentences, although one becomes aware of such problems only in the
4 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger
5 http://www.ims.uni-stuttgart.de/projekte/corplex/TagSets/stts-table.html
6 Translation: A generally good external winding condition is present.
Search WWH ::




Custom Search