Information Technology Reference
In-Depth Information
Table 4.2. Results of experiments for the contribution of stem information on
learning.
Experiment
Recall
Precision
a) no stems (only words)
90.38
92.32
b) only stems
91.29
93.40
Still, the reason for having a list of stems was not in avoiding more data due
to word inflections, but in capturing the word composition, a phenomenon typical
for the German language. For example, all the words in the first row of Table 4.3
are compound words that belong to the same semantic category identified by their
last word 'wert' (value), i.e., they all denote values of different measured quanti-
ties, and as such have a similar meaning. This similarity cannot be induced if one
compares the words in the original form, something possible by comparing the word
representations of the second row.
Table 4.3. Original words (first row), words composed of stems (second row).
Ableitstrom werte , Gesamtstrom werte , Isolationswiderstands werte , Isolation-
sstrom werte , Kapazitats werte , Ladestrom werte , Strom wert en, Verlustfak-
toranfangs wert ,etc.
Ableit-Strom-Wert,
Gesamt-Strom-Wert,
Isolation-Widerstand-Wert,
Isolation-Strom-Wert,
Kapazitat-Wert,
Lade-Strom-Wert,
Strom-Wert,
Verlustfaktor-Anfang-Wert, etc.
Unfortunately, there are only a few tools available for morphological analysis of
German words. We tried Morphy [17], which is publicly available, but it was not
able to analyze any of our domain-specific words. Therefore, we had to perform this
task by hand.
4.4.3 Parsing
Syntactical parsing is one of the most important steps in the learning framework,
since the produced parse trees serve as input for the creation of features used for
learning. Since we are interested in getting qualitative parsing results, we experi-
mented with three different parsers: the Stanford parser (Klein 2005), the BitPar
parser [27, 25], and the Sleepy parser [7]. What these parsers have in common is that
they all are based on unlexicalized probabilistic context free grammars (PCFG) [18],
trained on the same corpus of German, Negra 7 (or its superset Tiger 8 ), and their
source code is publicly available. Still, they do differ in the degree they model some
structural aspects of the German language, their annotation schemas, and the infor-
7 http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/
8 http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/
 
Search WWH ::




Custom Search