Database Reference
In-Depth Information
The recognition algorithm in its most primitive form consists of matching the
surface of the unknown letter sequence with the corresponding surface (key)
in a full-form lexicon, thus providing access to the relevant lexical description.
2.5.1 M ATCHING AN UNANALYZED SURFACE ONTO A KEY
unanalyzed word form surface:
learns
matching
morphosyntactic analysis:
[ categorization, lemmatization]
learn/s,
In DBS, the recognition algorithm consists of (i) the segmentation of the letter
sequence of a surface into known but unanalyzed parts, (ii) lexical lookup of
these parts in a trie structure, 30 and (iii) composition of the analyzed parts into
well-formed analyzed word forms (cf. FoCL'99, Chap. 14). This requires (i)
an online lexicon for base forms, (ii) allo-rules for deriving different variants
of a morpheme, e.g., wolf and wolv- , before runtime, and (iii) combi-rules for
combining the analyzed allomorphs during runtime.
Building such a system of automatic word form recognition for any given
natural language is not particularly difficult, even for writing systems based on
characters, e.g., Chinese and Japanese, rather than letters. Given (i) an online
dictionary of the natural language of choice, (ii) a suitable off-the-shelf soft-
ware framework, and (iii) a properly trained computational linguist, an initial
system can be completed in less than six months. 31 It will provide accurate,
highly detailed analyses of about 90% of the word form types in a corpus.
Increasing the recognition rate to approximately 100% is merely a matter
of additional work. 32 It consists of adding missing entries to the online lex-
icon, and improving the rules for allomorphy and for inflection or aggluti-
nation, derivation, and composition. To maintain a recognition rate of practi-
cally 100% over longer periods of time, the system must be serviced continu-
ally, based on a RMD corpus, i.e. a Reference Monitor corpus with a Domain
structure (Sect. 12.2).
2.6 Backbone of the Communication Cycle
Automatic word form recognition is the first step of natural language inter-
pretation in the hear mode. Automatic word form synthesis is the last step of
30 See Knuth (1998, pp. 495-512).
31 This is the standard period of time for writing an MA thesis at the University of Erlangen-Nürnberg.
32 This is in contrast to the statistical method, which does not lend itself to the correction of specific
errors. See FoCL'99, Sect. 15.5.
Search WWH ::




Custom Search