Digital Signal Processing Reference
In-Depth Information
more likely that a heuristic feature selection gets stuck in a local optimum given a
larger feature space. Compared to BoW, the requirements on the automatic speech or
singing recognition system are higher, as it has to recognise more consecutive words
accurately.
6.3.3 Bag of Character N-grams
N-grams can also be created on the character level by observing N-grams of characters
instead of words. This leads to Bag of Character N-grams (BoCNG). Like BoW and
BoNG, these base on mapping from text to a numeric feature space. Successes of
BoCNG was reported in the field of (spoken) document retrieval [ 79 ] and affect
recognition [ 80 ]. As for BoNG, observation of N-grams with different lengths is
possible in combination, determined by a minimum string length of c min characters
and a maximum string length of c max characters. Word boundaries can optionally be
ignored. For each character N-gram, a mapping to a numeric feature is realised as for
words in BoW. Because observation of N-grams at character level naturally results in
considerably more possible features than for BoW, more 'aggressive' stopping can
be used to discard rare strings.
BoCNG has some interesting characteristics: Stemming on word level is implic-
itly modelled by using N-grams of characters: one or even more words can be mapped
to a base form if they contain similar character substrings. The BoCNG approach—
in contrast to BoW—has a finer resolution by observing the character level. Given
successful feature selection, only strings of relevant lengths are kept in the feature
space. Further, BoCNG can handle unseen compound words if these consist of sub-
strings contained in the feature space. This may be relevant for 'open-vocabulary'
languages such as German, which allow the formation of long compound words.
Instead of characters in the sense of graphemes, phonemes from the ASR engine can
be used, which may lead to an improvement [ 79 ].
In fact, other variants of features can be thought of and are used, such as
N-grams of syllables. Compared to character N-grams the vocabulary size, and thus
the number of combinations for higher N, are significantly reduced.
6.3.4 On-Line Knowledge
Apart from the data-driven approaches for linguistic analysis introduced so far, open-
domain methods can be applied which base on knowledge sources (e.g., [ 81 , 82 ]).
On-line knowledge sources are publicly available on the Internet. In natural lan-
guage processing such databases provide linguistic knowledge, such as information
on words, concepts, or phrases, as well as on connections among them. Connections
among such entities—again referred to as words or terms in the ongoing indepen-
dently of their type—include common-sense knowledge, or lexical relations.Various
 
Search WWH ::




Custom Search