Digital Signal Processing Reference
In-Depth Information
Spectral band energies and energy densities such as for the following seven
octave based frequency intervals: 0-200 Hz, 200-400 Hz, 400-800 Hz, 800Hz-
1.6 kHz, 1.6-3.2 kHz, 3.2-6.4 kHz, and 6.4-12.8 kHz.
Further, a set of LLD is standardised in the MPEG-7 standard for audio analysis. 1 This
set is also often used for music analysis, but well suited for general sound analysis.
The LLD are audio power, spectrum centroid and spread, fundamental frequency,
harmonics, log attack time, harmonic spectral centroid with deviation, spread, and
variation, temporal centroid, spectral centroid, and spectrum envelope with flatness,
projection, and bias.
6.3 Textual Descriptors
For some tasks, especially recognition of emotion and speaker states and traits,
the spoken content is of importance. The acoustic LLDs described in the previous
sections only contain information on 'how' something is said and not on 'what'
is being said. To obtain the chain of spoken words, automatic speech recognition
(ASR) algorithms have to be used in real applications. For assessing the maximum
gain in recognition performance that a system can reach when the textual content is
considered, in most experiments often a manually transcribed ground truth is used.
Some of such studies have shown that methods of linguistic analysis of spoken (or
sung) text can complement the acoustic information and thus enhance the combined
recognition performance, e.g., in emotion recognition from speech [ 54 - 57 ]ormusic
mood recognition [ 58 ].
This section presents different approaches for linguistic analysis. While they
are mostly established for the processing of textual strings such as words or chord
sequences in music, any other information that may be represented as string by sym-
bolic entities can be modelled in a similar fashion [ 59 ]. In the ongoing—for the sake
of simplification—we will speak of 'words' consisting of 'characters' representing
the basic string units of analysis.
Often, only a fraction of these words convey relevant information about the target
task of interest and many words are similar and related in their meaning. In order
to reduce the information in a meaningful way, two methods are usually applied:
stopping and stemming.
Stopping is the exclusion of words from the vocabulary for their low relevance in
the context of the analysis. It is usually executed by expert rules such as exclusion of
function words or a data-driven evaluation. A popular data-driven method is using
a minimum word frequency f min for a word in the database to become part of the
vocabulary. Rare words are thus discarded. However, frequently appearing function
words which may be irrelevant in many search tasks are left over. Therefore, an
additional data-based feature selection by suitable criteria such as information gain
can be used.
1
ISO/IEC JTC 1/SC 29/WG 11 N7708.
 
Search WWH ::




Custom Search