Digital Signal Processing Reference
In-Depth Information
below the turn level or complete turns, etc. [ 4 ]. For music, these can be beats, single
or multiple consecutive bars, and parts such as chorus or bridge, etc. Obviously,
higher level chunking requires suited pre-analysis such as audio activity detection,
voicing analysis, or complex structural analysis (see Sect. 6.1.3 for a discussion).
Supra segmental analysis and (hierarchical) functional extraction : Next, the
method of segment level analysis has to be defined. If—as mentioned in the previous
section—a classifier operates directly on the LLD frames, either dynamic approaches
have to be used, or the frame-wise results have to be combined to a single segment
level result (late fusion, cf. below). Alternatively, or additionally, LLD feature vectors
can be combined into a single feature vector per segment, and then only a single
classification result is obtained.We refer to thismethod as 'supra-segmental' analysis.
In case that the length of all segments is constant, we can concatenate all LLD
feature vectors within the segment to a single, higher-dimensional feature vector. If
the length varies (e.g., for sentences, beats or bars in music, etc.), this approach is not
feasible, as the dimensionality of the resulting high-dimensional vector will not be
constant—which is usually required by classifiers. In this case, it is common practice
to summarise the LLD feature vectors by applying 'functionals' to them. These can
be statistical descriptors such as mean or standard deviation; in this case, information
from a pre-trained Gaussian (mixture) model of the features can be used to obtain
more robust estimates ('universal background model' approach). Other commonly
used statistics of the feature distribution comprise percentiles and higher moments.
Furthermore, one can compute descriptors related to the temporal evolution of the
LLDs, such as statistics of peaks (number, distances, etc.), spectrum (e.g., DCT
coefficients) or autoregressive coefficients. The result is a feature vector per segment
with a constant dimensionality d
N func . Thereby N LLD and N func are the
numbers of LLDs and functionals, respectively. This method of summarisation can
also be repeated on higher levels, i.e., 'functionals of functionals' can be computed,
etc. This leads to a hierarchical representation, referred to as analytical features [ 5 ]
and feature brute-forcing [ 6 , 7 ].
Feature reduction : As in any other pattern recognition task, the reduction of
the parameter space to those parameters which are most highly correlated with the
classification problem of interest, is beneficial in terms of classification accuracy,
model complexity, and speed.
In this step the the feature space is transformed in order to reduce the covariance
between features in the new space—usually by a translation into the origin of the
original feature space and a rotation to reduce covariances outside the main diagonal
of the covariance matrix. This is typically achieved by the Principal Component
Analysis (PCA) [ 8 ]. Linear Discriminant Analysis (LDA) additionally employs target
information (usually discrete class labels) to maximise the distance between class
centres and minimise dispersion of classes. Next, a reduction by selecting a limited
number of features in the new space takes place—in the case of PCA and LDA,
by choosing the components with the highest according eigenvalues. These features
still require extraction of all features in the original space—in the case of principle
components, this comes as the features in the new space are linear combinations of
all original ones.
=
N LLD ·
Search WWH ::




Custom Search