Chain of Audio Processing - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

below the turn level or complete turns, etc. [ 4 ]. For music, these can be beats, single

or multiple consecutive bars, and parts such as chorus or bridge, etc. Obviously,

higher level chunking requires suited pre-analysis such as audio activity detection,

voicing analysis, or complex structural analysis (see Sect. 6.1.3 for a discussion).

Supra segmental analysis and (hierarchical) functional extraction : Next, the

method of segment level analysis has to be defined. If—as mentioned in the previous

section—a classifier operates directly on the LLD frames, either dynamic approaches

have to be used, or the frame-wise results have to be combined to a single segment

level result (late fusion, cf. below). Alternatively, or additionally, LLD feature vectors

can be combined into a single feature vector per segment, and then only a single

classification result is obtained.We refer to thismethod as 'supra-segmental' analysis.

In case that the length of all segments is constant, we can concatenate all LLD

feature vectors within the segment to a single, higher-dimensional feature vector. If

the length varies (e.g., for sentences, beats or bars in music, etc.), this approach is not

feasible, as the dimensionality of the resulting high-dimensional vector will not be

constant—which is usually required by classifiers. In this case, it is common practice

to summarise the LLD feature vectors by applying 'functionals' to them. These can

be statistical descriptors such as mean or standard deviation; in this case, information

from a pre-trained Gaussian (mixture) model of the features can be used to obtain

more robust estimates ('universal background model' approach). Other commonly

used statistics of the feature distribution comprise percentiles and higher moments.

Furthermore, one can compute descriptors related to the temporal evolution of the

LLDs, such as statistics of peaks (number, distances, etc.), spectrum (e.g., DCT

coefficients) or autoregressive coefficients. The result is a feature vector per segment

with a constant dimensionality d

N func . Thereby N LLD and N func are the

numbers of LLDs and functionals, respectively. This method of summarisation can

also be repeated on higher levels, i.e., 'functionals of functionals' can be computed,

etc. This leads to a hierarchical representation, referred to as analytical features [ 5 ]

and feature brute-forcing [ 6 , 7 ].

Feature reduction : As in any other pattern recognition task, the reduction of

the parameter space to those parameters which are most highly correlated with the

classification problem of interest, is beneficial in terms of classification accuracy,

model complexity, and speed.

In this step the the feature space is transformed in order to reduce the covariance

between features in the new space—usually by a translation into the origin of the

original feature space and a rotation to reduce covariances outside the main diagonal

of the covariance matrix. This is typically achieved by the Principal Component

Analysis (PCA) [ 8 ]. Linear Discriminant Analysis (LDA) additionally employs target

information (usually discrete class labels) to maximise the distance between class

centres and minimise dispersion of classes. Next, a reduction by selecting a limited

number of features in the new space takes place—in the case of PCA and LDA,

by choosing the components with the highest according eigenvalues. These features

still require extraction of all features in the original space—in the case of principle

components, this comes as the features in the new space are linear combinations of

all original ones.

=

N LLD ·

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home