Audio Features - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

all further analysis if some should be deleted. As for insertions and substitutions,

these are only critical if they change the 'tone' of the content.

For the alternative processing of written text, some text pre-processing will usually

be needed. First, delimiters such as punctuation can be used for segmentation. Then,

capital letters are often de-capitalised to avoid double entries for same words. Finally,

it may be reasonable to allow for some word replacement rules or calculation of edit

distance between written words and their counterparts in the vocabulary. This may

cover misspelling of words or varieties such as in British English, American English,

or Australian English (e.g., [ 68 ]).

We will next look at different methods for generating linguistic features.

6.3.1 Bag of Words

The basic idea behind Bag of Words (BoW) is the representation of symbolic infor-

mation in a numeric feature space. Each feature thereby represents the occurrence

of a specific 'word', i.e., symbolic entity, in the string of analysis. BoW, originally

developed for document retrieval [ 69 ], was successfully applied to the fields of emo-

tion [ 57 ] and interest (cf. [ 70 , 71 ]) recognition from text and speech. BoW became a

popular approach for these fields [ 62 , 72 ]. The recognition is often based on speech

turns or larger segments, such as paragraphs or the entire lyrics of a song. Every

such sequence

can be described by the set of its contained word entities w i , i.e.,

S = { w 1 ,...,

w S }

= | S |

is the sequence length. The BoW method consid-

ers these words w i as units of interest. For a given training set

, where S

, all different words

build the word inventory—the 'vocabulary'

being

the size of this vocabulary. Particularly in spoken or sung language analysis, also

non-linguistic vocalisations like sighs and yawns [ 73 ], laughs [ 74 , 75 ], cries [ 76 ],

and coughs [ 77 ] can be integrated into such a vocabulary [ 62 , 70 ] in speech [ 78 ]or

singing decoding.

For each word w i with i

V = {

w 1 ,...,

w V }

, with V

= | V |

in the vocabulary a corresponding feature

x i is created. This may easily lead to a high dimensional feature vector space. Each

sequence

∈ {

,...,

}

S j can then be mapped to a vector x j in this feature space. Ways to determine

the value of each feature x i first include counting the number of occurrences of a

word w i in the sentence

S j , resulting in the word frequency f i , j . As a simplification,

the binary general occurrence (or non-occurrence) can be used. The 'term frequency'

can also be transformed in other ways (cf.[ 69 ]), for example by application of the

logarithm—the term frequency transformation (TF):

log c

f i , j ,

TF i , j =

(6.80)

where the offset parameter c prevents definition problems in case of f i , j

0. It is

often set to c

1. Another measure is the inverse document frequency transformation

(IDF). For

, and L i as the number

of sentences where the word w i appears, the IDF transformation is given by:

| L |

as the number of sequences in the training set

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home