Digital Signal Processing Reference
In-Depth Information
Database, it is generally sticked with this schema. However, categories without suffi-
cient audio instances were discarded, or, in the case of the sound type 'birds', clustered
(with 'animals'). This procedure leaves the following seven common categories out
of 16 original cover classes [ 5 ]:
People : 45 different human behaviours, such as biting, baby's crying, coughing,
laughing, moaning, kissing, etc.
Animals (including birds): 69 different non-bird animals (such as cat, frog, bear,
lamb, etc.) and 16 kinds of birds (such as blackbird, etc.)
Nature : 19 kinds of sounds from nature environment, for instance, earthquake,
ocean waves, flame, rain, wind, etc.
Vehicles : 34 different types of vehicles and their behaviours, such as motorcycling,
braking, helicopter, closing (vehicle) door, etc.
Noisemakers : 13 various events in this domain such as alarm, bell, whistle, horn,
etc.
Office : office space sound events including keyboard typing, printing, telephoning,
mouse clicking, etc.
Musical Instruments : 62 various musical instruments such as bass, drum, synthe-
siser, etc.
All audio files were converted into raw 16 bit encoding, mono-channel, at 16 kHz
sampling rate. This was needed to unify the various formats and rates used in the
original version as retrieved from the web. Each of the sound clips lasts between 1 s
to 10 s. Roughly 15 hours of recording time and 16 937 instances were obtained in
total, covering 276 sub-categories of real-life sound events. This set will be referred
to as FindSounds database in the ongoing. Details on the distribution of FindSound's
instances and total play time per category are summarised in Table 5.5 . Note that,
owing to the sheer size of the database, categorisation was not counter-checked, i.e.,
the gold standard is based on the categorisation found on the web which has been
created by experts according to [ 35 ].
5.3.3.2 Emotional FindSounds Database
As was shown in the last section, the FindSounds database is well suited for sound
event classification. If one additionally aims at recognising the emotion evoked in
a listener of a sound, an additional annotation is needed, as described in [ 36 ]. As
we had seen for the annotation for music mood above, a typical problem in general
emotion recognition is the selection of a suited emotion representation model [ 37 ,
38 ]. For the recognition of emotion evoked in a human listener by sound, Thayer's
frequently encountered 2-D model [ 22 ] with valence and arousal as dimensions is
again adopted. Respecting the divergence between individual labellers, the EWE
as gold standard can improve the robustness of sound emotion recognition (here
regression) results by making the gold standard more consistent.
To build the 'Emotional FindSounds Database', instances were chosen from the
rather huge FindSounds database as was described above. 390 sound files were
 
Search WWH ::




Custom Search