Information Technology Reference
In-Depth Information
5 Experimental Setup
Prior to the description of the experiments carried out and the discussion of the corre-
sponding results, we now describe the database used (Section 5.1), the particular feature
extraction process, the kind of classifier, and the parameters of the search algorithms
used for illustrating this chapter.
5.1 Available Database
The sound database used for the experiments consisted of a total of 2,627 files, with a
length of 2.5 seconds each. The sampling frequency was 22,050 Hz with 16 bits per
sample. The files correspond to the following categories: speech, music and noise.
Noise sources were varied, including those corresponding to the following environ-
ments: aircraft, bus, cafe, car, kindergarten, living room, nature, school, shop, sports,
traffic, train, and train station. Music files were both vocal and instrumental. The files
with speech in noise presented different Signal to Noise Ratios (SNRs) ranging from
0 to 10 dB.
The database has been divided into three different sets for training, validation and
test, including 943 (35%), 405 (15%) and 1,279 (50%) files respectively. The division
has been made randomly, ensuring that the relative proportion of files of each cate-
gory is preserved for each set.
5.2 Feature Extraction Stage
As described in Section 3.2, the particular feature extraction carried out in the ex-
periments may be summarized as follows:
1.
The input signal is divided into frames with a length of 512 samples (23.22
ms for the considered sampling frequency) with no overlap between adjacent
frames.
2.
The Discrete Cosine Transform (DCT) is computed [14].
3.
All considered features are calculated.
4.
Finally, the mean and standard deviation values are computed every 2.5 sec-
onds in order to mitigate the values.
The following initial 37 features were considered:
Mean and variance of: Spectral Centroid, Spectral Rolloff, Spectral Flux, Zero
Crossing Rate (ZCR), Short Time Energy (STE), Spectral Flatness Measure
(SFM) [22], and Voice2White (V2W) [23].
High Zero Crossing Rate Ratio (HZCRR), Low Short Time Energy Ratio
(LSTER) [24] and percentage of Low-Energy Frames (LEF).
20 Mel Frequency Cepstral Coefficients [25].
Since the mean value and the variance for any of the listed features has been con-
sidered, the number of features to be selected by the HS algorithm has been found to
be N F = 2×37 = 74. The final 74-feature vector, F , is created by calculating these
features from both the original time-domain sound signal, and from the linear predic-
tion coefficients (LPC) analysis residual. Note that some of these features have been
Search WWH ::




Custom Search