Digital Signal Processing Reference
In-Depth Information
corpus [ 49 ]. COSINE is a relatively new database. It contains multi-party conversa-
tions that were recorded in real world environments. Speech was captured by a spe-
cially crafted wearable recording system in a sort of backpack manner. This allowed
the speakers to walk around in the street during the recordings. Participants were
asked to speak about anything they liked and to walk to various noisy locations. The
corpus thus consists of natural, spontaneous, and highly disfluent speaking styles.
The signal is partly masked by indoor and outdoor noise sources including crowds,
vehicles, and wind noises. Seven microphones were used simultaneously per speaker.
To stick with the precondition of this topic to rely on monophonic sources, exclu-
sively speech recorded by a close-talk microphone (Sennheiser ME-3) is exploited
in the ongoing.
All ten sessions transcribed at the moment of writing are used. These contain
11.40 h of pairwise conversations and group discussions. The 37 contained speakers
are fluent, but not necessarily native English speakers. Each speaker participated
in one session exclusively. Their ages range from 18 to 71 years with a median of
21 years. COSINE's test set is used for evaluation (sessions three and ten). This
set comprises 1.81 h of speech. Sessions one and eight were chosen as validation
set (2.7 h of speech) and the remaining six sessions made up the training set. The
vocabulary size is 4.8 k, the out-of-vocabulary rate in the test set is 3.4 %.
10.2.2 Performance
The frame-wise phoneme recognition rate of different network architectures is now
presented on the COSINE task as described. It is further compared to a common
triphone HMM phoneme recogniser. Then, going from phonemes to words, the accu-
racy (WA) obtained by the multi-stream system introduced in Sect. 7.4.3 is compared
with the performance of a Tandem approach [ 35 ]. Again, a baseline is established
by a common HMM system that bases only on MFCC features. MFCCs 1-12 are
extracted as features for network input in addition to logarithmic energy together
with first and second order regression coefficients. To compensate for stationary
noise effects, CMS is applied to these features. A HMM system is used to obtain
phoneme borders via forced alignment. The following four different network archi-
tectures are considered: RNN, BRNN, LSTM networks, and BLSTM networks.
As network topology three hidden layers (per input direction) were chosen for any
of these four types. These layers have a size of 78, 128, and 80 hidden units, respec-
tively, and each memory block contains one memory cell. A learning rate of 10 5
and a momentum of 0.9 proved optimal for training. To improve the generalisation
ability of the networks, zero mean Gaussian noise with standard deviation 0.6 was
added to the inputs during training. Prior to the actual training process, weights were
uniformly random initialised in the range from
1 to 0.1. Input and output gates
used tanh activation functions. The forget gates had logistic activation functions.
The standard (CMU) set of 41 different English phonemes is applied. The 41
phonemes include silence and short pause labels. Once no improvement on the
0
.
 
Search WWH ::




Custom Search