Impact of the Approaches Involved on Word-Graph Derivation from the ASR System (Pattern Recognition and Image Analysis)

Abstract

Finding the most likely sequence of symbols given a sequence of observations is a classical pattern recognition problem. This problem is frequently approached by means of the Viterbi algorithm, which aims at finding the most likely sequence of states within a trellis given a sequence of observations. Viterbi algorithm is widely used within the automatic speech recognition (ASR) framework to find the expected sequence of words given the acoustic utterance in spite of providing a suboptimal result. Word-graphs (WGs) are also frequently provided as the ASR output as a means of obtaining alternative hypotheses, hopefully more accurate than the one provided by the Viterbi algorithm. The trouble is that WGs can grow up in a very computationally inefficient manner. The aim of this work is to fully describe a specific method, computationally affordable, for getting a WG given the input utterance. The paper focuses specifically on the underlying approaches and their influence on both the spatial cost and the performance.

Keywords: Lattice, word-graphs, automatic speech recognition.

Introduction

Statistical decision theory is applied in a wide variety of problems within pattern recognition framework that aim at minimising the probability of erroneous classifications. The maximisation of the posterior probability P(w\x) allows to get the most likely sequence of symbols w, that matches a given sequence of input observations, x, as shown in eq. (1).


tmp9F5-97_thumb

In many pattern recognition tasks, such as computer vision or automatic speech recognition (ASR), the knowledge source involved in the optimisation problem can be represented by stochastic finite-state models. Thus, the search problem is formulated as a decoding problem trough a finite-state network. In this context, the sequence of symbols that most probably causes the sequence of observations should be decoded considering all possible paths in the network that match the given input (denoted by ).

tmp9F5-98_thumb

Nevertheless, a typical approximation is to consider the sequence of symbols associated to the most probable path as in eq. (2). The Viterbi algorithm [1] is widely used to implement the maximum approximation and find the most probable path within the finite-state search space. This approach allows efficient implementation by means of dynamic programming. Nevertheless, for problems in which the sequence of symbols happen to be compatible with multiple sequences of states, this approach turns to be suboptimal. This is the case of ASR where given the acoustic representation of input signal (x) the same sequence of words (w) can be explained by a number of different ways within throughout search network.

All together, Viterbi-based decoding algorithm does not provide the most likely sequence of words, as it is the goal, but the word sequence associated to the most likely path in the search network. Moreover, additional assumptions made at the implementation stage also have important impact in the final sequence of words.

In order to get a better ASR performance it seems of interest not to restrict ourselves only to the hypothesis provided by the Viterbi-based implementation and take more hypotheses into account. Alternative hypotheses can be provided by means of a word-graph (WG), which can be seen as a fuzzy transcription of the acoustic utterance. WGs are an efficient means of representing a vast number of hypotheses. Numerous applications can benefit from WGs, by means of confidence measures [2,3], or extracting the list of n-best hypotheses that could be re-ordered with a different knowledge source [4].

However, the WGs grow quickly with the length of the input utterance, thus, the computational cost associated to its generation could result unfeasible. To face this, some approximations, closely related with a loose of accuracy, must be made. Since these approximations are poorly described in the literature, a standard method to compare the results obtained with WGs can be hardly adopted.

The aim of this work is to provide a full description of a method to derive a WG from the ASR system and above all, the approximations carried out to make this process computationally affordable. It is hardly ever referred in the literature the practical approaches adopted to carry out this sort of strategies, and this is precisely our aim. The paper focuses specifically on the influence of these approaches on both the spatial cost and the performance of the system.

Search space: cartesian product of acoustic observations and models' states

Fig. 1. Search space: cartesian product of acoustic observations and models’ states

ASR Decoding: Viterbi Algorithm

The goal of an ASR system is to obtain the most likely word sequence given the acoustic signal uttered by the speaker. In order to do this the Bayes’ decision rule is applied in eq.(1) giving as a result eq. (3).

tmp9F5-100_thumb

where P(w) is the probability associated by the language model (LM) to the string w and P(x\w) is the probability associated by the lexical-acoustic model. The decoding process in ASR can be represented by means of a search problem [5,6] where the search space consists of a finite-state network [7] comprising the aforementioned models (see Fig. 1).

To implement the search problem represented by eq. (2), each node in the search network is associated to the most likely path that reaches the node in a specific time-frame t. Thus, each node is linked with a predecessor node reached at time t — 1. If we turn to Fig. 1, the blue node reached at time t3 has two possible predecessors, but only the information related to the one associated with the most likely path will be kept, that is blue node reached at time t2. After the forward decoding phase the final node storing the highest accumulated probability can be traced backwards predecessor by predecessor giving as a result the sequence of words associated to the most-likely path.

There are a few approximations well worthy to bear in mind when it comes to making this implementation efficient. This algorithm is typically speed up by means of a beam search. That is, at each time frame not all the nodes remain alive for subsequent search, instead, the nodes where the accumulated probability does not exceed a percentage of the higher accumulated probability are not explored any longer.

Approaches Involved Getting the Word-Graph

Forward Decoding

The key issue to derive a word-graph from the trellis (see Fig. 1) is to allow each node storing n predecessors instead of a single one. Note that in the implementation of Viterbi algorithm each node only would store a single predecessor (n=1), and as a result a single sequence of words is obtained.

Intuitive though this procedure might seem, there are implementation nuances that are worthy of further consideration. For instance, being strict, by allowing to all the nodes of the search space comprising the involved models, a graph of phone like units would be derived. Alternative pronunciations and time segmentations of the same word would be contained in such a graph. Note that the alternative ways in which a word could be uttered has an impact on the lexical-acoustic models but it has no relevance at word level. As a result, in order to derive a word-graph computational cost can be significantly reduced by allowing to store n predecessors only to the nodes that match with the states of the LM.

Structure generated in the forward decoding phase for n=2

Fig. 2. Structure generated in the forward decoding phase for n=2

The alternative word-level hypotheses are stored in a structure that encompasses the list of the n-best predecessors and the probabilities associated to the corresponding paths. As an example, Fig. 2 shows the structure created in the forward decoding phase assuming n=2. Thus, when the word w is reached the information related to the best 2 paths (hyp1 and hyp2) is stored.

In an attempt to cut down the search-time the beam-search strategy is applied. At each time-frame the nodes storing paths with low probability are pruned. Note the assumption intrinsic to this node-pruning strategy. While a node is allowed to store up to n predecessors with their respective accumulated probabilities, it is the the probability of the most probable amongst the n predecessors that is associated to the node as far as pruning is regarded. As a result it might happen that a node storing a predecessor with a very high probability and the remaining predecessors with very low probabilities would remain alive, in contrast to other nodes in which the probabilities of all the predecessors are high, while not high enough compared to the most likely one in a given time-frame. On this account, beam search strategy implies an additional approach that did not occur over the standard decoding when the allowed number of predecessor stored was only one.

Backward Decoding: Getting the Word-Graph from the Lattice

Once the entire observation has been analyzed in the forward decoding phase, the n best final states are considered and traced backwards until the initial state in the lattice is reached. As a result a WG is generated conveying the following information: word label, probability of the edge, time-frame segmentation (see as an example Fig. 3).

Although the WG is an efficient means of storing a vast number of hypotheses, it grows quickly with the length of the input utterance. In order to avoid this inefficient behavior a new approach has been adopted. The idea is based on merging nodes of the WG associated to the same state in the LM, but not all of them, only those within a neighborhood of m time-frames. Being the duration of a time-frame denoted astmp9F5-102_thumb:

tmp9F5-104_thumbExample of a part of a WG showing the word log-probability and time-frame

Fig. 3. Example of a part of a WG showing the word log-probability and time-frame

For example in Fig. 3, given a time frame constant Atw = 1, if nodes 4 and 5 are associated with the same state of the LM, they would be merged for values of m > 2. The result of merging nodes has the advantage of providing a WG of smaller size. In general, this constraint prevents us from generating fake hypotheses, but this is not ensured for particular states of the LM such as unigrams. The drawback of merging nodes of the WG is that hypotheses with fake probabilities can be generated.

On the other hand, as mentioned above, the WG is an efficient means of storing a vast number of hypotheses, while some of them might result to be redundant. Thus, in this work we have only explored a number of hypotheses below a threshold. This additional approach is also necessary to become this process computationally efficient. Specifically, a depth search has been carried out over the WG as an attempt to extract the most likely hypothesis. Nevertheless, this method leads to a number of hypotheses that share the same initial words, while different hypothesis with low probabilities in the beginning might be left aside.

Experimental Results

The experimental results were carried out on two different corpora: Dihana corpus consists of 900 human-machine dialogues in Spanish regarding information about train timetables, fares, destinations and services [8]. This task has intrinsically a high level of difficulty due to the spontaneity of the speech. It comprises 5,590 different sentences to train the LM with a vocabulary of 865 words. The test set includes 1,348 spoken utterances. MetEus corpus is a weather forecast corpus in Spanish, Basque and English [9]. In this work we focus on the Basque ASR task. The training corpus consists of 7,523 different sentences with a vocabulary of 1,135 words, and the test includes 1,798 spoken utterances.

In these tasks for each input utterance in the test a series of WGs were generated for different values of n (being "n" the number of allowed predecesors) and m (being "m" the number of time-frames merged). In the following section the influence of both parameters (n and m) are evaluated in terms of spatial cost and performance.

The influence of merging the time frames in the number of edges of the WG for Dihana (MetEus task follows the same trend) is represented in Fig. 4. The ordinate shows the number of word-graphs with the edge-range represented in the abscissa. The figure includes three series of histograms associated to a time-merging of width ranging from m=0 to m=3. It is shown that as the number of merged frames increases, the number of small WGs gets higher and the number of big WGs decreases. From these results it comes out that merging procedure allows to reduce the size of the WG.

The performance of the system, in terms of word error-rate (WER), is given in Table 1. The WGs were evaluated in a twofold manner: on the one hand, the most probable hypothesis within the WGs (denoted by "p-best"), and on the other hand the oracle-best (denoted by "oracle").

Number of edges of the word-graphs obtained with n=2 and varying 0 < m < 3

Fig. 4. Number of edges of the word-graphs obtained with n=2 and varying 0 < m < 3

The oracle score provides the best performance achievable considering all the hypotheses involved in the WGs. Comparing "oracle" with "p-best" results presented in Table 1, it comes out that the most probable hypothesis is not necessarily the most accurate one. Thus, the aim of using alternative hypotheses rather than the most-likely one becomes apparent.

Comparing the performance for different time-merging choices it turns out that the increment of "m" hardly degrades the performance, with the additional advantage of dealing with smaller graphs. Similarly, comparing the differences for the most likely hypothesis, it turns out that the more states we merge the more degraded the first hypothesis is. In conclusion, while the probabilities are blurred in the merging process, the most accurate hypothesis remains amongst the hypotheses provided by the graph.

If we compare the results obtained through the Viterbi-like implementation (m=0, n=1) with the "p-best" obtained with m=0 and n=2, it seems as though the WG would not help. However, this decrement in performance is due to the different approaches (described in the previous sections) devoted to make the computation affordable and not by the scope of WG itself. Moreover, the oracle results reveal that the WGs offer significantly more accurate hypotheses than the Viterbi-like approach (a reduction of %28 and %33 in WER for Dihana and MetEus respectively). As a result, resorting to re-scoring techniques would allow to extract hypotheses with better system performance.

Table 1. WER for several values of the stored predecessors (n) and number of frames merged (m) in both Dihana and Meteus corpora

Di

hana

V

MetEus

(n,m)

(1.0)

(2,0)

(2,1)

(2,2)

(2,3)

(1,0)

(2,0)

(2,1)

(2,2)

(2,3)

p-best

18.37

20.01

21.15

22.01

22.68

12.78

12.90

13.47

14.38

15.05

oracle

18.37

13.09

13.26

13.30

13.42

12.78

8.45

8.45

8.44

8.45

The difference in performance between MetEus and Dihana tasks has to do with the difficulty of the task itself. While MetEus is read speech, Dihana is spontaneous speech with very long sentences at times.

Concluding Remarks and Future Work

We have presented a way of getting a WG from the lattice explored in the ASR process by storing a number (n) of predecessors at each node. Some approaches have been carried out (merging states, obtaining the n-best list of hypotheses from the WG, etc.) in order to make the computation affordable.

It has turned out that merging the states allows to alleviate the computational cost within the WG at the expense of generating hypotheses with possibly fake probabilities. Nevertheless, the explored WGs still contain more accurate hypotheses than the hypothesis obtained without considering a WG. Other minimisation strategies that have proven successful in other fields of PR (such as graph cut theory in computer vision [10]) could be explored.

Currently we are focusing on both re-scoring methods taking advantage of both time-segmentation and probabilities derived from the WG and also the integration of this sort of WG with other kind of WG aiming at speech translation or understanding systems.

Next post:

Previous post: