Phone Loop Fast-Match Merged in Decoders

These days speech recognition technology is under focus of many large corporations like Google, Baidu, Microsoft. It is an exciting time of shifting paradigms and approaches which lead to quite significant improvement of the accuracy and stability of the technology. Many ideas which seemed fundamental are now questioned, for example, the modern research tries to replace the Hidden Markov Models with recursive neural networks and more complex structures like long-short term memory (LTSM) networks. There is a lot of marketing here, for example LTSM papers often present that such networks improve frame classification accuracy significantly which was never a parameter to optimize in ASR system because the word-error rate results of LTSM networks are not as great as frame accuracy results.

One idea that was quite successfully challenged is the use of feature context in speech recognition system. MFCC with delta-deltas usually use just 7-9 frames around the current frame while modern deep neural network (DNN) classifiers can use up to 30 frames successfully. That certainly improves the accuracy of speech recognition for DNN system. One important thing here is that we still use breadth-first search in our decoders, thus our context is only used in classification, not in search. This is a major drawback in modern systems.

If we consider graph best path search there are two approaches here – breadth first search where we first explore all possible branches coming out of a node and depth-first approach where we continue to explore the path till some point without looking on branches. Modern decoders mostly use breadth-first search (BFS), in many cases it is considered a suboptimal approach.

For example, if we have some noise in a frame and next frame is correct we can not recover from it because we already pruned the correct path on the noisy frame. There is no way to recover. Only by looking on several frames at once we can figure out which path is correct and which is not.

Next advantage of depth-first search is speed. By using larger context we can quickly reduce the hypothesis space, for example, by looking on 3-4 following phonemes we can reduce the amount of words to search.

Depth-first search was quite popular before in 90-es, many decoders those days were using it like Dragon decoder or IBM Stack decoder or Noway decoder. Unfortunately, with introduction of WFST framework, those decoders declined.

WFST framework is also an interesting case to consider here. The problem with triphone models is that they explode when we consider cross-word transition. We do not know following phone so we have to consider all possible variants and every possible history, that grows the search space significantly. Before this problem was solved in a different ways, for example, developers use multipass search without cross-word triphones first and then rescored with cross-word triphones (pocketsphinx approach). This problem is well considered in Dynamic Network Decoding Revisited paper by H. Soltau, G. Saon.

By using graph reduction operation we solve the very specific problem – improve speed of decoding by properly compressing cross-word contexts. With free implementation in OpenFST toolkit this method become very popular. This is an improvement in decoding speed which allowed to improve decoding performance two-three folds, but it also has disadvantages. Due to the strict and simplistic formalism of WFST it is not easy to use it to perform more complex searches and integrate more complex models into search, for example, hierarchical language model. Also, it is quite memory extensive because precompiled WFST graph has to reside in system memory during decoding. An attempts to overcome WFST restrictions continue these days, one possible approach is dynamic context compression in decoder inspired by WFST ideas, which still requires recombination of word labels and careful tracking of context. And, if we consider simple things like noises and paralinguistic fillers, we make our system much more complex to implement in WFST framework.

If we get out of restrictions of breadth-first search we get another solution here. Just by looking on following phones we can greatly reduce amount of hypothesis to consider during cross-word transition and thus we do not need complex WFST compression anymore. And here we just need to tolerate a bit the delay in recognition. The allowed delay to tolerate can be derived from human response expectation. About 0.2 to 0.5 seconds is ok, thus we can consider up to 50 frames ahead. This is an old idea which was used in 90-es and somewhat supported in pocketsphinx: a phonetic loop fast match. We first decode the audio with very fast phonetic decoder and only after a delay we start main large vocabulary search. This way we not only improve word transition search space but also improve decoding inside lextree because we can predict following phones efficiently. Because of that we consider phone loop search as quite important feature of the decoder. Such ideas has been long advocated by Dr. James Baker and by Dr. Tony Robinson from CantabResearch.

Recently we improved fast match in sphinx4 and fixed fast-match issues in pocketsphinx. Since today we include phone-loop fast match both in sphinx4 and pocketsphinx by default, it greatly improves the speed of decoding and combined with PTM models you can decode large vocabulary speech on desktop in realtime with sphinx4 and high accuracy. That’s pretty big improvement. Please checkout either pocketsphinx or sphinx4, try it out and let us know what do you think.

4 Responses to “Phone Loop Fast-Match Merged in Decoders”

  1. Andrew Morris says:

    I haven’t read this article very carefully yet, but do the methods you are advocating require looking ahead several phonemes? If so there are many applications, especially in command and control, where the required response delay would not be acceptable because each word should be recognised as soon as it is completed, if not before.

    Even so I can see that the extra context information provided by such look ahead would be advantageous in applications where immediate response is not important, so that would generally be acceptable as long as it was coded in such a way that it was easy to adjust the acceptable look ahead distance.

  2. admin says:

    Andrew, this is correct, we need to consider some amount of phonemes ahead. Like I wrote in the paper, 0.2s – 0.5 seconds of speech are tolerable delay for most applications so we could look on 1-2 phonemes easily. In case we face confidence issues we might need to increase the delay to more phonemes providing slower response time but more accuracy. Just like humans do.

  3. James Salsman says:

    Viterbi beam search isn’t breadth-first, it’s best first with length of match as a heuristic. See for example page 5 of http://www.cs.rochester.edu/u/james/CSC248/Lec9.pdf

  4. Andrew Morris says:

    “0.2s – 0.5 seconds of speech are tolerable delay for most applications”

    For shoot-em-up type games (or serious applications) any delay at all is highly regrettable, if not completely unacceptable. There would have to be an option to extend context into the past without encroaching on the future.

    There is a technique which allows continual traceback from frame t to frame (t-d), in the Viterbi dynamic programming lattice, where just one word sequence hypothesis has a probability, accumulated from all paths at t which trace back to it, of 1. How far back you have to trace to find this point where it is safe to make a decision depends on the vocabulary and grammar in use. This allows the word being spoken at t to be confidently recognised as soon as it is disambiguated, which is not always but most often before it has finished being spoken, for all but the shortest words.

    Permitting models which required any look ahead would significant decrease the proportion of words which could be recognised in that way as soon as it was possible to do so. Humans do rarely make mistakes in speech recognition and then have to trace back further in time to find a grammatically legal interpretation, but that is because they have a tendency to interpret on the fly because the most likely hypothesis is usually correct even if it is not the only possibility. Using the above mentioned technique the machine could never accept any hypothesis where residual uncertainty remained, unless the beam width in use was too narrow.

    The importance of minimising response delay in interactive applications of any kind cannot be overstated. Not so long ago I working on software for gesture recognition using the company’s own 3D sensor. The combination of hardware and software used resulted in a delay of perhaps 1 or 2 frames at 30 fps. Another company then introduced a different hand gesture sensor with sub-frame delay. The improvement in the feel of the interactive response was so strong that the company I was working for would have been blown out of the water, if it were not for the fact that the new sensor was bulky and only had a range of about one metre.

    Apart from the above argument, the idea of training acoustic models on a time context window of duration longer than about one or two phonemes could never work in practice because the variation in such long samples would be so great that the amount of speech data required to train models to generalise correctly would be prohibitive. On the other hand it is well known that disambiguation often requires a grammar model window which is much larger, possibly spanning whole sentences, or even further. Statistical grammars can also suffer from explosive training data requirements with context duration, so care needs to be taken there too. There comes a point where large grammar must give way to knowledge modelling.