In many languages the amount of lexical forms is huge due to morphology. Even simple vocabulary can contain several million forms and variations. It’s hard to recognize such a big vocabulary because of huge search space. Decoder is slow and a language model takes enormous amount of memory.
Of course brute force approach make sense and actually quite successful but better ones already suggested. For example using morphological segmener we can build a language model and the acoustic model which can describe the same vocabulary in way smaller number of subword items. Real words are combined from the chunks which are separate entities in a language model. This way our search space is efficiently represented and the speed is comparable to English models.
The tricky part is to properly segment the words. Because pronunciaiton of decomposition is not so straightforward it takes some effort to build the split. We are happy that our contributor Zamir Ostroukhov managed to solve that problem. He created the acoustic model from the audiobooks from the Voxforge database and used large text corpora to create a morphologically-segmented language model. This is a very promising approach for morphologically-rich language so we look forward to see this framework as a part of CMUSphinx. Maybe this framework could be extended to multilevel speech representation which could hold both subwords and sentence-level items.
Over the last couple of weeks Long Audio Alignment Project has had a lot of new developments. It was understood that the accuracy of audio alignment would improve even further if some approximate time information for certain words were known from before the actual alignment. A decoder without such an information, during alignment has to go through all the frames of the audio to finally tell which alignment hypothesis scores the best. However, with such approximate timed information the decoder only has to wait until one (or more) of it’s hypothesis token agree with this additional information, allowing it to prune out the rest of the tokens. This helps keeping the beam size small.
A highly configurable phrase spotter was hence implemented to get this timed information. Phrase spotter creates a left to right no skips grammar from the words in the phrase and (like Aligner) uses a garbage model to model all out of grammar utterances. The grammar has been chosen this way to ensure that a phrase is recognized only when all the words in the phrase are present in the utterance without a skip. Corresponding changes in the linguist were made as well to ( allow and ) ensure that one a Phone Loop is inserted only at the start of a phrase in the search graph.
Aligner search manager was then designed to exploit this phrase spotter’s result and perform audio alignment. As a result, even with much more complicated grammar, the aligner can now align audio with much better accuracy (almost 0% error) , however memory requirements for aligning very large text allowing large skips still poses a problem. For now I am profiling the memory usage to locate and tackle this issue.
In the last post regarding the Long Audio Training it was indicated that there were still some problems in the reduced Baum-Welch implementation. Fortunately these were identified as memory leaks introduced during the optimization and were fixed. In the past days I have made some extensive tests, which show, that modified algorithms perform significantly better than the original version of SphinxTrain with the respect to the memory consumption.
A mixture of cool technologies could help to create really innovating applications. Check this video with demonstratoin of CMUSphinx capabilities when it’s combined with OpenCV video recognition library.
Making subtitles from scratch usually consists of two tedious tasks: (1) figuring out the times when someone starts and ends speaking — in subtitle length pieces — and (2) typing down text corresponding to that speech. An approach often worth attempting is to automate a large part of this work by using speech recognition to generate subtitles from given video. This method cannot be expected to produce release quality subtitles on its own, but it should provide a rough first draft, which can be finished by usual manual methods. With most video sources, the actual speech recognition cannot be expected to perform well, but voice recognition should provide decent results for the start and end times of subtitles.
Gaupol’s speech recognition uses the excellent CMU Sphinx speech recognition toolkit developed at Carnegie Mellon University — to be exact, the pocketsphinx plugin for the GStreamer multimedia framework.
My last update on Phone Loops indicated some significant improvements in Out of Grammar words recognition while aligning audio files with text, but it also remarked on alignment of large audio file’s need for some special attention because of the huge search space generated the linguist and it’s associated memory.
I hence modified the linguist to dynamically generate search graph during the process of recognition by adding Phone Loop only at the current grammar state, hence significantly reducing the memory required to store 1 phone-loop per word in the transcription. Some tests were conducted as well to determine ‘Out of Grammar Branch” probability and ‘Phone Insertion Probability” for audio files with some noise and close to 3% error in transcription. Best results in Out of Grammar word recognition were achieved at out of grammar branch probability ~ 1E-8 and phone insertion probability ~ 1E – 80. The word error rate hence obtained was close 5%.
Current version of the linguist is now available in long-audio-aligner branch. We now proceed towards the implementation of Keyword spotting algorithm for improving the alignment even further.
It’s been a while since my last post. In theese days I was modifying the Baum-Welch algorithm to the reduced version, which is finally complete.
Forward, backward and Viterbi methods were changed in the following way:
‘Reduced’ forward method was created. This method computes the checkpoints for later re-computation of actual alpha & scaling values. The size of reduced matrices is a function of block size, which is taken as a parameter.
‘Local’ forward method was created. This method performs the alpha values re-computation for a particular checkpoint (block of values).
As SphinxTrain has Viterbi back-pointers computation embedded in the Forward pass, the modification of Viterbi was just to use the reduced forward and to recompute the alpha values with the local forward.
Backward update was modified in a similar way as the Viterbi.
The modification was successfully tested on an4 database. It performed somewhat slower, which was anticipated, as the modified algorithm does more computation.
I also tried the modification on the ‘rita’ (long audio) database. I was forced to quit the computation as it took all my system’s memory. This sadly seems as no improvement in the memory demands and might suggest that some of the memory demands are not in the forward/backward/Viterbi as well as that I might have just introduced some memory leaks. During the brief tests the block_size parameter was set arbitrarily to 11, not the sqrt function of time frames count, which also may have some performance consequences.
The actual slow-down and memory requirements are subject to more detailed tests.
Regarding the CUDA, I have gain access to 3 CUDA machines. Two of them belong to Sitola, The Laboratory of Advanced Network Technologies. The access to these machines is provided by MetaCentrum, Czech academic grid organization providing the computation and storage resources. The cards are GeForce GTX 285, GeForce 8800 Ultra and GeForce 8400M GS (a rather low-end one in my personal laptop). These are devices the CUDA development and testing will take place on. More info to come, please check the project page.
Our objective this week was to model presence of words in the utterance that are not in the transcription. The approach used was to model it using Phone Loops. A phone loop contains all the phones of an acoustic model and can model any utterance (i.e. also words in the transcription). Hence the key to good alignment using phone loop is an optimal branch probability which is large enough so that recogniser does not mistake a OOV word as a word in the grammar and small enough to not replace a word in grammar by a OOV word.
A linguist satisfying the above criteria has been added to long audio aligner branch. However, the linguist performs quite well for small sized transcriptions , the size of the search graph produced is too large for small sized transcription. We plan to generate this search graph dynamically now, to solve this memory issue. This way the memory requirements for generating and storing huge search graphs will be reduced to almost O(1) .
Extending my work from last week, I have added a framework for testing audio aligner. The framework includes tools to read transcription from a file and convert it into a format compatible with grammar, corrupt a transcription with desired error rates of particular types and finally a pronunciation generator for OOV words.
The problem we are trying to tackle first is of word insertions. For this, the aligner grammar has been modified to contain word self-loops and one backward jump per word. The word error rates for small audio (~ 5 min) with word repetitions is less than 2% with the current configuration. However it models word repetitions quite well, it can’t model word insertion completely as words inserted don’t necessarily come from the neighborhood of recently decoded word.
A similar (but not quite the same) out of grammar utterance recognition problem was attempted in S4 by adding an out of grammar branch (PhoneLoops) in SentenceHMM parallel to the one obtained by expanding grammar. But this addition does not model insertion of a couple of words in an utterance that “almost” fits the grammar.
So I plan to modify the same approach, but rather than having a branch parallel to one generated from grammar, I will have similar branches between consecutive words nodes in the grammar. The branch out probability for this branch will require some experimentation as well, but I expect this addition to model insertion of words completely.
During the last week the memory management of the current SphinxTrain version was examined using Valgrind family tools.
Program was profiled on one pass Baum-Welch training on the whole an4 database (948 files) and on one recording from custom ‘rita’ database. The total length of traing recordings was ~42 minutes and 5:31 minutes respectively. The reason for the testing on rita database considered only one file was the big amount of time taken by the training process. In both cases this was about 8 minutes without Valgrind profiling. (This time has increased about 15 times when profiling.)
First interesting result is that the training on 42 minutes of short audio files (each few seconds long) took about the same as training on a single file 5:31 minutes long.
During the experiments an issue with extensive memory reallocation was found in the forward algorithm. This was cca 760,000 of __ckd_realloc__ function calls when training on an4 database. The cause was found by the analysis of source codes: initial active state set allocation size was repeatedly set too low and after that vastly insreased (by about 1.3 GB on rita database) by the step of a few bytes. Algorithm was modified to reallocate the memory by quadratic step instead of linear. The modification reduced the reallocation to the 18,500 function calls on an4 database, thus by the factor of about 40. The tradeof was slight increase in the total amount of memory allocated from 5.7 MB to 6.2 MB and from 1.4 GB to 1.7 GB on an4 and rita databases respectively. The multiplication factor was set to 2 but this was arbitrary and can be subject to optimization.
In theory this modification could also reduce the running time of the algorithm but significant reduce of time was not measured (approximately 8 minutes +- few seconds in all cases – an4 & rita with and without the modification).
A question was whether a substantial fraction of memory demands could be attributed to memory leaks. This has shown not to be the case. Some memory leaks were found but nothing significant – under 1 MB in all cases. It was concluded that the reduced space technique implementation is necessary.
The work on the implementation of reduced space Baum-Welch has begun. This will involve a major rewriting of the forward, backward and Viterbi algorithms using checkpoints method (as described in article http://www.cse.ucsc.edu/research/compbio/papers/samspace.pdf). This method allows parametrization reducing the memory use from linear with respect to T to an arbitrary integer-root of T. The tradeof is increase in the computation length by the same factor. During this also the memory leaks should be completely prevented.
In order of testing this should be done in few phases to ensure the new implementation is correct. More information will be found at the project wiki http://cmusphinx.sourceforge.net/wiki/longaudiotraining as the work progresses.