Grapheme-to-phoneme tool based on sequence-to-sequence learning

April 28th, 2016

Recurrent neural networks (RNN) with long short term memory cells (LSTM) recently demonstrated very promising performance results in language modeling, machine translation, speech recognition and other fields related to sequence processing. Nice thing is that the system is almost plug and play, you feed any inputs and get a decent accuracy.

We released a grapheme-to-phoneme toolkit based on sequence-to-sequence encoder-decoder LSTM for machine translation task. It is already successfully used by Microsoft and Google for the task of grapheme-to-phoneme conversion. The great thing in this approach is ultra simplicity. One LSTM is used to encode character sequences in continuous space and another LSTM decodes the phoneme sequence with attention mechanism. Interestingly, the training does not require phoneme and grapheme alignments as in conventional WFST approaches, it simply learns from the data.

This implementation is based on TensorFlow which allows an efficient training on both CPU and GPU.

The code is available in CMUSphinx section on github.

You can download an example model with 2 hidden layers and 64 units per layer trained on CMU dict for generating pronunciations of new words and use it in the following simple way:

python –decode [your_wordlist] –model [path_to_model]

In addition, you can make new G2P models using any existing dictionary.

python –train [train_dictionary.dic] –model [output_model_path]

The tool allows to select various training parameters. Feel free to experiment with the number of parameters and learning rates. Training speed is not fast, it takes about 1 day to train a large model, but it should be faster with GPU.

We are still testing accuracy of the model, but it seems that it is comparable with Phonetisaurus tool. Small model with hidden layer size 64 performs slightly worse, but is very small (500kb), large model with 512 elements in hidden layer is slightly more accurate.

JabberChess, A Mobile Chess App Speech Recognizer for Visual and Motion Impaired Users

March 2nd, 2016


Jackson Chen, a high-schooler from Colorado, US has recently made available a chess application controlled by voice created with CMUSphinx. You can find it in AppStore.

It is a very interesting case of CMUSphinx used in real-life application also because Jackson has created complete report of his experience creating the chess application and shared the models he created in the process. He performed a huge work on testing the application in real-life. He compared the performance of CMUSphinx grammars, language models and he also explored recognition with commercial engine from Nuance. You could easy guess which engine provided the best accuracy, if you want to learn more, please check the full report. You can find the models here.

If you have created your application with CMUSphinx, please let us know!

OpenEars 2.5 is out

February 23rd, 2016

See the details here. OpenEars is now able to support speech recognition with English, Spanish, Mandarin Chinese, French, German, and Dutch.