Speech recognition on Kindle Touch with CMUSphinx

We are happy to announce that CMUSphinx-powered speech recognition comes to Amazon Kindle. “Vague” or Voice Activated GUi Extension was recently introduced and already available for your Kindle with Kual, a unified launcher

Vague screenshot on Kual

Vague screenshot on Kual

If you have an old Kindle sitting around, or you just want to get a little more out of the one use every day, jailbreaking is simple. Once you jailbreak, KUAL is a worthwhile little application launcher that gives you easy access to what you download.

KUAL works with pretty much every single Kindle model. Once it’s installed, you can run the program, and you’re given a simple, easy to use launcher to access everything on your Kindle. That means games, VNC clients, apps, and plenty more. It’s a nifty little launcher, and the fact it works on pretty much every Kindle out there makes it simple to use.

Vague allows you to navigate through your bookreader, launch various tools and, more importantly, it’s highly extensible in mind. That means that you can add your own commands easily with just a simple script!

Great job done!

Speech recognition on Kindle Touch with CMUSphinx continued »

OpenEars version 1.3.0 Preview Is Available

Recently, a new version of OpenEars is announced. The main feature of a new release 1.3.0 is an upgrade to the latest CMUSphinx codebase pocketsphinx-0.8. This upgrade should bring additional stability and performance, so you are welcome to try it!

OpenEars is the most popular free offline speech recognition and text-to-speech framework on iOS, and the basis for the OpenEars Platform, a plugin system that lets you drag-and-drop new speech capabilities into your iOS app.

If you are interested in examples of the applications built with CMUSphinx and OpenEars framework, please visit this cool project. Photo editing can be a challenging task, and it becomes even more difficult on the small, portable screens such as camera phones that are now frequently used to edit images. To address this problem PixelTone, a multimodal photo editing interface that combines speech and direct manipulation was created:

This truely creative application demonstrates how powerful multimodal framework could be created with CMUSphinx. Your application could be the next voice-enabled one!

Pocketsphinx For Android In Google Play

Pocketsphinx is a great alternative to a closed-source vendor SDK’s due to it’s open source nature, extensibility and features. If you are looking to impelment a speech application on Android, feel free to try Pocketsphinx. To get started, you can use existing applications like Inimesed

It’s a great application to select contacts, you can install it on your device with a single click.

The sources and related things are available on the Github. Many thanks to Kaarel Kaljurand for his great software!

If you know some other applications using CMUSphinx, feel free to share!

A new English language model release

A new English language model is available (updated) for download on our new Torrent tracker.

This is a good trigram language model for a general transcription trained on a various open sources, for example Guttenberg texts.

It archives the good transcription performance on various types of
texts, for example on the following tests sets the perplexities are:

TED talks from IWSLT 2012

Perplexity: 158.3

Lecture transcription task

Perplexity: 206.677

Beside the transcription task, this model should be significantly better on conversational data like movie transcription.

The language model was pruned with a beam 5e-9 to reduce the model. It can be pruned further if needed or a vocabulary could be reduced to fit the target domain.

Help to distribute CMUSphinx data through Bittorrent

Modern speech recognition algorithms require enormous amount of data to estimate speech parameters. Audio recordings, transcriptions, texts for langauge model, pronuncation dictionaries and vocabularies are collected by speech developers. While it’s not necessary to be the case in the future and better algorithms might require just a few examples, now you need to process thousands of hours of recordings to build a speech recognition system.

Estimates show that human recieves thousands hours of speech data before it learns to understand speech. Note that human has prior knowledge structure embedded into the brain we are not aware of. Google trains their models on 100 thousands hours of audio recorings and petabytes of transcriptions, still it behind the human performance in speech recognition tasks. For search queries they still have word error rate of 10%, for youtube Google’s word error rate is over 40%.

mrflip/CC BY-NC-SA 2.0

While Google has a vast of resources so we do. We definitely can collect, process and share even more data than Google has. The first step in this direction is to create a shared storage for the audio data and CMUSphinx models.

We created a torrent tracker specifically to distribute a legal speech data related to CMUSphinx, speech recognition, speech technologies and natural language processing. Thanks to Elias Majic, the tracker is available at

http://cmusphinx.info

Currently tracker contains torrents for the existing acoustic and language models but new more accurate models for US English and other languages will be released soon.

We encourage you to make other speech-related data available through our tracker. Please contact cmusphinx-devel@lists.sourceforge.net mailing list if you want to add your data set to the tracker.

Please help us to distribute the data, start a client on your host and make the data available to others.

To learn more about BitTorrent visit this link or search in the web, there is a vast amount of resources about it.

You might wonder what is the next step. Pretty soon we will be able to run a distributed acoustic model training system to train the acoustic model using vast amount of distributed data and computing power. With a BOINC-grid computation network of CMUSphinx tools we together will create the most accurate models for speech. Stay tuned.

New release: sphinxbase-0.8, pocketsphinx-0.8 and sphinxtrain-0.8

We are pleased to announce that today a pack of CMUSphinx packages was released:

  • sphinxbase-0.8
  • pocketsphinx-0.8
  • sphinxtrain-0.8

For the download links see:

http://cmusphinx.sourceforge.net/wiki/download

The biggest update of this release is a new sphinxtrain. The code sharing between sphinxbase and sphinxtrain significantly increased bringing more consistent codebase and interface, accurate memory management and increased usability.

Beside that, a single sphinxtrain binary is introduced to provide an easy and flexible access to the whole training procedure. In the future we hope to reduce the amount of Perl scripts in training setup and to port everything on Python. This will open the access to an advanced Python ecosystem including scientific packages, graphics and distributed computing.

Another notable change of this release in a new openfst-based G2P framework implemented during Google Summer of Code. Credits for this should go to Josef Robert Novak and John Salatas. This framework is also supported by sphinx4 and provides a uniform and accurate algorithm to create dictionaries from word lists.

A numerous bug fixes and improvements were submitted by our contributors. We should be grateful to the great developers who made this release possible. Many thanks to our star team, which is impressively long:

Alessandro Curzi
Alexandru-Dan Tomescu
Balkce
Bhiksha Raj
Blake Lemoine
Boris Mansencal
Douglas Bagnall
Erik Andresen
Evandro Gouvea
Glenn Pierce
Halle Winkler
Jidong Tao
John Salatas
Josef Novak
Kho-And-Mica
Kris Thielemans
Lionel Koenig
Marc Legendre
Melmahdy
Michal Krajnansky
Nicola Murino
Pankaj Pailwar
Paul Dixon
Pecastro
Peter Grasch
Riccardo Magliocchetti
Scott Silliman
Shea Levy
Tanel Alumae
Tony Robinson
Vassil Panayotov
Vijay Abharadwaj
Vyacheslav Klimkov
Yuri Orlov
Zheng6822

For more detailed information see the NEWS file in the corresponding packages.

The new sphinx4 package and an android demo using pocketsphinx will be released soon, finalizing the release cycle. After that, a great new features will start their way into codebase. Stay tuned.

A bunch of great CMUSphinx posts

For those who are interested in CMUSphinx on mobile, please check out the PolitePix blog where you could find some interesting ideas about pocketsphinx on iPhone:

OpenEars tips #1: create a language model before runtime from a text file or corpus

OpenEars tips #2: N-Best hypotheses with OpenEars

OpenEars tips #3: Acoustic model adaptation

OpenEars tips #4: Testing someone else’s recognition results using a recording

OpenEars is the easiest way to try open offline speech recognition on iPhone platform. If you are interested to add speech recognition to your iPhone application, you should definitely check it out.

GSoC 2012: Pronunciation Evaluation using CMUSphinx – Project Conclusions

(Author: Srikanth Ronanki)
(Status: GSoC 2012 Pronunciation Evaluation Final Report)

This article briefly summarizes the implementation of GSoC 2012 Pronunciation Evaluation project.

Primarily, I started with sphinx forced-alignment and obtained the spectral matching acoustic scores, duration at phone, word level using WSJ models. After that I tried concentrating mainly on two things. They are edit-distance neighbor phones decoding and Scoring routines for both Text-dependent and Text-independent systems as a part of GSoC 2012 project.

Edit-distance Neighbor phones decoding:

1. Primarily started with single-phone decoder and then explored three-phones decoder, word decoder and complete phrase decoder by providing neighbor phones as alternate to the expected phone.
2. The decoding results shown that both word level and phrase level decoding using JFGF are almost same.
3. This method helps to detect the mispronunciations at phone level and to detect homographs as well if the percentage of error in decoding can be reduced.

Scoring Routines:

Text-dependent:
This method is based on exemplars for each phrase. Initially, mean acoustic score, mean duration along with deviations are calculated for each of the phone in the phrase based on exemplar recordings. Now, given the test recording, each phone in the phrase is then compared with exemplar statistics. After that, z-scores are calculated and then normalized scores are calculated based on maximum and minimum of z-scores from exemplar recordings. All phone scores are aggregated to get word score and then all word scores are aggregated with POS weight to get complete phrase score.

Text-independent:
This method is based on predetermined statistics built from any corpus. Here, in this project, I used TIMIT corpus to build statistics for each phone based on its position (begin/middle/end) in the word. Given any random test file, each phone acoustic score, duration is compared with corresponding phone statistics based on contextual information. The scoring method is same as to that of Text-dependent system.

Demo:
Please try our demo @ http://talknicer.net/~ronanki/test/ and help us by giving the feedback.

Documentation and Codes
All codes are uploaded at CMUSphinx svn @ http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/ and raw documentation of the project can be found here.

Conclusions:
The pronunciation evaluation system really helps second-language learners to improve their pronunciation by trying multiple times and it lets you correct your-self by giving necessary feedback at phone, word level. I couldn’t complete some of the things like CART modelling I have mentioned earlier during the project. But I hope that I can keep my contributions to this project in future as well.

This summer has been a great experience to me. Google Summer of code 2012 has finally ended. As a final note, the current article is just a summary of the work during the project, an extensive set of documentation will be updated at http://cmusphinx.sourceforge.net/wiki/faq#qhow_to_implement_pronunciation_evaluation. You can also read more about this project and weekly progress reports at http://pronunciationeval.blogspot.in/

GSoC 2012: Pronunciation Evaluation #Troy – Project Conclusions

(author: Troy Lee)

This article briefly summarized the Pronunciation Evaluation Web Portal Design and Implementation for the GSoC 2012 Pronunciation Evaluation Project.

The pronunciation evaluation system mainly consists following components:

1) Database management module: Store, retrieve and update all the necessary information including both user information and various data information such as phrases, words, correct pronunciations, assessment scores and etc.

2) User management module: New user registration, information update, change/reset password and so on.

3) Audio recording and playback module: Recording the user’s pronunciation for further processing.

4) Exemplar verification module: Justify whether a given recording is an exemplar or not.

5) Pronunciation assessment module: Provide numerical evaluation at the phoneme level (which could be aggregated to form higher level evaluation scores) in both acoustic and duration aspects.

6) Phrase library module: Allow users to create new phrases into the database for evaluation.

7) Human evaluation module: Support human experts to evaluate the users’ pronunciations which could be compared with the automatically generated evaluations.

The website could be tested at http://talknicer.net/~li-bo/datacollection/login.php. Do let me know (troy.lee2008@gmail.com) once you encounter any problem as the site needs quite a lot testing before it works robustly. The complete setup of the website could be found at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/troy/. More detailed functionality and implementations could be found in a more manual like report:

Although it is the end of this GSoC, it is just the start of our project that leveraging on open source tools to improve people’s lives around the world using speech technologies. We are currently preparing using Amazon Mechanical Turk to collect more exemplar data through our web portal to build a rich database for improved pronunciation evaluation performance and further making the learning much more fun through gamification.

GSOC 1012: Grapheme to Phoneme Conversion in sphinx-4 – Project conclusions

(author: John Salatas)

Foreword

This article tries to summarize the Grapheme-to-Phoneme (g2p) in sphinx-4 project which was part of the GSoC 2012 program and can be thought as an integration of phonetisaurus [1] g2p application with both SphinxTrain and Sphinx-4. The project can be divided in three parts which are the g2p model training procedure integrated in the SphinxTrain application, the java g2p decoder integrated in Sphinx-4 and finally the new FST framework in java which was created for the project’s needs.

The training procedure

The training procedure is based on the original phonetisaurus’ training procedure using the openGRM NGram Library instead of the MITLM toolkit and in order to use it, you need first to install the openFST [2] and openGRM NGram [3] libraries in your system and then build the SphinxTrain application providing the –enable-g2p-decoder parameter to the autogen.sh script.

Training an acoustic model following the instructions found at [4], can train also a g2p model. As an addition to [4], after running the sphinxtrain -t an4 setup command, you need to enable the g2p functionality by setting the $CFG_G2P_MODEL variable in the same file to

$CFG_G2P_MODEL= ‘yes’;

By enabling the g2p functionality, the SphinxTrain application will in its initial steps train a new model  based on the provided dictionary, and then will also use it to provide any missing pronunciations in the training transcription file.

The new java FST framework

In order to be able to use the generated g2p model in java we needed to port the original phonetisaurus’ decoder to java. As a first step a general use java fst framework was created which is capable of handling fst models generated with openFST library and which contains all the required fst functionality and operations needed by the g2p decoder.

The java FST framework is available at CMUSphinx SVN Repository in [5].

Using the g2p models in sphinx-4

Having the various files (fst text file and input/output symbol tables files) of text format of the g2p model created with SphinxTrain, we need first to convert to the java FST binary format. This can be done using the openfst2java.sh script which is distributed with the java FST framework. The script accepts two parameters: the first one pointing to the base location (path and base filename excluding extensions) of the trained model’s text format and the second providing the full path and filename to which the java FST model will be saved.
After the conversion, in order to use the java FST model, we need to add the following lines to the dictionary component in the configuration file

<property name=”allowMissingWords” value=”true”/>
<property name=”createMissingWords” value=”true”/>
<property name=”g2pModelPath” value=”path_to_the_g2p_model”/>
<property name=”g2pMaxPron” value=”2″/>

notice that the “wordReplacement” property should not exist in the dictionary component. The property “g2pModelPath” should contain a URI pointing to the g2p model in java fst format. The property “g2pMaxPron” holds the value of the number of different pronunciations generated by the g2p decoder for each word. For more information about sphinx-4 configuration can be found at [6].

Conclusion

Further to the new g2p feature introduced in sphinx-4, we need to emphasize the new java FST framework. Its’ usage and extensive testing in the sphinx-4 g2p decoder suggest that its’ implemented functionality are usable in general, although it may luck functionality required in different applications (eg. additional operations) which in any case should be not hard to implemented.

As a final note, the current article is just a summary of the work during the project, an extensive set of documentation is available at the GSoC project page [7].

References

[1] phonetisaurus A WFST-driven Phoneticizer

[2] OpenFst Library Home Page

[3] OpenGrm NGram Library

[4] Training Acoustic Model For CMUSphinx

[5] Java FST Framework SVN Repository

[6] Sphinx-4 Application Programmer’s Guide

[7] “GSoC 2012: Letter to Phoneme Conversion in CMU Sphinx-4”