CMUDict 0.7b update

Recently Professor Rudnicky has updated CMUDict with the latest changes and now we have cmudict-0.7b version which you are welcome to checkout from subversion and use in your applications. CMUDict, the phonetic dictionary for US English has been one of the major components of the CMUSphinx toolkit. CMUDict has a long history being a unique resource for English pronunciation which is used by many other speech projects, commercial and open source.

There are few things in CMUDict which we would love to improve, those things would definitely have a huge impact on speech recognition research and overall speech recognition technology:

Phonemic vs phonetic

CMUDict is originally a phonetic dictionary opposed to phonemic dictionary. It contains the approximations to the word pronunciation, it describes how the US native speaker would pronounce the word in a read speech. On the other hand in other condition even native US speaker would pronounce them differently. For example

uniform Y UW1 N AH0 F AO2 R M

with AH0 in the middle is already a reduced form of the

uniform Y UW1 N IH0 F AO2 R M

Which the speaker would say if he will pronounce the word slowly. On the other hand it doesn’t have the form

uniform Y UW1 N AH0 F AH0 R M

Which native speaker would use in fast conversational speech. There are many cases like that. Because of such structure the dictionary is ready to use in speech recognition system but it makes it hard to conduct research on real phonetic reduction in various contexts just because the dictionary often doesn’t have the original form. In modern systems where phonetic reduction gets more importance, we need to have more information on it in the dictionary. Hopefully, one day we will be able to collect both the information about original phonemic representation of words and their possible phonetic representation.

Newly appeared words

All the words which appeared in last few years and widely used around are often missing in CMUDict: “skype”, “ipad”, “spotify”, there are so many important entries to add. Well, “spotify” has to be added to my spellchecker first. Hopefully we could keep the update rate of the dictionary faster. The reasonable estimation of the required size of the dictionary is about 200-500 thousand US English words, so the size of the dictionary has to be increased twice. That’s a lot of work to review.

Word origins and morphology

There are many research projects on modeling the pronunciation of the words automatically. Still, for CMUDict the symbol error rate is about 8% which causes word error rate about 30%. However, it’s often very sad they are trying to model words as blackboxes without the attempt to add some sense to them. It’s very important in what context the word is used, what is the origin of the word. Is it a surname, an abbreviation, a geographical term or a scientific term. Such information could greatly improve the quality of the dictionary and the accuracy of the prediction.

Other languages

There is a growing interest in supporting other languages in CMUSphinx toolkit – Spanish, French, British English. One of the serious problems is that we still lack a lot of data for them and dictionary is one big issue. Hopefully, we will be able to make an original approximation to the dictionary with rule-based systems for at least some of the languages. Such data would enable research on multi-language and language-independent speech recognizer and would greatly benefit the speech recognition toolkits.

Automatic dictionary acquisition

This is still an emerging technology, however, there are already some advancements in automatic dictionary collection with software by LIUM. One can imagine the tool which scans through the audio and just learns the words it met and generates pronunciation for them to add to the dictionary. Hopefully, such tools are not a far future of speech recognition.

So there is a huge space of improvement for CMUDict alone which is very important for the speech recognition research unrelated to the toolkit or speech recognition implementation. For that reason it is worth to note that CMUDict is also available on github, so you are welcome to clone the repository, make your changes and submit a pull request, that would be very much appreciated!

Mozilla Announces WebSpeech API

One of the main problems with existing open source speech recognition systems is that they are not really designed to be used in end-user software. They are mainly research projects created by universities and they are intended to support new research. They allow to quickly add with new features and get best results for various evaluations.

The end-user software doesn’t work like that, you might not need to demonstrate the best accuracy but you need to match the user expectations. For example, user expects to get a reasonable result even if he speaks far from the microphone or whispers the words. No modern system can recognize whisper reliably, thus mismatched expectation, thus complains. A lot of work is required to solve all the problems like this.

However, since many commercial companies promoted speech recognition to end-users, open source software also got a chance. We can build software for mass-market and match commercial solutions in terms of accuracy and robustness. Important step here would be to gain the audience attention, instead of software for geeks we need to become a software for everybody, a very hard problem to solve. It’s great there are projects with big ambitions here, in particular Mozilla Foundation.

Recently Mozilla Foundation announced a project to support WebSpeech HTML API in their browsers. Celebrating 10 years of Firefox development, Mozilla CTO, Andreas Gal, announced this and many other features coming in Mozilla codebase. During Google Summer of Code project by Andre Natal a base system was implemented and Andre continues work on the project. You can get some ideas of where it is now and how it developers from the following post. So we will probably be able to see speech interface in Firefox browser and Firefox OS pretty soon.

One of the main issue in wide adoption of the speech interfaces would be the support for big and small languages. Firefox considers this as important direction of development, in particular support for Indian languages. I hope we are going to see a lot of progress here.

Pocketvox is listening you!

We are pleased to announce a release of Pocketvox, a voice control project for Linux desktop.

What is Pocketvox ?

Pocketvox is a desktop application and a development library written in C and created by Benoit Franquet, a french newly graduated engineer, visually impaired and passionated by desktop accessibility and application development.

Pocketvox is based on several well known open source libraries such as Sphinxbase, Pocketsphinx, GStreamer, GLib-2.0, GObject-2.0, Gtk+, … from the GNOME project, Espeak to make Pocketvox respond with voice and Xlib to detect the focused window.

Pocketvox comes with several tools in order to make both developers and desktop users able to create or use voice controlled applications.

Pocketvox for desktop users

For the desktop, Pocketvox comes with:

- A menu launcher to launch the pocketvox application
- A panel applet to manage Pocketvox
- A configuration GUI and its launcher in the menu
- A common way to store the configuration using GSettinsgs
- A way to choose the micro input
- A way to send commands over TCP
- An activation keyword in order to start actions when you say a specific word
- A very flexible way to manage modules

Pocketvox for developers

For developers, Pocketvox provides:

- A .pc file to develop with Pocketvox
- A Python interface
- A very easy way to design your own module in Python

Python interface has been built using the GObject introspection, a very basic example is available on the Github page of the project.

Some addons Pocketvox comes with

Pocketvox has been build in order to use a system of module. Why ? Because, this allow desktop users to define and create very rapidly new modules only by writing a dictionnary file with a very simple structure like this:

open my documents=xdg-open ~/Documents
open my images=xdg-open ~/Images

Moreover, thanks to this structure users are able to associate a specific application to a module. When pocketvox is running then it will detect the focused application and execute commands listed in the module’s dictionnary file.

Besides somes bash scripts have been integrated to the Pocketvox project in order to make users able to rapidly create a custom language model using the cmuclmtk toolkit.

Pocketvox is working out of the box using language models, acoustic models and dictionaries available on our website.

Pocketvox is ready for translation and already available in French and English. All steps to translate it are available on the Pocketvox repository on Github. Pocketvox is waiting for you to make it available in other language.

How to get Pocketvox ?

You can find all informations to try Pocketvox on the Github’s page of the project.

The first release has been published yesterday, so you can get it here.

A way to make the world accessible

We recently got information about CDRV project, a great effort to make the word more accessible.

CDRV (“Device Controller by Speech Recognition” in spanish) is a device which can control a lift chair using voice commands. The purpose of this project is to help people with mobility problems such as disabilities or old people who are not able to use their hands to control the lift chair. The hardware of the CDRV consists on an overclocked Raspberry Pi model B and an extension board developed to control the motors of the lift chair, a Wi-Fi dongle, an external manual control and a Logitech USB microphone. The operative system used has been developed with Buildroot and runs in RAM completely so it is possible to turn off the power without shutting down the OS. The extension board has been developed with KiCAD and the plastic box has been designed entirely with OpenSCAD and printed with a Prusa i3 RepRap. The offline speech recognition software that runs into the Raspberry Pi is the C implementation of CMUSphinx: pocketsphinx. It uses the spanish acoustic model provided by the CMUSphinx community and a slightly modified version of the application pocketsphinx_continuous to perform a command and control task. A language model with approximately 25 common words behaves like a garbage model. No kind of confidence measures are computed.

CDRV is continuously listening for the utterance “Lift Chair Activation” (“Activación Sillón”, actually). Once this command is recognized, it produces a confirmation beep. Then, it waits for some seconds searching for the actions “Up” (Sube) or “Down” (Baja). If any of these actions is recognized, the lift chair will be activated accordingly. If not, a desactivation beep will be emitted. Anytime, the command “Stop” (Para) will turn off the lift chair motors.

The device has been tested with real users in real situations achieving very good results. Some exceptional users, due to it’s particular phonetics, are not properly recognized but in general, the device reacts only when the correct words are pronounced even with a close loud television. In the video below you can find a short demonstration of the device.

This project is being developed for a non profit organization called CVI (Center of independent living). The non profit ONCE foundation is funding the project and the UPC university provides tech support.

Pocketsphinx Ruby Is available on Github

We are pleased to announce the availability of the Ruby bindings for pocketsphinx created by Howard Wilson.

pocketsphinx-ruby is a high-level Ruby wrapper for the pocketsphinx C API. It uses the Ruby Foreign Function Interface (FFI) to directly load and call functions in libpocketsphinx, as well as libsphinxad for recoding live audio using a number of different audio backends.

The goal of the project is to make it as easy as possible for the Ruby community to experiment with speech recognition, in particular for use in grammar-based command and control applications. Setting up a real time recognizer is as simple as:

configuration = do
  sentence "Go forward ten meters"
  sentence "Go backward ten meters"
end do |speech|
  puts speech

This library supports Ruby MRI 1.9.3+, JRuby, and Rubinius. It depends on the current development versions of Pocketsphinx and Sphinxbase – there are Homebrew recipes available for a quick start on OSX.

LIUM releases TEDLIUM corpus version 2

LIUM team, the main CMUSphinx contributor, has announced today the release of TEDLIUM corpus version2, an amazing database prepared from transcribed TED talks

A details on this update could be found in corresponding publication:

A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks”, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), May 2014.

This database of 200 hours of speech allows you to build a speech recognition system with very good performance with open source toolkits like Kaldi or CMUSphinx. A Kaldi recipe for TEDLIUM v1, is available in the repository and we hope that the update to TEDLIUM v2 will be available soon.

Modern technology like automatic alignment of transcribed audio made it easy to create very competitive databases, so it’s easy to predict that the size of the available databases will quickly grow to thousands of hours and thus we will see a very significant improvement in accuracy of the open source recognition. The problem comes here that quite powerful training clusters will be required to work with such databases, it is not possible to train model on a single server in acceptable amount of time.

CMUSphinx is available on Windows Phone Platform

Microsoft traditionally has very good speech recognition technology. Recently announced speech recognition assistant Cortana is one of the best available assistant. However, it might lack support for your native language or just behave not the way you expect (hey, Siri also still doesn’t support many languages).

Thanks to a wonderful work by Toine de Boer you can now enjoy Pocketsphinx on Windows phone platform. It is as straightforward as on Android, you can just download the project from our github, import it into your Visual Studio and run on the phone. You can enjoy all the features of CMUSphinx on Windows phone: continuous hands-free operation, switchable grammars, support for custom acoustic and language models. There is no need to wait for the speech recognition input in the game. We hope this opens the possibilities for new great applications.

The demo includes continuous listening for the keyphrase “oh mighty computer” and once keyphrase is detected it switches to grammar mode to let you input some information. Let us know how it works.

Python decoding example

Python programming language is getting amazing popularity recently due to the elegance of the language, wide range of tools for scientific computing including scipy and NLTK and the immediacy of a “scripting” style language. We often get request to explain how to decode with pocketsphinx and Python.

Another interesting activity going around CMUSphinx is an updated acoustic model for German language. A frequent updates are posted on Voxforge website by Guenter, please check his new very much improved German models here: With the new improvements like audio aligner tool you can build a very accurate model for almost any language in a week or so.

To summarize these new features, one of our users, Matthias provided a nice tutorial on how to start with Pocketsphinx and Python and German models. With the new SWIG-based API we increased support for decoder features available in Python, now you can do from Python almost the same things you can do from C. If you are interested, please check his blog post here:

If you have issues with Python API or want to help with your language, let us know.

Long Audio Aligner Landed in Trunk

After three years of development we have finally merged an aligner for long audio files into trunk. The aligner takes audio file and corresponding text and dumps timestamps for every word in the audio. This functionality is useful for processing of the transcribed files like podcasts with further applications like better support for audio editing or for automatic subtitle syncronization. Another important application is acoustic model training, with a new feature you can easily collect databases of thousand hours for your native language with the data from the Internet like news broadcasts, podcasts and audio books. With that new feature we expect the list of supported languages will grow very fast.

To access new feature checkout sphinx4 from subversion or from our new repository on Github and build code with maven with “mvn install”

For the best accuracy download En-US generic acoustic model from downloads as well as g2p model for US English.

Then run the alignment

java -cp sphinx4-samples/target/sphinx4-samples-1.0-SNAPSHOT-jar-with-dependencies.jar \
edu.cmu.sphinx.demo.aligner.AlignerDemo file.wav file.txt en-us-generic \
cmudict-5prealpha.dict cmudict-5prealpha.fst.ser

The result will look like this:

+ of                        [10110:10180]
  there                     [11470:11580]
  are                       [11670:11710]
- missing

Where + denotes inserted word and – is for missing word. Numbers are times in milliseconds.

Please remember that input file must be 16khz 16bit mono. Text must be preprocessed, the algorithm doesn’t handle numbers yet.

The work on long alignment started during 2011 GSoC project with proposal from Bhiksha Raj and James Baker. Apurv Tiwari made a lot of progress on it, however, we were not able to produce a robust algorithm for alignment. It still failed on too many cases and failures were critical. Finally we changed algorithm to multipass decoding and it started to work better and survive gracefully in case of errors in transcription. Alexander Solovets was responsible for the implementation. Still, the algorithm doesn’t handle some important tokens like numbers or abbreviations and the speed requires improvement, however, the algorithm is useful already so we can proceed with further steps of model training. We hope to improve the situation in near future.

SOA Architecture For Speech Recognition

It’s interesting that since speech recognition becomes widespread the approach to the architecture of speech recognition system changes significantly. When only a single  application needed speech recognition it was enough to provide a simple library for the speech recognition functions like pocketsphinx and link it to the application. It’s still a valid approach for embedded devices and specialized deployments. However, approach changes significantly when you start to plan the speech recognition framework on a desktop. There are many applications which require voice interface and we need to let all of them interact with the user. Each interaction requires time to load the models into memory and memory to hold the models. Since the requirements are pretty high it becomes obvious that speech recognition service has to be placed into a centralized process. Naturally a concept of speech recognition server appears.

It’s interesting that many speech recognition projects start to talk about the server:

Simon has been using a common daemon (SimonD) managed over the sockets in order to provide speech recognition functions

Rodrigo Parra implements dbus-based server for TamTam Listens project –  a speech recognition framework for Sugar OLPC project. This is a very active work in progress, subscribe to the Tumblr blog to get the latest updates .

Andre Natal talks about speech recognition server for the FirefoxOS during his summer project.

Right now the solution is not yet stable, it is more work in progress. It would be great if such efforts could converge to a single point in the future, probably CMUSphinx can be the common denominator here and provide the desktop service for the applications looking to implement voice interfaces. A standalone library is certainly needed, we shouldn’t only focus on the service architecture, but service would be a good addition too. It could provide the common interfaces for the applications which just need to register required commands on the service.

Of course there is an option to put everything in the cloud, but cloud solution has its own disadvantages. Privacy concerns are still here and the data connection is still expensive and slow. There are similar issues with other resource-intensive APIs like text-to-speech, desktop natural language processing, translation, and so on, so soon quite a lot of memory on the desktop will be spent on desktop intelligence. So reserve few more gigabytes of memory in your systems, it will be taken pretty soon.