Python decoding example

Python programming language is getting amazing popularity recently due to the elegance of the language, wide range of tools for scientific computing including scipy and NLTK and the immediacy of a “scripting” style language. We often get request to explain how to decode with pocketsphinx and Python.

Another interesting activity going around CMUSphinx is an updated acoustic model for German language. A frequent updates are posted on Voxforge website by Guenter, please check his new very much improved German models here: http://goofy.zamia.org/voxforge/de/. With the new improvements like audio aligner tool you can build a very accurate model for almost any language in a week or so.

To summarize these new features, one of our users, Matthias provided a nice tutorial on how to start with Pocketsphinx and Python and German models. With the new SWIG-based API we increased support for decoder features available in Python, now you can do from Python almost the same things you can do from C. If you are interested, please check his blog post here:

https://mattze96.safe-ws.de/blog/?p=640

If you have issues with Python API or want to help with your language, let us know.

Long Audio Aligner Landed in Trunk

After three years of development we have finally merged an aligner for long audio files into trunk. The aligner takes audio file and corresponding text and dumps timestamps for every word in the audio. This functionality is useful for processing of the transcribed files like podcasts with further applications like better support for audio editing or for automatic subtitle syncronization. Another important application is acoustic model training, with a new feature you can easily collect databases of thousand hours for your native language with the data from the Internet like news broadcasts, podcasts and audio books. With that new feature we expect the list of supported languages will grow very fast.

To access new feature checkout sphinx4 from subversion or from our new repository on Github http://github.com/cmusphinx/sphinx4 and build code with maven with “mvn install”

For the best accuracy download En-US generic acoustic model from downloads as well as g2p model for US English.

Then run the alignment

java -cp sphinx4-samples/target/sphinx4-samples-1.0-SNAPSHOT-jar-with-dependencies.jar \
edu.cmu.sphinx.demo.aligner.AlignerDemo file.wav file.txt en-us-generic \
cmudict-5prealpha.dict cmudict-5prealpha.fst.ser

The result will look like this:

+ of                        [10110:10180]
  there                     [11470:11580]
  are                       [11670:11710]
- missing

Where + denotes inserted word and – is for missing word. Numbers are times in milliseconds.

Please remember that input file must be 16khz 16bit mono. Text must be preprocessed, the algorithm doesn’t handle numbers yet.

The work on long alignment started during 2011 GSoC project with proposal from Bhiksha Raj and James Baker. Apurv Tiwari made a lot of progress on it, however, we were not able to produce a robust algorithm for alignment. It still failed on too many cases and failures were critical. Finally we changed algorithm to multipass decoding and it started to work better and survive gracefully in case of errors in transcription. Alexander Solovets was responsible for the implementation. Still, the algorithm doesn’t handle some important tokens like numbers or abbreviations and the speed requires improvement, however, the algorithm is useful already so we can proceed with further steps of model training. We hope to improve the situation in near future.

SOA Architecture For Speech Recognition

It’s interesting that since speech recognition becomes widespread the approach to the architecture of speech recognition system changes significantly. When only a single  application needed speech recognition it was enough to provide a simple library for the speech recognition functions like pocketsphinx and link it to the application. It’s still a valid approach for embedded devices and specialized deployments. However, approach changes significantly when you start to plan the speech recognition framework on a desktop. There are many applications which require voice interface and we need to let all of them interact with the user. Each interaction requires time to load the models into memory and memory to hold the models. Since the requirements are pretty high it becomes obvious that speech recognition service has to be placed into a centralized process. Naturally a concept of speech recognition server appears.

It’s interesting that many speech recognition projects start to talk about the server:

Simon has been using a common daemon (SimonD) managed over the sockets in order to provide speech recognition functions

Rodrigo Parra implements dbus-based server for TamTam Listens project –  a speech recognition framework for Sugar OLPC project. This is a very active work in progress, subscribe to the Tumblr blog to get the latest updates .

Andre Natal talks about speech recognition server for the FirefoxOS during his summer project.

Right now the solution is not yet stable, it is more work in progress. It would be great if such efforts could converge to a single point in the future, probably CMUSphinx can be the common denominator here and provide the desktop service for the applications looking to implement voice interfaces. A standalone library is certainly needed, we shouldn’t only focus on the service architecture, but service would be a good addition too. It could provide the common interfaces for the applications which just need to register required commands on the service.

Of course there is an option to put everything in the cloud, but cloud solution has its own disadvantages. Privacy concerns are still here and the data connection is still expensive and slow. There are similar issues with other resource-intensive APIs like text-to-speech, desktop natural language processing, translation, and so on, so soon quite a lot of memory on the desktop will be spent on desktop intelligence. So reserve few more gigabytes of memory in your systems, it will be taken pretty soon.

OpenEars introduces easy grammars for Pocketsphinx

The proper design of the voice interface is still quite a challenging task. It seems easy to generate the string from user speech and act on it, however, in practice things are way more complicated. On mobile devices there are no resources to decode every possible speech, sometimes you need to restrict the interaction to a some domain. Even if we restrict ourselves to a domain, it’s not clear how to handle non-straightforward interaction with the user with repetitions, delays and corrections. Consider you want to recognize just “left” or “right”, what will you do if user says “left, hm, no, right”. What if the word “right” will be uttered in context like “you are right”, do you need to react on it as well?

We provide two major ways to describe the user language – grammars and language models. Many people prefer to use language models due to the simple way to create them with a web service. You just submit a list of phrases and get the data back, however, this is a slippery road. The issue is that language model generation code usually makes some significant assumptions about the distribution of the probabilities of unseen ngrams in the target language and calculates unseen combination probabilities using those assumptions. For most of the simple cases the assumptions are wrong. For example our SimpleLM uses constant backoff with 0.5 discount ratio which means you get some unusual word combinations with nonzero probability. Most times it’s now what you expect. If you are using language modeling toolkits, please be aware that default smoothing methods like Good-Turing or Knesser-Ney assume you submit them really huge texts. For small data sizes you most likely need different discounting.

On the other hand grammars are complex to create online, they are hard to debug and there are many unseen cases that are hard to cover with a grammar. If you have more than 10 rules in the grammar, I can tell you that you are doing something wrong. You do not account the probabilities of the rules properly and probably your grammar is suboptimal for the efficient recognition. Grammars make sense only for a very simple lists of commands. Next comes the issue with the format itself which should be both readable and parseable by machine. We are using JSGF grammars but they require special parsers and not so well-supported by automatic tools outside of CMUSphinx. Most of the world is using XML-based grammars like SRGS, however, you know how is it hard to edit XML manually. Thanks Matrix, we don’t use XML for everything anymore, there are way more readable formats like JSON. Next, you probably want to create the grammars on the fly based on context from a simple list of strings, without writing any text files on the storage.

Its amazing to see that OpenEars, a speech recognition toolkit for iOS based on CMUSphinx is proposed a solution for this issue. In recently released version 1.7 it introduced a nice way to create on-the-fly grammars with the API directly from the in-memory data. The grammar looks like this:

@{
     ThisWillBeSaidOnce : @[
         @{ OneOfTheseCanBeSaidOnce : @[@"HELLO COMPUTER", @"GREETINGS ROBOT"]},
         @{ OneOfTheseWillBeSaidOnce : @[@"DO THE FOLLOWING", @"INSTRUCTION"]},
         @{ OneOfTheseWillBeSaidOnce : @[@"GO", @"MOVE"]},
         @{ThisWillBeSaidWithOptionalRepetitions : @[
             @{ OneOfTheseWillBeSaidOnce : @[@"10", @"20",@"30"]},
             @{ OneOfTheseWillBeSaidOnce : @[@"LEFT", @"RIGHT", @"FORWARD"]}
         ]},
         @{ OneOfTheseWillBeSaidOnce : @[@"EXECUTE", @"DO IT"]},
         @{ ThisCanBeSaidOnce : @[@"THANK YOU"]}
     ]
 };

and is defined directly in the code. This method uses Objective-C native primitives for grammar construction so you don’t need to learn any other syntax for the grammars and you do not need to create any text files. I think this approach will be popular across developers of OpenEars and probably one day an approach similar to this one will be merged in Pocketsphinx core. The final design is still evolving, but it seems to be the step in the right direction.

Feel free to contact Halle Winkler, the OpenEars author, if you are interested in this new way to define grammars.

Speech projects on GSOC 2014

Google summer of code is definitely one of the largest projects in open source world. 1400 students will enjoy the participation in the open source projects during this summer. Four projects of the pool are dedicated to speech recognition and it is really amazing all of them are planning to use CMUSphinx!

Here is the list of new hot projects for you to track and participate:

Speech to Text Enhancement Engine for Apache Stanbol
Student: Suman Saurabh
Organization: Apache Software Foundation
Assigned mentors: Andreas Kuckartz

Enhancement engine uses Sphinix library to convert the captured audio. Media (audio/video) data file is parsed with the ContentItem and formatted to proper audio format by Xuggler libraries. Audio speech is than extracted by Sphinix to ‘plain/text’ with the annotation of temporal position of the extracted text. Sphinix uses acoustic model and language model to map the utterances with the text, so the engine will also provide support of uploading acoustic model and language model.

Development of both online as offline speech recognition to B2G and Firefox

Student:Andre Natal
Organization: Mozilla
Assigned mentors: Guilherme Gonçalves
Short description: Mozilla needs to fill the gap between B2G and other mobile OSes, and also desktop Firefox lacks this important feature already available at Google Chrome. In addition, we’ll have a new Web API empowering developers , and every speech recognition application already developed and running on Chrome, will start to work on Firefox without changes. On future, this can be integrated on other Mozilla products, opening windows to a whole new class of interactive applications.

I know Andre very well, he is a very talented person, so I’m sure this project will be a huge success. Between, you can track it in github repository too: https://github.com/andrenatal/speechrtc

Sugar Listens – Speech Recognition within the Sugar Learning Platform

Student: Rodrigo Parra
Organization: Sugar Labs
Assigned mentors: tchx84
Short description: Sugar Listens seeks to provide an easy-to-use speech recognition API to educational content developers, within the Sugar Learning Platform. This will allow developers to integrate speech-enabled interfaces to their Sugar Activities, letting users interact with Sugar through voice commands. This goal will be achieved by exposing the open-source speech recognition engine Pocketsphinx as a D-Bus service.

Integrate Plasma Media Center with Simon to make navigation easier

Student: Ashish Madeti
Organization: KDE
Assigned mentors: Peter Grasch, Shantanu Tushar
Short description: User can currently navigate with keyboard and mouse in Plasma Media Center. Now, I will add Voice as a way for a user to navigate and use PMC. This will be done by integrating PMC with Simon.

I know Simon has a large and successful history of GSOC participation, so this project is also going to be very interesting.

Also, this summer we are going to run few student internships unrelated to GSOC, it’s going to be very interesting too, stay tuned!

Jasper – personal assistant for Raspberry PI

Personal assistants are hot these days. Open source personal assistant is a dream for many developers. Recently released Jasper makes it really easy to install personal assistant on Raspberry Pi and use it for custom voice commands, information retrieval and so on. Jasper is written in Python and can be extended through the API. More importantly, Jasper uses CMUSphinx for offline speech recognition, so much waited capability for assistant developers.

Try Jasper, download it’s source code from Github, modify it according to your needs and contribute new features.

We wish all the best to Jasper project and hope it will become popular.

Smile – Smart Picture Voice-Annotation using CMUSphinx

It’s great that more and more applications are using CMUSphinx. Recently a smart picture voice-annotation/tagging Android app using speech recognition to automatically generate multiple suggestion has been announced. Smile uses CMU Sphinx4 as one of the ASR recognizers.

Smile is a camera and picture gallery application that helps you to immediately voice-annotate your photos. Smile will then automatically extract the text (annotation) from the recorded voice using speech recognition technology, and will embed the text annotation and the voice as metadata inside the picture. So if you are searching for a picture in the hundreds of pictures you have on your phone, you can find them easily using Smile’s Search facility. Even more, with Smile you can share the photo with friends, with or without the speech and text. Your friends or colleagues can see the picture (if they don’t have Smile), and with Smile they can see the picture, the text and listen to the speech. Great for fun, quick greeting photos, work if you need to take photos with record to cherish memories, for documentation purposes, expense receipts, and many more uses.

To play with Smile you can download it at Google Play Store.

Public Release of the OpenDial Toolkit

It is not easy to build an intelligent software, the cooperation on all the levels of the speech processing needs to be tight. For example, you should not only recognize the input but also understand it and, more importantly, respond to it. You need to give your software an ability to understand the current situation, correct the results and respond intelligently. A great piece of software to assist you with that has been recently release.

OpenDial is a Java-based software toolkit that can be used to develop robust and adaptive dialogue systems for various domains. Dialogue understanding, management and generation models are expressed in OpenDial through probabilistic rules encoded in a simple XML format.

You can find more information about the toolkit on the project website:

http://opendial.googlecode.com

The current release contains a completely refactored code base, a large set of unit tests and a full-fledged graphical interface. You can also find on the website practical examples of dialogue domains and a step-by-step documentation on how to use the toolkit.

Try it!

Processing Speech Recognition Results With Wit.AI

The biggest challenge for developers today is a natural user interface. People already use gesture and speech to interact with their PCs and devices; such natural ways to interact with technologies make it easier to learn how to operate them. Biggest companies like Microsoft and Intel are putting a lot of effort into research in natural interaction.

CMUSphinx is a critical component of the open source infrastructure for creating natural user interfaces. However, it is not the only one component required to build an application. One of the most frequently asked questions are – how do I analyze speech recognition output to turn it into actionable information. The answer is not simple, again, it is all about a complex NLP technology which you can apply to analyze user intent as well as a dataset to help you with analysis.

In simple cases you can just parse the number strings to turn them into values, you can apply regex pattern matching to extract the name of the object to act upon. In Sphinx4 there exist a technology which can parse grammar output to assign semantic values in user request. In general, this is more complex task.

Recently, a Wit.AI has announced the availability of their NLP technology for developers. If you are looking for a simple technology to create a natural language interface, Wit.AI seems to be a good thing to try. Today, with the combination of the best engines like CMUSphinx and Wit, you can finally bring the power of voice to your app.

You can build a NLP analysis engine with Wit.AI in three simple stages:

  1. Provide a few examples of the responses you expect.
  2. Send raw user input to the API. You get structured information in return.
  3. Wit learns from usage and helps you improve your configuration.

Bringing natural language understanding to the masses of developers is really a hard problem and we great that tools appear to simplify the solution.

Pocketsphinx Python bindings ported from Cython to SWIG

As of today a large change of using SWIG-generated python bindings has been merged into pocketsphinx and sphinxbase trunk.

SWIG is an interface compiler that connects programs written in C and C++ with scripting languages such as Perl, Python, Ruby, and Tcl. It works by taking the declarations found in C/C++ header files and using them to generate the wrapper code that scripting languages need to access the underlying C/C++ code. In addition, SWIG provides a variety of customization features that let you tailor the wrapping process to suit your application.

By this port we hope to increase coverage of pocketsphinx bindings and provide a uniform and documented interface in various language: Python, Ruby, Java.

To test the change checkout sphinxbase and pocketsphinx from trunk and see the examples in pocketsphinx/swig/python/test.