The biggest challenge for developers today is a natural user interface. People already use gesture and speech to interact with their PCs and devices; such natural ways to interact with technologies make it easier to learn how to operate them. Biggest companies like Microsoft and Intel are putting a lot of effort into research in natural interaction.
CMUSphinx is a critical component of the open source infrastructure for creating natural user interfaces. However, it is not the only one component required to build an application. One of the most frequently asked questions are – how do I analyze speech recognition output to turn it into actionable information. The answer is not simple, again, it is all about a complex NLP technology which you can apply to analyze user intent as well as a dataset to help you with analysis.
In simple cases you can just parse the number strings to turn them into values, you can apply regex pattern matching to extract the name of the object to act upon. In Sphinx4 there exist a technology which can parse grammar output to assign semantic values in user request. In general, this is more complex task.
Recently, a Wit.AI has announced the availability of their NLP technology for developers. If you are looking for a simple technology to create a natural language interface, Wit.AI seems to be a good thing to try. Today, with the combination of the best engines like CMUSphinx and Wit, you can finally bring the power of voice to your app.
You can build a NLP analysis engine with Wit.AI in three simple stages:
Provide a few examples of the responses you expect.
Send raw user input to the API. You get structured information in return.
Wit learns from usage and helps you improve your configuration.
Bringing natural language understanding to the masses of developers is really a hard problem and we great that tools appear to simplify the solution.
As of today a large change of using SWIG-generated python bindings has been merged into pocketsphinx and sphinxbase trunk.
SWIG is an interface compiler that connects programs written in C and C++ with scripting languages such as Perl, Python, Ruby, and Tcl. It works by taking the declarations found in C/C++ header files and using them to generate the wrapper code that scripting languages need to access the underlying C/C++ code. In addition, SWIG provides a variety of customization features that let you tailor the wrapping process to suit your application.
By this port we hope to increase coverage of pocketsphinx bindings and provide a uniform and documented interface in various language: Python, Ruby, Java.
To test the change checkout sphinxbase and pocketsphinx from trunk and see the examples in pocketsphinx/swig/python/test.
It is an old idea to implement an open source dictation tool everyone could use. Without servers, networking, without the need to share your private speech with someone else. This is certainly not a trivial project which was started many times, but it’s something really world-changing. Now, it’s live again, powered by CMUSphinx.
It has been a long dream to voice-enable websites. However, no good technology existed for this either because speech recognition on the web required a connection to a server or due to the requirement to install binary plugin.
Great news is that you can now use CMUSphinx in any modern browser completely on the client side. No need for installation, no need to maintain voice recognition server farm. This is a really cool technology.
Presently, the Ubuntu developers are working on strengthening Ubuntu for phones (Ubuntu Touch), development where speech recognition is to probably play a relevant role.
Among the usage-cases, the speech recognition was demoed (at a mockup level) as part of the HUD 2.0, basically, allowing the user to trigger commands by pressing a button and speaking into the phone’s microphone, oral command translated and applied similarly to a regular command.
Great news of today is that PocketSphinx has just landed in Ubuntu 13.10 by default being shifted from its previous universe availability (via Ubuntu Software Center) directly into main (landing via the regular updates). This means PocketSphinx is to be utilized for the upcoming Unity 8’s release on the desktop, probably to allow users to fully grasp the Unity 8’s features via a full spectrum of functionalities.
It’s definitely just a beginning of the work but it’s really great to see CMUSphinx on its way to the desktop. Definitely there will be many problems on the way since a proper implementation of the speech recognition system is not a trivial task and needs certain expertise. Your help is needed here otherwise all the issues will be assigned to Pocketsphinx.
We are happy to announce that CMUSphinx-powered speech recognition comes to Amazon Kindle. “Vague” or Voice Activated GUi Extension was recently introduced and already available for your Kindle with Kual, a unified launcher
Vague screenshot on Kual
If you have an old Kindle sitting around, or you just want to get a little more out of the one use every day, jailbreaking is simple. Once you jailbreak, KUAL is a worthwhile little application launcher that gives you easy access to what you download.
KUAL works with pretty much every single Kindle model. Once it’s installed, you can run the program, and you’re given a simple, easy to use launcher to access everything on your Kindle. That means games, VNC clients, apps, and plenty more. It’s a nifty little launcher, and the fact it works on pretty much every Kindle out there makes it simple to use.
Vague allows you to navigate through your bookreader, launch various tools and, more importantly, it’s highly extensible in mind. That means that you can add your own commands easily with just a simple script!
Recently, a new version of OpenEars is announced. The main feature of a new release 1.3.0 is an upgrade to the latest CMUSphinx codebase pocketsphinx-0.8. This upgrade should bring additional stability and performance, so you are welcome to try it!
OpenEars is the most popular free offline speech recognition and text-to-speech framework on iOS, and the basis for the OpenEars Platform, a plugin system that lets you drag-and-drop new speech capabilities into your iOS app.
If you are interested in examples of the applications built with CMUSphinx and OpenEars framework, please visit this cool project. Photo editing can be a challenging task, and it becomes even more difficult on the small, portable screens such as camera phones that are now frequently used to edit images. To address this problem PixelTone, a multimodal photo editing interface that combines speech and direct manipulation was created:
This truely creative application demonstrates how powerful multimodal framework could be created with CMUSphinx. Your application could be the next voice-enabled one!
Pocketsphinx is a great alternative to a closed-source vendor SDK’s due to it’s open source nature, extensibility and features. If you are looking to impelment a speech application on Android, feel free to try Pocketsphinx. To get started, you can use existing applications like Inimesed
It’s a great application to select contacts, you can install it on your device with a single click.
The sources and related things are available on the Github. Many thanks to Kaarel Kaljurand for his great software!
If you know some other applications using CMUSphinx, feel free to share!
Modern speech recognition algorithms require enormous amount of data to estimate speech parameters. Audio recordings, transcriptions, texts for langauge model, pronuncation dictionaries and vocabularies are collected by speech developers. While it’s not necessary to be the case in the future and better algorithms might require just a few examples, now you need to process thousands of hours of recordings to build a speech recognition system.
Estimates show that human recieves thousands hours of speech data before it learns to understand speech. Note that human has prior knowledge structure embedded into the brain we are not aware of. Google trains their models on 100 thousands hours of audio recorings and petabytes of transcriptions, still it behind the human performance in speech recognition tasks. For search queries they still have word error rate of 10%, for youtube Google’s word error rate is over 40%.
While Google has a vast of resources so we do. We definitely can collect, process and share even more data than Google has. The first step in this direction is to create a shared storage for the audio data and CMUSphinx models.
We created a torrent tracker specifically to distribute a legal speech data related to CMUSphinx, speech recognition, speech technologies and natural language processing. Thanks to Elias Majic, the tracker is available at
Currently tracker contains torrents for the existing acoustic and language models but new more accurate models for US English and other languages will be released soon.
We encourage you to make other speech-related data available through our tracker. Please contact firstname.lastname@example.org mailing list if you want to add your data set to the tracker.
Please help us to distribute the data, start a client on your host and make the data available to others.
To learn more about BitTorrent visit this link or search in the web, there is a vast amount of resources about it.
You might wonder what is the next step. Pretty soon we will be able to run a distributed acoustic model training system to train the acoustic model using vast amount of distributed data and computing power. With a BOINC-grid computation network of CMUSphinx tools we together will create the most accurate models for speech. Stay tuned.