Speech projects on GSOC 2014

Google summer of code is definitely one of the largest projects in open source world. 1400 students will enjoy the participation in the open source projects during this summer. Four projects of the pool are dedicated to speech recognition and it is really amazing all of them are planning to use CMUSphinx!

Here is the list of new hot projects for you to track and participate:

Speech to Text Enhancement Engine for Apache Stanbol
Student: Suman Saurabh
Organization: Apache Software Foundation
Assigned mentors: Andreas Kuckartz

Enhancement engine uses Sphinix library to convert the captured audio. Media (audio/video) data file is parsed with the ContentItem and formatted to proper audio format by Xuggler libraries. Audio speech is than extracted by Sphinix to ‘plain/text’ with the annotation of temporal position of the extracted text. Sphinix uses acoustic model and language model to map the utterances with the text, so the engine will also provide support of uploading acoustic model and language model.

Development of both online as offline speech recognition to B2G and Firefox

Student:Andre Natal
Organization: Mozilla
Assigned mentors: Guilherme Gonçalves
Short description: Mozilla needs to fill the gap between B2G and other mobile OSes, and also desktop Firefox lacks this important feature already available at Google Chrome. In addition, we’ll have a new Web API empowering developers , and every speech recognition application already developed and running on Chrome, will start to work on Firefox without changes. On future, this can be integrated on other Mozilla products, opening windows to a whole new class of interactive applications.

I know Andre very well, he is a very talented person, so I’m sure this project will be a huge success. Between, you can track it in github repository too: https://github.com/andrenatal/speechrtc

Sugar Listens – Speech Recognition within the Sugar Learning Platform

Student: Rodrigo Parra
Organization: Sugar Labs
Assigned mentors: tchx84
Short description: Sugar Listens seeks to provide an easy-to-use speech recognition API to educational content developers, within the Sugar Learning Platform. This will allow developers to integrate speech-enabled interfaces to their Sugar Activities, letting users interact with Sugar through voice commands. This goal will be achieved by exposing the open-source speech recognition engine Pocketsphinx as a D-Bus service.

Integrate Plasma Media Center with Simon to make navigation easier

Student: Ashish Madeti
Organization: KDE
Assigned mentors: Peter Grasch, Shantanu Tushar
Short description: User can currently navigate with keyboard and mouse in Plasma Media Center. Now, I will add Voice as a way for a user to navigate and use PMC. This will be done by integrating PMC with Simon.

I know Simon has a large and successful history of GSOC participation, so this project is also going to be very interesting.

Also, this summer we are going to run few student internships unrelated to GSOC, it’s going to be very interesting too, stay tuned!

Jasper – personal assistant for Raspberry PI

Personal assistants are hot these days. Open source personal assistant is a dream for many developers. Recently released Jasper makes it really easy to install personal assistant on Raspberry Pi and use it for custom voice commands, information retrieval and so on. Jasper is written in Python and can be extended through the API. More importantly, Jasper uses CMUSphinx for offline speech recognition, so much waited capability for assistant developers.

Try Jasper, download it’s source code from Github, modify it according to your needs and contribute new features.

We wish all the best to Jasper project and hope it will become popular.

Smile – Smart Picture Voice-Annotation using CMUSphinx

It’s great that more and more applications are using CMUSphinx. Recently a smart picture voice-annotation/tagging Android app using speech recognition to automatically generate multiple suggestion has been announced. Smile uses CMU Sphinx4 as one of the ASR recognizers.

Smile is a camera and picture gallery application that helps you to immediately voice-annotate your photos. Smile will then automatically extract the text (annotation) from the recorded voice using speech recognition technology, and will embed the text annotation and the voice as metadata inside the picture. So if you are searching for a picture in the hundreds of pictures you have on your phone, you can find them easily using Smile’s Search facility. Even more, with Smile you can share the photo with friends, with or without the speech and text. Your friends or colleagues can see the picture (if they don’t have Smile), and with Smile they can see the picture, the text and listen to the speech. Great for fun, quick greeting photos, work if you need to take photos with record to cherish memories, for documentation purposes, expense receipts, and many more uses.

To play with Smile you can download it at Google Play Store.

Public Release of the OpenDial Toolkit

It is not easy to build an intelligent software, the cooperation on all the levels of the speech processing needs to be tight. For example, you should not only recognize the input but also understand it and, more importantly, respond to it. You need to give your software an ability to understand the current situation, correct the results and respond intelligently. A great piece of software to assist you with that has been recently release.

OpenDial is a Java-based software toolkit that can be used to develop robust and adaptive dialogue systems for various domains. Dialogue understanding, management and generation models are expressed in OpenDial through probabilistic rules encoded in a simple XML format.

You can find more information about the toolkit on the project website:

http://opendial.googlecode.com

The current release contains a completely refactored code base, a large set of unit tests and a full-fledged graphical interface. You can also find on the website practical examples of dialogue domains and a step-by-step documentation on how to use the toolkit.

Try it!

Processing Speech Recognition Results With Wit.AI

The biggest challenge for developers today is a natural user interface. People already use gesture and speech to interact with their PCs and devices; such natural ways to interact with technologies make it easier to learn how to operate them. Biggest companies like Microsoft and Intel are putting a lot of effort into research in natural interaction.

CMUSphinx is a critical component of the open source infrastructure for creating natural user interfaces. However, it is not the only one component required to build an application. One of the most frequently asked questions are – how do I analyze speech recognition output to turn it into actionable information. The answer is not simple, again, it is all about a complex NLP technology which you can apply to analyze user intent as well as a dataset to help you with analysis.

In simple cases you can just parse the number strings to turn them into values, you can apply regex pattern matching to extract the name of the object to act upon. In Sphinx4 there exist a technology which can parse grammar output to assign semantic values in user request. In general, this is more complex task.

Recently, a Wit.AI has announced the availability of their NLP technology for developers. If you are looking for a simple technology to create a natural language interface, Wit.AI seems to be a good thing to try. Today, with the combination of the best engines like CMUSphinx and Wit, you can finally bring the power of voice to your app.

You can build a NLP analysis engine with Wit.AI in three simple stages:

  1. Provide a few examples of the responses you expect.
  2. Send raw user input to the API. You get structured information in return.
  3. Wit learns from usage and helps you improve your configuration.

Bringing natural language understanding to the masses of developers is really a hard problem and we great that tools appear to simplify the solution.

Pocketsphinx Python bindings ported from Cython to SWIG

As of today a large change of using SWIG-generated python bindings has been merged into pocketsphinx and sphinxbase trunk.

SWIG is an interface compiler that connects programs written in C and C++ with scripting languages such as Perl, Python, Ruby, and Tcl. It works by taking the declarations found in C/C++ header files and using them to generate the wrapper code that scripting languages need to access the underlying C/C++ code. In addition, SWIG provides a variety of customization features that let you tailor the wrapping process to suit your application.

By this port we hope to increase coverage of pocketsphinx bindings and provide a uniform and documented interface in various language: Python, Ruby, Java.

To test the change checkout sphinxbase and pocketsphinx from trunk and see the examples in pocketsphinx/swig/python/test.

Open Source Dictation is Coming

It is an old idea to implement an open source dictation tool everyone could use. Without servers, networking, without the need to share your private speech with someone else. This is certainly not a trivial project which was started many times, but it’s something really world-changing. Now, it’s live again, powered by CMUSphinx.

Consider details about ongoing efforts of Simon project to implement open source dictation.

Voice-enable Your Website With CMUSphinx

It has been a long dream to voice-enable websites. However, no good technology existed for this either because speech recognition on the web required a connection to a server or due to the requirement to install binary plugin.

Great news is that you can now use CMUSphinx in any modern browser completely on the client side. No need for installation, no need to maintain voice recognition server farm. This is a really cool technology.

Sylvain Chevalier has been working on a port of Pocketsphinx to JavaScript using emscripten. Combined with the Web Audio API, it works great as a real-time recognizer for web applications, running entirely in the browser, without plug-in.

It’s on Github (https://github.com/syl22-00/pocketsphinx.js),
comments, suggestions and contributions are more than welcome!

Pocketsphinx will be used in Ubuntu Unity 8

Months ago, Mark Shuttleworth announced and explained Ubuntu’s converged vision, where a singular OS is to power phones, tablets, desktops, TVs, etc.

Presently, the Ubuntu developers are working on strengthening Ubuntu for phones (Ubuntu Touch), development where speech recognition is to probably play a relevant role.

Among the usage-cases, the speech recognition was demoed (at a mockup level) as part of the HUD 2.0, basically, allowing the user to trigger commands by pressing a button and speaking into the phone’s microphone, oral command translated and applied similarly to a regular command.

Great news of today is that PocketSphinx has just landed in Ubuntu 13.10 by default being shifted from its previous universe availability (via Ubuntu Software Center) directly into main (landing via the regular updates). This means PocketSphinx is to be utilized for the upcoming Unity 8’s release on the desktop, probably to allow users to fully grasp the Unity 8’s features via a full spectrum of functionalities.

It’s definitely just a beginning of the work but it’s really great to see CMUSphinx on its way to the desktop. Definitely there will be many problems on the way since a proper implementation of the speech recognition system is not a trivial task and needs certain expertise. Your help is needed here otherwise all the issues will be assigned to Pocketsphinx.

Speech recognition on Kindle Touch with CMUSphinx

We are happy to announce that CMUSphinx-powered speech recognition comes to Amazon Kindle. “Vague” or Voice Activated GUi Extension was recently introduced and already available for your Kindle with Kual, a unified launcher

If you have an old Kindle sitting around, or you just want to get a little more out of the one use every day, jailbreaking is simple. Once you jailbreak, KUAL is a worthwhile little application launcher that gives you easy access to what you download.

KUAL works with pretty much every single Kindle model. Once it’s installed, you can run the program, and you’re given a simple, easy to use launcher to access everything on your Kindle. That means games, VNC clients, apps, and plenty more. It’s a nifty little launcher, and the fact it works on pretty much every Kindle out there makes it simple to use.

Vague allows you to navigate through your bookreader, launch various tools and, more importantly, it’s highly extensible in mind. That means that you can add your own commands easily with just a simple script!

Great job done!

Speech recognition on Kindle Touch with CMUSphinx continued »