Using PocketSphinx with GStreamer and Python (or Vala)

A very recent and useful addition to PocketSphinx (in SVN only - will be released with version 0.5) is support for the GStreamer streaming media framework. What this means is that the PocketSphinx decoder can be treated as an element in a media processing pipeline, specifically, one which filters audio into text.

This is useful because it simplifies a lot of the work involved in writing interactive speech applications. You don't need to handle any of the difficulties of dealing with audio input. Also, perhaps more importantly, it means that speech input can be integrated into a GUI application using GTK+ or Maemo without agonizing pain.

Overview

This document will walk you through the steps of building a simple demo using GTK+, GStreamer, and Python. The only thing this program will do is recognize speech and display it in a text area, but it should give you the tools you need to do more interesting things. The instructions are for Python but if Vala is your thing then there is a code listing port to Vala at the bottom.

Before starting, you should have an up-to-date version of PocketSphinx from the Subversion repository, as well as GStreamer, GTK+, and the Python modules for both of them. You should make sure that GStreamer is able to find the PocketSphinx plugin. If you have installed PocketSphinx in /usr/local, then you may have to set the following environment variable:

export GST_PLUGIN_PATH=/usr/local/lib/gstreamer-0.10

If you have installed PocketSphinx under a different prefix, you will also need to set the LD_LIBRARY_PATH variable. If your installation prefix is $psprefix, then you would want to set these variables.

export LD_LIBRARY_PATH=$psprefix/lib
export GST_PLUGIN_PATH=$psprefix/lib/gstreamer-0.10

To verify that GStreamer can find the plugin, run gst-inspect pocketsphinx. You should get a large amount of output, ending with something like this:

Element Signals:
  "partial-result" :  void user_function (GstElement* object,
                                          gchararray arg0,
                                          gchararray arg1,
                                          gpointer user_data);
  "result" :  void user_function (GstElement* object,
                                  gchararray arg0,
                                  gchararray arg1,
                                  gpointer user_data);

Background Reading

Before you look at this, it would be a good idea to look at:

Of course, we also assume that you know Python.

Skeleton of a simple GUI program

Our simple demo program will just consist of a window, a text box, and a button which the user can push to start and stop speech recognition. It will look something like this:

We will start by creating a Python class representing our demo application:

class DemoApp(object):
    """GStreamer/PocketSphinx Demo Application"""
    def __init__(self):
        """Initialize a DemoApp object"""
        self.init_gui()
        self.init_gst()

    def init_gui(self):
        """Initialize the GUI components"""

    def init_gst(self):
        """Initialize the speech components"""

app = DemoApp()
gtk.main()

Now let's fill in the init_gui method. We are going to create a window with a gtk.VBox in it, holding a gtk.TextView (with associated gtk.TextBuffer) and a gtk.ToggleButton.

    def init_gui(self):
        """Initialize the GUI components"""
        self.window = gtk.Window()
        self.window.connect("delete-event", gtk.main_quit)
        self.window.set_default_size(400,200)
        self.window.set_border_width(10)
        vbox = gtk.VBox()
        self.textbuf = gtk.TextBuffer()
        self.text = gtk.TextView(self.textbuf)
        self.text.set_wrap_mode(gtk.WRAP_WORD)
        vbox.pack_start(self.text)
        self.button = gtk.ToggleButton("Speak")
        self.button.connect('clicked', self.button_clicked)
        vbox.pack_start(self.button, False, False, 5)
        self.window.add(vbox)
        self.window.show_all()

This gives us a nice text box and a button that does nothing. Now, we want to introduce GStreamer and PocketSphinx to our program, so that clicking the button makes it recognize speech and insert it into the textbox at the current location.

Adding GStreamer to the program

To use GStreamer, we need to add a few more lines of imports to the top of the program:

import gobject
import pygst
pygst.require('0.10')
gobject.threads_init() # This is very important!
import gst

Now, we will fill in the init_gst method, which initializes a GStreamer pipeline that will do speech recognition for us.

Creating the pipeline

For simplicity we are creating it using the gst.parse_launch function, which reads a textual description and creates a pipeline from it. If you are not running GNOME, you may need to change gconfaudiosrc to a different source element, such as alsasrc or osssrc.

        self.pipeline = gst.parse_launch('gconfaudiosrc ! audioconvert ! audioresample '
                                         + '! vader name=vad auto_threshold=true '
                                         + '! pocketsphinx name=asr ! fakesink')

This pipeline consists of an audio source, followed by conversion and resampling (the pocketsphinx element currently requires 8kHz, 16-bit PCM audio), followed by voice activity detection and recognition. The fakesink element discards the output of the speech recognition element (more on this below)

Running the pipeline

We are going to set up our program so that the pipeline starts out being paused, and is set to play (i.e. start recognition) when the user presses the “Speak” button. Once a final recognition result is retrieved we will put it back in paused mode and wait for another button press. If the user presses the button in the middle of recognition, it will halt speech recognition immediately. We will implement this using the button_clicked method which was connected to the button's signal in init_gui:

    def button_clicked(self, button):
        """Handle button presses."""
        if button.get_active():
            button.set_label("Stop")
            self.pipeline.set_state(gst.STATE_PLAYING)
        else:
            button.set_label("Speak")
            vader = self.pipeline.get_by_name('vad')
            vader.set_property('silent', True)

The 'pocketsphinx' element

The pocketsphinx element functions as a filter - it takes audio data as input and produces text as output. This makes sense in the GStreamer framework, where data flows from a source to a sink, and is potentially useful for captioning or other multimedia applications. However, due to the fact that automatic speech recognition by nature operates on arbitrarily large chunks of data (“utterances”), and the recognition result for an utterance can change as more data becomes available, this simple streaming data flow is not suitable for most ASR applications.

In practice, we want to treat the speech recognizer more like an input device in GUI programming, which emits a stream of events that can be subscribed to and interpreted by a controller object. Fortunately, GStreamer allows us to do (almost) exactly this, using the same mechanism used for GTK+ widgets. So, we are going to connect methods in DemoApp to the signals emitted by the pocketsphinx element:

        asr = self.pipeline.get_by_name('asr')
        asr.connect('partial_result', self.asr_partial_result)
        asr.connect('result', self.asr_result)

Now, the methods asr_partial_result and asr_result will be called whenever a partial or complete utterance is decoded.

The pocketsphinx element has a number of properties which can be used to select different acoustic and language models and to tune the performance of the recognizer. See PocketSphinxElementProperties for more information, or run gst-inspect pocketsphinx to get the embedded documentation.

Most importantly, you will need to set the language model property (lm) in order to do anything useful. The GStreamer element defaults to a “toy” language model designed for commanding wheeled robots, so it will only recognize things like “go forward three meters” and “turn left”.

Since initializing the speech recognizer takes some time, there is a “magical” property which allows you to force it to be initialized before you actually start feeding it data. We will use this here:

        asr.set_property('configured', True)

The 'vader' element

How does the pocketsphinx element know when an utterance starts and ends? This job is done by the Voice Activity DEtectoR (“VADER”) element, which inserts events into the audio stream marking utterance start and end points.

This element doesn't really have many options, but you'll notice that we have set auto_threshold=true in the pipeline above. This will use the first few frames of audio to determine the background noise level, which can be important if you have a lousy soundcard or a far-field microphone. As above, run gst-inspect vader to get the embedded documentation.

You can also force a transition to silence (or to speech) by setting the silent property explicitly.

Riding the bus

Now, the slightly complicated part. Although we can connect methods to the signals emitted by the pocketsphinx element, we are quite limited in what we can actually do within those handlers. This is because GStreamer may run elements in a separate thread from the rest of the program, and it is not possible to do any GUI operations from outside the main thread. Therefore, we need a way to forward these messages to the main thread where they can be dealt with inside the GUI event loop.

The “pipeline” object in GStreamer provides us with a convenient way to do this, using a “message bus”. This is simply a thread-safe event queue which allows you to pass messages out of element context to the main program. As you might expect, the bus can emit signals when a message is received, and you can connect methods to these signals. There is a specific type of message called an “application message” which is reserved for use by applications to do exactly what we want to do. You can connect a signal handler to this specific type of message by connecting to the message::application signal on the bus after calling bus.add_signal_watch():

        bus = self.pipeline.get_bus()
        bus.add_signal_watch()
        bus.connect('message::application', self.application_message)

Now, we will write the asr_partial_result and asr_result methods mentioned above, and have them post messages on the bus:

    def asr_partial_result(self, asr, text, uttid):
        """Forward partial result signals on the bus to the main thread."""
        struct = gst.Structure('partial_result')
        struct.set_value('hyp', text)
        struct.set_value('uttid', uttid)
        asr.post_message(gst.message_new_application(asr, struct))

    def asr_result(self, asr, text, uttid):
        """Forward result signals on the bus to the main thread."""
        struct = gst.Structure('result')
        struct.set_value('hyp', text)
        struct.set_value('uttid', uttid)
        asr.post_message(gst.message_new_application(asr, struct))

Having done this, we now need to implement the application_message method which will handle these messages. We can use the string passed to the gst.Structure constructor to demultiplex these messages and call the appropriate methods to update the GUI. Note that when we receive a final result, we set the pipeline's state to paused and reset the button to its original state. The button's label will be updated automatically in the button_clicked method.

    def application_message(self, bus, msg):
        """Receive application messages from the bus."""
        msgtype = msg.structure.get_name()
        if msgtype == 'partial_result':
            self.partial_result(msg.structure['hyp'], msg.structure['uttid'])
        elif msgtype == 'result':
            self.final_result(msg.structure['hyp'], msg.structure['uttid'])
            self.pipeline.set_state(gst.STATE_PAUSED)
            self.button.set_active(False)

Updating the text buffer

Finally, we need to implement the methods which update the text buffer with the current partial and final recognition result. There isn't much of interest here with relation to GStreamer and PocketSphinx.

    def partial_result(self, hyp, uttid):
        """Delete any previous selection, insert text and select it."""
        # All this stuff appears as one single action
        self.textbuf.begin_user_action()
        self.textbuf.delete_selection(True, self.text.get_editable())
        self.textbuf.insert_at_cursor(hyp)
        ins = self.textbuf.get_insert()
        iter = self.textbuf.get_iter_at_mark(ins)
        iter.backward_chars(len(hyp))
        self.textbuf.move_mark(ins, iter)
        self.textbuf.end_user_action()

    def final_result(self, hyp, uttid):
        """Insert the final result."""
        # All this stuff appears as one single action
        self.textbuf.begin_user_action()
        self.textbuf.delete_selection(True, self.text.get_editable())
        self.textbuf.insert_at_cursor(hyp)
        self.textbuf.end_user_action()

Improving Accuracy

You will notice that the accuracy of the speech recognition is pretty bad. In particular, the first utterance is basically never recognized correctly, and the system makes weird (and occasionally amusing) substitutions for future utterances.

These are actually two separate problems. We're working on the first one (it relates to CepstralMeanNormalization, for those who know what that is). Unless you have a particularly difficult accent, the second one is largely a matter of LanguageModeling. If you have a limited set of sentences that you want to recognize, you can follow the directions in LanguageModelHowto to create an improved language model.

To use your improved language model with GStreamer, you just have to set the lm and dict properties on the pocketsphinx element. So, if your language model is in /home/user/mylanguagemodel.lm and the associated dictionary is /home/user/mylanguagemodel.dic, you would add these lines to the init_gst method (before the configured property is set):

        asr.set_property('lm', '/home/user/mylanguagemodel.lm')
        asr.set_property('dict', '/home/user/mylanguagemodel.dic')

Code Listing

You can download the Python code for this example. A complete code listing follows:

#!/usr/bin/env python

# Copyright (c) 2008 Carnegie Mellon University.
#
# You may modify and redistribute this file under the same terms as
# the CMU Sphinx system.  See
# http://cmusphinx.sourceforge.net/html/LICENSE for more information.

import pygtk
pygtk.require('2.0')
import gtk

import gobject
import pygst
pygst.require('0.10')
gobject.threads_init()
import gst

class DemoApp(object):
    """GStreamer/PocketSphinx Demo Application"""
    def __init__(self):
        """Initialize a DemoApp object"""
        self.init_gui()
        self.init_gst()

    def init_gui(self):
        """Initialize the GUI components"""
        self.window = gtk.Window()
        self.window.connect("delete-event", gtk.main_quit)
        self.window.set_default_size(400,200)
        self.window.set_border_width(10)
        vbox = gtk.VBox()
        self.textbuf = gtk.TextBuffer()
        self.text = gtk.TextView(self.textbuf)
        self.text.set_wrap_mode(gtk.WRAP_WORD)
        vbox.pack_start(self.text)
        self.button = gtk.ToggleButton("Speak")
        self.button.connect('clicked', self.button_clicked)
        vbox.pack_start(self.button, False, False, 5)
        self.window.add(vbox)
        self.window.show_all()

    def init_gst(self):
        """Initialize the speech components"""
        self.pipeline = gst.parse_launch('gconfaudiosrc ! audioconvert ! audioresample '
                                         + '! vader name=vad auto-threshold=true '
                                         + '! pocketsphinx name=asr ! fakesink')
        asr = self.pipeline.get_by_name('asr')
        asr.connect('partial_result', self.asr_partial_result)
        asr.connect('result', self.asr_result)
        asr.set_property('configured', True)

        bus = self.pipeline.get_bus()
        bus.add_signal_watch()
        bus.connect('message::application', self.application_message)

        self.pipeline.set_state(gst.STATE_PAUSED)

    def asr_partial_result(self, asr, text, uttid):
        """Forward partial result signals on the bus to the main thread."""
        struct = gst.Structure('partial_result')
        struct.set_value('hyp', text)
        struct.set_value('uttid', uttid)
        asr.post_message(gst.message_new_application(asr, struct))

    def asr_result(self, asr, text, uttid):
        """Forward result signals on the bus to the main thread."""
        struct = gst.Structure('result')
        struct.set_value('hyp', text)
        struct.set_value('uttid', uttid)
        asr.post_message(gst.message_new_application(asr, struct))

    def application_message(self, bus, msg):
        """Receive application messages from the bus."""
        msgtype = msg.structure.get_name()
        if msgtype == 'partial_result':
            self.partial_result(msg.structure['hyp'], msg.structure['uttid'])
        elif msgtype == 'result':
            self.final_result(msg.structure['hyp'], msg.structure['uttid'])
            self.pipeline.set_state(gst.STATE_PAUSED)
            self.button.set_active(False)

    def partial_result(self, hyp, uttid):
        """Delete any previous selection, insert text and select it."""
        # All this stuff appears as one single action
        self.textbuf.begin_user_action()
        self.textbuf.delete_selection(True, self.text.get_editable())
        self.textbuf.insert_at_cursor(hyp)
        ins = self.textbuf.get_insert()
        iter = self.textbuf.get_iter_at_mark(ins)
        iter.backward_chars(len(hyp))
        self.textbuf.move_mark(ins, iter)
        self.textbuf.end_user_action()

    def final_result(self, hyp, uttid):
        """Insert the final result."""
        # All this stuff appears as one single action
        self.textbuf.begin_user_action()
        self.textbuf.delete_selection(True, self.text.get_editable())
        self.textbuf.insert_at_cursor(hyp)
        self.textbuf.end_user_action()

    def button_clicked(self, button):
        """Handle button presses."""
        if button.get_active():
            button.set_label("Stop")
            self.pipeline.set_state(gst.STATE_PLAYING)
        else:
            button.set_label("Speak")
            vader = self.pipeline.get_by_name('vad')
            vader.set_property('silent', True)

app = DemoApp()
gtk.main()

Code Listing Vala Port


//# Copyright (c) 2008 Carnegie Mellon University.
//#
//# You may modify and redistribute this file under the same terms as
//# the CMU Sphinx system.  See
//# http://cmusphinx.sourceforge.net/html/LICENSE for more information.

// valac --pkg gstreamer-0.10 --pkg gtk+-2.0 shpinx_livedemo.vala
using Gtk;
using Gst;


public class DemoApp : GLib.Object {
    private Gtk.Window window;
    private Gtk.TextBuffer textbuf;
    private dynamic Gst.Element asr;
    private dynamic Gst.Pipeline pipeline;
    private Gst.Element vader;
    private Gtk.TextView text;
    private Gtk.ToggleButton button;
    
    ////GStreamer/PocketSphinx Demo Application//
    public DemoApp() {
        ////Initialize a DemoApp object//
        this.init_gui();
        this.init_gst();
    }

    private void init_gui() {
        ////Initialize the GUI components//
        this.window = new Gtk.Window();
        this.window.delete_event.connect( () => { Gtk.main_quit(); return false; });
        this.window.set_default_size(400,200);
        this.window.set_border_width(10);
        
        var vbox        = new Gtk.VBox(false, 0);
        this.textbuf    = new Gtk.TextBuffer(null);
        
        text            = new Gtk.TextView.with_buffer(this.textbuf);
        text.set_wrap_mode(WrapMode.WORD);
        
        vbox.pack_start(text, true, true, 0);
        
        button = new Gtk.ToggleButton.with_label("Speak");
        button.clicked.connect(this.button_clicked);
        
        vbox.pack_start(button, false, false, 5);
        
        this.window.add(vbox);
        this.window.show_all();
    }

    private void init_gst() {
        ////Initialize the speech components//
        try {
            this.pipeline = (Gst.Pipeline) Gst.parse_launch("gconfaudiosrc ! audioconvert ! audioresample ! " + 
                                                            "vader name=vad auto-threshold=true ! pocketsphinx name=asr ! fakesink");
        }
        catch(Error e) {
            print("%s\n", e.message);
        }
        this.asr = this.pipeline.get_by_name("asr");
        this.asr.partial_result.connect(this.asr_partial_result);
        this.asr.result.connect(this.asr_result);
        this.asr.set_property("configured", true);
        
        var bus = this.pipeline.get_bus();
        bus.add_signal_watch();
        bus.message.connect(this.application_message);
//        bus.message.connect(this.application_message);
        this.pipeline.set_state(Gst.State.PAUSED);
    }

    private void asr_partial_result(Gst.Element sender, string text, string uttid) {
        //Forward partial result signals on the bus to the main thread.//
        var gststruct = new Gst.Structure.empty("partial_result");
        gststruct.set_value("hyp", text);
        gststruct.set_value("uttid", uttid);
        asr.post_message(new Gst.Message.application(this.asr, gststruct));
    }

    private void asr_result(Gst.Element sender, string text, string uttid) {
        //Forward result signals on the bus to the main thread.//
        var gststruct = new Gst.Structure.empty("result");
        gststruct.set_value("hyp", text);
        gststruct.set_value("uttid", uttid);
        asr.post_message(new Gst.Message.application(this.asr, gststruct));
    }

    private void application_message(Gst.Bus bus, Gst.Message msg) {
        //Receive application messages from the bus.//
        if(msg.type != Gst.MessageType.APPLICATION)
            return;
        if(msg.get_structure() == null)
            return;
        string msgtype = msg.get_structure().get_name();
        if(msgtype == "partial_result") {
            GLib.Value hy = msg.get_structure().get_value("hyp");
            GLib.Value ut = msg.get_structure().get_value("uttid");
            this.partial_result(hy, ut);
        }
        else if(msgtype == "result") {
            GLib.Value hy = msg.get_structure().get_value("hyp");
            GLib.Value ut = msg.get_structure().get_value("uttid");
            this.final_result(hy, ut);
            this.pipeline.set_state(Gst.State.PAUSED);
            this.button.set_active(false);
        }
    }

    private void partial_result(GLib.Value hyp, GLib.Value uttid) {
        //Delete any previous selection, insert text and select it.//
        // All this stuff appears as one single action
        this.textbuf.begin_user_action();
        this.textbuf.delete_selection(true, this.text.get_editable());
        this.textbuf.insert_at_cursor((string)hyp, ((string)hyp).length);
        var ins     = this.textbuf.get_insert();
        Gtk.TextIter iter;
        this.textbuf.get_iter_at_mark(out iter, ins);
//        var iter    = this.textbuf.get_iter_at_mark(ins);
        iter.backward_chars(((string)hyp).length);
        this.textbuf.move_mark(ins, iter);
        this.textbuf.end_user_action();
    }

    private void final_result(GLib.Value hyp, GLib.Value uttid) {
        //Insert the final result.//
        // All this stuff appears as one single action
        this.textbuf.begin_user_action();
        this.textbuf.delete_selection(true, this.text.get_editable());
        this.textbuf.insert_at_cursor(((string)hyp), ((string)hyp).length);
        this.textbuf.end_user_action();
    }

    private void button_clicked(Gtk.Widget sender) {
        //Handle button presses.//
        if(((ToggleButton)sender).get_active()) {
            ((ToggleButton)sender).set_label("Stop");
            this.pipeline.set_state(Gst.State.PLAYING);
        }
        else {
            ((ToggleButton)sender).set_label("Speak");
            vader = this.pipeline.get_by_name("vad");
            vader.set_property("silent", true);
        }
    }
}


void main(string[] args) {
    Gtk.init(ref args);
    Gst.init(ref args);
    
    var app = new DemoApp();
    Gtk.main();
}

 
gstreamer.txt · Last modified: 2012/10/11 01:19 by jjwiseman
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki