User Tools

Site Tools


gstreamer

Using PocketSphinx with GStreamer and Python

PocketSphinx supports for the GStreamer streaming media framework. What this means is that the PocketSphinx decoder can be treated as an element in a media processing pipeline, specifically, one which filters audio into text.

This is useful because it simplifies a lot of the work involved in writing interactive speech applications. You don't need to handle any of the difficulties of dealing with audio input. Also, perhaps more importantly, it means that speech input can be integrated into a GUI application using GTK+ without agonizing pain.

Overview

This document will walk you through the steps of building a simple demo using GTK+, GStreamer, and Python. The only thing this program will do is recognize speech and display it in a text area, but it should give you the tools you need to do more interesting things.

Before starting, you should have an up-to-date version of PocketSphinx from the Subversion repository, as well as GStreamer, GTK+, and the Python modules for both of them. You should make sure that GStreamer is able to find the PocketSphinx plugin. If you have installed PocketSphinx in /usr/local, then you may have to set the following environment variable:

export GST_PLUGIN_PATH=/usr/local/lib/gstreamer-1.0

If you have installed PocketSphinx under a different prefix, you will also need to set the LD_LIBRARY_PATH variable. If your installation prefix is $psprefix, then you would want to set these variables.

export LD_LIBRARY_PATH=$psprefix/lib
export GST_PLUGIN_PATH=$psprefix/lib/gstreamer-1.0

To verify that GStreamer can find the plugin, run gst-inspect-1.0 pocketsphinx. You should get a large amount of output, ending with something like this:

  decoder             : The underlying decoder
                        flags: readable
                        Boxed pointer of type "PSDecoder"
  configured          : Set this to finalize configuration
                        flags: readable, writable
                        Boolean. Default: true

Background Reading

Before you look at this, it would be a good idea to look at:

Of course, we also assume that you know Python.

Skeleton of a simple GUI program

Our simple demo program will just consist of a window, a text box, and a button which the user can push to start and stop speech recognition.

We will start by creating a Python class representing our demo application:

class DemoApp(object):
    """GStreamer/PocketSphinx Demo Application"""
    def __init__(self):
        """Initialize a DemoApp object"""
        self.init_gui()
        self.init_gst()

    def init_gui(self):
        """Initialize the GUI components"""

    def init_gst(self):
        """Initialize the speech components"""

app = DemoApp()
gtk.main()

Now let's fill in the init_gui method. We are going to create a window with a gtk.VBox in it, holding a gtk.TextView (with associated gtk.TextBuffer) and a gtk.ToggleButton.

    def init_gui(self):
        """Initialize the GUI components"""
        self.window = gtk.Window()
        self.window.connect("delete-event", gtk.main_quit)
        self.window.set_default_size(400,200)
        self.window.set_border_width(10)
        vbox = gtk.VBox()
        self.textbuf = gtk.TextBuffer()
        self.text = gtk.TextView(buffer=self.textbuf)
        self.text.set_wrap_mode(gtk.WRAP_WORD)
        vbox.pack_start(self.text)
        self.button = gtk.ToggleButton("Speak")
        self.button.connect('clicked', self.button_clicked)
        vbox.pack_start(self.button, False, False, 5)
        self.window.add(vbox)
        self.window.show_all()

This gives us a nice text box and a button that does nothing. Now, we want to introduce GStreamer and PocketSphinx to our program, so that clicking the button makes it recognize speech and insert it into the textbox at the current location.

Adding GStreamer to the program

To use GStreamer, we need to add a few more lines of imports to the top of the program:

gi.require_version('Gst', '1.0')
from gi.repository import GObject, Gst
GObject.threads_init()
Gst.init(None)

Now, we will fill in the gst.init method, which initializes a GStreamer pipeline that will do speech recognition for us.

Creating the pipeline

For simplicity we are creating it using the gst.parse_launch function, which reads a textual description and creates a pipeline from it. If you are not running GNOME, you may need to change gconfaudiosrc to a different source element, such as alsasrc or osssrc.

        self.pipeline = gst.parse_launch('autoaudiosrc ! audioconvert ! audioresample '
                                         + '! pocketsphinx name=asr ! fakesink')

This pipeline consists of an audio source, followed by conversion and resampling (the pocketsphinx element currently requires 16kHz, 16-bit PCM audio), followed by recognition. The fakesink element discards the output of the speech recognition element (more on this below). There used to be vader element before to do voice activity detection, now voice activity is detected inside decoder for best accuracy.

Running the pipeline

We are going to set up our program so that the pipeline starts out being paused, and is set to play (i.e. start recognition) when the user presses the “Speak” button. Once a final recognition result is retrieved we will put it back in paused mode and wait for another button press. If the user presses the button in the middle of recognition, it will halt speech recognition immediately. We will implement this using the button_clicked method which was connected to the button's signal in init_gui:

    def button_clicked(self, button):
        """Handle button presses."""
        if button.get_active():
            button.set_label("Stop")
            self.pipeline.set_state(gst.State.PLAYING)
        else:
            button.set_label("Speak")
            self.pipeline.set_state(gst.State.PAUSED)

The 'pocketsphinx' element

The pocketsphinx element functions as a filter - it takes audio data as input and produces text as output. This makes sense in the GStreamer framework, where data flows from a source to a sink, and is potentially useful for captioning or other multimedia applications. However, due to the fact that automatic speech recognition by nature operates on arbitrarily large chunks of data (“utterances”), and the recognition result for an utterance can change as more data becomes available, this simple streaming data flow is not suitable for most ASR applications.

In practice, we want to treat the speech recognizer more like an input device in GUI programming, which emits a stream of messages that can be subscribed to and interpreted by a controller object. Fortunately, GStreamer allows us to do (almost) exactly this, using the same mechanism used for GTK+ widgets. So, we are going to connect methods in DemoApp to the messages emitted by the pocketsphinx element:

        bus = self.pipeline.get_bus()
        bus.add_signal_watch()
        bus.connect('message::element', self.element_message)
        
        
    def element_message(self, bus, msg):
        """Receive element messages from the bus."""
        msgtype = msg.get_structure().get_name()
        if msgtype != 'pocketsphinx':
            return

        if msg.get_structure()['final']:
                self.final_result(msg.get_structure()['hypothesis'], msg.get_structure()['confidence'])
            self.pipeline.set_state(gst.State.PAUSED)
            self.button.set_active(False)
        elif msg.get_structure()['hypothesis']:
            self.partial_result(msg.get_structure()['hypothesis'])

Now, the methods asr_partial_result and asr_result will be called whenever a partial or complete utterance is decoded.

The pocketsphinx element has a number of properties which can be used to select different acoustic and language models and to tune the performance of the recognizer. See PocketSphinxElementProperties for more information, or run gst-inspect pocketsphinx to get the embedded documentation.

Most importantly, you will need to set the language model property (lm) in order to do anything useful. The GStreamer element defaults to a “toy” language model designed for commanding wheeled robots, so it will only recognize things like “go forward three meters” and “turn left”.

Since initializing the speech recognizer takes some time, there is a “magical” property which allows you to force it to be initialized before you actually start feeding it data. We will use this here:

        asr.set_property('configured', True)

Updating the text buffer

Finally, we need to implement the methods which update the text buffer with the current partial and final recognition result. There isn't much of interest here with relation to GStreamer and PocketSphinx.

    def partial_result(self, hyp, uttid):
        """Delete any previous selection, insert text and select it."""
        # All this stuff appears as one single action
        self.textbuf.begin_user_action()
        self.textbuf.delete_selection(True, self.text.get_editable())
        self.textbuf.insert_at_cursor(hyp)
        ins = self.textbuf.get_insert()
        iter = self.textbuf.get_iter_at_mark(ins)
        iter.backward_chars(len(hyp))
        self.textbuf.move_mark(ins, iter)
        self.textbuf.end_user_action()

    def final_result(self, hyp, uttid):
        """Insert the final result."""
        # All this stuff appears as one single action
        self.textbuf.begin_user_action()
        self.textbuf.delete_selection(True, self.text.get_editable())
        self.textbuf.insert_at_cursor(hyp)
        self.textbuf.end_user_action()

Improving Accuracy

You will notice that the accuracy of the speech recognition is pretty bad. In particular, the first utterance is basically never recognized correctly, and the system makes weird (and occasionally amusing) substitutions for future utterances.

These are actually two separate problems. We're working on the first one (it relates to CepstralMeanNormalization, for those who know what that is). Unless you have a particularly difficult accent, the second one is largely a matter of LanguageModeling. If you have a limited set of sentences that you want to recognize, you can follow the directions in LanguageModelHowto to create an improved language model.

To use your improved language model with GStreamer, you just have to set the lm and dict properties on the pocketsphinx element. So, if your language model is in /home/user/mylanguagemodel.lm and the associated dictionary is /home/user/mylanguagemodel.dic, you would add these lines to the init_gst method (before the configured property is set):

        asr.set_property('lm', '/home/user/mylanguagemodel.lm')
        asr.set_property('dict', '/home/user/mylanguagemodel.dic')

Code Listing

You can download the Python code for this example. A complete code listing follows:

from gi import pygtkcompat
import gi

gi.require_version('Gst', '1.0')
from gi.repository import GObject, Gst
GObject.threads_init()
Gst.init(None)
    
gst = Gst
    
print("Using pygtkcompat and Gst from gi")

pygtkcompat.enable() 
pygtkcompat.enable_gtk(version='3.0')

import gtk

class DemoApp(object):
    """GStreamer/PocketSphinx Demo Application"""
    def __init__(self):
        """Initialize a DemoApp object"""
        self.init_gui()
        self.init_gst()

    def init_gui(self):
        """Initialize the GUI components"""
        self.window = gtk.Window()
        self.window.connect("delete-event", gtk.main_quit)
        self.window.set_default_size(400,200)
        self.window.set_border_width(10)
        vbox = gtk.VBox()
        self.textbuf = gtk.TextBuffer()
        self.text = gtk.TextView(buffer=self.textbuf)
        self.text.set_wrap_mode(gtk.WRAP_WORD)
        vbox.pack_start(self.text)
        self.button = gtk.ToggleButton("Speak")
        self.button.connect('clicked', self.button_clicked)
        vbox.pack_start(self.button, False, False, 5)
        self.window.add(vbox)
        self.window.show_all()

    def init_gst(self):
        """Initialize the speech components"""
        self.pipeline = gst.parse_launch('autoaudiosrc ! audioconvert ! audioresample '
                                         + '! pocketsphinx configured=true ! fakesink')
        bus = self.pipeline.get_bus()
        bus.add_signal_watch()
        bus.connect('message::element', self.element_message)

        self.pipeline.set_state(gst.State.PAUSED)

    def element_message(self, bus, msg):
        """Receive element messages from the bus."""
        msgtype = msg.get_structure().get_name()
        if msgtype != 'pocketsphinx':
            return

        if msg.get_structure()['final']:
            self.final_result(msg.get_structure()['hypothesis'], msg.get_structure()['confidence'])
            self.pipeline.set_state(gst.State.PAUSED)
            self.button.set_active(False)
        elif msg.get_structure()['hypothesis']:
            self.partial_result(msg.get_structure()['hypothesis'])

    def partial_result(self, hyp):
        """Delete any previous selection, insert text and select it."""
        # All this stuff appears as one single action
        self.textbuf.begin_user_action()
        self.textbuf.delete_selection(True, self.text.get_editable())
        self.textbuf.insert_at_cursor(hyp)
        ins = self.textbuf.get_insert()
        iter = self.textbuf.get_iter_at_mark(ins)
        iter.backward_chars(len(hyp))
        self.textbuf.move_mark(ins, iter)
        self.textbuf.end_user_action()

    def final_result(self, hyp, confidence):
        """Insert the final result."""
        # All this stuff appears as one single action
        self.textbuf.begin_user_action()
        self.textbuf.delete_selection(True, self.text.get_editable())
        self.textbuf.insert_at_cursor(hyp)
        self.textbuf.end_user_action()

    def button_clicked(self, button):
        """Handle button presses."""
        if button.get_active():
            button.set_label("Stop")
            self.pipeline.set_state(gst.State.PLAYING)
        else:
            button.set_label("Speak")
            self.pipeline.set_state(gst.State.PAUSED)

app = DemoApp()
gtk.main()
gstreamer.txt · Last modified: 2015/08/25 21:25 by admin