Building application with pocketsphinx

Installation

Pocketsphinx is a library that depends on another library called SphinxBase which provides common functionality across all CMUSphinx projects. To install Pocketsphinx, you need to install both Pocketsphinx and Sphinxbase. It's possible to use Pocketsphinx both in Linux and in Windows.

First of all, download the released packages pocketsphinx and sphinxbase, checkout them from subversion or download a snapshot. For more details see download page. Unpack them into same directory. On Windows, you will need to rename 'sphinxbase-X.Y' (where X.Y is the SphinxBase version number) to simply 'sphinxbase' for this to work.

Unix-like Installation

In a unix-like environment (such as Linux, Solaris, FreeBSD etc):

  • On step one, build and install SphinxBase. If you downloaded directly from the repository, you need to do this at least once to generate the configure file:
% ./autogen.sh
  • if you downloaded the release version, or ran autogen.sh at least once, then compile and install:
% ./configure
% make
% make install
  • If you want to use fixed-point arithmetic, you must configure SphinxBase with the –enable-fixed option. You can also set installation prefix with –prefix. You can also configure with or without python.
  • The sphinxbase will be installed in /usr/local/ folder. Not every system loads libraries from this folder automatically. To load them you need to configure the path to look for shared libaries. It can be done either in the file /etc/ld.so.conf or with exporting environment variables:
export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
  • Then change to pocketsphinx folder and perform the same steps
% ./configure
% make
% make install
  • To test installation, run pocketsphinx_continuous and check that it recognizes words you are saying to the microphone.

Windows

In MS Windows (TM), under MS Visual Studio 2008 (or newer - we test with Visual C++ 2008 Express):

  • load sphinxbase.sln located in sphinxbase directory
  • compile all the projects in SphinxBase (from sphinxbase.sln)
  • load pocketsphinx.sln in pocketsphinx directory
  • compile all the projects in PocketSphinx

MS Visual Studio will build the executables under .\bin\Release or .\bin\Debug (depending on the version you choose on MS Visual Studio), and the libraries under .\lib\Release or .\lib\Build. To run pocketsphinx_continuous, don't forget to copy sphinxbase.dll to the bin folder. Otherwise the executable will fail to find this library.

XCode Installation (for iPhone)

Sphinxbase uses the standard unix autogen system, and there's a script included, build_for_iphone.sh that will setup configure to create binaries that are XCode friendly.

./autogen.sh
./build_for_iphone.sh simulator
./build_for_iphone.sh device

Then in XCode, open your project info, and for 'All Configurations', and set:

'Header Search Paths' = "$(HOME)$(SDK_DIR)/include/pocketsphinx"
'Library Search Paths' = "$(HOME)$(SDK_DIR)/lib"
'Other Linker Flags' = "-lpocketsphinx"

Pocketsphinx API Core Ideas

Pocketsphinx API is designed to ease the use of speech recognizer functionality in your applications

  1. It is much more likely to remain stable both in terms of source and binary compatibility, due to the use of abstract types.
  2. It is fully re-entrant, so there is no problem having multiple decoders in the same process.
  3. The new language model API (in SphinxBase) supports linear interpolation of multiple models at run-time.
  4. It has enabled a drastic reduction in code footprint and a modest but significant reduction in memory consumption.

Reference documentation for the new API is available at http://cmusphinx.sourceforge.net/api/pocketsphinx/

Basic Usage (hello world)

There are few key things you need to know on how to use the API:

  1. Command-line parsing is done externally (in <cmd_ln.h>)
  2. Everything takes a ps_decoder_t * as the first argument.

To illustrate the new API, we will step through a simple “hello world” example. This example is somewhat specific to Unix in the locations of files and the compilation process. We will create a C source file called hello_ps.c. To compile it (on Unix), use this command:

gcc -o hello_ps hello_ps.c \
    -DMODELDIR=\"`pkg-config --variable=modeldir pocketsphinx`\" \
    `pkg-config --cflags --libs pocketsphinx sphinxbase`

Please note that compilation errors here mean that you didn't carefully read the tutorial and didn't follow the installation guide above. For example pocketsphinx needs to be properly installed to be available through pkg-config system. To check that pocketsphinx is installed properly, just run pkg-config –cflags –libs pocketsphinx sphinxbase from the command line and see that output looks like

-I/usr/local/include -I/usr/local/include/sphinxbase -I/usr/local/include/pocketsphinx  
-L/usr/local/lib -lpocketsphinx -lsphinxbase -lsphinxad

Initialization

The first thing we need to do is to create a configuration object, which for historical reasons is called cmd_ln_t. Along with the general boilerplate for our C program, we will do it like this:

#include <pocketsphinx.h>

int
main(int argc, char *argv[])
{
        ps_decoder_t *ps;
        cmd_ln_t *config;

        config = cmd_ln_init(NULL, ps_args(), TRUE,
                             "-hmm", MODELDIR "/hmm/en_US/hub4wsj_sc_8k",
                             "-lm", MODELDIR "/lm/en/turtle.DMP",
                             "-dict", MODELDIR "/lm/en/turtle.dic",
                             NULL);
        if (config == NULL)
                return 1;

        return 0;
}

The cmd_ln_init() function takes a variable number of null-terminated string arguments, followed by NULL. The first argument is any previous cmd_ln_t * which is to be updated. The second argument is an array of argument definitions - the standard set can be obtained by calling ps_args(). The third argument is a flag telling the argument parser to be “strict” - if this is TRUE, then duplicate arguments or unknown arguments will cause parsing to fail.

The MODELDIR macro is defined on the GCC command-line by using pkg-config to obtain the modeldir variable from PocketSphinx configuration. On Windows, you can simply add a preprocessor definition to the code, such as this:

#define MODELDIR "c:/sphinx/model"

(replace this with wherever your models are installed). Now, to initialize the decoder, use ps_init:

        ps = ps_init(config);
        if (ps == NULL)
                return 1;

Decoding a file stream

Because live audio input is somewhat platform-specific, we will confine ourselves to decoding audio files. The “turtle” language model recognizes a very simple “robot control” language, which recognizes phrases such as “go forward ten meters”. In fact, there is an audio file helpfully included in the PocketSphinx source code which contains this very sentence. You can find it in test/data/goforward.raw. Copy it to the current directory. If you want to create your own version of it, it needs to be a single-channel (monaural), little-endian, unheadered 16-bit signed PCM audio file sampled at 16000 Hz.

To do this, we will first open the file:

        FILE *fh;

        fh = fopen("goforward.raw", "rb");
        if (fh == NULL) {
                perror("Failed to open goforward.raw");
                return 1;
        }

And then decode it, using ps_decode_raw():

        rv = ps_decode_raw(ps, fh, "goforward", -1);
        if (rv < 0)
                return 1;

Now, to get the hypothesis, we can use ps_get_hyp():

        char const *hyp, *uttid;
        int rv;
        int32 score;

        hyp = ps_get_hyp(ps, &score, &uttid);
        if (hyp == NULL)
                return 1;
        printf("Recognized: %s\n", hyp);

Decoding audio data from memory

Now, we will decode the same file again, but using the API for decoding audio data from blocks of memory. In this case, we need to first start the utterance using ps_start_utt():

        fseek(fh, 0, SEEK_SET);
        rv = ps_start_utt(ps, "goforward");
        if (rv < 0)
                return 1;

We will then read 512 samples at a time from the file, and feed them to the decoder using ps_process_raw():

        int16 buf[512];
        while (!feof(fh)) {
            size_t nsamp;
            nsamp = fread(buf, 2, 512, fh);
            rv = ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
        }

Then we will need to mark the end of the utterance using ps_end_utt():

        rv = ps_end_utt(ps);
        if (rv < 0)
                return 1;

Retrieving the hypothesis string works in exactly the same way:

        hyp = ps_get_hyp(ps, &score, &uttid);
        if (hyp == NULL)
                return 1;
        printf("Recognized: %s\n", hyp);

Cleaning up

To clean up, simply call ps_free() on the object that was returned by ps_init(). You should not do anything to free the configuration object.

Code listing

#include <pocketsphinx.h>

int
main(int argc, char *argv[])
{
	ps_decoder_t *ps;
	cmd_ln_t *config;
	FILE *fh;
	char const *hyp, *uttid;
        int16 buf[512];
	int rv;
	int32 score;

	config = cmd_ln_init(NULL, ps_args(), TRUE,
			     "-hmm", MODELDIR "/hmm/en_US/hub4wsj_sc_8k",
			     "-lm", MODELDIR "/lm/en/turtle.DMP",
			     "-dict", MODELDIR "/lm/en/turtle.dic",
			     NULL);
	if (config == NULL)
		return 1;
	ps = ps_init(config);
	if (ps == NULL)
		return 1;

	fh = fopen("goforward.raw", "rb");
	if (fh == NULL) {
		perror("Failed to open goforward.raw");
		return 1;
	}

	rv = ps_decode_raw(ps, fh, "goforward", -1);
	if (rv < 0)
		return 1;
	hyp = ps_get_hyp(ps, &score, &uttid);
	if (hyp == NULL)
		return 1;
	printf("Recognized: %s\n", hyp);

        fseek(fh, 0, SEEK_SET);
        rv = ps_start_utt(ps, "goforward");
	if (rv < 0)
		return 1;
        while (!feof(fh)) {
            size_t nsamp;
            nsamp = fread(buf, 2, 512, fh);
            rv = ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
        }
        rv = ps_end_utt(ps);
	if (rv < 0)
		return 1;
	hyp = ps_get_hyp(ps, &score, &uttid);
	if (hyp == NULL)
		return 1;
	printf("Recognized: %s\n", hyp);

	fclose(fh);
        ps_free(ps);
	return 0;
}

Advanced Usage

For more complicated uses of the old API, there are some significant differences:

  1. There are no longer separate functions for getting partial and full hypotheses.
  2. Word segmentations are accessed via iterators rather than being returned as arrays or lists.
  3. Language model switching is done externally (in <ngram_model.h>)

The first of these is straightforward. Before, you had to use uttproc_partial_result() to get partial results (i.e. before uttproc_end_utt() was called), and uttproc_result() for full results. Now, ps_get_hyp() works for both.

For word segmentations, the API provides an iterator object which is used to, well, iterate over the sequence of words. This iterator object is an abstract type, with some accessors provided to obtain timepoints, scores, and (most interestingly) posterior probabilities for each word.

Finally, language model switching is quite different. The decoder is always associated with a language model set object (yes, even if there is only one language model). Switching language models is accomplished by:

  1. Getting a handle to the language model set object: ps_get_lmset()
  2. Selecting the new language model: ngram_model_set_select()
  3. Telling the decoder the language model set has been updated: ps_update_lmset()
 
tutorialpocketsphinx.txt · Last modified: 2012/03/17 21:33 by admin
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki