|
Open Source Toolkit For Speech Recognition Project by Carnegie Mellon University |
Pocketsphinx is a library that depends on another library called SphinxBase which provides common functionality across all CMUSphinx projects. To install Pocketsphinx, you need to install both Pocketsphinx and Sphinxbase. It's possible to use Pocketsphinx both in Linux and in Windows.
First of all, download the released packages pocketsphinx and sphinxbase, checkout them from subversion or download a snapshot. For more details see download page. Unpack them into same directory. On Windows, you will need to rename 'sphinxbase-X.Y' (where X.Y is the SphinxBase version number) to simply 'sphinxbase' for this to work.
In a unix-like environment (such as Linux, Solaris, FreeBSD etc):
configure file:% ./autogen.sh
autogen.sh at least once, then compile and install:% ./configure % make % make install
–prefix. You can also configure with or without python./usr/local/ folder. Not every system loads libraries from this folder automatically. To load them you need to configure the path to look for shared libaries. It can be done either in the file /etc/ld.so.conf or with exporting environment variables:export LD_LIBRARY_PATH=/usr/local/lib export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
% ./configure % make % make install
In MS Windows (TM), under MS Visual Studio 2008 (or newer - we test with Visual C++ 2008 Express):
sphinxbase.sln)pocketsphinx.sln in pocketsphinx directory
MS Visual Studio will build the executables under .\bin\Release or .\bin\Debug (depending on the version you choose on MS Visual Studio), and the libraries under .\lib\Release or .\lib\Build. To run pocketsphinx_continuous, don't forget to copy sphinxbase.dll to the bin folder. Otherwise the executable will fail to find this library.
Sphinxbase uses the standard unix autogen system, and there's a script included, build_for_iphone.sh that will setup configure to create binaries that are XCode friendly.
./autogen.sh ./build_for_iphone.sh simulator ./build_for_iphone.sh device
Then in XCode, open your project info, and for 'All Configurations', and set:
'Header Search Paths' = "$(HOME)$(SDK_DIR)/include/pocketsphinx" 'Library Search Paths' = "$(HOME)$(SDK_DIR)/lib" 'Other Linker Flags' = "-lpocketsphinx"
Pocketsphinx API is designed to ease the use of speech recognizer functionality in your applications
Reference documentation for the new API is available at http://cmusphinx.sourceforge.net/api/pocketsphinx/
There are few key things you need to know on how to use the API:
<cmd_ln.h>)ps_decoder_t * as the first argument.
To illustrate the new API, we will step through a simple “hello world” example. This example is somewhat specific to Unix in the locations of files and the compilation process. We will create a C source file called hello_ps.c. To compile it (on Unix), use this command:
gcc -o hello_ps hello_ps.c \
-DMODELDIR=\"`pkg-config --variable=modeldir pocketsphinx`\" \
`pkg-config --cflags --libs pocketsphinx sphinxbase`
Please note that compilation errors here mean that you didn't carefully read the tutorial and didn't follow the installation guide above. For example pocketsphinx needs to be properly installed to be available through pkg-config system. To check that pocketsphinx is installed properly, just run pkg-config –cflags –libs pocketsphinx sphinxbase from the command line and see that output looks like
-I/usr/local/include -I/usr/local/include/sphinxbase -I/usr/local/include/pocketsphinx -L/usr/local/lib -lpocketsphinx -lsphinxbase -lsphinxad
The first thing we need to do is to create a configuration object, which for historical reasons is called cmd_ln_t. Along with the general boilerplate for our C program, we will do it like this:
#include <pocketsphinx.h>
int
main(int argc, char *argv[])
{
ps_decoder_t *ps;
cmd_ln_t *config;
config = cmd_ln_init(NULL, ps_args(), TRUE,
"-hmm", MODELDIR "/hmm/en_US/hub4wsj_sc_8k",
"-lm", MODELDIR "/lm/en/turtle.DMP",
"-dict", MODELDIR "/lm/en/turtle.dic",
NULL);
if (config == NULL)
return 1;
return 0;
}
The cmd_ln_init() function takes a variable number of null-terminated string arguments, followed by NULL. The first argument is any previous cmd_ln_t * which is to be updated. The second argument is an array of argument definitions - the standard set can be obtained by calling ps_args(). The third argument is a flag telling the argument parser to be “strict” - if this is TRUE, then duplicate arguments or unknown arguments will cause parsing to fail.
The MODELDIR macro is defined on the GCC command-line by using pkg-config to obtain the modeldir variable from PocketSphinx configuration. On Windows, you can simply add a preprocessor definition to the code, such as this:
#define MODELDIR "c:/sphinx/model"
(replace this with wherever your models are installed). Now, to initialize the decoder, use ps_init:
ps = ps_init(config);
if (ps == NULL)
return 1;
Because live audio input is somewhat platform-specific, we will confine ourselves to decoding audio files. The “turtle” language model recognizes a very simple “robot control” language, which recognizes phrases such as “go forward ten meters”. In fact, there is an audio file helpfully included in the PocketSphinx source code which contains this very sentence. You can find it in test/data/goforward.raw. Copy it to the current directory. If you want to create your own version of it, it needs to be a single-channel (monaural), little-endian, unheadered 16-bit signed PCM audio file sampled at 16000 Hz.
To do this, we will first open the file:
FILE *fh;
fh = fopen("goforward.raw", "rb");
if (fh == NULL) {
perror("Failed to open goforward.raw");
return 1;
}
And then decode it, using ps_decode_raw():
rv = ps_decode_raw(ps, fh, "goforward", -1);
if (rv < 0)
return 1;
Now, to get the hypothesis, we can use ps_get_hyp():
char const *hyp, *uttid;
int rv;
int32 score;
hyp = ps_get_hyp(ps, &score, &uttid);
if (hyp == NULL)
return 1;
printf("Recognized: %s\n", hyp);
Now, we will decode the same file again, but using the API for decoding audio data from blocks of memory. In this case, we need to first start the utterance using ps_start_utt():
fseek(fh, 0, SEEK_SET);
rv = ps_start_utt(ps, "goforward");
if (rv < 0)
return 1;
We will then read 512 samples at a time from the file, and feed them to the decoder using ps_process_raw():
int16 buf[512];
while (!feof(fh)) {
size_t nsamp;
nsamp = fread(buf, 2, 512, fh);
rv = ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}
Then we will need to mark the end of the utterance using ps_end_utt():
rv = ps_end_utt(ps);
if (rv < 0)
return 1;
Retrieving the hypothesis string works in exactly the same way:
hyp = ps_get_hyp(ps, &score, &uttid);
if (hyp == NULL)
return 1;
printf("Recognized: %s\n", hyp);
To clean up, simply call ps_free() on the object that was returned by ps_init(). You should not do anything to free the configuration object.
#include <pocketsphinx.h>
int
main(int argc, char *argv[])
{
ps_decoder_t *ps;
cmd_ln_t *config;
FILE *fh;
char const *hyp, *uttid;
int16 buf[512];
int rv;
int32 score;
config = cmd_ln_init(NULL, ps_args(), TRUE,
"-hmm", MODELDIR "/hmm/en_US/hub4wsj_sc_8k",
"-lm", MODELDIR "/lm/en/turtle.DMP",
"-dict", MODELDIR "/lm/en/turtle.dic",
NULL);
if (config == NULL)
return 1;
ps = ps_init(config);
if (ps == NULL)
return 1;
fh = fopen("goforward.raw", "rb");
if (fh == NULL) {
perror("Failed to open goforward.raw");
return 1;
}
rv = ps_decode_raw(ps, fh, "goforward", -1);
if (rv < 0)
return 1;
hyp = ps_get_hyp(ps, &score, &uttid);
if (hyp == NULL)
return 1;
printf("Recognized: %s\n", hyp);
fseek(fh, 0, SEEK_SET);
rv = ps_start_utt(ps, "goforward");
if (rv < 0)
return 1;
while (!feof(fh)) {
size_t nsamp;
nsamp = fread(buf, 2, 512, fh);
rv = ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}
rv = ps_end_utt(ps);
if (rv < 0)
return 1;
hyp = ps_get_hyp(ps, &score, &uttid);
if (hyp == NULL)
return 1;
printf("Recognized: %s\n", hyp);
fclose(fh);
ps_free(ps);
return 0;
}
For more complicated uses of the old API, there are some significant differences:
<ngram_model.h>)
The first of these is straightforward. Before, you had to use uttproc_partial_result() to get partial results (i.e. before uttproc_end_utt() was called), and uttproc_result() for full results. Now, ps_get_hyp() works for both.
For word segmentations, the API provides an iterator object which is used to, well, iterate over the sequence of words. This iterator object is an abstract type, with some accessors provided to obtain timepoints, scores, and (most interestingly) posterior probabilities for each word.
Finally, language model switching is quite different. The decoder is always associated with a language model set object (yes, even if there is only one language model). Switching language models is accomplished by:
ps_get_lmset()ngram_model_set_select()ps_update_lmset()