Frequenty Asked Questions (FAQ)

Q: Why my accuracy is poor

The first thing you need to understand if your accuracy just lower than expected or very low. If it's very low most likely you misconfigured the decoder. If it's lower than expected, you can apply various ways to improve it.

The first thing you should do is to collect a database of test samples and measure the recognition accuracy. You need to dump utterances into wav files, write reference text and use decoder to decode it. Then calculate WER using the word_align.pl tool from Sphinxtrain. Test database size depends on the accuracy but usually it's enough to have 30 minutes of transcribed audio to test recognizer accuracy reliably.

Only if you have a test database you can proceed with recognition accuracy optimization.

The top reasons of the bad accuracy are:

  1. The mismatch of the sample rate/no. of channels of the incoming audio. It must be 16kHz (or 8kHz, depending on the training data) 16bit Mono (= single channel) Little-Endian file. You need to fix sample rate of the source with resampling (only if its rate is higher than that of the training data). You should not upsample a file and decode it with acoustic models trained on higher sampling rate audio. Audio file format (sampling rate, number of channels) can be verified using below command
    sox --i /path/to/audio/file 
  2. Zero silence regions in audio files decoded from mp3 break the decoder. You can use dither to introduce small random noise to solve this problem.
  3. The mismatch of the acoustic model. To verify this hypothesis you need to construct a language model from the test database text. Such language model will be very good and must give you a high accuracy. If accuracy is still low, you need to work more on the acoustic model. You can use acoustic model adaptation to improve accuracy
  4. The mismatch of the langauge model. You can create your own langauge model to match the vocabulary you are trying to decode.
  5. The mismatch in the dictionary and the pronuncation of the words. In that case a work must be done on the phonetic dictionary.

Q: How to do the noise reduction

There are multiple levels to fight with noise and corruption of the audio. Noise cancellation algorithm modify the audio itself, feature denoising can cleanup features. You can also reduce noise and mismatch in model level, just by adapting the model.

Recent CMUSphinx code has noise cancellation featur. In sphinxbase/pocketsphinx/sphinxtrain it's 'remove_noise' option. In sphinx4 it's Denoise frontend component. So if you are using latest version you should be robust to noise in some degree already. Most modern models are trained with noise cancellation already, if you have your own model you need to retrain it.

The algorithm impelmented is spectral subtraction in on mel filterbank. There are more advanced algorithms for sure, if needed you can extend the current implementation.

It's not recommended to perform an external noise suppression because many noise cancellation algorithms corrupts speech spectrum in unusual ways and reduce speech recognition accuracy even more than the noise itself. For that reason you need to be very careful when selecting the noise cancellation algorithm. Only some of them like Ephraim Malach or Kalman will work properly.

A reasonable way to fight with noise is to adapt the model or train it on a noisy audio. MLLR adaptation usually compensates quite significant part of the noise corruption. It's so-called multistyle training.

Q: Can pocketsphinx reject out-of-grammar words and noises

There are few ways to deal with OOV rejection, for more details see Rejecting Out-of-Grammar Utterances. Situation with implementation of those approaches is:

  • Garbage Models - requires you to train special model. There is no public model with garbage phones which can reject OOV words now. There are models with fillers, but they reject only specific sounds (breath, laught, um). They can't reject OOV word.
  • Generic Word Model - same as above, requires you to train special model. There are no public models yet.
  • Confidence Scores - confidence score (ps_get_prob) can be reliably calculated only for a large vocabulary (> 100 words). It doens't work with small grammar. There are approaches with phone-based confidence and one of them was implemented in sphixn2, but pocketsphinx doesn't support them. Confidence scoring also require you to have three-pass recognition (enable both fwdflat and bestpath).

So for now recommendation for rejection with the small grammar is - train your own model (make it public). For the large language model (> 100 words) use confidence score.

Q: pocketsphinx_continuous stuck in READY, nothing is recognized

If continuous is showing READY and doesn't react to your speech it means that pocketsphinx recording silence. The reasons for that are:

  • Pocketsphinx is decoding from a wrong device. Try to check log for the warning
         ''Warning: Could not find Mic element'' 
    

    (try to change device with -adcdev option)

  • Recording volume is too low (try to increase recording level in volume settings)
  • Microphone is broken (test that other programs can record)

Q: Which languages are supported

CMUSphinx itself is language-independent, you can recognize any language. However, it requires an acoustic model and a language model. We provide prebuilt language models for many languages (Enlish, Chinese, French, Spanish, German, Russian, etc) in download section.

Q: How to add support for a new language

The process of building a new language model consists of the following steps:

* Data collection (you can collect audiobooks with text transcriptoin from project like librivox, transcribed podcasts, or setup web data collection. You can also try to contribute to Voxforge. You can start very quickly with just few hours of transcribed data.

  • Data cleanup
  • Model training
  • Testing

Most steps are described in CMUSphinx Tutorial For Developers

Q: I have an issue with CMUSphinx and need help

When you report about problem always provide the following information:

  • Version of the software you are using
  • Information about your system
  • Actions you've made
  • Your expectations
  • What went wrong

If you want to get fast answer, submit also the following information

  • System logs
  • Test sample. Try to make test sample as small and as self-contained as possible. It will help you to get fast and detailed answer

See How to ask questions howto for more details

Q: What speech feature type does CMUSphinx use and what do they represent

CMUSphinx uses mel-cepstrum MFCC features. There are various types of MFCC which differ by number of parameters, but not really different for accuracy (it might be a few percent worse or better).

The interpretation of MFCC (Roughtly introduced Alan V. Oppenheim and Ronald W. Schafer. From Frequency to Quefrency: A History of the Cepstrum. IEEE SIGNAL PROCESSING MAGAZINE) is not applicable as such, and the use of 12 or 13 coefficients seem to be due to historical reasons in many of the reported cases. The choice of the number of MFCCs to include in an ASR system is largely empirical. To understand why any specific number of cepstral coefficients is used, you could do worse than look at very early (pre-HMM) papers. When using DTW using Euclidean or even Mahalanobis distances, it quickly became apparent that the very high cepstral coefficients were not helpful for recognition, and to a lesser extent, neither were the very low ones. The most common solution was to “lifter” the MFCCs - i.e. apply a weighting function to them to emphasise the mid-range coefficients. These liftering functions were “optimised” by a number of researchers, but they almost always ended up being close to zero by the time you got to the 12th coefficient.

In practice, the optimal number of coefficients depends on the quantity of training data, the details of the training algorithm (in particular how well the PDFs can be modelled as the dimensionality of the feature space increases), the number of Gaussian mixtures in the HMMs, the speaker and background noise characteristics, and sometimes the available computing resources.

In semicontinuous reasons CMUSphinx uses specific packing of derivatives to optimize vector quantization and thus compress model better. Through years various features were used. Mostly they were selected by experiment.

Actually now MFCC coefficients are mostly obsolete. Most commercial decoders and research systems use PLP (Perceptual linear frequency) for feature extraction. PLP provides improved robustness over MFCC and slightly better accuracy but probably has patent issues. We hope to see PLP models in CMUSphinx soon.

Q: How to implement "Wake-up listening"

Sometimes its needed to listen for speech continuously and respond only to a commands. You need to have a background grammar this case and an analysis module to decide what make sense in current context and what doesn't. This extra analysis module is a requirement. See papers with description of the system for more details

HANDS-FREE VOICE ACTIVATION OF PERSONAL COMMUNICATION DEVICES

Sahar E. Bou-Ghazale and Ayman O. Asadi

Q: How to implement "Pronunciation Evaluation"

Some applications may wish to measure speech consistency with “ideal” forms of pronunciation as represented by the acoustic model and the grammar. An example of this might be software that presents users with text to speak and evaluates their pronunciation against an assumed ideal. The goal is figuring out if speakers are speaking the text we assume them to be speaking clearly enough to proceed with a certain level of confidence.

While it is not possible to force Sphinx to generate a probabilistic match against a random entry in a language model or grammar file, it is possible to create acoustic and language models containing the text to be found along with common variants, and then measure which of the selected variants comes closest to the spoken text. Thus the statistical hypothesis testing apparatus can be applied to distinguish match hypothesis and mismatch hypothesis. Both hypothesis should be evaluated against proper data.

The research on computer-aided language learning is pretty wide. The good starting point for it is the following paper, though you will have to read some more papers to get the full picture.

The SRI EduSpeakTM System: Recognition and Pronunciation Scoring Franco et al.

Q: Pocketsphinx crashes on Windows in _lock_file

The stack trace is usually the following:

INFO: fe_interface.c(289): You are using internal mechanism to generate the seed.
INFO: feat.c(289): Initializing feature stream to type: '1s_c_d_dd', ceplen=13,CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0
INFO: acmod.c(153): Reading linear feature trasformation from acoustic/feature.transform
INFO: mdef.c(520): Reading model definition: acoustic/mdef
INFO: bin_mdef.c(173): Allocation 104810 * 8 bytes (818 KiB) for CD tree
INFO: tmat.c(205): Reading HMM transition probability matrices:
acoustic/transition_matrices (After it I have crash)

Stack trace:
ntdll.dll!774f8db9()    
        [Frames below may be incorrect and/or missing, no symbols loaded for
ntdll.dll]
        ntdll.dll!774f8cc8()    
>       msvcr100.dll!_lock_file(_iobuf * pf)  Line 236 + 0xa bytes      C
        msvcr100.dll!fgets(char * string, int count, _iobuf * str)  Line 71 + 0x6
bytes   C
        sphinxbase.dll!002319ef()       
        msvcr100.dll!_unlock(int locknum)  Line 375     C
        msvcr100.dll!_unlock_file(_iobuf * pf)  Line 313 + 0xe bytes    C
        msvcr100.dll!fread_s(void * buffer, unsigned int bufferSize, unsigned int
elementSize, unsigned int count, _iobuf * stream)  Line 113 + 0x8 bytes C
        msvcr100.dll!fread(void * buffer, unsigned int elementSize, unsigned int count,
_iobuf * stream)  Line 303 + 0x13 bytes C
        sphinxbase.dll!00231743()       
        sphinxbase.dll!00231cbf()       

sphinxbase was compiled iwth MultiThreadedDLL runtime, see in vcxproj

 <RuntimeLibrary>MultiThreadedDLL</RuntimeLibrary>

If you don't compile your project with similar setting it will crash. Use proper runtime library or recompile sphinxbase

Q: Problems with building python in sphinxbase on OSX:

There are very common issues with building python modules on OSX. The sign of them is a warning about unsupported architecture.

ld: warning: in ../src/libsphinxbase/.libs/libsphinxbase.dylib, file was built 
for unsupported file format which is not the architecture being linked (i386)

or

/usr/libexec/gcc/powerpc-apple-darwin10/4.2.1/as: assembler (
/usr/bin/../libexec/gcc/darwin/ppc/as or /usr/bin/../local/libexec/gcc/darwin/ppc/as) 
for architecture ppc not installed
Installed assemblers are:
/usr/bin/../libexec/gcc/darwin/x86_64/as for architecture x86_64
/usr/bin/../libexec/gcc/darwin/i386/as for architecture i386

This is a bug in setup.py for OSX. For details and solutions for this problem please read

http://stackoverflow.com/questions/5256397/python-easy-install-fails-with-assembler-for-architecture-ppc-not-installed-on

You can also disable python in configure with –without-python configure option

Q: Failed to open audio device(/dev/dsp): No such file or directory

Device file /dev/dsp is missing because OSS support is not enabled in the kernel. You can either compile pocketsphinx with ALSA support by installing alsa development headers from a package libasound2 or alsa-devel and recompiling or you can install oss-compat package to enable OSS support.

The installation process is not an issue if you understand the complexity of audio subsystems in Linux. The audio subsystem is complex unfortunately, but once you get it things will be easier. Historically, audio subsystem is pretty fragmented. It includes the following major frameworks:

  • Old Unix-like DSP framework – everything is handled by the kernel-space driver. Applications interact with /dev/dsp device to produce and record audio
  • ALSA – newer audio subsystem, partially in kernel but also has userspace library libasound. ALSA also provides DSP compatibliity layer through snd_pcm_oss driver which creates /dev/dsp device and emulates audio
  • Pulseaudio – even newer system which works on the top of libasound ALSA library but provides a sound server to centralize all the processing. To communicate with the library it also provides libpulse library which must be used by applications to record sound
  • Jack – another sound server, also works on the top of ALSA, provides anoher library libjack. Similar to Pulseaudio there are others not very popular frameworks, but sphinxbase doesn’t support them. Example are ESD (old GNOME sound server), ARTS (old KDE sound server), Portaudio (portable library usable across Windows, Linux and Mac).

The recommended audio framework on Ubuntu is pulseaudio.

Sphinxbase and pocketsphinx support all the frameworks and automatically selects the one you need in compile time. The highest priority is in pulseaudio framework. Before you install sphinxbase you need to decide which framework to use. You need to setup the development part of the corresponding framework after that.

For example, it’s recommended to install libpulse-dev package to provide access to pulseaudio and after that sphinxbase will automatically work with Pulseaudio. Once you work with pulseaudio you do not need other frameworks. On embedded device try to configure alsa.

Q: What is sample rate and how does it affect accuracy

Unfortunately we don't provide universal models for different bandwidths (8khz models are 10% worse in accuracy) and we can not detect sample rate yet. So before using decoder you need to make sure that both sample rate of the decoder matches the sample rate of the input audio and the bandwidth of the audio matches the bandwidth that was used to train the model. A mismatch results in very bad accuracy.

First of all you need to understand the difference between sample rate and frequency bandwidth. Sample rate - the rate of samples in the recording. Sample rate 16000 means that there are 16000 samples collected every second. You can resample audio with sox or with ffmpeg:

sox file.wav -r 16000 file-16000.wav
ffmpeg file.mp3 -ar 16000 file-16000.wav

Then there is a bandwidth - the range of frequencies included in the audio, it doesn't change with sample rate. If audio had frequencies up to 8khz only no matter how you resample the bandwidth will be still the same. And due to mismatched bandwidth the audio will not be recognized properly.

To check the bandwidth of the audio you need to see it's spectrum in audio editor like Wavesurfer. You'll see dark spectrum only up to 4khz if audio is 8khz. If your audio was recorded from telephone source or was compressed by voip codec most likely it has only 8khz bandwidth.

 8khz bandwidth audio spectrum sampled at 16khz in Wavesurfer

So first you need to do the following: make sure that frontend sample rate matches the sample rate of the audio. And then make sure that bandwidth of the audio matches your model bandwidth. Use en-us generic model for 16khz bandwidth and en-us-8khz generic model for 8khz bandwidth.

 
faq.txt · Last modified: 2014/08/27 14:53 by admin
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki