The first thing you need to understand if your accuracy just lower than expected or very low. If it's very low most likely you misconfigured the decoder. If it's lower than expected, you can apply various ways to improve it.
The first thing you should do is to collect a database of test samples and measure the recognition accuracy. You need to dump utterances into wav files, write reference text and use decoder to decode it. Then calculate WER using the word_align.pl tool from Sphinxtrain. Test database size depends on the accuracy but usually it's enough to have 30 minutes of transcribed audio to test recognizer accuracy reliably.
Only if you have a test database you can proceed with recognition accuracy optimization.
The top reasons of the bad accuracy are:
sox --i /path/to/audio/file
There are multiple levels to fight with noise and corruption of the audio. Noise cancellation algorithm modify the audio itself, feature denoising can cleanup features. You can also reduce noise and mismatch in model level, just by adapting the model.
Recent CMUSphinx code has noise cancellation featur. In sphinxbase/pocketsphinx/sphinxtrain it's 'remove_noise' option. In sphinx4 it's Denoise frontend component. So if you are using latest version you should be robust to noise in some degree already. Most modern models are trained with noise cancellation already, if you have your own model you need to retrain it.
The algorithm impelmented is spectral subtraction in on mel filterbank. There are more advanced algorithms for sure, if needed you can extend the current implementation.
It's not recommended to perform an external noise suppression because many noise cancellation algorithms corrupts speech spectrum in unusual ways and reduce speech recognition accuracy even more than the noise itself. For that reason you need to be very careful when selecting the noise cancellation algorithm. Only some of them like Ephraim Malach or Kalman will work properly.
A reasonable way to fight with noise is to adapt the model or train it on a noisy audio. MLLR adaptation usually compensates quite significant part of the noise corruption. It's so-called multistyle training.
There are few ways to deal with OOV rejection, for more details see Rejecting Out-of-Grammar Utterances. Situation with implementation of those approaches is:
So for now recommendation for rejection with the small grammar is - train your own model (make it public). For the large language model (> 100 words) use confidence score.
If continuous is showing READY and doesn't react to your speech it means that pocketsphinx recording silence. The reasons for that are:
''Warning: Could not find Mic element''
(try to change device with
CMUSphinx itself is language-independent, you can recognize any language. However, it requires an acoustic model and a language model. We provide prebuilt language models for many languages (Enlish, Chinese, French, Spanish, German, Russian, etc) in download section.
The process of building a new language model consists of the following steps:
* Data collection (you can collect audiobooks with text transcriptoin from project like librivox, transcribed podcasts, or setup web data collection. You can also try to contribute to Voxforge. You can start very quickly with just few hours of transcribed data.
Most steps are described in CMUSphinx Tutorial For Developers
When you report about problem always provide the following information:
If you want to get fast answer, submit also the following information
See How to ask questions howto for more details
CMUSphinx uses mel-cepstrum MFCC features. There are various types of MFCC which differ by number of parameters, but not really different for accuracy (it might be a few percent worse or better).
The interpretation of MFCC (Roughtly introduced Alan V. Oppenheim and Ronald W. Schafer. From Frequency to Quefrency: A History of the Cepstrum. IEEE SIGNAL PROCESSING MAGAZINE) is not applicable as such, and the use of 12 or 13 coefficients seem to be due to historical reasons in many of the reported cases. The choice of the number of MFCCs to include in an ASR system is largely empirical. To understand why any specific number of cepstral coefficients is used, you could do worse than look at very early (pre-HMM) papers. When using DTW using Euclidean or even Mahalanobis distances, it quickly became apparent that the very high cepstral coefficients were not helpful for recognition, and to a lesser extent, neither were the very low ones. The most common solution was to “lifter” the MFCCs - i.e. apply a weighting function to them to emphasise the mid-range coefficients. These liftering functions were “optimised” by a number of researchers, but they almost always ended up being close to zero by the time you got to the 12th coefficient.
In practice, the optimal number of coefficients depends on the quantity of training data, the details of the training algorithm (in particular how well the PDFs can be modelled as the dimensionality of the feature space increases), the number of Gaussian mixtures in the HMMs, the speaker and background noise characteristics, and sometimes the available computing resources.
In semicontinuous reasons CMUSphinx uses specific packing of derivatives to optimize vector quantization and thus compress model better. Through years various features were used. Mostly they were selected by experiment.
Actually now MFCC coefficients are mostly obsolete. Most commercial decoders and research systems use PLP (Perceptual linear frequency) for feature extraction. PLP provides improved robustness over MFCC and slightly better accuracy but probably has patent issues. We hope to see PLP models in CMUSphinx soon.
Sometimes its needed to listen for speech continuously and respond only to a commands. You need to have a background grammar this case and an analysis module to decide what make sense in current context and what doesn't. This extra analysis module is a requirement. See papers with description of the system for more details
Sahar E. Bou-Ghazale and Ayman O. Asadi
Some applications may wish to measure speech consistency with “ideal” forms of pronunciation as represented by the acoustic model and the grammar. An example of this might be software that presents users with text to speak and evaluates their pronunciation against an assumed ideal. The goal is figuring out if speakers are speaking the text we assume them to be speaking clearly enough to proceed with a certain level of confidence.
While it is not possible to force Sphinx to generate a probabilistic match against a random entry in a language model or grammar file, it is possible to create acoustic and language models containing the text to be found along with common variants, and then measure which of the selected variants comes closest to the spoken text. Thus the statistical hypothesis testing apparatus can be applied to distinguish match hypothesis and mismatch hypothesis. Both hypothesis should be evaluated against proper data.
The research on computer-aided language learning is pretty wide. The good starting point for it is the following paper, though you will have to read some more papers to get the full picture.
The stack trace is usually the following:
INFO: fe_interface.c(289): You are using internal mechanism to generate the seed. INFO: feat.c(289): Initializing feature stream to type: '1s_c_d_dd', ceplen=13,CMN='current', VARNORM='no', AGC='none' INFO: cmn.c(142): mean= 12.00, mean[1..12]= 0.0 INFO: acmod.c(153): Reading linear feature trasformation from acoustic/feature.transform INFO: mdef.c(520): Reading model definition: acoustic/mdef INFO: bin_mdef.c(173): Allocation 104810 * 8 bytes (818 KiB) for CD tree INFO: tmat.c(205): Reading HMM transition probability matrices: acoustic/transition_matrices (After it I have crash) Stack trace: ntdll.dll!774f8db9() [Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll] ntdll.dll!774f8cc8() > msvcr100.dll!_lock_file(_iobuf * pf) Line 236 + 0xa bytes C msvcr100.dll!fgets(char * string, int count, _iobuf * str) Line 71 + 0x6 bytes C sphinxbase.dll!002319ef() msvcr100.dll!_unlock(int locknum) Line 375 C msvcr100.dll!_unlock_file(_iobuf * pf) Line 313 + 0xe bytes C msvcr100.dll!fread_s(void * buffer, unsigned int bufferSize, unsigned int elementSize, unsigned int count, _iobuf * stream) Line 113 + 0x8 bytes C msvcr100.dll!fread(void * buffer, unsigned int elementSize, unsigned int count, _iobuf * stream) Line 303 + 0x13 bytes C sphinxbase.dll!00231743() sphinxbase.dll!00231cbf()
sphinxbase was compiled iwth MultiThreadedDLL runtime, see in vcxproj
If you don't compile your project with similar setting it will crash. Use proper runtime library or recompile sphinxbase
There are very common issues with building python modules on OSX. The sign of them is a warning about unsupported architecture.
ld: warning: in ../src/libsphinxbase/.libs/libsphinxbase.dylib, file was built for unsupported file format which is not the architecture being linked (i386)
/usr/libexec/gcc/powerpc-apple-darwin10/4.2.1/as: assembler ( /usr/bin/../libexec/gcc/darwin/ppc/as or /usr/bin/../local/libexec/gcc/darwin/ppc/as) for architecture ppc not installed Installed assemblers are: /usr/bin/../libexec/gcc/darwin/x86_64/as for architecture x86_64 /usr/bin/../libexec/gcc/darwin/i386/as for architecture i386
This is a bug in setup.py for OSX. For details and solutions for this problem please read
You can also disable python in configure with –without-python configure option
Device file /dev/dsp is missing because OSS support is not enabled in the kernel. You can either compile pocketsphinx with ALSA support by installing alsa development headers from a package libasound2 or alsa-devel and recompiling or you can install oss-compat package to enable OSS support.
The installation process is not an issue if you understand the complexity of audio subsystems in Linux. The audio subsystem is complex unfortunately, but once you get it things will be easier. Historically, audio subsystem is pretty fragmented. It includes the following major frameworks:
The recommended audio framework on Ubuntu is pulseaudio.
Sphinxbase and pocketsphinx support all the frameworks and automatically selects the one you need in compile time. The highest priority is in pulseaudio framework. Before you install sphinxbase you need to decide which framework to use. You need to setup the development part of the corresponding framework after that.
For example, it’s recommended to install libpulse-dev package to provide access to pulseaudio and after that sphinxbase will automatically work with Pulseaudio. Once you work with pulseaudio you do not need other frameworks. On embedded device try to configure alsa.
Unfortunately we don't provide universal models for different bandwidths (8khz models are 10% worse in accuracy) and we can not detect sample rate yet. So before using decoder you need to make sure that both sample rate of the decoder matches the sample rate of the input audio and the bandwidth of the audio matches the bandwidth that was used to train the model. A mismatch results in very bad accuracy.
First of all you need to understand the difference between sample rate and frequency bandwidth. Sample rate - the rate of samples in the recording. Sample rate 16000 means that there are 16000 samples collected every second. You can resample audio with sox or with ffmpeg:
sox file.wav -r 16000 file-16000.wav ffmpeg file.mp3 -ar 16000 file-16000.wav
Then there is a bandwidth - the range of frequencies included in the audio, it doesn't change with sample rate. If audio had frequencies up to 8khz only no matter how you resample the bandwidth will be still the same. And due to mismatched bandwidth the audio will not be recognized properly.
To check the bandwidth of the audio you need to see it's spectrum in audio editor like Wavesurfer. You'll see dark spectrum only up to 4khz if audio is 8khz. If your audio was recorded from telephone source or was compressed by voip codec most likely it has only 8khz bandwidth.
So first you need to do the following: make sure that frontend sample rate matches the sample rate of the audio. And then make sure that bandwidth of the audio matches your model bandwidth. Use en-us generic model for 16khz bandwidth and en-us-8khz generic model for 8khz bandwidth.