Speech recognition accuracy is not always great.
The first thing you need to understand if your accuracy just lower than expected or very low. If it's very low most likely you misconfigured the decoder. If it's lower than expected, you can apply various ways to improve it.
The first thing you should do is to collect a database of test samples and measure the recognition accuracy. You need to dump utterances into wav files, write reference text and use decoder to decode it. Then calculate WER using the word_align.pl tool from Sphinxtrain. Test database size depends on the accuracy but usually it's enough to have 30 minutes of transcribed audio to test recognizer accuracy reliably.
Only if you have a test database you can proceed with recognition accuracy optimization.
The top reasons of the bad accuracy are:
sox –i /path/to/audio/file. Find more information here: What is sample rate
To test the recognition you need to configure the decoding with the required paramters, in particular, you need to have a language model <your.lm>. For more details see Building Language Model.
Create fileids file
Create transcription file
some text (test1) some text (test2)
Put the audio files in wav folder. Make sure those files have proper format and sample rate.
Now, let's run the decoder:
pocketsphinx_batch \ -adcin yes \ -cepdir wav \ -cepext .wav \ -ctl test.fileids \ -lm <your.lm, for example en-us.lm.dmp from pocketsphinx> \ -dict <your.dic, for example cmudict-en-us.dict from pocketsphinx> \ -hmm <your_hmm, for example en-us> \ -hyp test.hyp word_align.pl test.transcription test.hyp
word_align.pl script is a part of sphinxtrain distribution
Make sure to add
-samprate 8000 to the above command if you are decoding 8kHz files!
The script word-align.pl from Sphinxtrain will report you the exact error rate which you can use to decide if adaptation worked for you. It will look something like:
TOTAL Words: 773 Correct: 669 Errors: 121 TOTAL Percent correct = 86.55% Error = 15.65% Accuracy = 84.35% TOTAL Insertions: 17 Deletions: 11 Substitutions: 93
To see the speed of the decoding, check pocketsphinx logs, it should look like this:
INFO: batch.c(761): 2484510: 9.09 seconds speech, 0.25 seconds CPU, 0.25 seconds wall INFO: batch.c(763): 2484510: 0.03 xRT (CPU), 0.03 xRT (elapsed)
0.03xRT is a decoding speed.