This page describes how to do some simple acoustic model adaptation to improve speech recognition in your configuration. Please note that the adaptation doesn't necessary adapt for a particular speaker. It just improves the fit between the adapatation data and the model. For example you can adapt to your own voice to make dictation good, but you also can adapt to your particular recording environement, your audio transmission channel, your accent or accent of your users. You can use model trained with clean broadcast data and telephone data to produce telephone acoustic model by doing adaptation. Cross-language adaptation also make sense, for example you can adapt English model to sounds of other language by creating a phoneset map and creating other language dictionary with English phoneset.
The adaptation process takes transcribed data and improves the model you already have. It's more robust than training and could lead to a good results even if your adaptation data is small. For example, it's enough to have 5 minutes of speech to significantly improve the dictation accuracy by adaptation to the particular speaker.
The methods of adaptation are a bit different between PocketSphinx and Sphinx4 due to the different types of acoustic models used. For more technical information on that see AcousticModelTypes.
The first thing you need to do is create a corpus of adaptation data. This will consist of a list of sentences, a dictionary describing the pronunciation of all the words in that list of sentences, and a recording of you speaking each ofthose sentences.
The actual set of sentences you use is somewhat arbitrary, but ideally it should have good coverage of the most frequently used words or phonemes in the set of sentences or the type of text you want to recognize. We have had good results simply using sentences from the CMU ARCTIC text-to-speech databases. To that effect, here are the first 20 sentences from ARCTIC, a control file, a transcription file, and a dictionary for them:
The sections below will refer to these files, so it would be a good idea to download them now. You should also make sure that you have downloaded and compiled SphinxBase and SphinxTrain.
In case you are adapting to a single speaker you can record the adaptation data yourself. This is unfortunately a bit more complicated than it ought to be. Basically, you need to record a single audio file for each sentence in the adaptation corpus, naming the files according to the names listed in
arctic20.fileids. In addition, you will NEED TO MAKE SURE THAT YOU RECORD AT A SAMPLING RATE OF 16 KHZ (or 8 kHz if you adapt a telephone model) IN MONO WITH SINGLE CHANNEL.
If you are at a Linux command line, you can accomplish this in very nerdy style with the following
bash one-liner from the directory in which you downloaded
Since we are redirecting the output the /dev/null in the one-liner, you should verify whether you have the
sox package, and if not, install it using this command.
sudo apt-get install sox
Now, the one-liner is as follows
for i in `seq 1 20`; do fn=`printf arctic_%04d $i`; read sent; echo $sent; rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null; done < arctic20.txt
This will echo each sentence to the screen and start recording immediately. Hit Control-C to move on to the next sentence. You should see the following files in the current directory afterwards:
arctic_0001.wav arctic_0002.wav ..... arctic_0019.wav arctic20.dic arctic20.fileids arctic20.transcription arctic20.txt
If you've hit Control-C immediately after you finished speaking out a sentence, chances are that your recording might have truncated the last word. You should verify that these recordings sound okay. To do this, you can play them back with:
for i in *.wav; do play $i; done
If you are adapting to a channel, accent or some other generic property of the audio, then you need to collect a little bit more recordings manually. For example, in call center you can record and transcribe hundred calls and use them to improve the recognizer accuracy by means of adaptation.
First we will copy the default acoustic model from PocketSphinx into the current directory in order to work on it. Assuming that you installed PocketSphinx under
/usr/local, the acoustic model directory is
/usr/local/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k. Copy this directory to your working directory:
cp -a /usr/local/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k .
In order to run the adaptation tools, you must generate a set of acoustic model feature files from these WAV audio recordings. This can be done with the
sphinx_fe tool from SphinxBase. It is imperative that you make sure you are using the same acoustic parameters to extract these features as were used to train the standard acoustic model. Since PocketSphinx 0.4, these are stored in a file called
feat.params in the acoustic model directory. You can simply add it to the command line for
sphinx_fe, like this:
sphinx_fe -argfile hub4wsj_sc_8k/feat.params \ -samprate 16000 -c arctic20.fileids \ -di . -do . -ei wav -eo mfc -mswav yes
If you are using Sphinx4 and the model doesn't have the
feat.params file, just omit the
argfile parameter to use default settings.
You should now have the following files in your working directory:
arctic_0001.mfc arctic_0001.wav arctic_0002.mfc arctic_0002.wav arctic_0003.mfc arctic_0003.wav ..... arctic_0020.wav arctic20.dic arctic20.fileids arctic20.transcription arctic20.txt
Some models don't have enough data for adaptation. There is an extra file which you need which was left out of the PocketSphinx distribution in order to save space. You can download the it from code repository from the package called pocketsphinx-extra from the folder pocketsphinx-extra/model/hmm/en_US/hub4_wsj_sc_3s_8k.cd_semi_5000 or checkout it from subversion. Copy the mixture_weights file to your acoustic model folder.
Sometimes sendump file can be converted back to mixture_weights file. This is only possible for an older sendump files. If you have installed the SphinxTrain Python modules, you can use
SphinxTrain/python/cmusphinx/sendump.py to convert the
sendump file from the acoustic model to a
mixture_weights file. For hub4_wsj acoustic model it will not work.
You will also need to convert the
mdef file from the acoustic model to the plain text format used by the SphinxTrain tools. To do this, use the
pocketsphinx_mdef_convert -text hub4wsj_sc_8k/mdef hub4wsj_sc_8k/mdef.txt
The next step in adaptation is to collect statistics from the adaptation data. This is done using the
bw program from SphinxTrain. You should be able to find
bw tool in a sphinxtrain installation in a folder
/usr/local/libexec/sphinxtrain (or under other prefix on Linux) or in
bin\Release (in sphinxtrain directory on Windows). Copy it to the working directory along with the
Now, to collect statistics, run:
./bw \ -hmmdir hub4wsj_sc_8k \ -moddeffn hub4wsj_sc_8k/mdef.txt \ -ts2cbfn .semi. \ -feat 1s_c_d_dd \ -svspec 0-12/13-25/26-38 \ -cmn current \ -agc none \ -dictfn arctic20.dic \ -ctlfn arctic20.fileids \ -lsnfn arctic20.transcription \ -accumdir .
Make sure the arguments in
bw command should match the parameters in
feat.params file inside the acoustic model folder. Please note that not all the parameters from
feat.param are supported by
bw, only a few of them.
bw for example doesn't suppport
upperf or other feature extraction params. You only need to use parameters which are accepted, other parameters from
feat.params should be skipped.
For example, for most continuous models (like the ones used by Sphinx4) you don't need to include the
svspec option. Instead, you need to use just
-ts2cbfn .cont. For PTM models use
Sometimes if you miss the file
noisedict you also need an extra step, copy the
fillerdict file into the directory that you choose in the
hmmdir parameter, renaming it to
MLLR transforms are supported by pocketsphinx and sphinx4. MLLR is a cheap adaptation method that is suitable when amount of data is limited. It's a good idea to use MLLR for online adaptation. MLLR works best for continuous model. It's effect for semi-continuous models is very limited since semi-continuous models mostly relies on mixture weights. If you want best accuracy you can combine MLLR adaptation with MAP adaptation below
Next we will generate an MLLR transformation which we will pass to the decoder to adapt the acoustic model at run-time. This is done with the mllr_solve program:
./mllr_solve \ -meanfn hub4wsj_sc_8k/means \ -varfn hub4wsj_sc_8k/variances \ -outmllrfn mllr_matrix -accumdir .
This command will create an adaptation data file called
mllr_matrix. Now, if you wish to decode with the adapted model, simply add
-mllr mllr_matrix (or whatever the path to the mllr_matrix file you created is) to your pocketsphinx command line.
MAP is different adaptation method. In this case unlike for MLLR we don't create a generic transform but update each parameter in the model. We will now copy the acoustic model directory and overwrite the newly created directory with adapted model files:
cp -a hub4wsj_sc_8k hub4wsj_sc_8kadapt
To do adaptation, use the
./map_adapt \ -meanfn hub4wsj_sc_8k/means \ -varfn hub4wsj_sc_8k/variances \ -mixwfn hub4wsj_sc_8k/mixture_weights \ -tmatfn hub4wsj_sc_8k/transition_matrices \ -accumdir . \ -mapmeanfn hub4wsj_sc_8kadapt/means \ -mapvarfn hub4wsj_sc_8kadapt/variances \ -mapmixwfn hub4wsj_sc_8kadapt/mixture_weights \ -maptmatfn hub4wsj_sc_8kadapt/transition_matrices
If you want to save space for the model you can use sendump file supported by pocketsphinx. For sphinx4 you don't need that. To recreate the
sendump file from the updated
./mk_s2sendump \ -pocketsphinx yes \ -moddeffn hub4wsj_sc_8kadapt/mdef.txt \ -mixwfn hub4wsj_sc_8kadapt/mixture_weights \ -sendumpfn hub4wsj_sc_8kadapt/sendump
Congratulations! You now have an adapted acoustic model! You can delete the files
wsj1adapt/mdef.txt to save space if you like, because they are not used by the decoder.
For Sphinx4, the adaptation is the same as for PocketSphinx, except that the model is continuous. You need to take a continuous model from sphinx4 for adaptation. Continuous models often have text mdef file, thus you don't need to unpack the binary mdef and pack it back. Also, during the bw the type of model is .cont. and not .semi. and the feature type is usually 1s_c_d_dd. The rest is the same - bw + map_adapt will do the work.
After you have done the adaptation, it's critical to test the adaptation quality. To do that you need to setup the database similar to the one used for adaptation. To test the adaptation you need to configure the decoding with the required paramters, in particular, you need to have
<your.lm>. For more details see Building Language Model
Create fileids file adaptation-test.fileids:
Create transcription file adaptation-test.transcription:
some text (test1) some text (test2)
Put the audio files in
wav folder. Make sure those files have proper format and sample rate.
You can also use adaptation data for testing, but it's recommended to create a separate test set. Now, let's run the decoder:
pocketsphinx_batch \ -adcin yes \ -cepdir wav \ -cepext .wav \ -ctl adaptation-test.fileids \ -lm <your.lm> \ -dict <your.dic, for example arctic.dic> \ -hmm <your_new_adapted_model, for example hub4wsj_sc_8kadapt> \ -hyp adapation-test.hyp word_align.pl adaptation-test.transcription adapation-test.hyp
Make sure to add
to the above command if you are decoding 8kHz files!
word-align.pl from Sphinxtrain will report you the exact error rate which you can use to decide if adaptation worked for you. It will look something like:
TOTAL Words: 773 Correct: 669 Errors: 121 TOTAL Percent correct = 86.55% Error = 15.65% Accuracy = 84.35% TOTAL Insertions: 17 Deletions: 11 Substitutions: 93
You can try to run decoder on the original acoustic model and on new acoustic model to estimate the improvement.
After adaptation, the acoustic model is located in the folder
You need only that folder. The model should have the following files:
mdef feat.params mixture_weights means noisedict transition_matrices variances
Depending on the type of the model you trained.
To use the model in pocketsphinx, simply put the model files to the resources of your application. Then point to it with the
pocketsphinx_continuous -hmm <your_new_model_folder> -lm <your_lm> -dict <your_dict>.
-hmm engine configuration option through
cmd_ln_init function. Alternatively you can replace the old model files with the new ones.
To use the trained model in sphinx4, you need to update the model location in the config file. Read the documentation on Using SphinxTrain models in sphinx4.
Now test your accuracy to see it's good.
From few sentences you should get about 10% relative WER improvement.
whether it's needing more/better training data, whether I'm not doing the adaptation correctly, whether my language model is the problem here, or whether there is something intrinsically wrong with my configuration
Most likely you just ignored error messages that were printed to you. You obviosly need to provide more information and give access to your experiment files in order to get more definite advise.
We hope adapted model give you acceptable results. If not, try to improve your adaptation process: