Frequently, people want to use Sphinx to do phoneme recognition. In other words, they would like to convert speech to a stream of phonemes rather than words. This is possible, although the results can be disappointing. The reason is that automatic speech recognition relies heavily on contextual constraints (i.e. language modeling) to guide the search algorithm. The phoneme recognition task is much less constrained that word decoding, and therefore the error rate (even when measured in terms of phoneme error for word decoding) is considerably higher. For mostly the same reason, phoneme decoding is quite slow.
That said, even very inaccurate phoneme decoding can be helpful for diverse tasks including pronunciation modeling, speaker indentification, and voice conversion.
Currently Sphinx3 is the only Sphinx decoder which is able to do phoneme recognition. Setting up Sphinx3 for phoneme recognition is very nearly the same as setting it up for word recognition, with the exception of the extra flag
-mode allphone. First, you need to create a dictionary, which is simply a list mapping each phoneme you wish to recognize to itself - this is currently required for no good reason. For English phoneset it should look like this:
AA AA AE AE AH AH AO AO AW AW AY AY B B CH CH D D DH DH EH EH .....
Next, you need a language model for the phoneset, which is built from phonetic transcriptions. You can take a text, convert it to a phonetic strings using the phonetic dictionary provided with the voice. Just replace the words with their corresponding transcription. Since number of phones is small, text shouldn't be big either, just a book will do. If you have training data, you can use forced alignment to get transcription with dictionary variants. This way the phonetic transcription will be more precise. That you can build a language model from the phonetic transcription using any language model building tool like cmuclmtk
\data\ ngram 1=35 ngram 2=340 ngram 3=1202 \1-grams: -99.0000 <UNK> 0.0000 -1.8779 AA -2.3681 -3.2104 AE -1.1361 -1.4280 AH -2.4071 -1.9864 AO -2.2929 -2.4635 AW -1.8166 -1.5254 AY -2.3892 ..........
Finally, you will need a filler dictionary, which simply lists all of the noise (filler) phones in the acoustic model, like this:
+BREATH+ +BREATH+ +NOISE+ +NOISE+ +COUGH+ +COUGH+ +GARBAGE+ +GARBAGE+ +SMACK+ +SMACK+ +UH+ +UH+ +UM+ +UM+ +UHUM+ +UHUM+
You can download these dictionaries and a language model which will work with the acoustic model included with Sphinx3 from these locations:
Now you can run this in batch mode on some example data. We'll use this file from the AN4 corpus as an example:
Download all of the files above into a directory and change your working directory to that one. Now you will create a “control” file which tells the batch mode recognizer which files to recognize. This is a text file with the filename of each input (minus the extension) on each line, like this:
If you like you can download that from:
Now, run the
sphinx3_decode program from where you compiled it. Here
$SPHINX3 stands for the directory where your Sphinx3 sources are. You can set it by changing to that directory in the shell and running
$SPHINX3/sphinx3_decode \ -mode allphone \ -ctl test.ctl \ -cepdir . -cepext .sph \ -adcin yes -adchdr 1024 \ -hmm $SPHINX3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd \ -lm interp_nodx.arpa.DMP -dict phone.dict -fdict filler.dict
You should see a bunch of debugging output followed by a line that looks like this:
FWDVIT: SIL IH N EH V AH N SIL T AH W AH N T IY S EH D EH N D IH K IH T K AE M P (cen8-fcaw-b)
That is your decoding result! If you want to decode many files in batch mode and record the results to a file, you can use the
-hyp option to Sphinx3. For more help on the available options, simply run
sphinx3_decode with no arguments.