There are two types of models that describe language - grammars and statistical language models. Grammars describe very simple types of languages for command and control, and they are usually written by hand or generated automatically with scripting code. Grammars usually do not have probabilities for word sequences, but some elements might be weighed. Grammars could be created with JSGF format and usually have extension like .gram or .jsgf.
Statistical language models describe more complex language. They contain probabilities of the words and word combinations. There are many ways to build the statistical language models. When your data set is large, there is sense to use CMU language modeling toolkit. When a model is small, you can use an online quick web service. When you need specific options or you just want to use your favorite toolkit which builds ARPA models, you can use it.
Language model can be stored and loaded in three different format - text ARPA format, binary format BIN and binary DMP format. ARPA format takes more space but it is possible to edit it. ARPA files have
.lm extension. Binary format takes significantly less space and faster to load. Binary files have
.lm.bin extension. It is also possible to convert between formats. DMP format is obsolete and not recommended.
Grammars are usually written manually in JSGF format:
#JSGF V1.0; /** * JSGF Grammar for Hello World example */ grammar hello; public <greet> = (good morning | hello) ( bhiksha | evandro | paul | philip | rita | will );
For more information on JSGF format see the full documentation on W3C
You need to download and install cmuclmtk. See CMU Sphinx Downloads for details.
First of all you need to cleanup text. Expand abbreviations, convert numbers to words, clean non-word items. For example to clean Wikipedia XML dump you can use special python scripts. To clean HTML pages you can try http://code.google.com/p/boilerpipe/ a nice package specifically created to extract text from HTML
For example on how to create language model from Wikipedia texts please see
Once you went through the language model process, please submit your langauge model to CMUSphinx project, we'd be glad to share it!
Language modeling for Mandarin is largely the same as in English, with one addditional consideration, which is that the input text must be word segmented. A segmentation tool and associated word list is provided to accomplish this.
The process for creating a language model is as follows:
1) Prepare a reference text that will be used to generate the language model. The language model toolkit expects its input to be in the form of normalized text files, with utterances delimited by
</s> tags. A number of input filters are available for specific corpora such as Switchboard, ISL and NIST meetings, and HUB5 transcripts.
The result should be the set of sentences that are bounded by the start and end sentence markers: <s> and </s>. Here's an example:
<s> generally cloudy today with scattered outbreaks of rain and drizzle persistent and heavy at times </s> <s> some dry intervals also with hazy sunshine especially in eastern parts in the morning </s> <s> highest temperatures nine to thirteen Celsius in a light or moderate mainly east south east breeze </s> <s> cloudy damp and misty today with spells of rain and drizzle in most places much of this rain will be light and patchy but heavier rain may develop in the west later </s>
More data will generate better language models. The
weather.txt file from sphinx4 (used to generate the weather language model) contains nearly 100,000 sentences.
2) Generate the vocabulary file. This is a list of all the words in the file:
text2wfreq < weather.txt | wfreq2vocab > weather.tmp.vocab
3) You may want to edit the vocabulary file to remove words (numbers, misspellings, names). If you find misspellings, it is a good idea to fix them in the input transcript.
4) If you want a closed vocabulary language model (a language model that has no provisions for unknown words), then you should remove sentences from your input transcript that contain words that are not in your vocabulary file.
5) Generate the arpa format language model with the commands:
% text2idngram -vocab weather.vocab -idngram weather.idngram < weather.closed.txt % idngram2lm -vocab_type 0 -idngram weather.idngram -vocab \ weather.vocab -arpa weather.lm
6) Generate the CMU binary form (BIN)
sphinx_lm_convert -i weather.lm -o weather.lm.bin
The CMUCLTK tools and commands are documented at The CMU-Cambridge Language Modeling Toolkit page.
You can also use any other toolkit that generates ARPA text files.
Then you can convert the model to binary format and use it as usual.
Some toolkits you can try:
If you are training large vocabulary speech recognition system, the language model training is outlined in a separate page Building a large scale language model for domain-specific transcription.
If your language is English and text is small it's sometimes more convenient to use web service to build it. Language models built in this way are quite functional for simple command and control tasks. First of all you need to create a corpus.
The “corpus” is just a list of sentences that you will use to train the language model. As an example,
we will use a hypothetical voice control task for a mobile Internet device. We'd like to tell it things
like “open browser”, “new e-mail”, “forward”, “backward”, “next window”, “last window”, “open music player”,
and so forth. So, we'll start by creating a file called
open browser new e-mail forward backward next window last window open music player
Then go to the page http://www.speech.cs.cmu.edu/tools/lmtool-new.html. Simply click on the “Browse…” button, select
corpus.txt file you created, then click “COMPILE KNOWLEDGE BASE”.
The legacy version is still available online also here: http://www.speech.cs.cmu.edu/tools/lmtool.html
You should see a page with some status messages, followed by a page entitled “Sphinx knowledge base”.
This page will contain links entitled “Dictionary” and “Language Model”. Download these files and
make a note of their names (they should consist of a 4-digit number followed by the extensions
.lm). You can now test your newly created language model with PocketSphinx.
To quickly load large models you probably would like to convert them to binary format that
will save your decoder initialization time. That's not necessary with small models. Pocketsphinx
and sphinx3 can handle both of them with
-lm option. Sphinx4 automatically detects format by extension of the lm file.
ARPA format and BINARY format are mutually convertable. You can produce other file with
sphinx_lm_convert command from sphinxbase:
sphinx_lm_convert -i model.lm -o model.lm.bin sphinx_lm_convert -i model.lm.bin -ifmt bin -o model.lm -ofmt arpa
You can also convert old DMP models to bin format this way.
This section will show you how to use, test, and improve the language model you created.
If you have installed PocketSphinx, you will have a program called
pocketsphinx_continuous which can be
run from the command-line to recognize speech. Assuming it is installed under
/usr/local, and your language model and dictionary are called
8521.lm and placed in the current folder, try running the following command:
pocketsphinx_continuous -inmic yes -lm 8521.lm -dict 8521.dic
This will use your new language model and the dictionary and default acoustic model. On Windows you also have to specify the acoustic model folder with
bin/Release/pocketsphinx_continuous.exe -inmic yes -lm 8521.lm -dict 8521.dic -hmm model/en-us/en-us
You will see a lot of diagnostic messages, followed by a pause, then “READY…”. Now you can try speaking some of the commands. It should be able to recognize them with complete accuracy. If not, you may have problems with your microphone or sound card.
In Sphinx4 high-level API you need to specify the location of the language model in Configuration:
If the model is in resources you can reference it with resource: URL
See Sphinx4 tutorial for details