Before you start the development of the speech application, you need to consider several important points. They will define the way you'll implement the application.
Speech technology puts several important limits on the way it's possible to implement the application. For example, as noted above it is impossible to recognize any known word of the language. You need to consider the ways to overcome such limitations. Such ways are known for most types of applications out there and described later in tutorial. To follow them, you sometimes need to rethink how your application will behave and interact with the user.
Although we try to provide important examples, we obviously can't cover everything. There is no utterance verification or speaker identification example yet, though they could be created later. Most algorithms are widely covered in scientific literature, and some of them are explained in tutorial later in the section. Moreover, new methods to solve old problems raise each year.
To name several common applications and the way to approach them:
Generic dictation is never so generic. You need to find out a domain you'll recognize which can be dialogs, readings, meetings, voicemails, legal or medical transcriptions. If you consider voicemails, note that the language there is way more restricted than general language. It's actually a very small vocabulary with specialized sequence of terms:
There will be a lot of names and that's a problem, but you'll never find a voicemail about quantum physics and that's a very good thing. The recognizer will use the restrictions you provided with the language model to improve accuracy of the result.
You'll have to build a language model for your domain, but that's not as complicated as you might think. Don't afraid as well, if you'll cover the 60k most common words in English; the accuracy will be the same as with 120k words. For other languages with rich morphology the situation is different, but also solvable with morphology-based subwords. Also, you have to build a post-processing system, adaptation system and user-identification system.
For recognition on an embedded processor, there are two ways to consider - recognition on the server and recognition on the device. The former is more popular now days because it lets you use the power and flexibility of the cloud computations.
Language learning will require you to build a framework for tracking incorrect pronunciations. That will include generation of incorrect pronunciations and scoring them.
For command and control, it was popular to use a finite state grammar for a long time. Unfortunately, we could not recommend that to you now days. It's way better to employ a medium vocabulary recognizer with semantic analysis framework on the top to improve user experience and let him use more or less natural language. In short, don't build command and control, build intelligent assistants instead.
For intelligent assistants you need not just the recognition, but also intent parsing and database knowledge. For more detail on how to implement this you can check Lucida powered by OpenEphyra. Dialog systems will require user feedback framework as well.
Voice search, semantic analysis and translation will need to be build on the top of the lattices generated by engine. You need to take lattices with confidence scores and feed them into the upper levels like translation engine.
For open vocabulary recognition like name and places recognition, you will need a subword language model.
Text alignment, like captions synchronization, will require you to build a specialized language model from reference text to restrict the search.
For most tasks above there are published accuracy results. You can find them if you'll identify the task. Those results could be useful or not useful in terms of accuracy for your users. You might count that you'll jump over the figures, but it's unlikely that it will be done quickly.
For example, the broadcast news recognition task is done with 20-25% accuracy. If it's not enough for your application, you probably need to consider modification of the application. You might add hand-correction step or preliminary adaptation step to improve accuracy. If accuracy will not be sufficient after that, probably it's better to think if you need speech at all. There are other more reliable interfaces you could use.
For example, though ASR-based IVR systems are fancy and handy, many people still prefer communication with DTMF systems or web-based forms or just email to contact the company. Remember that you need an effective interface, not modest one.
Next issue you need to consider, is the availability of the speech material for training, testing and optimizing the system. You need to find out which resources are available to you.
The testing set is a critical issue for any speech recognition application. The testing set should be representative enough acoustically and terms of language. But the test set shouldn't necessary be large, you can spend 10 minutes to create a good one. It might be a sample recordings you could do yourself.
For training set and models you should check the resources that are already present. The increasing interest in speech technology makes people contribute by creation of models for their native languages. In general, you'll have to collect audio material for specified language. Actually it's not so complicated thing to do. Audio books, movies and podcasts provide enough recordings to build very good acoustic model with little effort.
To build a phonetic dictionary you can use existing TTS synthesizer which nowdays cover a lot of languages. Also you can boostrap dictionary by hand and then extend it with machine learning tools.
For language models you'll have to find a lot of texts for your domain. It might be textbooks, already transcribed recordings or some other sources like website contents crawled on the web.
Third thing to consider is the set of particular technologies you will build on. Although CMUSphinx tries to provide more or less complete program suite for development of speech applications, you'll sometimes need to use other packages/programming languages/tools. You need to find out yourself if you are going to continue with Java, C or any of scripting languages CMUSphinx supports. The rule to choose between sphinx4 or pocketsphinx is the following:
Although people often ask what is more accurate sphinx4 or pocketsphinx, you shouldn't bother with this question at all. Accuracy is not the argument here. Both sphinx4 and pocketsphinx provide acceptable accuracy and even then it depends on many factors, not just the engine. The thing is that engine is just a part of the system which should include many more components. If we are talking about large vocabulary decoder, there must be diarization framework, adaptation framework and postprocessing framework. They all need to cooperate somehow. Flexibility of sphinx4 allows you to build such a system quickly. It's easy to embed sphinx4 into flash server like red5 to provide web-based recognition, it's easy to manage many sphinx4 instances doing large-scale decoding on a cluster.
On the other side, if your system needs to be efficient and reasonably accurate, if you are running on embedded device or you are interested in using recognizer with some exotic language like Erlang, pocketsphinx is your choice. It's very hard to integrate Java with other languages not supported by JVM pocketsphinx is way better here.
Next example of what you need to consider a development platform choice. If you are bound to some, that's an easy question for you. If you can choose, we highly recommend you to use GNU/Linux as a development platform. We can help you with Windows or Mac issues but there are no guarantees, our main development platform is Linux. For many tasks you'll need to run complex scripts using perl of python. On Windows it might be problematic.