Recently Professor Rudnicky has updated CMUDict with the latest changes and now we have cmudict-0.7b version which you are welcome to checkout from subversion and use in your applications. CMUDict, the phonetic dictionary for US English has been one of the major components of the CMUSphinx toolkit. CMUDict has a long history being a unique resource for English pronunciation which is used by many other speech projects, commercial and open source.
There are few things in CMUDict which we would love to improve, those things would definitely have a huge impact on speech recognition research and overall speech recognition technology:
Phonemic vs phonetic
CMUDict is originally a phonetic dictionary opposed to phonemic dictionary. It contains the approximations to the word pronunciation, it describes how the US native speaker would pronounce the word in a read speech. On the other hand in other condition even native US speaker would pronounce them differently. For example
uniform Y UW1 N AH0 F AO2 R M
with AH0 in the middle is already a reduced form of the
uniform Y UW1 N IH0 F AO2 R M
Which the speaker would say if he will pronounce the word slowly. On the other hand it doesn’t have the form
uniform Y UW1 N AH0 F AH0 R M
Which native speaker would use in fast conversational speech. There are many cases like that. Because of such structure the dictionary is ready to use in speech recognition system but it makes it hard to conduct research on real phonetic reduction in various contexts just because the dictionary often doesn’t have the original form. In modern systems where phonetic reduction gets more importance, we need to have more information on it in the dictionary. Hopefully, one day we will be able to collect both the information about original phonemic representation of words and their possible phonetic representation.
Newly appeared words
All the words which appeared in last few years and widely used around are often missing in CMUDict: “skype”, “ipad”, “spotify”, there are so many important entries to add. Well, “spotify” has to be added to my spellchecker first. Hopefully we could keep the update rate of the dictionary faster. The reasonable estimation of the required size of the dictionary is about 200-500 thousand US English words, so the size of the dictionary has to be increased twice. That’s a lot of work to review.
Word origins and morphology
There are many research projects on modeling the pronunciation of the words automatically. Still, for CMUDict the symbol error rate is about 8% which causes word error rate about 30%. However, it’s often very sad they are trying to model words as blackboxes without the attempt to add some sense to them. It’s very important in what context the word is used, what is the origin of the word. Is it a surname, an abbreviation, a geographical term or a scientific term. Such information could greatly improve the quality of the dictionary and the accuracy of the prediction.
There is a growing interest in supporting other languages in CMUSphinx toolkit – Spanish, French, British English. One of the serious problems is that we still lack a lot of data for them and dictionary is one big issue. Hopefully, we will be able to make an original approximation to the dictionary with rule-based systems for at least some of the languages. Such data would enable research on multi-language and language-independent speech recognizer and would greatly benefit the speech recognition toolkits.
Automatic dictionary acquisition
This is still an emerging technology, however, there are already some advancements in automatic dictionary collection with software by LIUM. One can imagine the tool which scans through the audio and just learns the words it met and generates pronunciation for them to add to the dictionary. Hopefully, such tools are not a far future of speech recognition.
So there is a huge space of improvement for CMUDict alone which is very important for the speech recognition research unrelated to the toolkit or speech recognition implementation. For that reason it is worth to note that CMUDict is also available on github, so you are welcome to clone the repository, make your changes and submit a pull request, that would be very much appreciated!