Help to distribute CMUSphinx data through Bittorrent


Modern speech recognition algorithms require enormous amount of data to estimate speech parameters. Audio recordings, transcriptions, texts for langauge model, pronuncation dictionaries and vocabularies are collected by speech developers. While it's not necessary to be the case in the future and better algorithms might require just a few examples, now you need to process thousands of hours of recordings to build a speech recognition system.

Estimates show that human recieves thousands hours of speech data before it learns to understand speech. Note that human has prior knowledge structure embedded into the brain we are not aware of. Google trains their models on 100 thousands hours of audio recorings and petabytes of transcriptions, still it behind the human performance in speech recognition tasks. For search queries they still have word error rate of 10%, for youtube Google's word error rate is over 40%.

mrflip/CC BY-NC-SA 2.0

While Google has a vast of resources so we do. We definitely can collect, process and share even more data than Google has. The first step in this direction is to create a shared storage for the audio data and CMUSphinx models.

We created a torrent tracker specifically to distribute a legal speech data related to CMUSphinx, speech recognition, speech technologies and natural language processing. Thanks to Elias Majic, the tracker is available at

http://cmusphinx.info

Currently tracker contains torrents for the existing acoustic and language models but new more accurate models for US English and other languages will be released soon.

We encourage you to make other speech-related data available through our tracker. Please contact cmusphinx-devel@lists.sourceforge.net mailing list if you want to add your data set to the tracker.

Please help us to distribute the data, start a client on your host and make the data available to others.

To learn more about BitTorrent visit this link or search in the web, there is a vast amount of resources about it.

You might wonder what is the next step. Pretty soon we will be able to run a distributed acoustic model training system to train the acoustic model using vast amount of distributed data and computing power. With a BOINC-grid computation network of CMUSphinx tools we together will create the most accurate models for speech. Stay tuned.