Help to distribute CMUSphinx data through Bittorrent

Modern speech recognition algorithms require enormous amount of data to estimate speech parameters. Audio recordings, transcriptions, texts for langauge model, pronuncation dictionaries and vocabularies are collected by speech developers. While it’s not necessary to be the case in the future and better algorithms might require just a few examples, now you need to process thousands of hours of recordings to build a speech recognition system.

Estimates show that human recieves thousands hours of speech data before it learns to understand speech. Note that human has prior knowledge structure embedded into the brain we are not aware of. Google trains their models on 100 thousands hours of audio recorings and petabytes of transcriptions, still it behind the human performance in speech recognition tasks. For search queries they still have word error rate of 10%, for youtube Google’s word error rate is over 40%.

mrflip/CC BY-NC-SA 2.0

While Google has a vast of resources so we do. We definitely can collect, process and share even more data than Google has. The first step in this direction is to create a shared storage for the audio data and CMUSphinx models.

We created a torrent tracker specifically to distribute a legal speech data related to CMUSphinx, speech recognition, speech technologies and natural language processing. Thanks to Elias Majic, the tracker is available at

http://cmusphinx.info

Currently tracker contains torrents for the existing acoustic and language models but new more accurate models for US English and other languages will be released soon.

We encourage you to make other speech-related data available through our tracker. Please contact cmusphinx-devel@lists.sourceforge.net mailing list if you want to add your data set to the tracker.

Please help us to distribute the data, start a client on your host and make the data available to others.

To learn more about BitTorrent visit this link or search in the web, there is a vast amount of resources about it.

You might wonder what is the next step. Pretty soon we will be able to run a distributed acoustic model training system to train the acoustic model using vast amount of distributed data and computing power. With a BOINC-grid computation network of CMUSphinx tools we together will create the most accurate models for speech. Stay tuned.

4 Responses to “Help to distribute CMUSphinx data through Bittorrent”

  1. Khoa says:

    why don’t we build a speech recognition service based on available speech data to collect data ? I mean each client which want to use the service, they must accept to provide their data to improve system :)

  2. admin says:

    > I mean each client which want to use the service, they must accept to provide their data to improve system

    Exactly, this is how it will work soon.

  3. Jsmith404 says:

    will seed it.

  4. Jsmith404 says:

    Oh, I detected one quite a bad issue: all torrents have a private flag set. This is the dumbest thing you can do on public data. Tracker could fail to respond. Single failure and I have no download. Hence FAIL. But DHT is a really die-hard thing. It does not cares about single point of failure. It will just flow around failure point and rejoin/converge when possible. Private torrents are inherently unreliable, hard to download and perform poorly. Remove private flag and try again.