Long Audio Aligner Landed in Trunk

After three years of development we have finally merged an aligner for long audio files into trunk. The aligner takes audio file and corresponding text and dumps timestamps for every word in the audio. This functionality is useful for processing of the transcribed files like podcasts with further applications like better support for audio editing or for automatic subtitle syncronization. Another important application is acoustic model training, with a new feature you can easily collect databases of thousand hours for your native language with the data from the Internet like news broadcasts, podcasts and audio books. With that new feature we expect the list of supported languages will grow very fast.

To access new feature checkout sphinx4 from subversion or from our new repository on Github http://github.com/cmusphinx/sphinx4 and build code with maven with “mvn install”

For the best accuracy download En-US generic acoustic model from downloads as well as g2p model for US English.

Then run the alignment

java -cp sphinx4-samples/target/sphinx4-samples-1.0-SNAPSHOT-jar-with-dependencies.jar \
edu.cmu.sphinx.demo.aligner.AlignerDemo file.wav file.txt en-us-generic \
cmudict-5prealpha.dict cmudict-5prealpha.fst.ser

The result will look like this:

+ of                        [10110:10180]
  there                     [11470:11580]
  are                       [11670:11710]
- missing

Where + denotes inserted word and – is for missing word. Numbers are times in milliseconds.

Please remember that input file must be 16khz 16bit mono. Text must be preprocessed, the algorithm doesn’t handle numbers yet.

The work on long alignment started during 2011 GSoC project with proposal from Bhiksha Raj and James Baker. Apurv Tiwari made a lot of progress on it, however, we were not able to produce a robust algorithm for alignment. It still failed on too many cases and failures were critical. Finally we changed algorithm to multipass decoding and it started to work better and survive gracefully in case of errors in transcription. Alexander Solovets was responsible for the implementation. Still, the algorithm doesn’t handle some important tokens like numbers or abbreviations and the speed requires improvement, however, the algorithm is useful already so we can proceed with further steps of model training. We hope to improve the situation in near future.

3 Responses to “Long Audio Aligner Landed in Trunk”

  1. Ragini Mankad says:


    Awesome work, guys. I experimented with the Long Audio Aligner and it works great. Gives me accurate results in most of the case, wonderful :)

    However, the execution time is very long.

    It takes around 30 min for my 3 min audio to get results. 30 min are considered okay? If i am doing something wrong, can you please point it out. My audio files would always be in between 2-10 minutes, 16bit mono 16000 little endian. My JVM is alloted 1GB of space and i have total 3GB of ram on a Intel Core 2 Duo CPU 1.67Ghz.

    It would really be great if i can get the 3 min audio to get processed in 5 to 10 min.

    Please suggest.


  2. Dmytro says:

    Can it be used to force-align transcripts, i.e. finding the best alternative over pronounciation variants, like sphinx3_align?

  3. admin says:

    No, it has different purpose