Web Data Collection For Language Modeling – New plan, IRST LM and perplexity

Dr. Tony Robinson, one of the mentors, has come up with a new plan for the project.

The core idea remains the same: Given some audio and its transcripts belonging to some domain, we find additional text on the web which matches our domain. Then we build language models on this obtained text.

We will use podcasts for this. Podcasts can have very different domains and also sometimes come with transcripts. The task therefore becomes building adaptive language models for podcasts.

The difference is that we are going to use Lucene for searching text after obtaining it from a small set of websites that provide high-quality text, such as news websites. This eliminates the problem of search engine automated query policies and also increases the speed of processing data as everything is queried locally.

The language model toolkit was chosen to be IRST LM. Installation of IRST LM is straightforward. Manual for version 5.60.01 explains the process quite well, with one minor error. Caching has to be enabled in configure step.

I have read some of “Speech and Language Processing” by Jurafsky et al. to get some ideas about n-grams, smoothing and perplexity. Then I have run some experiments on training data that was provided by IRST LM. I used your-text-file and test files in the zip archive to get some results. Punctuation marks and suffixes like 's have already been separated by whitespace in the provided text, so no extra processing than adding sentence boundary marks was necessary. I added sentence boundaries using the provided script add-start-end.sh and saved the result as training-text for convenience:

emre@ammit ~/gsoc/irstlm $ add-start-end.sh < your-text-file > training-text

(Note that you have to set environment path correctly to access IRST LM tools without specifying folders.)

Then, I created a language model using trigrams and Witten-Bell smoothing and evaluated it using given test file:

emre@ammit ~/gsoc/irstlm $ tlm -tr=training-text -n=3 -lm=wb -te=test

which gave n=49984 LP=301734.5406 PP=418.4772517 OVVRate=0.05007602433 as output. PP stands for the perplexity of the language model and OOVRate is out-of-vocabulary rate of the test set. When using Modified shift-beta smoothing by setting the parameter -lm to msb,

emre@ammit ~/gsoc/irstlm $ tlm -tr=training-text -n=3 -lm=msb -te=test

the perplexity score seems to be much lower: n=49984 LP=287035.4908 PP=311.8578364 OVVRate=0.05007602433.

The project proposal is going to be updated soon. My next task is using some actual text and comparing perplexities. I also have to read a bit more about entropy as I am confused about the idea that the cross entropy of a model of a language is always going to be higher than the entropy of the language itself, i.e. cross entropy of a language model is an upper bound to entropy of the language.

- Emre

GSOC 2012 Accepted Projects Announced

We are happy to announce a list of students which will participate in Google Summer Of Code 2012 project with CMUSphinx organization:

Letter to Phoneme Conversion in sphinx4

Task

Currently sphinx4 can only work with predefined dictionary. It’s possible to build phonetic dictionary automatically but it requires both application of machine learning for training and development of decoder module as well as testing. Various language modules needs to be trained as well. This work will be implement letter to sound rules with OpenFST in sphinx4.

Student John Salatas

Pronunciation Evaluation

Task

Implement the simple reading and pronunciation learning system

Students

Srikanth Ronanki and Troy Lee

Semantic language model

Current language models are very basic that means they don’t really understand what’s transcribed. That affects error rate. Create a decoder over the lattices that will select semantically correct path and create a perfectly readable result.

Student

Wencan Luo

Postprocessing punctuation and capitalization framework

Create language-independent postprocessing framework that will turn ASR results into something readable with punctuation, abbreviations and capitalization.

http://www.makapa.de/Paulik_Sent_ICASSP08.pdf

Student

Alexandru-Dan Tomescu

Web Data Collection For Language Modeling

Write a crawler which can collect text data for language model training on certain topic

Student

Emre Çelikten

We expect great features implemented this summer. Please stay tuned, the news will appear here.

Podcast About CMUSphinx History

Hello CMUSphinx User and Developers

If you are interested in CMUSphinx history or just want to become more familar with core CMUSphinx developers and listen to them you can now do so. Recently Sourceforge team and Rich Bowen has made a great podcast with the CMUSphinx team

Check it out

https://sourceforge.net/blog/podcast-cmusphinx/

CMUSphinx powers mobile dictation application

Sonalight, which showed off its product at this week’s Y Combinator Demo Day, thinks voice tech is better put to use tackling real issues users have with their mobiles in everyday settings, like texting while driving. Sonalight actually employs Google’s own existing voice recognition tech, in combination with the CMU Sphinx open source software, to achieve its results. This is a great use case for CMUSphinx.

Visit

http://sonalight.com/

To try it.

onalight actually employs Google’s own existing voice recognition tech, in combination with the CMU Sphinx open source software, to achieve its results.

CMUSphinx at GSOC 2012

We are pleased to announce that CMUSphinx project is accepted to Google Summer Of Code 2012 program. That will enable us to help several students to start their way in speech recognition, open source development and in CMUSphinx. We are really excited about that.

http://www.google-melange.com/gsoc/org/google/gsoc2012/cmusphinx

If you are interested to participate as a student, an application period will open soon but it’s better to start preparation of your application right now. Feel free to contact us for any questoins! For more details see:

http://cmusphinx.sourceforge.net/wiki/summerofcodestudents

If you would like to be a mentor please sign in into gsoc web application and add your ideas to the ideas list:

http://cmusphinx.sourceforge.net/wiki/summerofcodeideas

We invite you to participate!

Audio tool for displaying spectrogram in real time using Sphinx-4

The iSound is a program that was built with help of the CMU Sphinx-4 system. It is a part of the thesis at the Faculty of Mathematics, Natural Sciences and Information Technologies from Koper, Slovenia. Its main goal is real-time audio signal visualization, also known as spectrogram or sonogram. Which means, that it allows observation of the sound.

Spectrogram

This property could be useful in many areas, such as: phonetics, animal sounds analysis, music, sonar/radar, speech processing, seismology, etc. Additionally, it has included few features into basic spectrogram drawing, which made the application more useful. That features are: image freezing, zoom control, signal frequency display, resizing, changing of the color schemes and contrast adjustment.

Compared to the other programs with similar functioning it gives promising results of the CPU, memory and graphics usage. Tests were made on Windows XP, Windows 7, Linux ubuntu and OS X Lion. You can find full test results in the diploma, on page 40.

Author’s comment: “For future work i plan to publish research work as an article in the journal. Currently I’m working on idea, how to use similar technologies and develop a tool, which can help persons with hearing handicap.”

Find useful information about the project, at the author, Irman Abdić’s web page:
http://www.irmanabdic.com

Russian Audiobook Morphology-Based Model

In many languages the amount of lexical forms is huge due to morphology. Even simple vocabulary can contain several million forms and variations. It’s hard to recognize such a big vocabulary because of huge search space. Decoder is slow and a language model takes enormous amount of memory.

Of course brute force approach make sense and actually quite successful but better ones already suggested. For example using morphological segmener we can build a language model and the acoustic model which can describe the same vocabulary in way smaller number of subword items. Real words are combined from the chunks which are separate entities in a language model. This way our search space is efficiently represented and the speed is comparable to English models.

The tricky part is to properly segment the words. Because pronunciaiton of decomposition is not so straightforward it takes some effort to build the split. We are happy that our contributor Zamir Ostroukhov managed to solve that problem. He created the acoustic model from the audiobooks from the Voxforge database and used large text corpora to create a morphologically-segmented language model. This is a very promising approach for morphologically-rich language so we look forward to see this framework as a part of CMUSphinx. Maybe this framework could be extended to multilevel speech representation which could hold both subwords and sentence-level items.

Check Zamir’s project
https://github.com/zamiron/ru4sphinx

For more details on the approach please see

Large vocabulary continuous speech recognition of an inflected language using stems and endings by Toma Rotovnik at al.

Download Russian audiobook model here, the morphological language model is included:
http://sourceforge.net/projects/cmusphinx/files/Acoustic and Language Models/Russian Audiobook Morphology Zero

For more details see
http://www.cis.hut.fi/projects/morpho/

Long Audio Alignment: Phrase Spotter and the subsequent improvements

Over the last couple of weeks Long Audio Alignment Project has had a lot of new developments. It was understood that the accuracy of audio alignment would improve even further if some approximate time information for certain words were known from before the actual alignment. A decoder without such an information, during alignment has to go through all the frames of the audio to finally tell which alignment hypothesis scores the best. However, with such approximate timed information the decoder only has to wait until one (or more) of it’s hypothesis token agree with this additional information, allowing it to prune out the rest of the tokens. This helps keeping the beam size small.
A highly configurable phrase spotter was hence implemented to get this timed information. Phrase spotter creates a left to right no skips grammar from the words in the phrase and (like Aligner) uses a garbage model to model all out of grammar utterances. The grammar has been chosen this way to ensure that a phrase is recognized only when all the words in the phrase are present in the utterance without a skip. Corresponding changes in the linguist were made as well to ( allow and ) ensure that one a Phone Loop is inserted only at the start of a phrase in the search graph.

Aligner search manager was then designed to exploit this phrase spotter’s result and perform audio alignment. As a result, even with much more complicated grammar, the aligner can now align audio with much better accuracy (almost 0% error) , however memory requirements for aligning very large text allowing large skips still poses a problem. For now I am profiling the memory usage to locate and tackle this issue.

Long Audio Training – Reduced Baum-Welch Evaluation

In the last post regarding the Long Audio Training it was indicated that there were still some problems in the reduced Baum-Welch implementation. Fortunately these were identified as memory leaks introduced during the optimization and were fixed. In the past days I have made some extensive tests, which show, that modified algorithms perform significantly better than the original version of SphinxTrain with the respect to the memory consumption.

See http://cmusphinx.sourceforge.net/wiki/longaudiotraining for more detailed evaluation.

CMUSphinx and OpenCV

A mixture of cool technologies could help to create really innovating applications. Check this video with demonstratoin of CMUSphinx capabilities when it’s combined with OpenCV video recognition library.