Sphinx-4 Application Programmer's Guide

WARNING: THIS TUTORIAL DESCRIBES SPHINX4 API FROM THE PRE-ALPHA RELEASE:

https://sourceforge.net/projects/cmusphinx/files/sphinx4/5%20prealpha/

The API described here is not supported in earlier versions

Overview

Sphinx-4 is a pure Java speech recognition library. It's very flexible in its configuration, and in order to carry out speech recognition jobs quite a lot of objects depending on each other should be instantiated, throughout this article we will call them all together “object graph”. Fortunately, the most of the objects can be instantiated automatically, and for those few requiring manual setup Sphinx-4 provides high-level interfaces and a context class that takes out the need to setup each parameter of the object graph separately.

Using in your projects

Sphinx-4 is available as a maven package in Sonatype OSS repository. To use sphinx4 in your maven project specify this repository in your pom.xml:

<project>
...
    <repositories>
        <repository>
            <id>snapshots-repo</id>
            <url>https://oss.sonatype.org/content/repositories/snapshots</url>
            <releases><enabled>false</enabled></releases>
        <snapshots><enabled>true</enabled></snapshots>
        </repository>
    </repositories>
...
</project>

Then add sphinx4-core to the project dependencies:

<dependency>
  <groupId>edu.cmu.sphinx</groupId>
  <artifactId>sphinx4-core</artifactId>
  <version>1.0-SNAPSHOT</version>
</dependency>

Add sphinx4-data to dependencies as well if you want to use default acoustic and language models:

<dependency>
  <groupId>edu.cmu.sphinx</groupId>
  <artifactId>sphinx4-data</artifactId>
  <version>1.0-SNAPSHOT</version>
</dependency>

Basic Usage

There are several high-level recognition interfaces in Sphinx-4:

  • LiveSpeechRecognizer
  • StreamSpeechRecognizer
  • SpeechAligner

For the most of the speech recognition jobs high-levels interfaces should be enough. And basically you will have only to setup four attributes:

  • Acoustic model.
  • Dictionary.
  • Grammar/Language model.
  • Source of speech.

First three attributes are setup using Configuration object which is passed then to a recognizer. The way to point out to the speech source depends on a concrete recognizer and usually is passed as a method parameter.

Configuration

Configuration is used to supply required and optional attributes to recognizer.

Configuration configuration = new Configuration();
 
// Set path to acoustic model.
configuration.setAcousticModelPath("resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz");
// Set path to dictionary.
configuration.setDictionaryPath("resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d");
// Set language model.
configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/language/en-us.lm.dmp");

LiveSpeechRecognizer

LiveSpeechRecognizer uses microphone as the speech source.

LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
// Start recognition process pruning previously cached data.
recognizer.startRecognition(true);
SpeechResult result = recognizer.getResult();
// Pause recognition process. It can be resumed then with startRecognition(false).
recognizer.stopRecognition();

StreamSpeechRecognizer

StreamSpeechRecognizer uses audio file as the speech source.

StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
recognizer.startRecognition(new File("speech.wav").toURI().toURL());
SpeechResult result = recognizer.getResult();
recognizer.stopRecognition();

You can retreive multiple results untill the file end:

while ((result = recognizer.getResult()) != null) {
    System.out.println(result.getHypothesis());
}

SpeechAligner

SpeechAligner time-aligns text with audio speech.

SpeechAligner aligner = new SpeechAligner(configuration);
recognizer.align(new URL("101-42.wav"), "one oh one four two");

SpeechResult

SpeechResult provides access to various parts of the recognition result, such as recognized utterance, list of words with time stamps, recognition lattice and so forth.

// Print utterance string without filler words.
System.out.println(result.getHypothesis());
// Save lattice in a graphviz format.
result.getLattice().dumpDot("lattice.dot", "lattice");

API Extension

If you need something more sophisticated than the recognition interfaces provided by Sphinx-4, you can still write your own one. In this case some internal classes can be helpful. Let's quickly go over them.

Context

Context wraps low-level manipulations with the underlying configuration into logical methods. For example, setting up the acoustic model it is crucial to correctly set low- and high-pass filters. Context#setAcousticModel(String) will automatically extract such information from the provided model and make necessary changes in the configuration.

Another important function of Context is the access to the object graph. It can fetch graph objects components by its class. Basically you will always need Recognizer instance as a primary class which carries out recognition and a few secondary instances which are responsible for various aspects of recognition, such as microphone or audio file interface.

Context context = new Context(configuration);
// Use microphone input.
context.useMicrophone();
 
// Get required instances. 
Recognizer recognizer = context.getInstance(Recognizer.class);
Microphone microphone = context.getInstance(Microphone.class);
 
// Start recognition.
recognizer.allocate();
microphone.startRecording();
Result result = recognizer.recognize();
microphone.stopRecording();
recognizer.deallocate();

AbstractSpeechRecognizer

AbstractSpeechRecognizer contains boilerplate code that is common to existing speech recognizer implementations.

XML Configuration

It is possible to configure low-level components of the application through XML file although you should do that ONLY IF you understand what is going on. If you have custom XML configuration and want to use it, you can provide a path to your configuration as an argument of Context:

Context context = new Context("file:custom.config.xml", configuration);

Additional Information

 
tutorialsphinx4.txt · Last modified: 2014/11/04 21:31 by mbait
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki