Sphinx-4 Application Programmer's Guide

WARNING: THIS TUTORIAL DESCRIBES SPHINX4 API FROM THE PRE-ALPHA RELEASE:

https://sourceforge.net/projects/cmusphinx/files/sphinx4/5prealpha/

The API described here is not supported in earlier versions

Overview

Sphinx-4 is a pure Java speech recognition library. It's very flexible in its configuration, and in order to carry out speech recognition jobs quite a lot of objects depending on each other should be instantiated, throughout this article we will call them all together “object graph”. Fortunately, the most of the objects can be instantiated automatically, and for those few requiring manual setup Sphinx-4 provides high-level interfaces and a context class that takes out the need to setup each parameter of the object graph separately.

Using in your projects

Sphinx-4 is available as a maven package in Sonatype OSS repository. To use sphinx4 in your maven project specify this repository in your pom.xml:

<project>
...
    <repositories>
        <repository>
            <id>snapshots-repo</id>
            <url>https://oss.sonatype.org/content/repositories/snapshots</url>
            <releases><enabled>false</enabled></releases>
        <snapshots><enabled>true</enabled></snapshots>
        </repository>
    </repositories>
...
</project>

Then add sphinx4-core to the project dependencies:

<dependency>
  <groupId>edu.cmu.sphinx</groupId>
  <artifactId>sphinx4-core</artifactId>
  <version>1.0-SNAPSHOT</version>
</dependency>

Add sphinx4-data to dependencies as well if you want to use default acoustic and language models:

<dependency>
  <groupId>edu.cmu.sphinx</groupId>
  <artifactId>sphinx4-data</artifactId>
  <version>1.0-SNAPSHOT</version>
</dependency>

Basic Usage

There are several high-level recognition interfaces in Sphinx-4:

  • LiveSpeechRecognizer
  • StreamSpeechRecognizer
  • SpeechAligner

For the most of the speech recognition jobs high-levels interfaces should be enough. And basically you will have only to setup four attributes:

  • Acoustic model.
  • Dictionary.
  • Grammar/Language model.
  • Source of speech.

First three attributes are setup using Configuration object which is passed then to a recognizer. The way to point out to the speech source depends on a concrete recognizer and usually is passed as a method parameter.

Configuration

Configuration is used to supply required and optional attributes to recognizer.

Configuration configuration = new Configuration();
 
// Set path to acoustic model.
configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
// Set path to dictionary.
configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
// Set language model.
configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/language/en-us.lm.dmp");

LiveSpeechRecognizer

LiveSpeechRecognizer uses microphone as the speech source.

LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
// Start recognition process pruning previously cached data.
recognizer.startRecognition(true);
SpeechResult result = recognizer.getResult();
// Pause recognition process. It can be resumed then with startRecognition(false).
recognizer.stopRecognition();

StreamSpeechRecognizer

StreamSpeechRecognizer uses InputStream as the speech source, you can pass the data from the file this way, you can pass the data from the network socket or from existing byte array.

StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
recognizer.startRecognition(new FileInputStream("speech.wav"));
SpeechResult result = recognizer.getResult();
recognizer.stopRecognition();

You can retreive multiple results until the file end:

while ((result = recognizer.getResult()) != null) {
    System.out.println(result.getHypothesis());
}

SpeechAligner

SpeechAligner time-aligns text with audio speech.

SpeechAligner aligner = new SpeechAligner(configuration);
recognizer.align(new URL("101-42.wav"), "one oh one four two");

SpeechResult

SpeechResult provides access to various parts of the recognition result, such as recognized utterance, list of words with time stamps, recognition lattice and so forth.

// Print utterance string without filler words.
System.out.println(result.getHypothesis());
// Save lattice in a graphviz format.
result.getLattice().dumpDot("lattice.dot", "lattice");

XML Configuration

Previously Sphinx4 was configured by means of complex XML files, this created a lot of problems for our users. Now Sphinx4 comes with sane defaults and perform well without any configuration. For that reason it is not recommended to use XML files or to modify them.

Additional Information

 
tutorialsphinx4.txt · Last modified: 2015/01/21 15:46 by admin
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki