WARNING: THIS TUTORIAL DESCRIBES SPHINX4 API FROM THE EXPERIMENTAL HL-INTERFACE BRANCH. IT IS NOT AVAILABLE IN RELEASED VERSION YET. TO FOLLOW THIS TUTORIAL YOU NEED TO CHECKOUT HL-INTERFACE BRANCH FROM SUBVERSION:
Sphinx-4 is a pure Java speech recognition library. It's very flexible in its configuration, and in order to carry out speech recognition jobs quite a lot of objects depending on each other should be instantiated, throughout this article we will call them all together “object graph”. Fortunately, the most of the objects can be instantiated automatically, and for those few requiring manual setup Sphinx-4 provides high-level interfaces and a context class that takes out the need to setup each parameter of the object graph separately.
There are several high-level recognition interfaces in Sphinx-4:
For the most of the speech recognition jobs high-levels interfaces should be enough. And basically you will have only to setup four attributes:
First three attributes are setup using Configuration object which is passed then to a recognizer. The way to point out to the speech source depends on a concrete recognizer and usually is passed as a method parameter.
Configuration is used to supply required and optional attributes to recognizer.
Configuration configuration = new Configuration(); // Set path to acoustic model. configuration.setAcousticModelPath("resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz"); // Set path to dictionary. configuration.setDictionaryPath("resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d"); // Set language model. configuration.setLanguageModelPath("models/language/en-us.lm.dmp");
LiveSpeechRecognizer uses microphone as the speech source.
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration); // Start recognition process pruning previously cached data. recognizer.startRecognition(true); SpeechResult result = recognizer.getResult() // Pause recognition process. It can be resumed then with startRecognition(false). recognizer.stopRecognition();
BatchSpeechRecognizer uses audio file as the speech source.
BatchSpeechRecognizer recognizer = BatchSpeechRecognizer(configuration); recognizer.startRecognition(new File("speech.wav").toURI().toURL()); SpeechResult result = recognizer.getResult() recognizer.stopRecognition();
SpeechAligner time-aligns text with audio speech.
SpeechAligner aligner = new SpeechAligner(configuration); recognizer.align(new URL("101-42.wav"), "one oh one four two");
SpeechResult provides access to various parts of the recognition result, such as recognized utterance, list of words with time stamps, recognition lattice and so forth.
// Print utterance string without filler words. System.out.println(result.getUtterance(false)); // Save lattice in a graphviz format. result.getLattice().dumpDot("lattice.dot", "lattice");
If you need something more sophisticated than the recognition interfaces provided by Sphinx-4, you can still write your own one. In this case some internal classes can be helpful. Let's quickly go over them.
Context wraps low-level manipulations with the underlying configuration into logical methods. For example, setting up the acoustic model it is crucial to correctly set low- and high-pass filters. Context#setAcousticModel(String) will automatically extract such information from the provided model and make necessary changes in the configuration.
Another important function of Context is the access to the object graph. It can fetch graph objects components by its class. Basically you will always need Recognizer instance as a primary class which carries out recognition and a few secondary instances which are responsible for various aspects of recognition, such as microphone or audio file interface.
Context context = new Context(configuration); // Use microphone input. context.useMicrophone(); // Get required instances. Recognizer recognizer = context.getInstance(Recognizer.class); Microphone microphone = context.getInstance(Microphone.class); // Start recognition. recognizer.allocate(); microphone.startRecording(); Result result = recognizer.recognize(); microphone.stopRecording(); recognizer.deallocate();
AbstractSpeechRecognizer contains boilerplate code that is common to existing speech recognizer implementations.
Even though it is still possible to use the old way of configuring application, XML configuration is deprecated and is a subject to remove in future releases. If you have custom XML configuration and want to switch to the new API, you will have to implement your own interface as it is described above and provide path to your configuration as an argument of Context: