(author: Alex Tomescu)
The Postprocessing Framework project (part of GSoC 2012) is ready for use.
This project concentrates on capitalization and punctuation recovery tasks, based on capitalized and punctuated language models. The current accuracy for comma prediction is 35% and for period it’s 39%. Capitalization is at around 94% (most of the words are lower-cased).
This project had two main parts: the language model and the main algorithm.
The language model
For the post processing task the language model used has to contain capitalized words and punctuation mark word tokens. In the training data, commas are replaced with <COMMA> and periods are replaced with <PERIOD>. Also sentences should be grouped into paragraphs so that start and end of sentence markers (<s> and </s>) are not very frequent. The language model need to be compressed from ARPA format to DMP format with sphinx_lm_convert (or sphinx3_lm_convert).
The gutenberg.DMP language model is correctly formatted and can be found in the language model download section on the project’s sourceforge page (https://sourceforge.net/projects/cmusphinx/files/).
The algorithm relies on iterating throught word symbols to create word sequences, which are evaluated and put into stacks. When a stack gets full (a maximum capacity is set) it gets sorted (by sequence probabilities) and the lowest scoring part is discarded. This way bad scoring sequences are discarded, and only the best ones are kept. The final solution is the sequence with the same size as the input, with the best probability.
The project is available for download at:
To compile the project install apache ant and be sure to set the required enviroment variables. Then type the following:
To postprocess text use the postprocess.sh script:
sh ./postprocessing.sh -input_text path_to_file -lm path_to_lm