Troy: GSoC 2012 Pronunciation Evaluation Week 1


The first week of GSoC 2012 has already been a busy summer. Here is what I have accomplished so far:

  1. To measure the Speex recording "quality" parameter (which is set by the client from 0 to 10) I recorded the same Sphinx3 test utterance ("NO ONE AT THE STATE DEPARTMENT WANTS TO LET SPIES IN") with the quality varying from 0 to 10. As shown on the graph, the higher the Speex quality parameter, the larger the .FLV file will be. Judging from my own listening, greater quality parameter values do result in better quality, but it is difficult to hear the differences above level 7. I also tried to generate alignment scores to see whether the quality affects the alignment. However, from the results shown in the following graph, the acoustic scores seems essentially identical for the different recordings. But to be on the safe side in case of background and line noise, for now we will use a Speex recording quality parameter of 8.graph
  2. The rtmplite server is now configured to save its uploaded files to the[path_to_webroot]/data directory on the server. The initial audioRecorder applet will place its recordings in the [path_to_webroot]/data/audioRecorder directory, and for each user there will be a separate folder (e.g. [path_to_webroot]/data/audioRecorder/user1). For each recording utterance, the file name is now in the format of [sentence name]_[quality level].flv
  3. The conversion from .FLV Speex uploads to .WAV PCM audio files is done entirely in the rtmplite server using a process spawned by Python's subprocess.Popen() function calling ffmpeg. After the rtmplite closes the FLV file, the conversion is performed immediately and the converted WAV file has exactly the same path and name except the suffix, which is .wav instead of .flv. Guillem suggested the sox command for the conversion, but it doesn't recognize .flv files directly.  Other possibilities included speexdec, but that won't open .flv files either.
  4. In the audioRecorder client, the user interface now waits for NetConnection and NetStream events to open and close successfully before proceeding with other events. And a 0.5 second delay has been inserted at the beginning and end of the recording button click event to avoid inadvertently trimming the front or end of the recording.
My plans for the 2nd week are:
  1. Solve a problem encountered in converting FLV files to WAV using ffmpeg with Python's Popen() function. If the main Python script (call it test.py for example) is run from a terminal as "python test.py", then everything works great. However, if I put it in background and log off the server by doing "python test.py &", everytime when Popen() is invoked, the whole process hangs there with a "Stopped + test.py &" error message. I will try to figure out a way to work around this issue. Maybe if I start the process from cron (after checking to see whether it already running with a process ID number in a .pid text file) then it will start subprocesses without stopping as occurs when it is detached from a terminal.
  2. Finish the upload interface. There will be two kinds of interfaces: one for students and one for exemplar pronunciations. For the students, we will display from one to five cue phrases below space for a graphic or animation, assuming the smallest screen possible using HTML which would also look good in a larger window. For the exemplar recordings, we just need to display one phrase but we should also have per-upload form fields (e.g., name, age, sex, native speaker (y/n?), where speaker lived ages 6-8 (which determines their accent), self-reported accent, etc.) which should persist across multiple uploads by the same user (perhaps using HTTP cookies.)  I want to integrate those fields with the mysql database running on our server, so I will need to create a SQL schema with some CREATE TABLE statements to hold all those fields, the filenames, maybe recording durations, the date and time, and perhaps other information.
  3. Test the rtmplite upload server to make sure it works correctly and without race conditions during simultaneous uploads from multiple users, and both sequential and simultaneous recording uploads by the same user, just to be on the safe side.
  4. Further milestones are listed at https://cmusphinx.github.io/wiki/pronunciation_evaluation#milestones1