There are various tools to help you to extend an existing dictionary for new words or to build a new dictionary from scratch. If your language already has a dictionary it's recommended to use since it's carefully tuned for best performance. If you starting a new language you need to account for various reductions and coarticulations effects. They make it very hard to create accurate rules to convert text to sounds. However, the practice shows that even naive conversion could produce a good results for speech recognition. For example, many developers were successful to create ASR with simple grapheme-based synthesis where each letter is just mapped to itself not to the corresponding phone.
For most of the languages you need to use specialized grapheme to phoneme (g2p) code to do the conversion using machine learning methods and existing small database. Nowdays most accurate g2p tools are Phonetisaurus:
Also note that almost each TTS package has G2P code included. For example you can use g2p code from FreeTTS written in Java:
See FreeTTS example in Sphinx4 here
OpenMary Java TTS:
or espeak for C:
Please note that if you use TTS you often need to do phoneset conversion. TTS phonesets are usually more extensive than required for ASR. However, there is a great adavantage in TTS tools because they usually contain more required functionality than simple G2P. For example, they are doing tokenization by converting numbers and abbreviations to spoken format.
For English you can use simplier capabilities by using on-line webservice:
Online LM Tool, produces a dictionary which matches its language model. It uses the latest CMU dictionary as a base, and is programmed to guess at pronunciations of words not in the existing dictionary. You can look at the log file to find which words were guesses, and make your own corrections, if necessary. With the advanced option, LM Tool can use a hand-made dictionary that you specify for your specialized vocabulary, or for your own pronunciations as corrections. The hand dictionary must be in the same format as the main dictionary
If you want to run lmtool offline you can checkout it from subversion:
The pronunciation generation code currently only supports US English.