/README.txt

https://code.google.com/p/phonetisaurus/ · Plain Text · 154 lines · 107 code · 47 blank · 0 comment · 0 complexity · ae8dc8c2a59d4b1e829d711ea33ae0b8 MD5 · raw file

  1. Phonetisaurus
  2. 2010-10-27
  3. Josef Robert Novak
  4. UPDATED: 2012-03-13 Josef R. Novak
  5. m2m-aligner is no longer required. The native WFST-based aligner should
  6. be preferred as it is faster and produces more accurate alignments.
  7. UPDATED: 2011-04-07 Josef R. Novak
  8. The python 'decoder' is now obsolete. Use the C++ tool.
  9. UPDATED: 2011-04-11 Josef R. Novak
  10. Many updates to the C++ 'decoder'. It is now fairly nice.
  11. Also wrote new model-training scripts.
  12. It should be much easier to train and test models now.
  13. REQUIREMENTS
  14. estimate-ngram must be accessibile from your ${PATH}
  15. variable. If you have not installed it, you can obtain the source
  16. code from the following locations.
  17. mitlm:
  18. https://code.google.com/p/mitlm/
  19. The m2m-aligner is also a nice tool and is supported as an alternative
  20. but support is now deprecated in favor of the native implementation.
  21. m2m-aligner:
  22. http://code.google.com/p/m2m-aligner/
  23. RECOMMENDED INSTALL:
  24. $ make -j
  25. $ ./mk_swig.sh
  26. ALIGN A DATABASE
  27. $ ./m2m-aligner.py --align data/g014a2.train.bsf -s2 -s1 --write_align script/test/test.corpus -m1 2 -m2 2
  28. TRAIN A MODEL
  29. $ cd script
  30. $ ./train-model.py --dict ../data/g014a2.train.bsf --prefix test/test --order 7 --noalign --palign
  31. TEST A MODEL
  32. $ ./evaluate.py --modelfile test/test.fst --testfile ../data/g014a2.test.tabbed.bsf --prefix test/test
  33. -----
  34. OLD INSTALL
  35. $ make
  36. TRAIN A MODEL
  37. $ cd script
  38. $ ./train-model.py --dict ../data/g014a2.train.bsf --verbose --delX --prefix "test/test"
  39. TEST A MODEL
  40. $ ../phonetisaurus-g2p -m test/test.fst -n 1 -t ../data/g014a2.clean.test
  41. TEST A WORD
  42. $ ../phonetisaurus-g2p -m test/test.fst -n 1 -w airmail
  43. GET NBEST RESULTS
  44. $ ../phonetisaurus-g2p -m test/test.fst -n 7 -w airmail
  45. I've only tested this latest build properly with the NETtalk database.
  46. It might die with something else...
  47. ------------------------------------
  48. OLD STUFF
  49. ------------------------------------
  50. Compile:
  51. $ make
  52. Running the tool:
  53. $ ./phonetisaurus <clusters> <testlist> <isyms> <model.fst> <osyms> <n-best>
  54. for example,
  55. $ ./phonetisaurus mymodel.clusters testlist mymodel.isyms mymodel.fst mymodel.osyms 5
  56. Will produce pronunciation (or spelling) hypotheses for each item in 'testlist'.
  57. It will produce a maximum of 5 hypotheses for each item. Depending on the item it
  58. may produce fewer hypotheses.
  59. INSTALL
  60. First, install the following packages,
  61. Python:
  62. sudo easy_install simplejson
  63. sudo easy_install argparse (if using python version < 2.7)
  64. Other:
  65. OpenFST http://openfst.org/
  66. mitlm https://code.google.com/p/mitlm/
  67. The project is setup to use the open nettalk db by default,
  68. cd into the phonetisaurus sub-directory and run,
  69. ---------------------------------------
  70. phonetisaurus$ ./train-phoneticizer.sh
  71. STARTING TRAINING
  72. Using partitioned dataset: 95% training, 5% testing
  73. Aligning the dictionary...
  74. Iteration 1. Only equal length pairs...
  75. Iteration 2. Only equal length pairs...
  76. Iteration 3...
  77. Iteration 4...
  78. Iteration 5...
  79. Generating LM training corpus...
  80. Training the n-gram model with mitlm, n=6...
  81. Loading corpus models/dev-train-0.95-un.corpus...
  82. Smoothing[1] = ModKN
  83. Smoothing[2] = ModKN
  84. Smoothing[3] = ModKN
  85. Smoothing[4] = ModKN
  86. Smoothing[5] = ModKN
  87. Smoothing[6] = ModKN
  88. Set smoothing algorithms...
  89. Estimating full n-gram model...
  90. Saving LM to models/aligned_corpus-6g.arpa...
  91. Building the phoneticizer FST...
  92. TRAINING FINISHED
  93. Running the test...
  94. Evaluating WER...
  95. WER: 0.361
  96. TESTING FINISHED
  97. In order to run the new models yourself, try something like,
  98. ./g2p-en.py -m g-mod.fst -i g-mod.isyms -o g-mod.osyms -w testing
  99. -------------------------------------
  100. That's it!
  101. In order to use your own dictionary, it needs to follow the simple
  102. format where each line contains a single word (no spaces) followed
  103. by a tab, followed by a sequence of phonemes separated by spaces,
  104. for example,
  105. --------------------------
  106. testing T EH S T IH NG
  107. --------------------------