PageRenderTime 18ms CodeModel.GetById 10ms app.highlight 3ms RepoModel.GetById 1ms app.codeStats 0ms

/README.txt

https://code.google.com/p/phonetisaurus/
Plain Text | 154 lines | 107 code | 47 blank | 0 comment | 0 complexity | ae8dc8c2a59d4b1e829d711ea33ae0b8 MD5 | raw file
  1Phonetisaurus
  22010-10-27
  3Josef Robert Novak
  4
  5UPDATED: 2012-03-13 Josef R. Novak
  6m2m-aligner is no longer required.  The native WFST-based aligner should
  7be preferred as it is faster and produces more accurate alignments.
  8
  9UPDATED: 2011-04-07 Josef R. Novak
 10The python 'decoder' is now obsolete.  Use the C++ tool.
 11
 12UPDATED: 2011-04-11 Josef R. Novak
 13Many updates to the C++ 'decoder'.  It is now fairly nice.
 14Also wrote new model-training scripts.  
 15
 16It should be much easier to train and test models now.
 17
 18REQUIREMENTS
 19estimate-ngram must be accessibile from your ${PATH}
 20variable.  If you have not installed it, you can obtain the source
 21code from the following locations.  
 22
 23mitlm:
 24 https://code.google.com/p/mitlm/
 25
 26The m2m-aligner is also a nice tool and is supported as an alternative
 27but support is now deprecated in favor of the native implementation.
 28m2m-aligner:
 29 http://code.google.com/p/m2m-aligner/
 30  
 31RECOMMENDED INSTALL:
 32$ make -j
 33$ ./mk_swig.sh
 34
 35ALIGN A DATABASE
 36$ ./m2m-aligner.py --align data/g014a2.train.bsf -s2 -s1 --write_align script/test/test.corpus -m1 2 -m2 2
 37
 38TRAIN A MODEL
 39$ cd script
 40$ ./train-model.py --dict ../data/g014a2.train.bsf --prefix test/test --order 7 --noalign --palign
 41
 42TEST A MODEL
 43$ ./evaluate.py --modelfile test/test.fst --testfile ../data/g014a2.test.tabbed.bsf --prefix test/test
 44
 45
 46-----
 47
 48OLD INSTALL
 49$ make
 50
 51TRAIN A MODEL
 52$ cd script
 53$ ./train-model.py --dict ../data/g014a2.train.bsf --verbose --delX --prefix "test/test"
 54
 55TEST A MODEL
 56$ ../phonetisaurus-g2p -m test/test.fst -n 1 -t ../data/g014a2.clean.test
 57
 58TEST A WORD
 59$ ../phonetisaurus-g2p -m test/test.fst -n 1 -w airmail
 60
 61GET NBEST RESULTS
 62$ ../phonetisaurus-g2p -m test/test.fst -n 7 -w airmail
 63
 64
 65I've only tested this latest build properly with the NETtalk database.
 66It might die with something else...
 67
 68
 69
 70
 71
 72
 73
 74
 75------------------------------------
 76OLD STUFF
 77------------------------------------
 78
 79Compile:
 80$ make
 81
 82Running the tool:
 83$ ./phonetisaurus <clusters> <testlist> <isyms> <model.fst> <osyms> <n-best>
 84
 85for example,
 86$ ./phonetisaurus mymodel.clusters testlist mymodel.isyms mymodel.fst mymodel.osyms 5
 87
 88Will produce pronunciation (or spelling) hypotheses for each item in 'testlist'.  
 89It will produce a maximum of 5 hypotheses for each item.  Depending on the item it
 90may produce fewer hypotheses.
 91
 92
 93
 94
 95
 96
 97
 98
 99
100INSTALL
101First, install the following packages, 
102
103Python:
104sudo easy_install simplejson
105sudo easy_install argparse (if using python version < 2.7)
106
107Other:
108OpenFST  http://openfst.org/
109mitlm    https://code.google.com/p/mitlm/
110
111The project is setup to use the open nettalk db by default,
112cd into the phonetisaurus sub-directory and run,
113
114---------------------------------------
115phonetisaurus$ ./train-phoneticizer.sh
116  STARTING TRAINING
117  Using partitioned dataset: 95% training, 5% testing
118  Aligning the dictionary...
119  Iteration 1. Only equal length pairs...
120  Iteration 2. Only equal length pairs...
121  Iteration 3...
122  Iteration 4...
123  Iteration 5...
124  Generating LM training corpus...
125  Training the n-gram model with mitlm, n=6...
126  Loading corpus models/dev-train-0.95-un.corpus...
127  Smoothing[1] = ModKN
128  Smoothing[2] = ModKN
129  Smoothing[3] = ModKN
130  Smoothing[4] = ModKN
131  Smoothing[5] = ModKN
132  Smoothing[6] = ModKN
133  Set smoothing algorithms...
134  Estimating full n-gram model...
135  Saving LM to models/aligned_corpus-6g.arpa...
136  Building the phoneticizer FST...
137  TRAINING FINISHED
138  Running the test...
139  Evaluating WER...
140  WER: 0.361
141  TESTING FINISHED
142  In order to run the new models yourself, try something like,
143  ./g2p-en.py -m g-mod.fst -i g-mod.isyms -o g-mod.osyms -w testing
144-------------------------------------
145
146That's it!
147
148In order to use your own dictionary, it needs to follow the simple
149format where each line contains a single word (no spaces) followed
150by a tab, followed by a sequence of phonemes separated by spaces, 
151for example,
152--------------------------
153testing    T EH S T IH NG
154--------------------------