README.txt | searchcode

/README.txt

https://code.google.com/p/phonetisaurus/ · Plain Text · 154 lines · 107 code · 47 blank · 0 comment · 0 complexity · ae8dc8c2a59d4b1e829d711ea33ae0b8 MD5 · raw file

Phonetisaurus
2010-10-27
Josef Robert Novak

UPDATED: 2012-03-13 Josef R. Novak
m2m-aligner is no longer required.  The native WFST-based aligner should
be preferred as it is faster and produces more accurate alignments.

UPDATED: 2011-04-07 Josef R. Novak
The python 'decoder' is now obsolete.  Use the C++ tool.

UPDATED: 2011-04-11 Josef R. Novak
Many updates to the C++ 'decoder'.  It is now fairly nice.
Also wrote new model-training scripts.  

It should be much easier to train and test models now.

REQUIREMENTS
estimate-ngram must be accessibile from your ${PATH}
variable.  If you have not installed it, you can obtain the source
code from the following locations.  

mitlm:
 https://code.google.com/p/mitlm/

The m2m-aligner is also a nice tool and is supported as an alternative
but support is now deprecated in favor of the native implementation.
m2m-aligner:
 http://code.google.com/p/m2m-aligner/
  
RECOMMENDED INSTALL:
$ make -j
$ ./mk_swig.sh

ALIGN A DATABASE
$ ./m2m-aligner.py --align data/g014a2.train.bsf -s2 -s1 --write_align script/test/test.corpus -m1 2 -m2 2

TRAIN A MODEL
$ cd script
$ ./train-model.py --dict ../data/g014a2.train.bsf --prefix test/test --order 7 --noalign --palign

TEST A MODEL
$ ./evaluate.py --modelfile test/test.fst --testfile ../data/g014a2.test.tabbed.bsf --prefix test/test


-----

OLD INSTALL
$ make

TRAIN A MODEL
$ cd script
$ ./train-model.py --dict ../data/g014a2.train.bsf --verbose --delX --prefix "test/test"

TEST A MODEL
$ ../phonetisaurus-g2p -m test/test.fst -n 1 -t ../data/g014a2.clean.test

TEST A WORD
$ ../phonetisaurus-g2p -m test/test.fst -n 1 -w airmail

GET NBEST RESULTS
$ ../phonetisaurus-g2p -m test/test.fst -n 7 -w airmail


I've only tested this latest build properly with the NETtalk database.
It might die with something else...








------------------------------------
OLD STUFF
------------------------------------

Compile:
$ make

Running the tool:
$ ./phonetisaurus <clusters> <testlist> <isyms> <model.fst> <osyms> <n-best>

for example,
$ ./phonetisaurus mymodel.clusters testlist mymodel.isyms mymodel.fst mymodel.osyms 5

Will produce pronunciation (or spelling) hypotheses for each item in 'testlist'.  
It will produce a maximum of 5 hypotheses for each item.  Depending on the item it
may produce fewer hypotheses.









INSTALL
First, install the following packages, 

Python:
sudo easy_install simplejson
sudo easy_install argparse (if using python version < 2.7)

Other:
OpenFST  http://openfst.org/
mitlm    https://code.google.com/p/mitlm/

The project is setup to use the open nettalk db by default,
cd into the phonetisaurus sub-directory and run,

---------------------------------------
phonetisaurus$ ./train-phoneticizer.sh
  STARTING TRAINING
  Using partitioned dataset: 95% training, 5% testing
  Aligning the dictionary...
  Iteration 1. Only equal length pairs...
  Iteration 2. Only equal length pairs...
  Iteration 3...
  Iteration 4...
  Iteration 5...
  Generating LM training corpus...
  Training the n-gram model with mitlm, n=6...
  Loading corpus models/dev-train-0.95-un.corpus...
  Smoothing[1] = ModKN
  Smoothing[2] = ModKN
  Smoothing[3] = ModKN
  Smoothing[4] = ModKN
  Smoothing[5] = ModKN
  Smoothing[6] = ModKN
  Set smoothing algorithms...
  Estimating full n-gram model...
  Saving LM to models/aligned_corpus-6g.arpa...
  Building the phoneticizer FST...
  TRAINING FINISHED
  Running the test...
  Evaluating WER...
  WER: 0.361
  TESTING FINISHED
  In order to run the new models yourself, try something like,
  ./g2p-en.py -m g-mod.fst -i g-mod.isyms -o g-mod.osyms -w testing
-------------------------------------

That's it!

In order to use your own dictionary, it needs to follow the simple
format where each line contains a single word (no spaces) followed
by a tab, followed by a sequence of phonemes separated by spaces, 
for example,
--------------------------
testing    T EH S T IH NG
--------------------------