PageRenderTime 56ms CodeModel.GetById 22ms RepoModel.GetById 1ms app.codeStats 0ms

/wav2vec_cycle_code/FragmentVC/src/fairseq/examples/m2m_100/README.md

https://gitlab.com/lwd17/enhanced_examplar_ae
Markdown | 226 lines | 174 code | 52 blank | 0 comment | 0 complexity | f3fb51471119ee0986fc2325ed23051c MD5 | raw file
  1. # Beyond English-Centric Multilingual Machine Translation
  2. ## Introduction
  3. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively with the best single systems of WMT.
  4. If you are new to using fairseq, read the following walkthrough. Otherwise, skip to the sections below.
  5. 0. **Generation Data**
  6. To download the generation data, follow the below commands. Note that all datasets need to be detokenized *before* applying SPM in the data preprocessing step. If you use these evaluation datasets, please cite their associated papers.
  7. ```bash
  8. # WMT - use sacrebleu, example here:
  9. sacrebleu -t wmt14 -l fr-en --echo src > wmt.test.fr-en.fr
  10. sacrebleu -t wmt14 -l fr-en --echo ref > wmt.test.fr-en.en
  11. # WAT
  12. wget http://lotus.kuee.kyoto-u.ac.jp/WAT/my-en-data/wat2019.my-en.zip
  13. unzip wat2019.my-en.zip
  14. # FLORES
  15. # download from: https://github.com/facebookresearch/flores
  16. # TED - need to detokenize with Moses!
  17. # from: https://github.com/neulab/word-embeddings-for-nmt
  18. wget http://phontron.com/data/ted_talks.tar.gz
  19. # Autshumato
  20. # request to download: https://repo.sadilar.org/handle/20.500.12185/397
  21. # Tatoeba Challenge
  22. # available here: https://github.com/Helsinki-NLP/Tatoeba-Challenge
  23. ```
  24. 1. **Training Data**
  25. To produce the training data, we use a combination of [CCMatrix](https://arxiv.org/abs/1911.04944) and [CCAligned](https://arxiv.org/abs/1911.06154). Check out the instructions [here](https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix) to download the raw data.
  26. 2. **Preprocess Data**
  27. After downloading raw data, you will need to postprocess the data, then apply SPM, then binarize. Note that it is very important you run the postprocessing script, because this removes any instance of the evaluation data in the mined training data.
  28. ```bash
  29. # preprocess data
  30. # remove sentences with more than 50% punctuation
  31. python /path/to/fairseq/examples/m2m_100/process_data/remove_too_much_punc.py
  32. # deduplicate training data
  33. paste /path/to/datadir/train.$src /path/to/datadir/train.$tgt | awk '!x[$0]++' > /path/to/datadir/train.dedup
  34. echo "keeping $(wc -l /path/to/datadir/train.dedup) bitext out of $(wc -l /path/to/datadir/train.$src)"
  35. cut -f1 /path/to/datadir/train.dedup > /path/to/datadir/train.$src
  36. cut -f2 /path/to/datadir/train.dedup > /path/to/datadir/train.$tgt
  37. # remove all instances of evaluation data from the training data
  38. python /path/to/fairseq/examples/m2m_100/process_data/dedup_data.py
  39. # frequency cleaning
  40. wget https://dl.fbaipublicfiles.com/m2m_100/histograms.tar.gz
  41. tar -xvzf histograms.tar.gz
  42. python /path/to/fairseq/examples/m2m_100/process_data/clean_histogram.py --src $src --tgt $tgt --src-file /path/to/source/file --tgt-file /path/to/output/file --src-output-file source_output.$src --tgt-output-file target_output.$tgt --histograms /path/to/histograms
  43. # apply SPM
  44. wget https://dl.fbaipublicfiles.com/m2m_100/spm.128k.model
  45. python /path/to/fairseq/scripts/spm_encode.py \
  46. --model spm.128k.model \
  47. --output_format=piece \
  48. --inputs=/path/to/input/file/here \
  49. --outputs=/path/to/output/file/here
  50. # length ratio cleaning
  51. perl mosesdecoder/scripts/training/clean-corpus-n.perl --ratio 3 /path/to/training/data/train.spm.$src-$tgt $src $tgt /path/to/output/directory/train.spm.$src-$tgt 1 250
  52. # binarize data
  53. wget https://dl.fbaipublicfiles.com/m2m_100/data_dict.128k.txt
  54. fairseq-preprocess \
  55. --source-lang $src --target-lang $tgt \
  56. --testpref spm.$src.$tgt \
  57. --thresholdsrc 0 --thresholdtgt 0 \
  58. --destdir data_bin \
  59. --srcdict data_dict.128k.txt --tgtdict data_dict.128k.txt
  60. ```
  61. 3. **Training Scripts**
  62. To reproduce the training of our models, we train with fairseq-py's multilingual translation [task](https://github.com/pytorch/fairseq/tree/master/examples/multilingual). If you are interested in model parallel training, also check out [fairscale](https://github.com/facebookresearch/fairscale).
  63. 4. **Generation**
  64. To generate from our models, follow the the commands in the generation section below.
  65. If you use any of the resources listed here, please cite:
  66. ```bibtex
  67. @article{fan2020beyond,
  68. title={Beyond English-Centric Multilingual Machine Translation},
  69. author={Fan, Angela and Bhosale, Shruti and Schwenk, Holger and Ma, Zhiyi and El-Kishky, Ahmed and Goyal, Siddharth and Baines, Mandeep and Celebi, Onur and Wenzek, Guillaume and Chaudhary, Vishrav and Goyal, Naman and Birch, Tom and Liptchinsky, Vitaliy and Edunov, Sergey and Grave, Edouard and Auli, Michael and Joulin, Armand},
  70. journal={arXiv preprint},
  71. year={2020}
  72. }
  73. @article{schwenk2019ccmatrix,
  74. title={Ccmatrix: Mining billions of high-quality parallel sentences on the web},
  75. author={Schwenk, Holger and Wenzek, Guillaume and Edunov, Sergey and Grave, Edouard and Joulin, Armand},
  76. journal={arXiv preprint arXiv:1911.04944},
  77. year={2019}
  78. }
  79. @article{el2019massive,
  80. title={A Massive Collection of Cross-Lingual Web-Document Pairs},
  81. author={El-Kishky, Ahmed and Chaudhary, Vishrav and Guzman, Francisco and Koehn, Philipp},
  82. journal={arXiv preprint arXiv:1911.06154},
  83. year={2019}
  84. }
  85. ```
  86. ## Trained Models
  87. More models coming up soon.
  88. ### 12B Model
  89. 12B parameter model trained on many-to-many training data for 100 languages. We include the last checkpoint, average of last 5 checkpoints, average of last 10 checkpoints. There isn't a universally best choice out of these three, but all three versions are pretty close in accuracy. You can either sweep over the 3 checkpoints on a dev test and use the best performing checkpoint for final testing. Or the last checkpoint can be a good default choice.
  90. **Model Download Links**
  91. Configuration | 2 32GB GPUs | 4 16GB GPUs | 6 12GB GPUs | 8 8GB GPUs
  92. :--|:--|:--|:--|:--
  93. Last Checkpoint | [12b_last_chk_2_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_2_gpus.pt) | [12b_last_chk_4_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_4_gpus.pt) | [12b_last_chk_6_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_6_gpus.pt) | [12b_last_chk_8_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_8_gpus.pt)
  94. Average of last 5 checkpoints | [12b_avg5_chk_2_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_avg5_chk_2_gpus.pt) | [12b_avg5_chk_4_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_avg5_chk_4_gpus.pt) | [12b_avg5_chk_6_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_avg5_chk_6_gpus.pt) | [12b_avg5_chk_8_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_avg5_chk_8_gpus.pt)
  95. Average of last 10 checkpoints | [12b_avg10_chk_2_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_avg10_chk_2_gpus.pt) | [12b_avg10_chk_4_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_avg10_chk_4_gpus.pt) | [12b_avg10_chk_6_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_avg10_chk_6_gpus.pt) | [12b_avg10_chk_8_gpus.pt](https://dl.fbaipublicfiles.com/m2m_100/12b_avg10_chk_8_gpus.pt)
  96. **Generation Arguments**
  97. Configuration | 2 32GB GPUs | 4 16GB GPUs | 6 12GB GPUs | 8 8GB GPUs
  98. :--|:--|:--|:--|:--
  99. `--pipeline-encoder-balance` | `[26]` | `[1,15,10]` | `[1,9,9,7]` | `[1,6,6,6,7]`
  100. `--pipeline-encoder-devices` | `[0]` | `[0,1,0]` | `[0,1,2,0]` | `[0,4,5,1,0]`
  101. `--pipeline-decoder-balance` | `[3,22,1]` | `[3,11,11,1]` | `[3,7,7,8,1]` | `[1,6,6,6,6,1]`
  102. `--pipeline-decoder-devices` | `[0,1,0]` | `[0,2,3,0]` | `[0,3,4,5,0]` | `[0,2,6,7,3,0]`
  103. ## SentencePiece Model
  104. ```bash
  105. wget https://dl.fbaipublicfiles.com/m2m_100/spm.128k.model
  106. ```
  107. ## Generation with M2M-100
  108. ### Encode using our SentencePiece Model
  109. Note: Install SentencePiece from [here](https://github.com/google/sentencepiece)
  110. ```bash
  111. fairseq=/path/to/fairseq
  112. cd $fairseq
  113. sacrebleu --echo src -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.de
  114. sacrebleu --echo ref -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.fr
  115. wget https://dl.fbaipublicfiles.com/m2m_100/spm.128k.model
  116. for lang in de fr ; do
  117. python scripts/spm_encode.py \
  118. --model spm.128k.model \
  119. --output_format=piece \
  120. --inputs=raw_input.de-fr.${lang} \
  121. --outputs=spm.de-fr.${lang}
  122. done
  123. ```
  124. ### Binarization
  125. ```bash
  126. wget https://dl.fbaipublicfiles.com/m2m_100/data_dict.128k.txt
  127. fairseq-preprocess \
  128. --source-lang de --target-lang fr \
  129. --testpref spm.de-fr \
  130. --thresholdsrc 0 --thresholdtgt 0 \
  131. --destdir data_bin \
  132. --srcdict data_dict.128k.txt --tgtdict data_dict.128k.txt
  133. ```
  134. ### Generation for the 12B model
  135. Note that generation can currently be run using 2 32GB / 4 16GB / 6 12GB / 8 8GB GPUs, and the corresponding model checkpoints and pipeline arguments can be found in the [12B Model Section](#12b-model).
  136. Generation on CPUs will be added in the future.
  137. ```bash
  138. wget https://dl.fbaipublicfiles.com/m2m_100/model_dict.128k.txt
  139. wget https://dl.fbaipublicfiles.com/m2m_100/language_pairs.txt
  140. wget https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_4_gpus.pt
  141. fairseq-generate \
  142. data_bin \
  143. --batch-size 1 \
  144. --path 12b_last_chk_4_gpus.pt \
  145. --fixed-dictionary model_dict.128k.txt \
  146. -s de -t fr \
  147. --remove-bpe 'sentencepiece' \
  148. --beam 5 \
  149. --task translation_multi_simple_epoch \
  150. --lang-pairs language_pairs.txt \
  151. --decoder-langtok --encoder-langtok src \
  152. --gen-subset test \
  153. --fp16 \
  154. --dataset-impl mmap \
  155. --distributed-world-size 1 --distributed-no-spawn \
  156. --pipeline-model-parallel \
  157. --pipeline-chunks 1 \
  158. --pipeline-encoder-balance '[1,15,10]' \
  159. --pipeline-encoder-devices '[0,1,0]' \
  160. --pipeline-decoder-balance '[3,11,11,1]' \
  161. --pipeline-decoder-devices '[0,2,3,0]' > gen_out
  162. ```
  163. ## Evaluation with M2M-100
  164. ### Tokenization
  165. Note: Refer to tokenizers/README.md for more details on tokenization.
  166. ```bash
  167. cd ${fairseq}/examples/m2m_100
  168. cat ${fairseq}/gen_out | grep -P "^H" | sort -V | cut -f 3- | sh tok.sh fr > hyp
  169. cat ${fairseq}/raw_input.de-fr.fr | sh tok.sh fr > ref
  170. ```
  171. ### BLEU
  172. ```bash
  173. sacrebleu -tok 'none' ref < hyp
  174. ```