/tools/maf/maf_to_fasta.xml

https://bitbucket.org/cistrome/cistrome-harvard/ · XML · 199 lines · 156 code · 43 blank · 0 comment · 0 complexity · 6070ce5f2311016787b776b577d7eff0 MD5 · raw file

  1. <tool id="MAF_To_Fasta1" name="MAF to FASTA" version="1.0.1">
  2. <description>Converts a MAF formatted file to FASTA format</description>
  3. <command interpreter="python">
  4. #if $fasta_target_type.fasta_type == "multiple" #maf_to_fasta_multiple_sets.py $input1 $out_file1 $fasta_target_type.species $fasta_target_type.complete_blocks
  5. #else #maf_to_fasta_concat.py $fasta_target_type.species $input1 $out_file1
  6. #end if#
  7. </command>
  8. <inputs>
  9. <param format="maf" name="input1" type="data" label="MAF file to convert"/>
  10. <conditional name="fasta_target_type">
  11. <param name="fasta_type" type="select" label="Type of FASTA Output">
  12. <option value="multiple" selected="true">Multiple Blocks</option>
  13. <option value="concatenated">One Sequence per Species</option>
  14. </param>
  15. <when value="multiple">
  16. <param name="species" type="select" label="Select species" display="checkboxes" multiple="true" help="checked taxa will be included in the output">
  17. <options>
  18. <filter type="data_meta" ref="input1" key="species" />
  19. </options>
  20. </param>
  21. <param name="complete_blocks" type="select" label="Choose to">
  22. <option value="partial_allowed">include blocks with missing species</option>
  23. <option value="partial_disallowed">exclude blocks with missing species</option>
  24. </param>
  25. </when>
  26. <when value="concatenated">
  27. <param name="species" type="select" label="Species to extract" display="checkboxes" multiple="true">
  28. <options>
  29. <filter type="data_meta" ref="input1" key="species" />
  30. </options>
  31. </param>
  32. </when>
  33. </conditional>
  34. </inputs>
  35. <outputs>
  36. <data format="fasta" name="out_file1" />
  37. </outputs>
  38. <tests>
  39. <test>
  40. <param name="input1" value="3.maf" ftype="maf"/>
  41. <param name="fasta_type" value="concatenated"/>
  42. <param name="species" value="canFam1"/>
  43. <output name="out_file1" file="cf_maf2fasta_concat.dat" ftype="fasta"/>
  44. </test>
  45. <test>
  46. <param name="input1" value="4.maf" ftype="maf"/>
  47. <param name="fasta_type" value="multiple"/>
  48. <param name="species" value="hg17,panTro1,rheMac2,rn3,mm7,canFam2,bosTau2,dasNov1"/>
  49. <param name="complete_blocks" value="partial_allowed"/>
  50. <output name="out_file1" file="cf_maf2fasta_new.dat" ftype="fasta"/>
  51. </test>
  52. </tests>
  53. <help>
  54. **Types of MAF to FASTA conversion**
  55. * **Multiple Blocks** converts a single MAF block to a single FASTA block. For example, if you have 6 MAF blocks, they will be converted to 6 FASTA blocks.
  56. * **One Sequence per Species** converts MAF blocks to a single aggregated FASTA block. For example, if you have 6 MAF blocks, they will be converted and concatenated into a single FASTA block.
  57. -------
  58. **What it does**
  59. This tool converts MAF blocks to FASTA format and concatenates them into a single FASTA block or outputs multiple FASTA blocks separated by empty lines.
  60. The interface for this tool contains two pages (steps):
  61. * **Step 1 of 2**. Choose multiple alignments from history to be converted to FASTA format.
  62. * **Step 2 of 2**. Choose the type of output as well as the species from the alignment to be included in the output.
  63. Multiple Block output has additional options:
  64. * **Choose species** - the tool reads the alignment provided during Step 1 and generates a list of species contained within that alignment. Using checkboxes you can specify taxa to be included in the output (all species are selected by default).
  65. * **Choose to include/exclude blocks with missing species** - if an alignment block does not contain any one of the species you selected within **Choose species** menu and this option is set to **exclude blocks with missing species**, then such a block **will not** be included in the output (see **Example 2** below). For example, if you want to extract human, mouse, and rat from a series of alignments and one of the blocks does not contain mouse sequence, then this block will not be converted to FASTA and will not be returned.
  66. -----
  67. **Example 1**:
  68. In the concatenated approach, the following alignment::
  69. ##maf version=1
  70. a score=68686.000000
  71. s hg18.chr20 56827368 75 + 62435964 GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
  72. s panTro2.chr20 56528685 75 + 62293572 GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
  73. s rheMac2.chr10 89144112 69 - 94855758 GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA-------
  74. s mm8.chr2 173910832 61 + 181976762 AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC-------
  75. s canFam2.chr24 46551822 67 + 50763139 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C
  76. a score=10289.000000
  77. s hg18.chr20 56827443 37 + 62435964 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
  78. s panTro2.chr20 56528760 37 + 62293572 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
  79. s rheMac2.chr10 89144181 37 - 94855758 ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG
  80. will be converted to (**note** that because mm8 (mouse) and canFam2 (dog) are absent from the second block, they are replaced with gaps after concatenation)::
  81. &gt;canFam2
  82. CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C-------------------------------------
  83. &gt;hg18
  84. GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
  85. &gt;mm8
  86. AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC--------------------------------------------
  87. &gt;panTro2
  88. GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
  89. &gt;rheMac2
  90. GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA-------ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG
  91. ------
  92. **Example 2a**: Multiple Block Approach **Include all species** and **include blocks with missing species**:
  93. The following alignment::
  94. ##maf version=1
  95. a score=68686.000000
  96. s hg18.chr20 56827368 75 + 62435964 GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
  97. s panTro2.chr20 56528685 75 + 62293572 GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
  98. s rheMac2.chr10 89144112 69 - 94855758 GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA-------
  99. s mm8.chr2 173910832 61 + 181976762 AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC-------
  100. s canFam2.chr24 46551822 67 + 50763139 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C
  101. a score=10289.000000
  102. s hg18.chr20 56827443 37 + 62435964 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
  103. s panTro2.chr20 56528760 37 + 62293572 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
  104. s rheMac2.chr10 89144181 37 - 94855758 ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG
  105. will be converted to::
  106. &gt;hg18.chr20(+):56827368-56827443|hg18_0
  107. GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
  108. &gt;panTro2.chr20(+):56528685-56528760|panTro2_0
  109. GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
  110. &gt;rheMac2.chr10(-):89144112-89144181|rheMac2_0
  111. GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA-------
  112. &gt;mm8.chr2(+):173910832-173910893|mm8_0
  113. AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC-------
  114. &gt;canFam2.chr24(+):46551822-46551889|canFam2_0
  115. CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C
  116. &gt;hg18.chr20(+):56827443-56827480|hg18_1
  117. ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
  118. &gt;panTro2.chr20(+):56528760-56528797|panTro2_1
  119. ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
  120. &gt;rheMac2.chr10(-):89144181-89144218|rheMac2_1
  121. ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG
  122. -----
  123. **Example 2b**: Multiple Block Approach **Include hg18 and mm8** and **exclude blocks with missing species**:
  124. The following alignment::
  125. ##maf version=1
  126. a score=68686.000000
  127. s hg18.chr20 56827368 75 + 62435964 GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
  128. s panTro2.chr20 56528685 75 + 62293572 GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
  129. s rheMac2.chr10 89144112 69 - 94855758 GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA-------
  130. s mm8.chr2 173910832 61 + 181976762 AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC-------
  131. s canFam2.chr24 46551822 67 + 50763139 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C
  132. a score=10289.000000
  133. s hg18.chr20 56827443 37 + 62435964 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
  134. s panTro2.chr20 56528760 37 + 62293572 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
  135. s rheMac2.chr10 89144181 37 - 94855758 ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG
  136. will be converted to (**note** that the second MAF block, which does not have mm8, is not included in the output)::
  137. &gt;hg18.chr20(+):56827368-56827443|hg18_0
  138. GACAGGGTGCATCTGGGAGGGCCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC
  139. &gt;mm8.chr2(+):173910832-173910893|mm8_0
  140. AGAAGGATCCACCT---------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC------
  141. ------
  142. .. class:: infomark
  143. **About formats**
  144. **MAF format** multiple alignment format file. This format stores multiple alignments at the DNA level between entire genomes.
  145. - The .maf format is line-oriented. Each multiple alignment ends with a blank line.
  146. - Each sequence in an alignment is on a single line.
  147. - Lines starting with # are considered to be comments.
  148. - Each multiple alignment is in a separate paragraph that begins with an "a" line and contains an "s" line for each sequence in the multiple alignment.
  149. - Some MAF files may contain two optional line types:
  150. - An "i" line containing information about what is in the aligned species DNA before and after the immediately preceding "s" line;
  151. - An "e" line containing information about the size of the gap between the alignments that span the current block.
  152. ------
  153. **Citation**
  154. If you use this tool, please cite `Blankenberg D, Taylor J, Nekrutenko A; The Galaxy Team. Making whole genome multiple alignments usable for biologists. Bioinformatics. 2011 Sep 1;27(17):2426-2428. &lt;http://www.ncbi.nlm.nih.gov/pubmed/21775304&gt;`_
  155. </help>
  156. </tool>