/tools/maf/maf_to_interval.xml
XML | 127 lines | 93 code | 34 blank | 0 comment | 0 complexity | 22573889a80913b7ba89e5da6db39910 MD5 | raw file
1<tool id="MAF_To_Interval1" name="MAF to Interval" force_history_refresh="True"> 2 <description>Converts a MAF formatted file to the Interval format</description> 3 <command interpreter="python">maf_to_interval.py $input1 $out_file1 $out_file1.id $__new_file_path__ $input1.dbkey $species $input1.metadata.species $complete_blocks $remove_gaps</command> 4 <inputs> 5 <param format="maf" name="input1" type="data" label="MAF file to convert"/> 6 <param name="species" type="select" label="Select additional species" display="checkboxes" multiple="true" help="The species matching the dbkey of the alignment is always included. A separate history item will be created for each species."> 7 <options> 8 <filter type="data_meta" ref="input1" key="species" /> 9 <filter type="remove_value" meta_ref="input1" key="dbkey" /> 10 </options> 11 </param> 12 <param name="complete_blocks" type="select" label="Exclude blocks which have a species missing"> 13 <option value="partial_allowed">include blocks with missing species</option> 14 <option value="partial_disallowed">exclude blocks with missing species</option> 15 </param> 16 <param name="remove_gaps" type="select" label="Remove Gap characters from sequences"> 17 <option value="keep_gaps">keep gaps</option> 18 <option value="remove_gaps">remove gaps</option> 19 </param> 20 </inputs> 21 <outputs> 22 <data format="interval" name="out_file1" /> 23 </outputs> 24 <tests> 25 <test> 26 <param name="input1" value="4.maf" dbkey="hg17"/> 27 <param name="complete_blocks" value="partial_disallowed"/> 28 <param name="remove_gaps" value="keep_gaps"/> 29 <param name="species" value="panTro1" /> 30 <output name="out_file1" file="maf_to_interval_out_hg17.interval"/> 31 <output name="out_file1" file="maf_to_interval_out_panTro1.interval"/> 32 </test> 33 </tests> 34 <help> 35 36**What it does** 37 38This tool converts every MAF block to a set of genomic intervals describing the position of that alignment block within a corresponding genome. Sequences from aligning species are also included in the output. 39 40The interface for this tool contains several options: 41 42 * **MAF file to convert**. Choose multiple alignments from history to be converted to BED format. 43 * **Choose species**. Choose additional species from the alignment to be included in the output 44 * **Exclude blocks which have a species missing**. if an alignment block does not contain any one of the species found in the alignment set and this option is set to **exclude blocks with missing species**, then coordinates of such a block **will not** be included in the output (see **Example 2** below). 45 * **Remove Gap characters from sequences**. Gaps can be removed from sequences before they are output. 46 47 48----- 49 50**Example 1**: **Include only reference genome** (hg18 in this case) and **include blocks with missing species**: 51 52For the following alignment:: 53 54 ##maf version=1 55 a score=68686.000000 56 s hg18.chr20 56827368 75 + 62435964 GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- 57 s panTro2.chr20 56528685 75 + 62293572 GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- 58 s rheMac2.chr10 89144112 69 - 94855758 GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA------- 59 s mm8.chr2 173910832 61 + 181976762 AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC------- 60 s canFam2.chr24 46551822 67 + 50763139 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C 61 62 a score=10289.000000 63 s hg18.chr20 56827443 37 + 62435964 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG 64 s panTro2.chr20 56528760 37 + 62293572 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG 65 s rheMac2.chr10 89144181 37 - 94855758 ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG 66 67the tool will create **a single** history item containing the following (**note** the name field is numbered iteratively: hg18_0_0, hg18_1_0 etc. where the first number is the block number and the second number is the iteration through the block (if a species appears twice in a block, that interval will be repeated) and sequences for each species are included in the order specified in the header: the field is left empty when no sequence is available for that species):: 68 69 #chrom start end strand score name canFam2 hg18 mm8 panTro2 rheMac2 70 chr20 56827368 56827443 + 68686.0 hg18_0_0 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC------- GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA------- 71 chr20 56827443 56827480 + 10289.0 hg18_1_0 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG 72 73 74----- 75 76**Example 2**: **Include hg18 and mm8** and **exclude blocks with missing species**: 77 78For the following alignment:: 79 80 ##maf version=1 81 a score=68686.000000 82 s hg18.chr20 56827368 75 + 62435964 GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- 83 s panTro2.chr20 56528685 75 + 62293572 GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- 84 s rheMac2.chr10 89144112 69 - 94855758 GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA------- 85 s mm8.chr2 173910832 61 + 181976762 AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC------- 86 s canFam2.chr24 46551822 67 + 50763139 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C 87 88 a score=10289.000000 89 s hg18.chr20 56827443 37 + 62435964 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG 90 s panTro2.chr20 56528760 37 + 62293572 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG 91 s rheMac2.chr10 89144181 37 - 94855758 ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG 92 93the tool will create **two** history items (one for hg18 and one for mm8) containing the following (**note** that both history items contain only one line describing the first alignment block. The second MAF block is not included in the output because it does not contain mm8): 94 95History item **1** (for hg18):: 96 97 #chrom start end strand score name canFam2 hg18 mm8 panTro2 rheMac2 98 chr20 56827368 56827443 + 68686.0 hg18_0_0 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC------- GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA------- 99 100 101History item **2** (for mm8):: 102 103 #chrom start end strand score name canFam2 hg18 mm8 panTro2 rheMac2 104 chr2 173910832 173910893 + 68686.0 mm8_0_0 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC------- GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA------- 105 106 107------- 108 109.. class:: infomark 110 111**About formats** 112 113**MAF format** multiple alignment format file. This format stores multiple alignments at the DNA level between entire genomes. 114 115 - The .maf format is line-oriented. Each multiple alignment ends with a blank line. 116 - Each sequence in an alignment is on a single line. 117 - Lines starting with # are considered to be comments. 118 - Each multiple alignment is in a separate paragraph that begins with an "a" line and contains an "s" line for each sequence in the multiple alignment. 119 - Some MAF files may contain two optional line types: 120 121 - An "i" line containing information about what is in the aligned species DNA before and after the immediately preceding "s" line; 122 - An "e" line containing information about the size of the gap between the alignments that span the current block. 123 124 125 </help> 126</tool> 127