PageRenderTime 23ms CodeModel.GetById 13ms app.highlight 4ms RepoModel.GetById 1ms app.codeStats 0ms

/tools/stats/dna_filtering.xml

https://bitbucket.org/cistrome/cistrome-harvard/
XML | 136 lines | 110 code | 26 blank | 0 comment | 0 complexity | 95041187e01dfd710f70ddcc3213fd8c MD5 | raw file
  1<tool id="dna_filter" name="Filter on ambiguities" version="1.0.0">
  2  <description>in polymorphism datasets</description>
  3  <command interpreter="python">
  4    dna_filtering.py
  5      --input=$input 
  6      --output=$out_file1 
  7      --cond="$cond" 
  8      --n_handling=$n_handling
  9      --columns=${input.metadata.columns} 
 10      --col_types="${input.metadata.column_types}"
 11  </command>
 12  <inputs>
 13    <param format="tabular" name="input" type="data" label="Filter" help="Dataset missing? See TIP below."/>
 14    <param name="cond" size="40" type="text" value="c4 == 'G'" label="With following condition" help="Double equal signs, ==, must be used as shown above. To filter for an arbitrary string, use the Select tool.">
 15      <validator type="empty_field" message="Enter a valid filtering condition, see syntax and examples below."/>
 16    </param>
 17    <param name="n_handling" type="select" label="What is the meaning of N" help="Everything matches everything, Unknown matches nothing">
 18      <option value="all">Everything (A, T, C, G)</option>
 19      <option value="none">Unknown</option>
 20    </param>
 21  </inputs>
 22  <outputs>
 23    <data format="input" name="out_file1" metadata_source="input"/>
 24  </outputs>
 25  <tests>
 26    <test>
 27      <param name="input" ftype="tabular" value="dna_filter_in1.tabular" />
 28      <param name="cond" value="c8=='G'" />
 29      <param name="n_handling" value="all" />
 30      <output name="out_file1" ftype="tabular" file="dna_filter_out1.tabular" />
 31    </test>
 32    <test>
 33      <param name="input" value="dna_filter_in1.tabular" />
 34      <param name="cond" value="(c10 == c11 or c17 == c18) and c6 != 'C' and c23 == 'R'" />
 35      <param name="n_handling" value="all" />
 36      <output name="out_file1" file="dna_filter_out2.tabular" />
 37    </test>
 38    <test>
 39      <param name="input" value="dna_filter_in1.tabular" />
 40      <param name="cond" value="c4=='B' or c9==c10" />
 41      <param name="n_handling" value="none" />
 42      <output name="out_file1" file="dna_filter_out3.tabular" />
 43    </test>
 44    <test>
 45      <param name="input" value="dna_filter_in1.tabular" />
 46      <param name="cond" value="c1!='chr1' and c7!='Y' and c25!='+'" />
 47      <param name="n_handling" value="none" />
 48      <output name="out_file1" file="dna_filter_out4.tabular" />
 49    </test>
 50  </tests>
 51  <help>
 52
 53.. class:: infomark
 54
 55**TIP:** If your data is not TAB delimited, use *Text Manipulation-&gt;Convert*
 56
 57.. class:: warningmark
 58
 59**TIP:** This tool is intended primarily for comparing column values (such as "c5==c12"), although it is also possible to filter on specific values (like "c6!='G'"). Be aware that when searching for specific values, any possible match is considered. So if you search on "c6!='G'", rows will be excluded when c6 is G, K, R, S, B, V, or D (plus N or X if you set that to equal "Everything"), because it is possible those values could indicate G. 
 60
 61-----
 62
 63**What it does**
 64
 65This tool is written for a very specific case related to an analysis of polymorphism data. Suppose you have a table of SNP data that looks like this::
 66
 67  chromosome start end patient1 parient2 patient3 patient4
 68  --------------------------------------------------------
 69  chr1       100   101 A        M        C        R 
 70  chr1       200   201 T        K        C        C 
 71  
 72and your want to select all rows where patient1 has the same base as patient2. Unfortunately you cannot do this with the *Filter and Sort -> Filter* tool because it does not understand DNA ambiguity codes (see below). For example, at position 100 patient1 is the same as patient2 because M is a mix of As and Cs. This tool is designed to make filtering on ambiguities possible.
 73
 74-----
 75
 76**Syntax**
 77
 78The filter tool allows you to restrict the dataset using simple conditional statements:
 79
 80- Columns are referenced with **c** and a **number**. For example, **c1** refers to the first column of a tab-delimited file (e.g., **c4 == c5**)
 81- When using 'equal-to' operator **double equal sign '==' must be used** ( e.g., **c1=='chr1'** )
 82- Non-numerical values must be included in single or double quotes ( e.g., **c6=='C'** )
 83- Filtering condition can include logical operators, but **make sure operators are all lower case** ( e.g., **(c1!='chrX' and c1!='chrY') or c6=='+'** )
 84
 85------
 86
 87**Allowed types of filtering**
 88
 89The following types of filtering are allowed:
 90
 91- Testing columns for equality (e.g., c2 == c4 or c2 != c4)
 92- Testing that a column contains a particular base (e.g., c4 == 'C'). Only bases listed in *DNA Codes* below are allowed.
 93- Testing that a column represents a plus or a minus strand (e.g., c3 == '+' or c3 != '-')
 94- Testing that a column is a chromosomes (c1 == 'chrX') or a scaffold (c1 == 'scaffold87976')
 95
 96All other types of filtering should be done with *Filter and Sort -> Filter* tool.
 97
 98-----
 99
100**DNA Codes**
101
102The following are the DNA codes used for filtering::
103
104  Code   Meaning
105  ----   --------------------------
106   A     A
107   T     T
108   U     T
109   G     G
110   C     C
111   K     G or T
112   M     A or C
113   R     A or G
114   Y     C or T
115   S     C or G
116   W     A or T
117   B     C, G or T
118   V     A, C or G
119   H     A, C or T
120   D     A, G or T
121   X     A, C, G or T
122   N     A, C, G or T
123   .     not (A, C, G or T)
124   -     gap of indeterminate length
125
126-----
127
128**Example**
129
130- **c8=='A'** selects lines in which the eighth column is A, M, R, W, V, H, or D, or N or X if appropriate
131- **c12==c15** selects lines where the value in the twelfth column could be the same as the fifteenth and the fifteenth column could be the same as the twelfth column (based on appropriate codes)
132- **c9!=c19** selects lines where column nine could not be the same as column nineteen or column nineteen could not be the same as column nine (using appropriate codes)
133- **c4 == 'A' and c4 == c5** selects lines where column 4 and 5 are both A, M, R, W, V, H, D or N, or X if appropriate
134
135</help>
136</tool>