/tools/expression/diffexp.xml
https://bitbucket.org/cistrome/cistrome-harvard/ · XML · 239 lines · 205 code · 33 blank · 1 comment · 0 complexity · 4025db414e20677953e234645944e73b MD5 · raw file
- <?xml version="1.0"?>
- <tool id="differential_expression" name="Calculate differential expression">
- <description>to find differential expressed genes between two conditions.
- </description>
- <code file="diffexp_code.py"/>
- <command interpreter="python">
- #if $howToGetPheno.howToGetPheno_select=="file" #diffexp.py '$rexprIn' '$title' '$outfile' '$rexprIn.name' '$rexprIn.extension' '$howToGetPheno.phecols' '${rexprIn.metadata.dbkey}' '${rexprIn.metadata.pheno}' '$howToGetPheno.groupTwo' '$method' '$permutations' '$padj' '$pthresh' '$mthresh' '$outfile.extra_files_path' '$txtoutfile'
- #else #diffexp.py '$rexprIn' '$title' '$outfile' '$rexprIn.name' '$rexprIn.extension' '$howToGetPheno.phecols' '$howToGetPheno.annotation' '$howToGetPheno.groupOne' '$howToGetPheno.groupTwo' '$method' '$permutations' '$padj' '$pthresh' '$mthresh' '$outfile.extra_files_path' '$txtoutfile'
- #end if
- </command>
- <inputs>
- <page>
- <param name="title" label="Title to label job outputs" type="text" size="80" value="Differential expression analysis" />
- <param name="rexprIn" type="data" format="txt"
- label="Express set text chosen from your history (Required)" optional="false"
- size="120" help="Choose an expression set text having samples in the columns and genes in the rows from your current history" />
- </page>
- <page>
- <conditional name="howToGetPheno">
- <param name="howToGetPheno_select" type="select" label="How to classify dichotomous phenotype factors" dynamic_options="getHasPheno(rexprIn)"/>
- <when value="file">
- <param name="phecols" type="hidden" value="GroupXXX0XXX1"/>
- <param name="groupOne" type="hidden" value=""/>
- <param name="groupTwo" type="hidden" value=""/>
- <param name="annotation" type="hidden" value=""/>
- </when>
- <when value="manual">
- <param name="phecols" type="hidden" value="GroupXXX0XXX1"/>
- <param name="annotation" type="select" label="Annotation">
- <option value="hgu133a" selected="True">Homo sapiens hgu133a</option>
- <option value="hgu133b">Homo sapiens hgu133b</option>
- <option value="hgu133plus2">Homo sapiens hgu133plus</option>
- <option value="hgu95av2">Homo sapiens hgu95av2</option>
-
- <option value="mouse430a2.db">Mouse 430a2</option>
- <option value="celegans.db">C. elengans</option>
- <option value="drosophila2.db">Drosophila2</option>
- <option value="org.Hs.eg">org.Hs.eg</option>
- <option value="org.Mm.eg">org.Mm.eg</option>
- <option value="org.Ce.eg">org.Ce.eg</option>
- <option value="org.Dm.eg">org.Dm.eg</option>
- </param>
- <param name="groupOne" type="select" multiple="true" label="Group 1 samples" dynamic_options="getSamples(rexprIn)"/>
- <param name="groupTwo" type="select" multiple="true" label="Group 2 samples" dynamic_options="getSamples(rexprIn)"/>
- </when>
- <when value="nothing" />
- </conditional>
- <!-- ##note trick needed to extract rexprIn from the conditional on the previous page##-->
-
- <param name="method" label="Regression method" type="select">
- <option value="L" selected="True">Limma (moderated t-statistics and log-odds
- of differential expression by empirical Bayes shrinkage)</option>
- <option value="F">Ordinary least-squares regression</option>
- <option value="P">Permutation by phenotype resampling</option>
- </param>
- <param name="permutations" label="Number of permutations if using permutation method"
- type="integer" value="1000" />
- <param name="padj" label="Type II error control method" type="select"
- help='Control either family wise error or false discovery rate using routines from R p.adjust'>
- <option value="fdr" selected="True">False discovery rate</option>
- <option value="BY">Benjamini-Yukateli FDR</option>
- <option value="BH">Benjamini-Hochberg FDR</option>
- <option value="holm">Holm FWE control</option>
- <option value="hochberg">Hochberg FWE control</option>
- <option value="bonferroni">Bonferroni FWE control (Overconservative if data correlated)</option>
- <option value="none">None (NOT recommended!)</option>
- </param>
- <param name="pthresh" label="Maximum P-value threshold to report (1.0 to report all)" type="float" value="0.01">
- <validator type="in_range" max="1.0" min="0" message="width is out of range, width has to be between 0 to 1.0" />
- </param>
- <param name="mthresh" label="Minimum fold-change value threshold to report" type="float" value="1.2">
- <validator type="in_range" max="10" min="1.2" message="width is out of range, width has to be between 1.2 to 10" />
- </param>
- </page>
- </inputs>
- <outputs>
- <data format='html' name="outfile" />
- <data format='txt' name="txtoutfile" />
- </outputs>
-
- <help>
- **Syntax**
- - **Title** is used to name the output files - so make it meaningful
- - **Data set to use** an expression index text file from your history
- - **Method** LIMMA method - compute moderated t-statistics and log-odds of differential expression by empirical Bayes shrinkage ofthe standard errors towards a common value, using ordinary linea regression, permuation test by resampling the phenotype
- - **Permutations** Number of permutations - ignored unless using permutation method
- - **Type II error control** FDR, one of the specific methods from the R stats p.adjust function
- - **P threshold** Sets the adjusted p value below which genes will be reported in the output "top table"
- - **Fold-change threshold** Sets the fold change (it will be log2 transformed) above which genes will be reported in the output "top table"
- -----
- **Summary**
- Given a normalized .expression set as input, this tool will calculate differential expression of genes in 2 groups analysis, group classification will be done by the "Group" column defined in the phenotype file (pheno.txt) provided together with the raw data ( see Upload Affymetrix .CEL or NimbleGen .XYS files from your computer1. Upload Gene Expression data), or by manually selecting which samples belong to each group.
- The user can also upload any standard tab delimited file containing expression values having gene/probes IDs in the rows and samples in the .columns, see example bellow::
- a.cel b.cel c.cel d.cel
- 1053_at 5.9716 5.4552 5.7371 5.6868
- 117_at 6.1563 6.2173 6.3502 6.3966
- 121_at 8.7508 9.2037 8.7661 9.1232
- 1255_g_at 4.3780 4.9647 4.4988 5.0353
- 1294_at 6.4600 6.4432 6.3925 6.7555
- 1316_at 5.1460 5.6332 5.3307 5.6146
- 1320_at 5.0362 5.1392 5.0129 5.1335
- 1405_i_at 3.8925 4.1668 3.9669 4.1081
- 1431_at 3.6796 3.9467 3.7574 4.1612
- 1438_at 7.5160 7.7695 7.3263 7.5756
- 1487_at 7.1885 7.2374 7.2611 7.4751
- 1494_f_at 6.3009 7.1231 6.8411 7.0828
- 1598_g_at 9.1183 9.0740 9.0078 8.9683
- 160020_at 7.4241 8.0714 7.5623 7.8344
- 1729_at 7.0726 7.1000 6.8014 7.1348
- 1773_at 6.0027 5.7266 5.5894 5.6914
- Different regression methods are available to determine differentially expressed genes: Limma moderated t-test, ordinary least-squares and permutation by resampling. Since Microarray experiments generate large multiplicity problems in which thousands of hypothesis (is gene x differentially expressed between the two groups) are tested simultaneously, correction for false positive (type I errors) and false negative (type 2) errors may be performed by using one of the available methods (Bonferroni and Benjamini FDR). The output from this tool will be a text file (.txt) containing a set of differentially expressed genes, fold-change ( Log2 transformed) and p-values of differential expression. This file that can be saved or used as an input for the next step in the pipe-line(e.g. Find highest expressed TFs tool).
- This tool implements the AffyExpress vignette shown below:
-
- ::
- 5 Identifying Differentially Expressed Genes
- Identifying differentially expressed genes depends on a researcher's interest and
- applying correct statistical models during the analysis process. We will illustrate
- a few basic statistical models on this data set.
- 5.1 Single Factor
- Suppose we would like to identify how differentially expressed genes respond to
- estrogen regardless of time period. Analysis on a single categorical variable is
- the same as One Way ANOVA. Since we only have two levels, present and absent,
- for the estrogen variable, this type of analysis is also equivalent to a two-sample t test.
- 5.1.1 Design Matrix and Contrast Matrix
- To run the analysis, we need to create a design and a contrast matrix. One of the
- major strengths of this package that we can use the built-in function to create the
- design matrix and the contrast matrix using standard statistics approaches which
- is different from the design matrix from the LIMMA package. To create a design
- matrix, we will use the make.design function.
- > design = make.design(target = pData(filtered), cov = "estrogen")
- > design
- (Intercept) estrogen/present
- 1 1 0
- 2 1 0
- 3 1 1
- 4 1 1
- 5 1 0
- 6 1 0
- 7 1 1
- 8 1 1
- attr(,"assign")
- [1] 0 1
- attr(,"contrasts")
- attr(,"contrasts")$estrogen
- [1] "contr.treatment"
- Notice that the name of the second column of the design matrix is estrogen/present,
- where estrogen is the name of the variable and present tells us that present
- corresponds to 1. Thus, the design matrix above is equvalent to the equation
- below:
- y = alpha + betaE xE + (1)
- where
- 1 if estrogen = present
- xE =
- 0 if estrogen = absent
- Next we need to create a contrast matrix. Since we are comparing present
- versus absent, we will create the following contrast:
- > contrast = make.contrast(design.matrix = design, compare1 = "present",
- + compare2 = "absent")
- > contrast
- [,1]
- [1,] 0
- [2,] 1
- 5.1.2 Analysis
- Once the design matrix and contrast matrix are created, we can run the analysis
- by using the regress function. There are three types of regression methods that
- are being supported: LIMMA (computing moderated t-statistics and log-odds of
- differential expressions by empirical Bayes shrinkage of the standard errors towards
- a common value), permutation test (resampling the phenotype), and ordinary
- linear regression. Also, we can apply multiple comparison corrections by using the
- adj option, such as fdr. The default value for the adj is none
- > result = regress(object = filtered, design = design, contrast = contrast,
- + method = "L", adj = "fdr")
- Here are the first three genes of the result
- > result[1:3, ]
- ID Log2Ratio.1 F P.Value adj.P.Val
- 1000_at 1000_at -0.23625810 2.9566239 0.1195380 0.4969571
- 1001_at 1001_at 0.12495314 1.0542776 0.3312475 0.7979252
- 1003_s_at 1003_s_at 0.06103375 0.2581607 0.6235681 0.9536450
- 4
- For the next step, we can select differentaly expressed genes based on p value
- and/or fold change. Suppose that we would like to select genes with a p value
- LT 0.05 and a fold change value greater than 1.5.
- > select = select.sig.gene(top.table = result, p.value = 0.05,
- + m.value = log2(1.5))
- [1] "There are 381 differentially expressed genes"
- [1] "based on your selection criteria."
- The select.sig.gene function adds an additional column, significant, which
- gives a value of either TRUE or FALSE indicating whether the gene is differen-
- tially expressed based on your selection criteria. In this example, there are 381
- differentially expressed genes being selected.
- 5.1.3 Output Your Result
- To output the differentially expressed genes along with annotations to an HTML
- file in your current working directory, we can use the result2html function.
- > result2html(cdf.name = annotation(filtered), result = select,
- + filename = "singleFactor")
- 5.1.4 A Wrapper Function
- There is a wrapper function, AffyRegress, that can acomplish all of the above
- steps together including: create a design and contrast matrix, run regression, select
- differentaly expressed genes, and output the differentally expressed gene to an html
- file.
- > result.wrapper = AffyRegress(normal.data = filtered, cov = "estrogen",
- + compare1 = "present", compare2 = "absent", method = "L",
- + adj = "fdr", p.value = 0.05, m.value = log2(1.5))
- [1] "There are 381 differentially expressed genes"
- [1] "based on your selection criteria."
- Note that this tool uses ideas documented at the R and BioConductor manual:
- http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html#biocon_affypack and uses standard R (www.r-project.org) and Bioconductor (htt://bioconductor.org) packages.
- Some base code for this tool was designed and written for the Rexpression BioC/Galaxy tools by ross lazarus (ross d0t lazarus at gmail d0t com) August 2008 Copyright Ross Lazarus 2008 Licensed under the LGPL as documented at http://www.gnu.org/licenses/lgpl.html
- </help>
- </tool>