diffexp.xml | searchcode

/tools/expression/diffexp.xml

https://bitbucket.org/cistrome/cistrome-harvard/ · XML · 239 lines · 205 code · 33 blank · 1 comment · 0 complexity · 4025db414e20677953e234645944e73b MD5 · raw file

<?xml version="1.0"?>
<tool id="differential_expression" name="Calculate differential expression">
<description>to find differential expressed genes between two conditions.
</description>
<code file="diffexp_code.py"/> 
<command interpreter="python">
#if $howToGetPheno.howToGetPheno_select=="file" #diffexp.py '$rexprIn' '$title' '$outfile' '$rexprIn.name' '$rexprIn.extension' '$howToGetPheno.phecols' '${rexprIn.metadata.dbkey}' '${rexprIn.metadata.pheno}' '$howToGetPheno.groupTwo' '$method' '$permutations' '$padj' '$pthresh' '$mthresh'  '$outfile.extra_files_path' '$txtoutfile'

#else #diffexp.py '$rexprIn' '$title' '$outfile' '$rexprIn.name' '$rexprIn.extension' '$howToGetPheno.phecols' '$howToGetPheno.annotation' '$howToGetPheno.groupOne' '$howToGetPheno.groupTwo' '$method' '$permutations' '$padj' '$pthresh' '$mthresh'  '$outfile.extra_files_path' '$txtoutfile'

#end if
</command>
<inputs>
  <page>
    <param name="title" label="Title to label job outputs" type="text" size="80" value="Differential expression analysis" />
    <param name="rexprIn" type="data" format="txt" 
	   label="Express set text chosen from your history (Required)" optional="false"
	   size="120" help="Choose an expression set text having samples in the columns and genes in the rows from your current history" />     
  </page>
  <page>
    <conditional name="howToGetPheno">
      <param name="howToGetPheno_select" type="select" label="How to classify dichotomous phenotype factors" dynamic_options="getHasPheno(rexprIn)"/>
      <when value="file">
	<param name="phecols" type="hidden" value="GroupXXX0XXX1"/>
	<param name="groupOne" type="hidden" value=""/>
	<param name="groupTwo" type="hidden" value=""/>
	<param name="annotation" type="hidden" value=""/>
      </when>
      <when value="manual">
	<param name="phecols" type="hidden" value="GroupXXX0XXX1"/>
	<param name="annotation" type="select" label="Annotation">
	  <option value="hgu133a" selected="True">Homo sapiens hgu133a</option>
	  <option value="hgu133b">Homo sapiens hgu133b</option>
	  <option value="hgu133plus2">Homo sapiens hgu133plus</option>
	  <option value="hgu95av2">Homo sapiens hgu95av2</option>
	
	  <option value="mouse430a2.db">Mouse 430a2</option>
	  <option value="celegans.db">C. elengans</option>
	  <option value="drosophila2.db">Drosophila2</option>
	  <option value="org.Hs.eg">org.Hs.eg</option>
	  <option value="org.Mm.eg">org.Mm.eg</option>
	  <option value="org.Ce.eg">org.Ce.eg</option>
	  <option value="org.Dm.eg">org.Dm.eg</option>
	</param>
	<param name="groupOne" type="select" multiple="true" label="Group 1 samples" dynamic_options="getSamples(rexprIn)"/>
	<param name="groupTwo" type="select" multiple="true" label="Group 2 samples" dynamic_options="getSamples(rexprIn)"/>
      </when>
      <when value="nothing" />
    </conditional>
    <!-- ##note trick needed to extract rexprIn from the conditional on the previous page##-->
    
    <param name="method" label="Regression method" type="select">
      <option value="L" selected="True">Limma (moderated t-statistics and log-odds 
      of differential expression by empirical Bayes shrinkage)</option>
      <option value="F">Ordinary least-squares regression</option>
      <option value="P">Permutation by phenotype resampling</option>
    </param>
    <param name="permutations" label="Number of permutations if using permutation method"
	   type="integer" value="1000" />
    <param name="padj" label="Type II error control method" type="select"
	   help='Control either family wise error or false discovery rate using routines from R p.adjust'>
      <option value="fdr" selected="True">False discovery rate</option>
      <option value="BY">Benjamini-Yukateli FDR</option>
      <option value="BH">Benjamini-Hochberg FDR</option>
      <option value="holm">Holm FWE control</option>
      <option value="hochberg">Hochberg FWE control</option>
      <option value="bonferroni">Bonferroni FWE control (Overconservative if data correlated)</option>
      <option value="none">None (NOT recommended!)</option>
    </param>
    <param name="pthresh" label="Maximum P-value threshold to report (1.0 to report all)" type="float" value="0.01">
      <validator type="in_range" max="1.0" min="0" message="width is out of range, width has to be between 0 to 1.0" />
    </param>
    <param name="mthresh" label="Minimum fold-change value threshold to report" type="float" value="1.2">
      <validator type="in_range" max="10" min="1.2" message="width is out of range, width has to be between 1.2 to 10" />
    </param>
  </page>
</inputs>

<outputs>  
    <data format='html' name="outfile" />
    <data format='txt' name="txtoutfile" />
</outputs>

 

<help>


**Syntax**

- **Title** is used to name the output files - so make it meaningful
- **Data set to use** an expression index text file from your history
- **Method** LIMMA method - compute moderated t-statistics and log-odds of differential expression by empirical Bayes shrinkage ofthe standard errors towards a common value, using ordinary linea regression, permuation test by resampling the phenotype
- **Permutations** Number of permutations - ignored unless using permutation method
- **Type II error control** FDR, one of the specific methods from the R stats p.adjust function
- **P threshold** Sets the adjusted p value below which genes will be reported in the output "top table"
- **Fold-change threshold** Sets the fold change (it will be log2 transformed) above which genes will be reported in the output "top table"

-----

**Summary**

Given a normalized .expression set as input, this tool will calculate differential expression of genes in 2 groups  analysis, group classification will be done by the "Group" column defined in the phenotype file (pheno.txt) provided together with the raw data ( see Upload Affymetrix .CEL or NimbleGen .XYS files from your computer1. Upload Gene Expression data), or by manually selecting which samples belong to each group.

The user can also upload any standard tab delimited file containing expression values having gene/probes IDs in the rows and samples in the .columns, see example bellow::

                a.cel   b.cel   c.cel   d.cel
  1053_at	5.9716  5.4552  5.7371	5.6868
  117_at	6.1563	6.2173	6.3502	6.3966
  121_at	8.7508	9.2037	8.7661	9.1232
  1255_g_at	4.3780	4.9647	4.4988	5.0353
  1294_at	6.4600	6.4432	6.3925	6.7555
  1316_at	5.1460	5.6332	5.3307	5.6146
  1320_at	5.0362	5.1392	5.0129	5.1335
  1405_i_at	3.8925	4.1668	3.9669	4.1081
  1431_at	3.6796	3.9467	3.7574	4.1612
  1438_at	7.5160	7.7695	7.3263	7.5756
  1487_at	7.1885	7.2374	7.2611	7.4751
  1494_f_at	6.3009	7.1231	6.8411	7.0828
  1598_g_at	9.1183	9.0740	9.0078	8.9683
  160020_at	7.4241	8.0714	7.5623	7.8344
  1729_at	7.0726	7.1000	6.8014	7.1348
  1773_at	6.0027	5.7266	5.5894	5.6914


Different regression methods are available to determine differentially expressed genes: Limma moderated t-test, ordinary least-squares and permutation by resampling. Since Microarray experiments generate large multiplicity problems in which thousands of hypothesis (is gene x differentially expressed between the two groups) are tested simultaneously, correction for false positive (type I errors) and false negative (type 2) errors may be performed by using one of the available methods (Bonferroni and Benjamini FDR). The output from this tool will be a text file (.txt) containing a set of differentially expressed genes, fold-change ( Log2 transformed)  and p-values of differential expression. This file that can be saved or used as an input for the next step in the pipe-line(e.g. Find highest expressed TFs tool).
This tool implements the AffyExpress vignette shown below:
    
::

  5    Identifying Differentially Expressed Genes

  Identifying differentially expressed genes depends on a researcher's interest and 
  applying correct statistical models during the analysis process. We will illustrate 
  a few basic statistical models on this data set.

  5.1     Single Factor
  Suppose we would like to identify how differentially expressed genes respond to 
  estrogen regardless of time period. Analysis on a single categorical variable is 
  the same as One Way ANOVA. Since we only have two levels, present and absent, 
  for the estrogen variable, this type of analysis is also equivalent to a two-sample t test.

    5.1.1    Design Matrix and Contrast Matrix

    To run the analysis, we need to create a design and a contrast matrix. One of the 
  major strengths of this package that we can use the built-in function to create the
  design matrix and the contrast matrix using standard statistics approaches which
  is different from the design matrix from the LIMMA package. To create a design
  matrix, we will use the make.design function.
  > design = make.design(target = pData(filtered), cov = "estrogen")
  > design
      (Intercept) estrogen/present
  1              1                    0
  2              1                    0
  3              1                    1
  4              1                    1
  5              1                    0
  6              1                    0
  7              1                    1
  8              1                    1

  attr(,"assign")
  [1] 0 1
  attr(,"contrasts")
  attr(,"contrasts")$estrogen
  [1] "contr.treatment"
  Notice that the name of the second column of the design matrix is estrogen/present,
  where estrogen is the name of the variable and present tells us that present
  corresponds to 1. Thus, the design matrix above is equvalent to the equation
  below:
                               y = alpha + betaE xE +                                (1)
  where
                                 1 if estrogen = present
                        xE =
                                 0 if estrogen = absent
  Next we need to create a contrast matrix. Since we are comparing present
  versus absent, we will create the following contrast:
  > contrast = make.contrast(design.matrix = design, compare1 = "present",
  +       compare2 = "absent")
  > contrast
         [,1]
  [1,]      0
  [2,]      1

  5.1.2    Analysis
  Once the design matrix and contrast matrix are created, we can run the analysis
  by using the regress function. There are three types of regression methods that
  are being supported: LIMMA (computing moderated t-statistics and log-odds of
  differential expressions by empirical Bayes shrinkage of the standard errors towards
  a common value), permutation test (resampling the phenotype), and ordinary
  linear regression. Also, we can apply multiple comparison corrections by using the
  adj option, such as fdr. The default value for the adj is none
  > result = regress(object = filtered, design = design, contrast = contrast,
  +       method = "L", adj = "fdr")
  Here are the first three genes of the result
  > result[1:3, ]
                       ID Log2Ratio.1              F    P.Value adj.P.Val
  1000_at         1000_at -0.23625810 2.9566239 0.1195380 0.4969571
  1001_at         1001_at 0.12495314 1.0542776 0.3312475 0.7979252
  1003_s_at 1003_s_at 0.06103375 0.2581607 0.6235681 0.9536450
                                            4
    For the next step, we can select differentaly expressed genes based on p value
  and/or fold change. Suppose that we would like to select genes with a p value
  LT 0.05 and a fold change value greater than 1.5.
  > select = select.sig.gene(top.table = result, p.value = 0.05,
  +       m.value = log2(1.5))
  [1] "There are 381 differentially expressed genes"
  [1] "based on your selection criteria."
  The select.sig.gene function adds an additional column, significant, which
  gives a value of either TRUE or FALSE indicating whether the gene is differen-
  tially expressed based on your selection criteria. In this example, there are 381
  differentially expressed genes being selected.

  5.1.3    Output Your Result
  To output the differentially expressed genes along with annotations to an HTML
  file in your current working directory, we can use the result2html function.
  > result2html(cdf.name = annotation(filtered), result = select,
  +       filename = "singleFactor")

  5.1.4    A Wrapper Function
  There is a wrapper function, AffyRegress, that can acomplish all of the above
  steps together including: create a design and contrast matrix, run regression, select
  differentaly expressed genes, and output the differentally expressed gene to an html
  file.

  > result.wrapper = AffyRegress(normal.data = filtered, cov = "estrogen",
  +       compare1 = "present", compare2 = "absent", method = "L",
  +       adj = "fdr", p.value = 0.05, m.value = log2(1.5))
  [1] "There are 381 differentially expressed genes"
  [1] "based on your selection criteria."


Note that this tool uses ideas documented at the R and BioConductor manual:
http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html#biocon_affypack and uses standard R (www.r-project.org) and Bioconductor (htt://bioconductor.org) packages.

Some base code for this tool was designed and written for the Rexpression BioC/Galaxy tools by ross lazarus (ross d0t lazarus at gmail d0t com) August 2008 Copyright Ross Lazarus 2008 Licensed under the LGPL as documented at http://www.gnu.org/licenses/lgpl.html

</help>
</tool>