/tools/rgenetics/rgEigPCA.xml
https://bitbucket.org/cistrome/cistrome-harvard/ · XML · 167 lines · 127 code · 40 blank · 0 comment · 0 complexity · 1d590e22ba9f5a8855232dcf51fd7359 MD5 · raw file
- <tool id="rgEigPCA1" name="Eigensoft:">
- <description>PCA Ancestry using SNP</description>
- <command interpreter="python">
- rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1"
- "$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca"
- </command>
- <inputs>
- <param name="i" type="data" label="Input genotype data file"
- size="120" format="ldindep" />
- <param name="title" type="text" value="Ancestry PCA" label="Title for outputs from this run"
- size="80" />
- <param name="k" type="integer" value="4" label="Number of principal components to output"
- size="3" />
- <param name="m" type="integer" value="0" label="Max. outlier removal iterations"
- help="To turn on outlier removal, set m=5 or so. Do this if you plan on adjusting any analyses"
- size="3" />
- <param name="t" type="integer" value="5" label="# principal components used for outlier removal"
- size="3" />
- <param name="s" type="integer" value="6" label="#SDs for outlier removal"
- help = "Any individual with SD along one of k top principal components > s will be removed as an outlier."
- size="3" />
- </inputs>
- <outputs>
- <data name="out_file1" format="html" label="${title}_rgEig.html"/>
- <data name="pca" format="txt" label="${title}_rgEig.txt"/>
- </outputs>
- <tests>
- <test>
- <param name='i' value='tinywga' ftype='ldindep' >
- <metadata name='base_name' value='tinywga' />
- <composite_data value='tinywga.bim' />
- <composite_data value='tinywga.bed' />
- <composite_data value='tinywga.fam' />
- <edit_attributes type='name' value='tinywga' />
- </param>
- <param name='title' value='rgEigPCAtest1' />
- <param name="k" value="4" />
- <param name="m" value="2" />
- <param name="t" value="2" />
- <param name="s" value="2" />
- <output name='out_file1' file='rgtestouts/rgEigPCA/rgEigPCAtest1.html' ftype='html' compare='diff' lines_diff='195'>
- <extra_files type="file" name='rgEigPCAtest1_PCAPlot.pdf' value="rgtestouts/rgEigPCA/rgEigPCAtest1_PCAPlot.pdf" compare="sim_size" delta="3000"/>
- </output>
- <output name='pca' file='rgtestouts/rgEigPCA/rgEigPCAtest1.txt' compare='diff'/>
- </test>
- </tests>
- <help>
- **Syntax**
- - **Genotype data** is an input genotype dataset in Plink lped (http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml) format. See below for notes
- - **Title** is used to name the output files so you can remember what the outputs are for
- - **Tuning parameters** are documented in the Eigensoft (http://genepath.med.harvard.edu/~reich/Software.htm) documentation - see below
- -----
- **Summary**
- Eigensoft requires ld-reduced genotype data.
- Galaxy has an automatic converter for genotype data in Plink linkage pedigree (lped) format.
- For details of this generic genotype format, please see the Plink documentation at
- http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml
- Reading that documentation, you'll see that the linkage pedigree format is really two related files with the same
- file base name - a map and ped file - eg 'mygeno.ped' and 'mygeno.map'.
- The map file has the chromosome, offset, genetic offset and snp name corresponding to each
- genotype stored as separate alleles in the ped file. The ped file has family id, individual id, father id (or 0), mother id
- (or 0), gender (1=male, 2=female, 0=unknown) and affection (1=unaffected, 2=affected, 0=unknown),
- then two separate allele columns for each genotype.
- Once you have your data in the right format, you can upload those into your Galaxy history using the "upload" tool.
- To upload your lped data in the upload tool, choose 'lped' as the 'file format'. The tool form will change to
- allow you to navigate to and select each member of the pair of ped and map files stored on your local computer
- (or available at a public URL for Galaxy to grab).
- Give the dataset a meaningful name (replace rgeneticsData with something more useful!) and click execute.
- When the upload is done, your new lped format dataset will appear in your history and then,
- when you choose the ancestry tool, that history dataset will be available as input.
- **Warning for the Impatient**
- When you execute the tool, it will look like it has not started running for a while as the automatic converter
- reduces the amount of LD - otherwise eigenstrat gives biased results.
- **Attribution**
- This tool runs and relies on the work of many others, including the
- maintainers of the Eigensoft program, and the R and
- Bioconductor projects. For full attribution, source code and documentation, please see
- http://genepath.med.harvard.edu/~reich/Software.htm, http://cran.r-project.org/
- and http://www.bioconductor.org/ respectively
- This implementation is a Galaxy tool wrapper around these third party applications.
- It was originally designed and written for family based data from the CAMP Illumina run of 2007 by
- ross lazarus (ross.lazarus@gmail.com) and incorporated into the rgenetics toolkit.
- copyright Ross Lazarus 2007
- Licensed under the terms of the LGPL as documented http://www.gnu.org/licenses/lgpl.html
- but is about as useful as a sponge boat without EIGENSOFT pca code.
- **README from eigensoft2 distribution at http://genepath.med.harvard.edu/~reich/Software.htm**
- [rerla@beast eigensoft2]$ cat README
- EIGENSOFT version 2.0, January 2008 (for Linux only)
- This is the same as our EIGENSOFT 2.0 BETA release with a few recent changes
- as described at http://genepath.med.harvard.edu/~reich/New_In_EIGENSOFT.htm.
- Features of EIGENSOFT version 2.0 include:
- -- Keeping track of ref/var alleles in all file formats: see CONVERTF/README
- -- Handling data sets up to 8 billion genotypes: see CONVERTF/README
- -- Output SNP weightings of each principal component: see POPGEN/README
- The EIGENSOFT package implements methods from the following 2 papers:
- Patterson N. et al. 2006 PLoS Genetics in press (population structure)
- Price A.L. et al. 2006 NG 38:904-9 (EIGENSTRAT stratification correction)
- See POPGEN/README for documentation of population structure programs.
- See EIGENSTRAT/README for documentation of EIGENSTRAT programs.
- See CONVERTF/README for documentation of programs for converting file formats.
- Executables and source code:
- ----------------------------
- All C executables are in the bin/ directory.
- We have placed source code for all C executables in the src/ directory,
- for users who wish to modify and recompile our programs. For example, to
- recompile the eigenstrat program, type
- "cd src"
- "make eigenstrat"
- "mv eigenstrat ../bin"
- Note that some of our software will only compile if your system has the
- lapack package installed. (This package is used to compute eigenvectors.)
- Some users may need to change "blas-3" to "blas" in the Makefile,
- depending on how blas and lapack are installed.
- If cc is not available on your system, try "cp Makefile.alt Makefile"
- and then recompile.
- If you have trouble compiling and running our code, try compiling and
- running the pcatoy program in the src directory:
- "cd src"
- "make pcatoy"
- "./pcatoy"
- If you are unable to run the pcatoy program successfully, please contact
- your system administrator for help, as this is a systems issue which is
- beyond our scope. Your system administrator will be able to troubleshoot
- your systems issue using this trivial program. [You can also try running
- the pcatoy program in the bin directory, which we have already compiled.]
- </help>
- </tool>