PageRenderTime 52ms CodeModel.GetById 16ms RepoModel.GetById 1ms app.codeStats 0ms

/README.md

https://gitlab.com/Grouumf/garmire_SNV_calling
Markdown | 109 lines | 78 code | 31 blank | 0 comment | 0 complexity | 9a01f4b282f20891d84882a9fe1736fa MD5 | raw file
  1. # SNV computation pipeline
  2. This package aims to align reads from FASTQ files and infer SNVs from RNA-seq dataset. The pipeline is largely inspired from the [GATK variant calling good practices.](http://gatkforums.broadinstitute.org/wdl/discussion/3891/calling-variants-in-rnaseq). Also, it can optionally infer raw gene expression, annotate SNV and doing Quality Control (QC) check.
  3. * GATK reference:
  4. * [From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.](http://www.ncbi.nlm.nih.gov/pubmed/25431634)
  5. * Pipeline schema:
  6. ![Pipeline schema:](./img/workflow.png)
  7. ## installation (local)
  8. ```bash
  9. git clone git@gitlab.com:Grouumf/garmire_SNV_calling.git
  10. cd garmire_SNV_calling
  11. pip install -r requirements.txt --user
  12. ```
  13. ## Requirements
  14. * The pipeline requires that the following programs are installed:
  15. * Linux/ Unix (not tested) working environment
  16. * [python 2 (>=2.7)](https://www.python.org/download/releases/2.7.2/)
  17. * [STAR Aligner](https://github.com/alexdobin/STAR)
  18. * [GATK](https://software.broadinstitute.org/gatk/download/)
  19. * [picard-tools](https://broadinstitute.github.io/picard/)
  20. * [Java (>=1.8)](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
  21. * [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc) \[OPTIONAL\]
  22. * [featureCounts](http://subread.sourceforge.net/) \[OPTIONAL\]
  23. * [snpEff](http://snpeff.sourceforge.net/) \[OPTIONAL\]
  24. * Appropriate snpEff database should be downloaded and installed (see config.py). (It can be done using snpEff command line, see documentation)
  25. * For each sample, FASTQ files must be inside a specific folder. Also, all the FASTQ folders must be inside a specific folder. (see config.py file)
  26. * reference genome (.fa file) and gene annotations file (.gtf) must be provided (see config.py file)
  27. * Reference variant files must be also provided for the SNV calling procedure (see config.py file).
  28. * \[HUMAN\]:
  29. * dbsnp can be downloaded here: [ftp://ftp.ncbi.nih.gov/snp/organisms/](ftp://ftp.ncbi.nih.gov/snp/organisms/)
  30. * additional reference SNV resources can be downloaded here: [ftp://ftp.broadinstitute.org/bundle/2.8/hg19](ftp://ftp.broadinstitute.org/bundle/2.8/hg19)
  31. * \[MOUSE\]:
  32. * Mouse reference variant and indel databases can be downloaded here: [ftp://ftp-mouse.sanger.ac.uk/REL-1303- SNPs_Indels-GRCm38/](ftp://ftp-mouse.sanger.ac.uk/REL-1303- SNPs_Indels-GRCm38/). However, vcf files should probably be resorted toward the mouse reference genome using the sequence dictionnary.
  33. ## configuration
  34. All the environment variables should be set into the config.py file
  35. ## usage
  36. * Once all the environment variables are defined, one should run the test scripts:
  37. * [optional] Running all the tests:
  38. *
  39. ```bash
  40. python test/*.py -v
  41. nosetests -v # alternative using nose
  42. pytest test/test.py -v # alternative using pytest
  43. ```
  44. * create a STAR index for the used reference genome and the read length used:
  45. ```bash
  46. python garmire_SNV_calling/generate_STAR_genome_index.py
  47. ```
  48. * Align the reads
  49. ```bash
  50. python garmire_SNV_calling/deploy_star.py
  51. ```
  52. * infer SNVs
  53. ```bash
  54. python garmire_SNV_calling/process_multiple_snv.py
  55. ```
  56. * Check STAR overall quality (generate a csv file with the percentage of unique reads mapped for each sample in OUTPUT_PATH)
  57. ```bash
  58. python garmire_SNV_calling/check_star_overall_quality.py
  59. ```
  60. * generate a fastqc report for each sample \[argument --nb_threads: number of processes in parallel\]
  61. ```bash
  62. python garmire_SNV_calling/process_fastqc_report.py --nb_threads <int>
  63. ```
  64. * Use the FastQC report to generate a csv file in OUTPUT_PATH reporting, for each sample, if the [duplicated test](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html) of fastqc is passed.
  65. ```bash
  66. python garmire_SNV_calling/check_fastqc_stats.py
  67. ```
  68. * Generate gene expression matrices (raw count)
  69. ```bash
  70. python garmire_SNV_calling/compute_frequency_matrix.py
  71. ```
  72. * Annotate SNV: generate new .vcf files with SNV annotations. \[argument --nb_threads: number of processes in parallel\]
  73. ```bash
  74. python garmire_SNV_calling/process_annotate_snv.py --nb_threads <int>
  75. ```
  76. ## contact and credentials
  77. * Developer: Olivier Poirion (PhD)
  78. * contact: opoirion@hawaii.edu