/TDP_for_Ed_lab.md

https://gitlab.com/tangming2005/Pancancer-SVs · Markdown · 70 lines · 45 code · 25 blank · 0 comment · 0 complexity · f7e014da9da4725710e43bb2b1c1a568 MD5 · raw file

  1. grep out only the duplications for all TCGA SVs
  2. ```bash
  3. find *bedpe | parallel 'cat {} | body grep "DUP" > {.}.DUP.txt'
  4. ```
  5. SNVs for DNA repair genes
  6. ```
  7. ## some of my vcf files (3Gb) are big, I want to sort the bed file and vcf files first
  8. less dna_repair_genes.txt | sed '1d' | cut -f3-5 | sed 's/^chr//' | sort -k1,1 -k2,2n > regions.bed
  9. ```
  10. `vcfsort.sh`:
  11. ```bash
  12. #!/bin/bash
  13. # Faster, but can't handle streams
  14. [ $# -eq 0 ] && { echo "Sorts a VCF file in natural chromosome order";\
  15. echo "Usage: $0 [my.vcf | my.vcf.gz]"; exit 1;
  16. }
  17. # cheers, @michaelhoffman
  18. if (zless $1 | grep ^#; zless $1 | grep -v ^# | LC_ALL=C sort -k1,1 -k2,2n);
  19. then
  20. exit 0
  21. else
  22. printf 'sort failed. Does your version of sort support the -V option?\n'
  23. printf 'If not, you should update sort with the latest from GNU coreutils:\n'
  24. printf 'git clone git://git.sv.gnu.org/coreutils'
  25. fi
  26. ```
  27. **very powerful one liner** :)
  28. use with caution. always specify `-j`. several vcf.gz files are 3Gb. it takes very long time and 30G memory.
  29. Maybe sort them first is not a good idea.
  30. From the bedtools mannual:
  31. >File B is loaded into memory (most of the time).
  32. >the gene annotation file will have tens of thousands of features. In this case, it is wise to sets the reads as file A and the genes as file B.
  33. Because the `regions.bed` only contains 200 rows, it is OK to just do:
  34. `bedtools intersect -a myvcf.gz -b regions.bed -wa | sort | uniq`
  35. ```bash
  36. find *gz | parallel -j 6 'bedtools intersect -a <(./vcfsort.sh {}) -b regions.bed -wa -sorted | sort | uniq > {/.}.subset'
  37. ## put each command in a txt file
  38. find *gz | parallel -j 6 'echo "bedtools intersect -a <(./vcfsort.sh {}) -b regions.bed -wa -sorted | sort | uniq > {/.}.subset" > {/.}.commands'
  39. ## make a pbs file for each command
  40. find *commands | parallel 'makemsub -a {} -m 12g -c 4 -o a -j {/.} > {/.}.pbs
  41. for pbs in *pbs
  42. do
  43. msub $pbs
  44. done
  45. ```