/TDP_for_Ed_lab.md
https://gitlab.com/tangming2005/Pancancer-SVs · Markdown · 70 lines · 45 code · 25 blank · 0 comment · 0 complexity · f7e014da9da4725710e43bb2b1c1a568 MD5 · raw file
- grep out only the duplications for all TCGA SVs
- ```bash
- find *bedpe | parallel 'cat {} | body grep "DUP" > {.}.DUP.txt'
- ```
- SNVs for DNA repair genes
- ```
- ## some of my vcf files (3Gb) are big, I want to sort the bed file and vcf files first
- less dna_repair_genes.txt | sed '1d' | cut -f3-5 | sed 's/^chr//' | sort -k1,1 -k2,2n > regions.bed
- ```
- `vcfsort.sh`:
- ```bash
- #!/bin/bash
- # Faster, but can't handle streams
- [ $# -eq 0 ] && { echo "Sorts a VCF file in natural chromosome order";\
- echo "Usage: $0 [my.vcf | my.vcf.gz]"; exit 1;
- }
- # cheers, @michaelhoffman
- if (zless $1 | grep ^#; zless $1 | grep -v ^# | LC_ALL=C sort -k1,1 -k2,2n);
- then
- exit 0
- else
- printf 'sort failed. Does your version of sort support the -V option?\n'
- printf 'If not, you should update sort with the latest from GNU coreutils:\n'
- printf 'git clone git://git.sv.gnu.org/coreutils'
- fi
- ```
- **very powerful one liner** :)
- use with caution. always specify `-j`. several vcf.gz files are 3Gb. it takes very long time and 30G memory.
- Maybe sort them first is not a good idea.
- From the bedtools mannual:
- >File B is loaded into memory (most of the time).
- >the gene annotation file will have tens of thousands of features. In this case, it is wise to sets the reads as file A and the genes as file B.
- Because the `regions.bed` only contains 200 rows, it is OK to just do:
- `bedtools intersect -a myvcf.gz -b regions.bed -wa | sort | uniq`
- ```bash
- find *gz | parallel -j 6 'bedtools intersect -a <(./vcfsort.sh {}) -b regions.bed -wa -sorted | sort | uniq > {/.}.subset'
- ## put each command in a txt file
- find *gz | parallel -j 6 'echo "bedtools intersect -a <(./vcfsort.sh {}) -b regions.bed -wa -sorted | sort | uniq > {/.}.subset" > {/.}.commands'
- ## make a pbs file for each command
- find *commands | parallel 'makemsub -a {} -m 12g -c 4 -o a -j {/.} > {/.}.pbs
- for pbs in *pbs
- do
- msub $pbs
- done
- ```