TDP_for_Ed_lab.md | searchcode

/TDP_for_Ed_lab.md

https://gitlab.com/tangming2005/Pancancer-SVs · Markdown · 70 lines · 45 code · 25 blank · 0 comment · 0 complexity · f7e014da9da4725710e43bb2b1c1a568 MD5 · raw file


grep out only the duplications for all TCGA SVs

```bash
find *bedpe | parallel 'cat {} | body grep "DUP" > {.}.DUP.txt' 

```


SNVs for DNA repair genes


```
## some of my vcf files (3Gb) are big, I want to sort the bed file and vcf files first

less dna_repair_genes.txt | sed '1d' | cut -f3-5 | sed 's/^chr//' | sort -k1,1 -k2,2n > regions.bed


```

`vcfsort.sh`:

```bash
#!/bin/bash
# Faster, but can't handle streams

[ $# -eq 0 ] && { echo "Sorts a VCF file in natural chromosome order";\
                  echo "Usage: $0 [my.vcf | my.vcf.gz]"; exit 1;
                 }
# cheers, @michaelhoffman
if (zless $1 | grep ^#; zless $1 | grep -v ^# | LC_ALL=C sort -k1,1 -k2,2n);
then
    exit 0
else
    printf 'sort failed. Does your version of sort support the -V option?\n'
    printf 'If not, you should update sort with the latest from GNU coreutils:\n'
    printf 'git clone git://git.sv.gnu.org/coreutils'
fi

```

**very powerful one liner** :)

use with caution. always specify `-j`. several vcf.gz files are 3Gb. it takes very long time and 30G memory.
Maybe sort them first is not a good idea.

From the bedtools mannual:

>File B is loaded into memory (most of the time).
>the gene annotation file will have tens of thousands of features. In this case, it is wise to sets the reads as file A and the genes as file B.

Because the `regions.bed` only contains 200 rows, it is OK to just do:  

`bedtools intersect -a myvcf.gz -b regions.bed -wa | sort | uniq`

```bash

find *gz | parallel -j 6 'bedtools intersect -a <(./vcfsort.sh {}) -b regions.bed -wa -sorted | sort | uniq > {/.}.subset' 

## put each command in a txt file
find *gz | parallel -j 6 'echo "bedtools intersect -a <(./vcfsort.sh {}) -b regions.bed -wa -sorted | sort | uniq > {/.}.subset" > {/.}.commands' 

## make a pbs file for each command

find *commands | parallel 'makemsub -a {} -m 12g -c 4 -o a -j {/.} > {/.}.pbs

for pbs in *pbs
do
msub $pbs
done
```