Snakemake pipeline for detecting, annotating, filtering and visualising structure variations and gene fusions from WGS/WES.

public 1yr ago Version: v1.0.0 0 bookmarks

View Workflow

Snakemake Workflow for Structure Variations Calling and Gene Fusion

This is a Snakemake workflow for structure variations calling using delly, tiddit, and sniffles. (More structure variations tools will be included).

We also use genefuse for calling gene fusions.

The pipeline uses trimgalore and cutadapt to trim adapters. Align the reads using bwa mem. Based on the SV tool used, we either sort and index or add readgroups and mark duplicates.
The pipeline annotates, filters, and merges samples. As well as visualize the structure variants into nice plots.

Edit the configfile

You will need to edit your config file as described below:

Config Variable	Description
SAMPLES	name of file containing your samples names, default: samples.tsv
TOOLS	Leave as it is if you want to run delly, tiddit, sniffles, or remove unwanted tool
BUILD	For AnnotSV, select genome BUILD required
GENOME	Path to your genome file
COHORT	Name of your Cohort
PAIRED	True if your samples are paired, false otherwise
RG	The READ Group
druggable	druggable file requried by genefuse
EXCL	Excl file required by delly
MIN_BP	Minumum Base Pairs of variant to plot. Default is 20, going lower than 20 will cause plotting issues

The pipeline takes samples with a suffix 'r_1.fq.gz' and 'r_2.fq.gz' if the samples are paired. Or it takes samples with suffix 'fq.gz' if the samples are single-end reads. Regardless your samples are paired or single-ended, SAMPLES should be listed in samples.tsv without the suffix.

Output

The pipeline shows some plots to help visualize the SV as follows:

Deletion Example

Duplication Example

Inversion Example

For each sample you will get a a vcf, annotated tsv file in addition to a sub-folder samplot-out in your working directory which contains all the plots of the SVs in your samples and yoursample.html page listing all the SVs per sample.

The results of gene fusion will be in an html page yoursample_report.html in your working directory.

Run the pipeline

To run the pipeline use:

snakemake -jn where n is the number of cores.

For example for using 10 cores, run:

snakemake -j10

Use Conda

For less frooodiness, to pull automatically the same versions of dependencies use:

snakemake -jn --use-conda

This will pull the same versions of tools we used. Conda has to be installed in your system.

For example, for 10 cores:

snakemake -j10 --use-conda

There is a conda env yaml file for all tools. However, ANNOTSV has to be installed. We couldn't find conda installation for it.

Run rule by rule

You can run a specific rule using the rule output, for example we can call the delly_vcf rule on a specific sample:

snakemake -j1 sample.delly.vcf --use-conda

Dry run

for a dry run use:

snakemake -j1 -n

and you can see the command printed on a dry run using:

snakemake -j1 -n -p

Keep going option

You can try the following to keep going if any issues happen, like no variants is found by one tool:

snakemake -j1 --keep-going

Cite Us

If you use the pipeline, please cite us as follows:

 Sherine Awad. (2022). SherineAwad/StructureVariations: (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.6390009

References

Brouard, Jean-Simon, Flavio Schenkel, Andrew Marete, and Nathalie Bissonnette. "The GATK joint genotyping workflow is appropriate for calling variants in RNA-seq experiments." Journal of animal science and biotechnology 10, no. 1 (2019): 1-6.
Van der Auwera, Geraldine A., Mauricio O. Carneiro, Christopher Hartl, Ryan Poplin, Guillermo Del Angel, Ami Levy‐Moonshine, Tadeusz Jordan et al. "From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline." Current protocols in bioinformatics 43, no. 1 (2013): 11-10.
Poplin, R., Ruano-Rubio, V., DePristo, M. A., Fennell, T. J., Carneiro, M. O., Van der Auwera, G. A., ... & Banks, E. (2018). Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv, 201178.
Rausch, T., Zichner, T., Schlattl, A., Stütz, A. M., Benes, V., & Korbel, J. O. (2012). DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28(18), i333-i339.
Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
Eisfeldt, J., Vezzi, F., Olason, P., Nilsson, D., & Lindstrand, A. (2017). TIDDIT, an efficient and comprehensive structural variant caller for massive parallel sequencing data. F1000Research, 6.
Chen, S., Liu, M., Huang, T., Liao, W., Xu, M., & Gu, J. (2018). GeneFuse: detection and visualization of target gene fusions from DNA sequencing data. International journal of biological sciences, 14(8), 843–848. https://doi.org/10.7150/ijbs.24626
Geoffroy, V., Herenger, Y., Kress, A., Stoetzel, C., Piton, A., Dollfus, H., & Muller, J. (2018). AnnotSV: an integrated tool for structural variations annotation. Bioinformatics, 34(20), 3572-3574.
Jeffares, D. C., Jolly, C., Hoti, M., Speed, D., Shaw, L., Rallis, C., ... & Sedlazeck, F. J. (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nature communications, 8(1), 1-11.
Sedlazeck, F. J., Rescheneder, P., Smolka, M., Fang, H., Nattestad, M., Von Haeseler, A., & Schatz, M. C. (2018). Accurate detection of complex structural variations using single-molecule sequencing. Nature methods, 15(6), 461-468.

Code Snippets

shell:
    """
    mkdir -p galore
    mkdir -p fastqc
    trim_galore --gzip --retain_unpaired --trim1 --fastqc --fastqc_args "--outdir fastqc" -o galore --paired {input.r1} {input.r2}
    """

SnakeMake FastQC Trim_Galore From line 35 of master/Snakefile

shell:
    "bwa mem {input.genome} {input.r1} {input.r2} > {output}"

SnakeMake BWA From line 49 of master/Snakefile

shell:
    """
    mkdir -p galore
    mkdir -p fastqc
    trim_galore --gzip --retain_unpaired --trim1 --fastqc --fastqc_args "--outdir fastqc" -o galore {input}
    """

SnakeMake FastQC Trim_Galore From line 58 of master/Snakefile

shell:
   "bwa mem {input.genome} {input[0]} > {output}" 

SnakeMake BWA From line 71 of master/Snakefile

shell: 
     """ 
     samtools view -S -b {input} > {output}
     """

SnakeMake SAMtools From line 80 of master/Snakefile

shell: 
    """
    samtools sort {input} -o {output}
    samtools index {output}
    """  

SnakeMake SAMtools From line 91 of master/Snakefile

shell:
    """
     python src/RG.py -s {input[0]} -f {input[1]} -o {output} -p = {params.PL}
    """

SnakeMake From line 107 of master/Snakefile

shell:
   "picard MarkDuplicates I={input} O={output[0]} CREATE_INDEX=true M={output[1]}"

SnakeMake Picard From line 119 of master/Snakefile

shell: 
   "wget https://raw.githubusercontent.com/dellytools/delly/master/excludeTemplates/{params}"

SnakeMake From line 128 of master/Snakefile

shell:
  "delly call -x {input.EXCL}  -o {output} -g {input.genome} {input[0]}"

SnakeMake delly From line 139 of master/Snakefile

shell:
    "bcftools view {input} > {output}"

SnakeMake BCFtools From line 148 of master/Snakefile

shell: 
  """ 
     tiddit --sv --bam {input} -o {params}
  """

SnakeMake TIDDIT From line 160 of master/Snakefile

shell:
    """
    sniffles -i {input} --vcf {output}
    """

SnakeMake sniffles From line 171 of master/Snakefile

shell:
    """
    wget https://raw.githubusercontent.com/OpenGene/GeneFuse/master/genes/{params.druggable}
    genefuse -r {input.genome}  -f {params.druggable} -1 {input.val1} -2 {input.val2} -h {output.html} > {output.result}
    """

SnakeMake From line 187 of master/Snakefile

shell:
    "$ANNOTSV/bin/AnnotSV -SVinputFile ./{input} -outputFile ./{output} -genomeBuild {params.build}"         

SnakeMake AnnotSV From line 200 of master/Snakefile

shell: 
   """
     SURVIVOR merge {params[0]}  1000 10 1 1 0 30 {output[0]}
   """ 

SnakeMake SURVIVOR From line 211 of master/Snakefile

shell: 
   """
   samplot vcf --vcf {input[0]} --min_bp {params} -O png -b {input[1]} && mv samplot-out/index.html {output}   
   """ 

SnakeMake Samplot From line 225 of master/Snakefile

import argparse
import os
import sys
import gzip

def get_RGs(fastq_file):

    with gzip.open(fastq_file,'rb') as fin:        
        for line in fin:        
            line = line.strip().decode( "utf-8" )  
            break
        header = str(line).split(":")
        print(header)
        RG=header[0]+"_"+header[1]+"_"+header[2]+"_"+header[3]
        print(RG)
    SM = fastq_file.split("_")[0]
    LB = fastq_file.split("_")[1]
    LB = SM+"_"+LB 
    PU = RG+"."+LB 
    print("@RGID:",RG, "SM:", SM, "PL:Illumina", "LB:", LB, "PU:", PU)
    return RG, SM, LB, PU 

def run_picard(sam_file, outfile, RG, SM, LB, PU, PL): 
    I = "I=" + sam_file  
    O = "O=" + outfile 
    group = "RGID=" +RG+ " RGSM= " +SM +" RGPL=" + PL+ " RGLB="+ LB+ " RGPU=" +PU + " VALIDATION_STRINGENCY=SILENT" 
    cmd = "picard AddOrReplaceReadGroups " + I +" "+ O +" "+ group 
    os.system(cmd)     

def main(): 
    parser = argparse.ArgumentParser()
    parser.add_argument('-s', '--samfile',dest= "sam_file", 
                        required=True, help="sam file")
    parser.add_argument("-o", "--outfile", dest="outfile",required=True,
                        help="outfile")
    parser.add_argument("-f", "--fastq", dest="fastq_file", required=True,
                        help="fastq one pair file")
    parser.add_argument("-p", "--PL" , dest ="PL", default="Illumina") 
    args = parser.parse_args()
    fastq_file = args.fastq_file  
    sam_file = args.sam_file 
    outfile = args.outfile
    PL =args.PL 
    RG, SM, LB, PU = get_RGs(fastq_file)
    run_picard(sam_file, outfile, RG, SM, LB, PU, PL) 
if __name__ == "__main__":
    main()