Snakemake pipeline for detecting, annotating, filtering and visualising structure variations and gene fusions from WGS/WES.
Snakemake Workflow for Structure Variations Calling and Gene Fusion
This is a Snakemake workflow for structure variations calling using delly, tiddit, and sniffles. (More structure variations tools will be included).
We also use genefuse for calling gene fusions.
The pipeline uses trimgalore and cutadapt to trim adapters. Align the reads using bwa mem. Based on the SV tool used, we either sort and index or add readgroups and mark duplicates.
The pipeline annotates, filters, and merges samples. As well as visualize the structure variants into nice plots.
Edit the configfile
You will need to edit your config file as described below:
Config Variable | Description |
---|---|
SAMPLES | name of file containing your samples names, default: samples.tsv |
TOOLS | Leave as it is if you want to run delly, tiddit, sniffles, or remove unwanted tool |
BUILD | For AnnotSV, select genome BUILD required |
GENOME | Path to your genome file |
COHORT | Name of your Cohort |
PAIRED | True if your samples are paired, false otherwise |
RG | The READ Group |
druggable | druggable file requried by genefuse |
EXCL | Excl file required by delly |
MIN_BP | Minumum Base Pairs of variant to plot. Default is 20, going lower than 20 will cause plotting issues |
The pipeline takes samples with a suffix 'r_1.fq.gz' and 'r_2.fq.gz' if the samples are paired. Or it takes samples with suffix 'fq.gz' if the samples are single-end reads. Regardless your samples are paired or single-ended, SAMPLES should be listed in samples.tsv without the suffix.
Output
The pipeline shows some plots to help visualize the SV as follows:
Deletion Example
Duplication Example
Inversion Example
For each sample you will get a a vcf, annotated tsv file in addition to a sub-folder samplot-out in your working directory which contains all the plots of the SVs in your samples and yoursample.html page listing all the SVs per sample.
The results of gene fusion will be in an html page yoursample_report.html in your working directory.
Run the pipeline
To run the pipeline use:
snakemake -jn where n is the number of cores.
For example for using 10 cores, run:
snakemake -j10
Use Conda
For less frooodiness, to pull automatically the same versions of dependencies use:
snakemake -jn --use-conda
This will pull the same versions of tools we used. Conda has to be installed in your system.
For example, for 10 cores:
snakemake -j10 --use-conda
There is a conda env yaml file for all tools. However, ANNOTSV has to be installed. We couldn't find conda installation for it.
Run rule by rule
You can run a specific rule using the rule output, for example we can call the delly_vcf rule on a specific sample:
snakemake -j1 sample.delly.vcf --use-conda
Dry run
for a dry run use:
snakemake -j1 -n
and you can see the command printed on a dry run using:
snakemake -j1 -n -p
Keep going option
You can try the following to keep going if any issues happen, like no variants is found by one tool:
snakemake -j1 --keep-going
Cite Us
If you use the pipeline, please cite us as follows:
Sherine Awad. (2022). SherineAwad/StructureVariations: (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.6390009
References
-
Brouard, Jean-Simon, Flavio Schenkel, Andrew Marete, and Nathalie Bissonnette. "The GATK joint genotyping workflow is appropriate for calling variants in RNA-seq experiments." Journal of animal science and biotechnology 10, no. 1 (2019): 1-6.
-
Van der Auwera, Geraldine A., Mauricio O. Carneiro, Christopher Hartl, Ryan Poplin, Guillermo Del Angel, Ami Levy‐Moonshine, Tadeusz Jordan et al. "From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline." Current protocols in bioinformatics 43, no. 1 (2013): 11-10.
-
Poplin, R., Ruano-Rubio, V., DePristo, M. A., Fennell, T. J., Carneiro, M. O., Van der Auwera, G. A., ... & Banks, E. (2018). Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv, 201178.
-
Rausch, T., Zichner, T., Schlattl, A., Stütz, A. M., Benes, V., & Korbel, J. O. (2012). DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28(18), i333-i339.
-
Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
-
Eisfeldt, J., Vezzi, F., Olason, P., Nilsson, D., & Lindstrand, A. (2017). TIDDIT, an efficient and comprehensive structural variant caller for massive parallel sequencing data. F1000Research, 6.
-
Chen, S., Liu, M., Huang, T., Liao, W., Xu, M., & Gu, J. (2018). GeneFuse: detection and visualization of target gene fusions from DNA sequencing data. International journal of biological sciences, 14(8), 843–848. https://doi.org/10.7150/ijbs.24626
-
Geoffroy, V., Herenger, Y., Kress, A., Stoetzel, C., Piton, A., Dollfus, H., & Muller, J. (2018). AnnotSV: an integrated tool for structural variations annotation. Bioinformatics, 34(20), 3572-3574.
-
Jeffares, D. C., Jolly, C., Hoti, M., Speed, D., Shaw, L., Rallis, C., ... & Sedlazeck, F. J. (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nature communications, 8(1), 1-11.
-
Sedlazeck, F. J., Rescheneder, P., Smolka, M., Fang, H., Nattestad, M., Von Haeseler, A., & Schatz, M. C. (2018). Accurate detection of complex structural variations using single-molecule sequencing. Nature methods, 15(6), 461-468.
Code Snippets
35 36 37 38 39 40 | shell: """ mkdir -p galore mkdir -p fastqc trim_galore --gzip --retain_unpaired --trim1 --fastqc --fastqc_args "--outdir fastqc" -o galore --paired {input.r1} {input.r2} """ |
49 50 | shell: "bwa mem {input.genome} {input.r1} {input.r2} > {output}" |
58 59 60 61 62 63 | shell: """ mkdir -p galore mkdir -p fastqc trim_galore --gzip --retain_unpaired --trim1 --fastqc --fastqc_args "--outdir fastqc" -o galore {input} """ |
71 72 | shell: "bwa mem {input.genome} {input[0]} > {output}" |
80 81 82 83 | shell: """ samtools view -S -b {input} > {output} """ |
91 92 93 94 95 | shell: """ samtools sort {input} -o {output} samtools index {output} """ |
107 108 109 110 | shell: """ python src/RG.py -s {input[0]} -f {input[1]} -o {output} -p = {params.PL} """ |
119 120 | shell: "picard MarkDuplicates I={input} O={output[0]} CREATE_INDEX=true M={output[1]}" |
128 129 | shell: "wget https://raw.githubusercontent.com/dellytools/delly/master/excludeTemplates/{params}" |
139 140 | shell: "delly call -x {input.EXCL} -o {output} -g {input.genome} {input[0]}" |
148 149 | shell: "bcftools view {input} > {output}" |
160 161 162 163 | shell: """ tiddit --sv --bam {input} -o {params} """ |
171 172 173 174 | shell: """ sniffles -i {input} --vcf {output} """ |
187 188 189 190 191 | shell: """ wget https://raw.githubusercontent.com/OpenGene/GeneFuse/master/genes/{params.druggable} genefuse -r {input.genome} -f {params.druggable} -1 {input.val1} -2 {input.val2} -h {output.html} > {output.result} """ |
200 201 | shell: "$ANNOTSV/bin/AnnotSV -SVinputFile ./{input} -outputFile ./{output} -genomeBuild {params.build}" |
211 212 213 214 | shell: """ SURVIVOR merge {params[0]} 1000 10 1 1 0 30 {output[0]} """ |
225 226 227 228 | shell: """ samplot vcf --vcf {input[0]} --min_bp {params} -O png -b {input[1]} && mv samplot-out/index.html {output} """ |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | import argparse import os import sys import gzip def get_RGs(fastq_file): with gzip.open(fastq_file,'rb') as fin: for line in fin: line = line.strip().decode( "utf-8" ) break header = str(line).split(":") print(header) RG=header[0]+"_"+header[1]+"_"+header[2]+"_"+header[3] print(RG) SM = fastq_file.split("_")[0] LB = fastq_file.split("_")[1] LB = SM+"_"+LB PU = RG+"."+LB print("@RGID:",RG, "SM:", SM, "PL:Illumina", "LB:", LB, "PU:", PU) return RG, SM, LB, PU def run_picard(sam_file, outfile, RG, SM, LB, PU, PL): I = "I=" + sam_file O = "O=" + outfile group = "RGID=" +RG+ " RGSM= " +SM +" RGPL=" + PL+ " RGLB="+ LB+ " RGPU=" +PU + " VALIDATION_STRINGENCY=SILENT" cmd = "picard AddOrReplaceReadGroups " + I +" "+ O +" "+ group os.system(cmd) def main(): parser = argparse.ArgumentParser() parser.add_argument('-s', '--samfile',dest= "sam_file", required=True, help="sam file") parser.add_argument("-o", "--outfile", dest="outfile",required=True, help="outfile") parser.add_argument("-f", "--fastq", dest="fastq_file", required=True, help="fastq one pair file") parser.add_argument("-p", "--PL" , dest ="PL", default="Illumina") args = parser.parse_args() fastq_file = args.fastq_file sam_file = args.sam_file outfile = args.outfile PL =args.PL RG, SM, LB, PU = get_RGs(fastq_file) run_picard(sam_file, outfile, RG, SM, LB, PU, PL) if __name__ == "__main__": main() |
Support
- Future updates
Related Workflows





