Genome Re-sequencing Analysis Snakemake Workflow: De-novo and Variant Calling Modes
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This Snakemake workflow is for analysing genome re-sequencing experiments. It features 2 modes. The **de-novo** mode is used to confirm sample relationships from the raw sequencing reads with [kwip](https://github.com/kdmurray91/kWIP) and [mash](https://github.com/marbl/Mash). The **varcall** mode performs read alignments to one or several reference genomes followed by variant detection. Read alignments can be performed with [bwa](http://bio-bwa.sourceforge.net/bwa.shtml) and/or [NextGenMap]
Usage
-
Create a new github repository in your github account using this workflow as a template .
-
Clone your newly created repository to your local system where you want to perform the analysis.
-
Setup the software dependencies
-
Configure the workflow for your needs and input files
-
Run the workflow
-
Archive your workflow for documenting your work and easy reproduction.
Some pointers for setup , configuring , and running the workflow are below, for details please consult the technical documentation .
Setup
An easy way to setup the dependencies is conda.
Get the Miniconda Python 3 distribution:
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
Create an environment with the required software:
NOTE: conda's enviroment name in these examples is
dna-proto
.
$ conda env create --name dna-proto --file envs/condaenv.yml
Activate the environment:
$ conda activate dna-proto
Additional useful conda commands are here .
----
Check config and metadata
We provide scripts to list metadata and configuration parameters in
utils/
.
python utils/check_metadata.py
python utils/check_config.py
Visualising the workflow
You can check the workflow in graphical form by printing the so-called DAG.
snakemake --dag -npr -j -1 | dot -Tsvg > dag.svg
eog dag.svg
Pretending a run of the workflow
Prior to running the workflow, pretend a run and confirm it will do what is intended.
snakemake -npr
Data
Main directory content:
.
├── envs
├── genomes_and_annotations
├── metadata
├── output
├── rules
├── scripts
├── utils
├── config.yml
├── Snakefile
├── snpEff.config
NOTE: the
output
directory and some files in themetadata
directory are/will be generated by the workflow.
You will need to configure the workflow for your specific project. For details see the technical documentation . Below files and directories will need editing:
-
Snakefile
-
genomes_and_annotations/
-
metadata/
-
config.yml
-
snpEff.config
You can download example data for testing the workflow. click here to download
--
Clone the repository
Now **clone the forked repository** to your machine. Go to your GitHub account, open the forked repository, click on the clone button and then click the *copy to clipboard* icon. The url is going to be like: ```https://github.com/your-username/dna-proto-workflow.git``` where `your-username` is your GitHub username.
Open a terminal and run the following git command:
git clone https://github.com/your-username/dna-proto-workflow.git
Once you've cloned your fork, you can edit your local copy. However, if you want to contribute, you'll need to create a new branch.
Create a branch
Change to the repository directory on your computer (if you are not already there):
NOTE: Don't change the name of this directory!
cd dna-proto-workflow
You can check your branches and active branch, using the
git branch
command.
git branch -a
Now create a branch using the
git checkout
command:
git checkout -b new-branch-name
For example:
git checkout -b development
From this point, you are in the new branch and edits only affect your branch. If things go wrong, simply remove your branch using
git branch -d name-of-the-branch
Or revert back to the
master
-branch using
git checkout master
Make changes and commit
Once you've modified something, you can confirm that there are changes with
git status
(called from the top-level directory). Add those changes to your branch with
git add
:
git status
git add .
or
git add name_of_the_file_you_modified
Commit those changes with
git commit
:
git commit -m "write a message"
Code Snippets
29 30 31 32 33 | shell: "python3 scripts/tidybamstat.py" " -o output/alnstats/everything" # prefix " {input}" ">'{log}' 2>&1" |
71 72 73 74 75 76 77 78 79 | shell: "( bwa mem" " -p" # paired input " -t {threads}" " -R '@RG\\tID:{wildcards.run}_{wildcards.lib}\\tSM:{params.sample}'" " {input.ref}" " {input.reads}" "| samtools view -Suh - >{output.bam}" " ) >'{log}' 2>&1" |
100 101 102 103 104 105 106 107 108 109 110 | shell: "( java" " -{params.mem}" " -jar {params.abra_release}" " --in {input.set}" " --out {output}" " --ref {params.ref}" " --threads {params.threads}" " --targets {params.region}" " --tmpdir {params.abra_temp}" ") >'{log}' 2>&1" |
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | shell: "( samtools fixmate " " -m" " -@ {threads}" " --output-fmt bam,level=0" " {input.bam}" " -" " | samtools sort" " -T {config[abra2][temp]}/{wildcards.run}_{wildcards.lib}_sort_$RANDOM" " --output-fmt bam,level=0" " -@ {threads}" " -m 1g" " -" " | samtools markdup" " -T {config[abra2][temp]}/{wildcards.run}_{wildcards.lib}_markdup_$RANDOM" " -s" # report stats " -@ {threads}" " --output-fmt bam,level=3" " -" " {output.bam}" " ) >'{log}' 2>&1" |
155 156 157 158 159 160 161 | shell: "( samtools merge" " -@ {threads}" " --output-fmt bam,level=4" " {output.bam}" " {input}" " ) >'{log}' 2>&1" |
172 173 174 175 176 177 178 179 180 | shell: "( unset DISPLAY; qualimap bamqc" " --java-mem-size=4G" " -bam {input.bam}" " -nr 10000" " -nt {threads}" " -outdir {output}" " {input}" " ) >'{log}' 2>&1" |
191 192 193 194 | run: with open(output[0], "w") as fh: for s in input: print(s, file=fh) |
208 209 210 211 212 213 214 215 216 | shell: "( samtools merge" " --output-fmt bam,level=7" " -@ {threads}" " -" " {input}" " | tee {output.bam}" " | samtools index - {output.bai}" # indexing takes bloody ages, we may as well do this on the fly " ) >'{log}' 2>&1" |
227 228 | shell: "(samtools stats -i 5000 -x {input} >{output}) >'{log}'" |
241 242 | shell: "samtools index {input}" |
249 250 | shell: "samtools view -Suh {input} >{output}" |
297 298 299 300 301 302 303 304 305 306 307 | shell: "( ngm" " -q {input.reads}" " --paired --broken-pairs" " -r {input.ref}" " -t {threads}" " --rg-id {wildcards.run}_{wildcards.lib}" " --rg-sm {params.sample}" " --sensitivity {params.sensitivity}" # this is the mean from a bunch of different runs "| samtools view -Suh - >{output.bam}" " ) >'{log}' 2>&1" |
41 42 43 44 45 46 47 48 49 50 | shell: "( snpEff ann" " -filterInterval {input.bed}" " -csvStats {output.csvstats}" " -htmlStats {output.htmlStats}" " {params.extra}" " {params.database}" " {input.vcf}" " > {output.vcf}" " ) >'{log}' 2>&1" |
35 36 | script: "../scripts/pca_mash.R" |
44 45 | script: "../scripts/pca_kwip.R" |
68 69 70 71 72 73 74 75 | shell: " mash sketch" " -k {wildcards.ksize}" " -s {wildcards.sketchsize}" " -p {threads}" " -o {output}" " {input}" " >{log} 2>&1" |
86 87 88 89 90 91 92 | shell: "mash dist" " -p {threads}" " -t" # tabular format " {input} {input}" # needs input twice " >{output}" " 2>{log}" |
105 106 107 108 109 110 111 112 113 114 115 116 | shell: "load-into-counting.py" " -N 1" " -x {wildcards.sketchsize}" " -k {wildcards.ksize}" " -b" " -f" " -s tsv" " -T {threads}" " {output.ct}" " {input}" " >{log} 2>&1" |
130 131 132 133 134 135 136 | shell: "kwip" " -d {output.d}" " -k {output.k}" " -t {threads}" " {input}" " >{log} 2>&1" |
150 151 152 153 154 155 156 | shell: "( kdm-unique-kmers.py" " -t {threads}" " -k {params.kmersize}" " {input}" " >{output}" " ) 2>{log}" |
166 167 168 169 170 171 172 173 | shell: "( sourmash compute" " --name '{wildcards.sample}'" " -k {wildcards.ksize}" " -n {wildcards.sketchsize}" " -o {output}" " {input}" ") >{log} 2>&1" |
185 186 | shell: "(sourmash compare -k {wildcards.ksize} -o {output} {input} ) >{log} 2>&1" |
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | shell: "( AdapterRemoval" " --file1 {input.r1}" " --file2 {input.r2}" " --adapter1 {params.adp1}" " --adapter2 {params.adp2}" " --combined-output" " --interleaved-output" " --trimns" " --trimqualities" " --trimwindows 10" " --minquality {params.minqual}" " --threads 2" " --settings {log.settings}" " --output1 /dev/stdout" " | seqhax pairs" " -l 20" " -b >(pigz -p 5 >{output.reads})" " /dev/stdin" ") >'{log.log}' 2>&1" |
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | shell: "( AdapterRemoval" " --file1 {input}" " --adapter1 {params.adp1}" " --adapter2 {params.adp2}" " {params.extra}" " --minquality {params.minqual}" " --threads 2" " --settings {log.settings}" " --output1 /dev/stdout" " | seqhax pairs" " -l 20" " -b >(pigz -p 5 >{output.reads})" " /dev/stdin" ") >'{log.log}' 2>&1" |
110 111 | shell: "cat {input} > {output}" |
123 124 125 126 127 128 | shell: "( seqhax stats" " -t {threads}" " {input}" " >{output}" " ) 2>'{log}'" |
140 141 142 143 144 145 | shell: "( seqhax stats" " -t {threads}" " {input}" " >{output}" " ) 2>''{log}'" |
40 41 | script: "../scripts/plot-quals.py" |
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | shell: "( freebayes" " --theta {params.theta}" " --samples {input.sset}" " --ploidy 2" " --use-best-n-alleles 3" " --min-mapping-quality {params.minmq}" " --min-base-quality {params.minbq}" " --read-max-mismatch-fraction 0.1" " --min-alternate-fraction 0" " --min-alternate-count 2" # per sample " --min-alternate-total 5" # across all samples " --min-coverage 10" # across all samples " --prob-contamination 1e-6" " --use-mapping-quality" " --strict-vcf" " --region '{wildcards.region}'" " --fasta-reference {input.ref}" " {input.bam}" " | bcftools view" " -O b -o '{output.bcf}'" " ) >'{log}' 2>&1" |
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | shell: "( bcftools mpileup" " --adjust-MQ 50" " --redo-BAQ" " --max-depth 20000" # the default per file max (250x) is insane, i.e. <1x for most sets. new limit of 20000x equates to a max. of 20x across all samples. " --min-MQ {params.minmq}" " --min-BQ {params.minbq}" " --fasta-ref {input.ref}" " --samples-file {input.sset}" " --annotate FORMAT/DP,FORMAT/AD,FORMAT/SP,INFO/AD" #output extra tags " --region '{wildcards.region}'" " --output-type u" #uncompressed " {input.bam}" " | bcftools call" " --targets '{wildcards.region}'" # might not be needed " --multiallelic-caller" " --prior {params.theta}" " -O b" " -o {output.bcf}" " ) >'{log}' 2>&1" |
136 137 138 139 140 141 | shell: "( bcftools view" " {params.filtarg}" " '{input.bcf}'" " -O b -o '{output.bcf}'" " ) >'{log}' 2>&1" |
152 153 154 155 | run: with open(output[0], "w") as fh: for s in sorted(input): print(s, file=fh) |
168 169 170 171 172 173 174 | shell: "( bcftools concat" " --threads {threads}" " -O b" " -o {output.bcf}" " --file-list {input.fofn}" " ) >'{log}' 2>&1" |
185 186 187 188 189 190 191 | shell: "( bcftools view" " {input.bcf}" " -O z" " --threads {threads}" " -o {output.vcf}" " ) >'{log}' 2>&1" |
201 202 203 204 205 206 207 | shell: "( bcftools view" " {input.bcf}" " -O v" " --threads {threads}" " -o {output.vcf}" " ) >'{log}' 2>&1" |
214 215 | shell: "bcftools index -c -f {input} && bcftools index -t -f {input}" |
222 223 | shell: "bcftools stats -s - -d 0,1000,2 --threads {threads} {input} >{output}" |
1 2 3 4 5 6 7 8 9 10 11 12 13 | library(rgl) print (snakemake@input[[1]]) print (snakemake@output[[1]]) y0<-read.delim(snakemake@input[[1]], header=T) data<-as.matrix(y0[, -1]) labels<-colnames(data) pca <- prcomp(data, scale=F) print ('PCA done!') plot3d(pca$x[,c(1,2,3)], size=10) text3d(pca$x[,c(1,2,3)], text=labels) rgl.postscript(snakemake@output[[1]],"pdf") |
1 2 3 4 5 6 7 8 9 10 11 12 13 | library(rgl) print (snakemake@input[[1]]) print (snakemake@output[[1]]) y0<-read.delim(snakemake@input[[1]], header=T) data<-as.matrix(y0[, -1]) labels<-as.data.frame(do.call(rbind, strsplit(colnames(data), "[.]")))$V4 pca <- prcomp(data, scale=F) print ('PCA done!') plot3d(pca$x[,c(1,2,3)], size=10) text3d(pca$x[,c(1,2,3)], text=labels) rgl.postscript(snakemake@output[[1]],"pdf") |
1 2 3 4 5 6 7 8 9 10 11 12 | import matplotlib import matplotlib.pyplot as plt from pysam import VariantFile import numpy as np ###matplotlib.use("Agg") print ("$ Script: plot-quals") print ("INPUT: ", snakemake.input) quals = [record.qual for record in VariantFile(snakemake.input[0])] plt.hist(quals) plt.savefig(snakemake.output[0]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | from sys import stderr, stdout, stdin, argv from os.path import basename, splitext import re import json import csv import argparse try: from tqdm import tqdm except ImportError: tqdm = lambda x: x def doSN(line, data): if "SN" not in data: data["SN"] = {} row = line.rstrip('\n').split('\t') field = re.sub(r'\(|\)|:|%', '', row[1]) field = re.sub(r'\s+$|^\s+', '', field) field = re.sub(r'\s+', '_', field) val = row[2] data["SN"][field] = val # TODO: not handling this right now # def doTableByCycle(line, data, key): # cols = { # "FFQ": ["cycle_number", "quality"], # "LFQ": ["cycle_number", "quality"], # } TABLE_COLS = { "IS": ["insert_size", "pairs", "pairs_in", "pairs_out", "pairs_other"], "ID": ["indel_size", "n_insertion", "n_deletion"], "GCF": ["gc_content", "n_reads"], "GCL": ["gc_content", "n_reads"], "GCC": ["cycle_number", "a_pct", "c_pct", "g_pct", "t_pct", "n_pct", None], "RL": ["read_length", "n_reads"], "COV": ["coverage_bin", None, "genome_bp"], } def doTable(line, data): key = line.strip().split('\t')[0] if key not in data: data[key] = [] vals = line.strip().split("\t") valdict = {} for i, col in enumerate(TABLE_COLS[key]): if col is None: continue valdict[col] = vals[i+1] data[key].append(valdict) def dump(samples, key, outfile): with open(outfile, "w") as fh: csvfh = None for sample, alldata in samples.items(): if key not in alldata: continue data = alldata[key] if isinstance(data, dict): # for some where rows are split over multiple lines, e.g. SN data = [data, ] if csvfh is None: header = ["sample", ] + list(data[0].keys()) csvfh = csv.DictWriter(fh, header) csvfh.writeheader() for row in data: row["sample"] = sample csvfh.writerow(row) def main(): p = argparse.ArgumentParser(prog="tidybamstat") p.add_argument("-o", "--outprefix", type=str, required=True, help="Output prefix") p.add_argument("input", type=str, nargs='+') args = p.parse_args() samples = {} print("Load alignment stats:", file=stderr) for file in tqdm(args.input): data = dict() with open(file) as fh: for line in fh: dtype = line.strip().split("\t")[0] if line.startswith("#"): continue elif line.startswith("SN"): doSN(line, data) elif dtype in TABLE_COLS: doTable(line, data) sample = splitext(basename(file))[0] samples[sample] = data allkeys = ["SN"] + list(TABLE_COLS.keys()) print("Dumping data by data type: ", file=stderr, end='', flush=True) for key in allkeys: print(f"{key}, ", file=stderr, end='', flush=True) outfile = f"{args.outprefix}_{key}.csv" dump(samples, key, outfile) print(" Done!", file=stderr) if __name__ == "__main__": main() |
Support
- Future updates
Related Workflows





