A straightforward bioinformatic pipeline for detecting a bacterial strain in one or more metagenome(s).
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A straightforward bioinformatic pipeline for detecting the presence of a bacterial strain in one or more metagenome(s).
StrainSifter is based on Snakemake . This pipeline allows you to output phylogenetic trees showing strain relatedness of input strains, as well as pairwise counts of single-nucleotide variants (SNVs) between input samples.
Installation
To run StrainSifter, you must have miniconda3 and Snakemake installed.
Install instructions (One time only)
-
Download and install miniconda3 :
For Linux:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
-
Clone the StrainSifter workflow to the directory where you wish to run the pipeline:
git clone https://github.com/bhattlab/strainsifter
-
Create the new conda environment:
cd strainsifter conda env create -f envs/environment.yaml
-
Install Snakemake:
conda install snakemake -c bioconda -c conda-forge
StrainSifter has been developed and tested with Snakemake version 5.1.4 or higher. Check your version by typing:
snakemake --version If you are running a version of Snakemake prior to 5.1.4, update to the latest version:
conda update snakemake -c bioconda -c conda-forge
Activate the conda environment (Every time you use StrainSifter)
source activate ssift
Dependencies
We recommend running StrainSifter in the provided conda environment. If you wish to run StrainSifter without using the conda environment, the following tools must be installed and in your system PATH:
Running StrainSifter
Due to the computing demands of the StrainSifter pipeline, we recommend running on a computing cluster if possible. Instructions to enable Snakemake to schedule cluster jobs with SLURM can be found at https://github.com/bhattlab/slurm
Input files
-
Reference genome assembly in fasta format (can be a draft genome or a finished reference genome) Acceptable file extensions: ".fasta", ".fa", ".fna"
-
Two or more short read datasets in fastq format (metagenomic reads or isolate reads), optionally gzipped Acceptable file extensions: ".fq", ".fastq", ".fq.gz", ".fastq.gz"
Short read data can be paired- or single-end.
You will need to indicate input files in the config file for each sample you wish to run StrainSifter on. This is described below in the Config section:
Config
You must update the config.yaml file as follows:
reference: Path to reference genome (fasta format)
reads: Samples and the file path(s) to the input reads.
Optionally, you can update the following parameters:
prefix: (optional) desired filename for output files. If blank, the name of the reference genome will be used.
mapq: minimum mapping quality score to evaluate a read aligment
n_mismatches: consider reads with this many mismatches or fewer
min_cvg: minimum read depth to determine the nucleotide at any given postion
min_genome_percent: the minimum fraction of bases that must be covered at min_cvg or greater to process an sample
base_freq: minimum frequency of a nucleotide to call a base at any position
Example config.yaml:
##### input files #####
# reference genome (required)
reference: /path/to/ref.fna
# short read data (at least two samples required)
reads:
sample1:
[/path/to/sample1_R1.fq,
/path/to/sample1_R2.fq]
sample2:
[/path/to/sample2_R1.fq,
/path/to/sample2_R2.fq]
sample3: /path/to/sample3.fq
# prefix for output files (optional - can leave blank)
prefix:
##### workflow parameters #####
# alignment parameters:
mapq: 60
n_mismatches: 5
# variant calling parameters:
min_cvg: 5
min_genome_percent: 0.5
base_freq: 0.8
Running StrainSifter
To run StrainSifter, the config file must be present in the directory in which you wish to run the workflow. You should then be able to run StrainSifter as follows:
Phylogeny
To generate a phylogenetic tree showing all of the input samples that contain your strain of interest at sufficient coverage to profile:
snakemake {prefix}.tree.pdf
SNV counts
To generate a list of pairwise SNV counts between all input samples:
snakemake {prefix}.dist.tsv
FAQ
Q: Can StrainSifter be used for non-bacterial genomes (e.g. yeast)?
A: At present, we recommend StrainSifter for bacteria only. In theory, StrainSifter should work for yeast if a haploid reference genome is provided.
Code Snippets
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | import sys import gzip import re DEBUG = 0 ASCII_OFFSET = ord("!") INDEL_PATTERN = re.compile(r"[+-](\d+)") INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}" ROUND = 3 # places to right of decimal point min_coverage, min_proportion, min_qual = snakemake.params pileup_file = snakemake.input[0] out_file = snakemake.output[0] min_coverage = int(min_coverage) min_proportion = float(min_proportion) min_qual = int(min_qual) ## function to process each line of pileup def parse_pileup(line): # read fields from line of pileup chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t") # proportion is 0 if no consensus base, top base is N by default parsed = { "proportion": 0.0, "chromosome": chromosome, "position": int(position), "reference": reference, "coverage": int(coverage), "pileup": pileup, "quality": quality, "top_base": "N"} # if the base coverage is below the limit or above our acceptable max, call N if parsed["coverage"] < min_coverage: return parsed # uppercase pileup string for processing pileup = pileup.upper() # Remove start and stop characters from pileup string pileup = re.sub(r"\^.|\$", "", pileup) # Remove indels from pileup string start = 0 while True: match = INDEL_PATTERN.search(pileup, start) if match: integer = match.group(1) pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup) start = match.start() else: break # get total base count and top base count total = 0 top_base = "N" top_base_count = 0 # uppercase reference base for comparison reference = reference.upper() base_counts = {"A": 0, "C": 0, "G": 0, "T": 0} quality_length = len(quality) for i in range(quality_length): # convert ASCII character to phred base quality base_quality = ord(quality[i]) - ASCII_OFFSET # only count high-quality bases if base_quality >= min_qual: currBase = pileup[i] if currBase in base_counts: base_counts[currBase] += 1 else: base_counts[reference] += 1 total += 1 parsed["total"] = total # find top base for base in base_counts: if base_counts[base] > top_base_count: top_base = base top_base_count = base_counts[base] # if more that 0 bases processed if total > 0: prop = top_base_count / float(total) if prop >= min_proportion: parsed["proportion"] = prop parsed["top_base"] = top_base return parsed ## process pileup file # read in input file if not DEBUG: out_file = open(out_file, "w") with open(pileup_file, "rt") as pileup_file: for pileup_file_line in pileup_file: parsed = parse_pileup(pileup_file_line) proportion = parsed["proportion"] # print to output file if DEBUG: print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), parsed["pileup"], parsed["quality"]])) else: print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), parsed["pileup"], parsed["quality"]]), file=out_file) if not DEBUG: out_file.close() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | import sys HEADER_POS = 0 FASTA_POS = 1 #coreSNPs_file = sys.argv[1] snp_file = snakemake.input[0] out_file = snakemake.output[0] # hold sequences as we go fastas = [] with open(snp_file, "r") as core_snps: # process header header = core_snps.readline().rstrip("\n").split("\t") # add header to dict with empty string for h in range(len(header)): fastas += [[header[h], ""]] for line in core_snps: line = line.rstrip("\n").split("\t") for pos in range(len(line)): # print(fastas[pos]) # print(fastas[pos][FASTA_POS]) fastas[pos][FASTA_POS] += line[pos] with open(out_file, "w") as sys.stdout: for f in range(len(fastas)): print(">" + fastas[f][0]) print(fastas[f][1]) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import sys snp_file = snakemake.input[0] out_file = snakemake.output[0] # read each line and print if base is called for every sample and # bases are not the same in every sample with open(snp_file, "r") as snps: with open(out_file, "w") as sys.stdout: # print header print(snps.readline().rstrip('\n')) # process each line and output positions with no unknown bases ("N") # and where all samples do not have same base for line in snps: positions = line.rstrip('\n').split('\t') if (len(set(positions)) > 1) and ("N" not in positions): print("\t".join(positions)) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | import sys # snakemake input and output files sample_file = snakemake.input[0] out_file = snakemake.output[0] # minimum coverage threshold -- report percentage of bases covered at or # beyond this depth minCvg = int(snakemake.params[0]) totalBases = 0 coveredBases = 0 weightedAvg = 0 with open(sample_file, 'r') as sample: for line in sample: if line.startswith("genome"): chr, depth, numBases, size, fraction = line.rstrip('\n').split('\t') depth = int(depth) numBases = int(numBases) totalBases += numBases weightedAvg += depth * numBases if depth >= minCvg: coveredBases += numBases avgCvg = 0 percCovered = 0 if totalBases > 0: avgCvg = float(weightedAvg) / float(totalBases) percCovered = float(coveredBases) / float(totalBases) with open(out_file, 'w') as out: print("\t".join([str(round(avgCvg, 2)), str(round(percCovered, 2))]), file = out) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | import sys import itertools import subprocess import re # input and output files from snakemake in_files = snakemake.input out_file = snakemake.output[0] cmd = " ".join(["cat", in_files[0], "| wc -l"]) totalBases = subprocess.check_output(cmd, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n') totalBases = int(totalBases) - 1 # for every pairwise combination of files, check SNV distance with open(out_file, "w") as sys.stdout: # print header print('\t'.join(["Sample1", "Sample2", "SNVs", "BasesCompared", "TotalBases"])) # get all pairwise combinations of input files for element in itertools.combinations(in_files, 2): file1, file2 = element cmd1 = " ".join(["paste", file1, file2, "| sed '1d' | grep -v N | wc -l"]) totalPos = subprocess.check_output(cmd1, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n') cmd2 = " ".join(["paste", file1, file2, "| sed '1d' | grep -v N | awk '$1 != $2 {print $0}' | wc -l"]) diffPos = subprocess.check_output(cmd2, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n') fname1 = re.findall('consensus/(.+)\.', file1)[0] fname2 = re.findall('consensus/(.+)\.', file2)[0] # dist = subprocess.check_output(['./count_snvs.sh', file1, file2]).decode('ascii').rstrip('\n') print('\t'.join([fname1, fname2, str(diffPos), str(totalPos), str(totalBases)])) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | library(ggtree) library(phangorn) library(ggplot2) # input files tree_file <- snakemake@input[[1]] out_file <- snakemake@output[[1]] # read tree file tree <- read.tree(tree_file) # midpoint root tree if there is more than 1 node if (tree$Nnode > 1){ tree <- midpoint(tree) } # draw tree p <- ggtree(tree, branch.length = 1) + geom_tiplab(offset = .05) + geom_tippoint(size=3) + xlim(0, 4) + geom_treescale(width = 0.1, offset=0.25) # save output file ggsave(out_file, plot = p, height = 1*length(tree$tip.label)) |
34 35 | shell: "bwa index {input}" |
52 53 54 55 56 | shell: "bwa mem -t {threads} {params.ref} {input.r} | "\ "samtools view -b -q {params.qual} | "\ "bamtools filter -tag 'NM:<={params.nm}' | "\ "samtools sort --threads {threads} -o {output}" |
68 69 | shell: "bedtools genomecov -ibam {input} > {output}" |
83 84 | script: "scripts/getCoverage.py" |
98 99 100 101 102 103 104 | run: samps = input for samp in samps: with open(samp) as s: cvg, perc = s.readline().rstrip('\n').split('\t') if (float(cvg) >= params.min_cvg and float(perc) > params.min_perc): shell("ln -s $PWD/filtered_bam/{s}.filtered.bam passed_samples/{s}.bam".format(s=os.path.basename(samp).rstrip(".cvg"))) |
113 114 | shell: "samtools faidx {input}" |
127 128 | shell: "samtools mpileup -f {input.ref} -B -aa -o {output} {input.bam}" |
142 143 | script: "scripts/callSNPs.py" |
153 154 | shell: "echo {wildcards.sample} > {output}; cut -f4 {input} >> {output}" |
165 166 | shell: "paste {input} > {output}" |
177 178 | script: "scripts/findCoreSNPs.py" |
188 189 | script: "scripts/coreSNPs2fasta.py" |
199 200 | shell: "muscle -in {input} -out {output}" |
210 211 | shell: "fasttree -nt {input} > {output}" |
221 222 | script: "scripts/renderTree.R" |
232 233 | script: "scripts/pairwiseDist.py" |
Support
- Future updates
Related Workflows





