M. tuberculosis variant identification (deprecated 2023). Use: https://github.com/ksw9/mtb-vars
M. tuberculosis variant identification
Deprecated 2023. Please use: https://github.com/ksw9/mtb-vars
Snakemake pipeline for M. tuberculosis variant identification from short-read data, mapping with bwa, variant identification with GATK.
Usage
Clone repository from GitHub.
git clone https://github.com/ksw9/mtb-call.git
Navigate to your project root directory.
Create a conda environment with snakemake. Installation instructions here .
module add anaconda/3_2022.05 # On Stanford's SCG.
conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake
Activate snakemake environment
source activate snakemake
snakemake --help
Download SnpEff for gene annotation .
Update the config file (config/config.yml) with the correct SnpEff path.
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
# Unzip file
unzip snpEff_latest_core.zip
Download the updated M. tuberculosis annotations.
java -jar snpEff.jar download Mycobacterium_tuberculosis_h37rv
Update the config file (config/config.yml) so that Snakemake uses the correct sample list as input. The test sample list is config/test_data.tsv.
Create a directory to store cluster run outputs for troubleshooting.
mkdir logs
List snakemake jobs that have not yet completed, but don't run.
snakemake -np
Running snakemake locally will stop at Kraken step, due to memory requirements.
Instead, submit to the cluster.
--use-conda
indicates using conda libraries specified at each step.
nohup snakemake -j 5 -k --cluster-config config/cluster_config.yml --use-conda --rerun-triggers mtime --rerun-incomplete --cluster \
"sbatch -A {cluster.account} --mem={cluster.memory} -t {cluster.time} --cpus-per-task {threads} --error {cluster.error} --output {cluster.output} " \
> logs/snakemake_test_data.out &
Monitor jobs running on the cluster.
sq
Look at entire output once jobs are completed.
cat run_logs/snakemake_test_data.out
Each step also produces a log, for troubleshooting a specific step.
cat results/test_data/test/bams/test_bwa_H37Rv_map.log
If snakemake runs into an error or if a run is prematurely terminated, the directory will be locked.
snakmake --unlock
Example data
Sampled paired-end fastq files are in the test_data directory. An input sample .tsv file list is located at config/test_data.tsv.
Directory structure
Results will be organized in the results directory, in sub-directories organized by sequencing batch and then sample name.
├── .gitignore
├── README.md
├── workflow
│ ├── rules
│ ├── envs
| │ ├── bwa.yml
| │ ├── gatk.yml
| │ ├── kraken2.yml
| │ ├── picard.yml
| │ ├── quanttb.yml
| │ ├── samtools.yml
| │ ├── TBprofilerv4.3.0.yml
| │ ├── trim-galore.yml
| │ ├── conda-meta
│ ├── scripts
| │ ├──process_stanford_tb.sh
| │ ├──run_kraken.sh
| │ ├──run_quanttb.sh
| │ ├──reads_output.sh
| │ ├──map_reads.sh
| │ ├──cov_stats.sh
| │ ├──mykrobe_predict.sh
| │ ├──call_varsgatk.sh
| │ ├──vcf2fasta.sh
│ ├── refs
| │ ├──H37Rv.fa
| │ ├──H37Rv_ppe.bed.gz
| │ ├──ppe_hdr.txt
| │ ├──snpEff
| └── Snakefile
├── config
│ ├── config.yml (run/user specific parameters)
│ ├── cluster_config.yml (cluster specific parameters)
├── results
│ ├── test_data/test (example organized by sequencing batch, then sample)
| │ ├──trim
| │ ├──kraken
| │ ├──bams
| │ ├──vars
| │ ├──fasta
| │ ├──stats
Code Snippets
59 60 61 62 63 64 65 | shell: ''' # Add test if environment exists already, create if not. existing_env=$(conda env list | awk '{{print $1}}' | grep {params.enviro}) [ -z "$existing_env" ] && mamba create -f {input} echo {params.enviro} > {output} ''' |
81 82 | shell: 'trim_galore --basename {params.sample} --nextseq 20 --gzip --output_dir {params.output_dir} --paired {input.p1} {input.p2}' |
98 99 | shell: '{config[scripts_dir]}run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_report} &>> {log}' |
112 113 | shell: '{config[scripts_dir]}run_quanttb.sh {input.kr1} {input.kr2} {output.quanttb_out} &>> {log}' |
131 132 133 134 135 | shell: ''' {config[scripts_dir]}map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log} sambamba markdup -r -t {threads} --tmpdir={params.tmp_dir} {output.bam} {output.combined_bam} ''' |
147 148 149 150 | shell: ''' {config[scripts_dir]}reads_output.sh {input.p1} {input.trim1} {input.kr1} {output.reads_stats} &>> {log} ''' |
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | shell: ''' {config[scripts_dir]}cov_stats.sh {input.ref_path} {input.bam} {output.cov_stats} &>> {log} #paste {input.kraken_stats} <(sed -n '7,8'p {output.cov_stats} ) > {output.combined_stats} # Combine with reads output #paste -d ',' <( cat {output.combined_stats} | tr "\\t" "," ) {input.reads_stats} > {output.all_stats} # Combine reads output and coverage paste -d ',' {input.reads_stats} <(sed -n '7,8'p {output.cov_stats} | tr "\\t" "," ) > {output.combined_stats} paste -d ',' {output.combined_stats} {input.quanttb_stats} <( cat {input.kraken_stats} | tr "\\t" ",") > {output.all_stats} # Mosdepth coverage along genome (for plotting) mosdepth --by 2000 -Q 30 {params.prefix} {input.bam} ''' |
193 194 | shell: "cat {input.combined_stats} > {output}" |
210 211 212 213 214 215 216 217 218 | shell: """ # Make sure tb-profiler reference chromosome names match the name used in your reference FASTA. tb-profiler update_tbdb --match_ref {params.ref_path} tb-profiler profile --bam {input.combined_bam} --prefix {params.samp} --dir {params.outdir} --csv --spoligotype &>> {log} mv {params.tmp_file} {output.profile} """ |
229 230 | shell: "{config[scripts_dir]}mykrobe_predict.sh {input.combined_bam} {output.amr_out} &>> {log}" |
244 245 | shell: "{config[scripts_dir]}call_vars_gatk.sh {input.ref_path} {input.combined_bam} {params.ploidy} {output.vcf} &>> {log}" |
262 263 264 265 | shell: ''' {config[scripts_dir]}vcf2fasta.sh {input.ref_path} {params.sample_name} {input.unfilt_vcf} {params.bed} {output.fasta} {params.depth} {params.qual} &>> {log} ''' |
281 282 283 284 285 286 287 288 289 290 291 292 293 | shell: ''' # Rename Chromosome to be consistent with snpEff/Ensembl genomes. zcat {input.vcf}| sed 's/NC_000962.3/Chromosome/g' | bgzip > {output.rename_vcf} tabix {output.rename_vcf} # Run snpEff and then rename Chromosome. java -jar -Xmx8g {config[snpeff]} eff {config[snpeff_db]} {output.rename_vcf} -config {config[snpeff_configpath]} -dataDir {config[snpeff_datapath]} -noStats -no-downstream -no-upstream -canon | sed 's/Chromosome/NC_000962.3/g' > {output.tmp_vcf} # Also use bed file to annotate vcf, zip. bcftools annotate -a {params.bed} -h {params.vcf_header} -c CHROM,FROM,TO,FORMAT/PPE {output.tmp_vcf} | bgzip > {output.ann_vcf} ''' |
Support
- Future updates
Related Workflows





