Variant Annotation Workflow for Pearl Millet Using Snakemake and SnpEff

public 1yr ago 0 bookmarks

View Workflow

annotate-millet-variants-snakeflow — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

annotate-millet-variants-snakeflow

This is a little Snakemake workflow to annotate some variants from pearl millet. This task came up during a bioinformatics course, BZ582A, at CSU in the spring of 2022. SnpEff is pretty easy to use when there is a precompiled data base for your organism. It is a little clunkier when you need to build a data base from a GFF. This shows one approach to that. (Note that it would probably be good to check the results by comparing them to some coding sequences, which we have not done here, yet).

The purpose of this little workflow is to demonstrate how a Snakefile can be put together for this task, and also to show how to use a SLURM profile derived from jdblischak’s smk-simple-slurm profile. The profile included in the repository at /hpcc-profiles/slurm-summit is tailored to the SUMMIT supercomputer.

Here is a DAG of this workflow, condensed using SnakemakeDagR :

To try this out on SUMMIT:

From scompile clone this repository from GitHub into your scratch directory.

git clone https://github.com/eriqande/annotate-millet-variants-snakeflow.git

Then, on the scompile node, first do a dry run:

cd annotate-millet-variants-snakeflow
conda activate snakemake
snakemake -np

The output of that should include a jobs table that looks like this:

Job stats:
job count min threads max threads
------------------------ ------- ------------- -------------
add_to_snpeff_config 1 1 1
all 1 1 1
annotate_each_chromosome 7 1 1
build_snpeff_database 1 1 1
catenate_anno_chroms 1 1 1
convert_to_vcf 7 1 1
download_genome 1 1 1
download_genotype_readme 1 1 1
download_genotypes 7 1 1
download_gff 1 1 1
headerize_vcfs 7 1 1
make_fai 1 1 1
total 36 1 1

Use this line to install the conda environments:

snakemake --use-conda --conda-create-envs-only --cores 1

Now, we will just have Snakemake launch each job on SLURM using sbatch , by way of the slurm profile (except for the rules marked as localrules at the top of the snakefile—those rules, which are simple text manipulations or downloads, will be run locally on scompile .) Make sure you do this next step in a tmux session on scompile , because this has to keep running, even after you log off.

snakemake --profile hpcc-profiles/slurm-summit

While these things are running or are in the SLURM queue you can see them, by opening up another shell (with tmux, for example) and doing this::

squeue -u $(whoami) -o "%.12i %.9P %.50j %.10u %.2t %.15M %.6D %.18R %.5C %.12m"

If you need more space for the job names, change the 50 above to a larger number.

In the end, the annotated VCF file is in the directory results/vcf , and all the other intermediate (VCF) files have been deleted.

Additionally, the snpEff reports (html files and TSV files of genes) are all in the directory snpeff_reports .

A few things could be done better here. We don’t need to compress after reheadering, because it just gets decompressed to concat it. Also, it might be better to create the whole VCF before running it through snpEff, because then you get just single summary report. But, that would take a little longer, since you lose the parallelizability over chromosomes.

At any rate. This shows a simple Snakefile that has made this analysis easily reproducible.

Code Snippets

GENOME_VERSION_NAME=$1
SPECIES_NAME_STRING=$2
OUTPUT=$3

if [ $# -ne 3 ]; then
	echo "expected 3 arguments" > /dev/stderr
	exit 1;
fi


# get the directory that the alias lives in
FIRSTDIR=$(dirname $(dirname $(which snpEff))) &&
# get the directory of the value of the alias
TMP=$(dirname $(readlink -s `which snpEff`)) &&
# combine those to get the directory where the config file is
SNPEFFDIR=${FIRSTDIR}${TMP/../} &&



mkdir -p resources/SnpEff  &&

# copy the config file
cp $SNPEFFDIR/snpEff.config  $OUTPUT  &&

# add an entry for pearl millet
echo "

# added by snakemake rule add_to_snpeff_config
$GENOME_VERSION_NAME.genome : $SPECIES_NAME_STRING

" >> $OUTPUT

Shell Snakemake From line 3 of script/add_to_config.sh

shell:
	"wget --no-check-certificate -O {output} {params.url} > {log} 2>&1"

SnakeMake From line 23 of main/Snakefile

shell:
	"wget --no-check-certificate -O {output} {params.url} > {log} 2>&1"

SnakeMake From line 35 of main/Snakefile

shell:
	" (wget --no-check-certificate -O {output}.gz {params.url}; "
	" gunzip {output}.gz) > {log} 2>&1;"

SnakeMake From line 47 of main/Snakefile

shell:
	"(wget --no-check-certificate -O {output}.gz {params.url}; "
	" gunzip {output}.gz) > {log} 2>&1; "

SnakeMake From line 60 of main/Snakefile

	shell:
		"""
		(
  			awk 'NR>=5 {{printf("SAMPLE %s\\n", $2);}}' {input.readme}; 
  			echo xxxxxxxxxxxxxxxxx; 
  			zcat {input.geno}
		) | awk -f script/genotypes2vcf.awk > {output} 2>{log}
		"""

SnakeMake From line 76 of main/Snakefile

shell:
	" samtools faidx {input.fasta} 2> {log} "

SnakeMake SAMtools From line 96 of main/Snakefile

shell:
	" bcftools reheader -f {input.fai} {input.vcf} | "
	" bcftools view -Oz -  > {output} 2> {log};"

SnakeMake BCFtools From line 111 of main/Snakefile

shell:
	"./script/add_to_config.sh {params.genome_version} {params.species_name_string} {output} 2> {log}"

SnakeMake From line 128 of main/Snakefile

shell:
	"( mkdir -p {output} && "
	" cp resources/genome.fasta {output}/sequences.fa && "
	" cp resources/genome.gff {output}/genes.gff && "
	" snpEff build -Xmx4g  -noCheckCds -noCheckProtein -gff3 "
	"    -c resources/SnpEff/snpEff.config  -v {params.genome_version} ) > {log} 2>&1"

SnakeMake snpEff From line 146 of main/Snakefile

shell:
	"snpEff ann -s {output.html} -c {input.config}  {params.genome_version} {input.vcf} > {output.vcf} 2> {log} "

SnakeMake snpEff From line 171 of main/Snakefile

shell:
	" (bcftools concat {input} | bcftools view -Oz > {output} && "
	" bcftools index -t {output} ) 2> {log} "

SnakeMake BCFtools From line 186 of main/Snakefile

ShowHide 8 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/eriqande/annotate-millet-variants-snakeflow

Name: annotate-millet-variants-snakeflow

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Sequence annotation BCFtools SAMtools Snakemake snpEff Genetics

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free