Variant Annotation Workflow for Pearl Millet Using Snakemake and SnpEff
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
annotate-millet-variants-snakeflow
This is a little Snakemake workflow to annotate some variants from pearl millet. This task came up during a bioinformatics course, BZ582A, at CSU in the spring of 2022. SnpEff is pretty easy to use when there is a precompiled data base for your organism. It is a little clunkier when you need to build a data base from a GFF. This shows one approach to that. (Note that it would probably be good to check the results by comparing them to some coding sequences, which we have not done here, yet).
The purpose of this little workflow is to demonstrate how a Snakefile can be put together for this task, and also to show how to use a SLURM profile derived from jdblischak’s
smk-simple-slurm
profile. The profile included in the repository at
/hpcc-profiles/slurm-summit
is tailored to the SUMMIT supercomputer.
Here is a DAG of this workflow, condensed using
SnakemakeDagR
:
To try this out on SUMMIT:
-
From
scompile
clone this repository from GitHub into your scratch directory.
git clone https://github.com/eriqande/annotate-millet-variants-snakeflow.git
-
Then, on the
scompile
node, first do a dry run:
cd annotate-millet-variants-snakeflow
conda activate snakemake
snakemake -np
The output of that should include a jobs table that looks like this:
Job stats:
job count min threads max threads
------------------------ ------- ------------- -------------
add_to_snpeff_config 1 1 1
all 1 1 1
annotate_each_chromosome 7 1 1
build_snpeff_database 1 1 1
catenate_anno_chroms 1 1 1
convert_to_vcf 7 1 1
download_genome 1 1 1
download_genotype_readme 1 1 1
download_genotypes 7 1 1
download_gff 1 1 1
headerize_vcfs 7 1 1
make_fai 1 1 1
total 36 1 1
- Use this line to install the conda environments:
snakemake --use-conda --conda-create-envs-only --cores 1
-
Now, we will just have Snakemake launch each job on SLURM using
sbatch
, by way of the slurm profile (except for the rules marked aslocalrules
at the top of the snakefile—those rules, which are simple text manipulations or downloads, will be run locally onscompile
.) Make sure you do this next step in a tmux session onscompile
, because this has to keep running, even after you log off.
snakemake --profile hpcc-profiles/slurm-summit
While these things are running or are in the SLURM queue you can see them, by opening up another shell (with tmux, for example) and doing this::
squeue -u $(whoami) -o "%.12i %.9P %.50j %.10u %.2t %.15M %.6D %.18R %.5C %.12m"
If you need more space for the job names, change the
50
above to a larger number.
In the end, the annotated VCF file is in the directory
results/vcf
, and all the other intermediate (VCF) files have been deleted.
Additionally, the snpEff reports (html files and TSV files of genes) are all in the directory
snpeff_reports
.
A few things could be done better here. We don’t need to compress after reheadering, because it just gets decompressed to concat it. Also, it might be better to create the whole VCF before running it through snpEff, because then you get just single summary report. But, that would take a little longer, since you lose the parallelizability over chromosomes.
At any rate. This shows a simple Snakefile that has made this analysis easily reproducible.
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | GENOME_VERSION_NAME=$1 SPECIES_NAME_STRING=$2 OUTPUT=$3 if [ $# -ne 3 ]; then echo "expected 3 arguments" > /dev/stderr exit 1; fi # get the directory that the alias lives in FIRSTDIR=$(dirname $(dirname $(which snpEff))) && # get the directory of the value of the alias TMP=$(dirname $(readlink -s `which snpEff`)) && # combine those to get the directory where the config file is SNPEFFDIR=${FIRSTDIR}${TMP/../} && mkdir -p resources/SnpEff && # copy the config file cp $SNPEFFDIR/snpEff.config $OUTPUT && # add an entry for pearl millet echo " # added by snakemake rule add_to_snpeff_config $GENOME_VERSION_NAME.genome : $SPECIES_NAME_STRING " >> $OUTPUT |
23 24 | shell: "wget --no-check-certificate -O {output} {params.url} > {log} 2>&1" |
35 36 | shell: "wget --no-check-certificate -O {output} {params.url} > {log} 2>&1" |
47 48 49 | shell: " (wget --no-check-certificate -O {output}.gz {params.url}; " " gunzip {output}.gz) > {log} 2>&1;" |
60 61 62 | shell: "(wget --no-check-certificate -O {output}.gz {params.url}; " " gunzip {output}.gz) > {log} 2>&1; " |
76 77 78 79 80 81 82 83 | shell: """ ( awk 'NR>=5 {{printf("SAMPLE %s\\n", $2);}}' {input.readme}; echo xxxxxxxxxxxxxxxxx; zcat {input.geno} ) | awk -f script/genotypes2vcf.awk > {output} 2>{log} """ |
96 97 | shell: " samtools faidx {input.fasta} 2> {log} " |
111 112 113 | shell: " bcftools reheader -f {input.fai} {input.vcf} | " " bcftools view -Oz - > {output} 2> {log};" |
128 129 | shell: "./script/add_to_config.sh {params.genome_version} {params.species_name_string} {output} 2> {log}" |
146 147 148 149 150 151 | shell: "( mkdir -p {output} && " " cp resources/genome.fasta {output}/sequences.fa && " " cp resources/genome.gff {output}/genes.gff && " " snpEff build -Xmx4g -noCheckCds -noCheckProtein -gff3 " " -c resources/SnpEff/snpEff.config -v {params.genome_version} ) > {log} 2>&1" |
171 172 | shell: "snpEff ann -s {output.html} -c {input.config} {params.genome_version} {input.vcf} > {output.vcf} 2> {log} " |
186 187 188 | shell: " (bcftools concat {input} | bcftools view -Oz > {output} && " " bcftools index -t {output} ) 2> {log} " |
Support
- Future updates
Related Workflows





