AugusMake: Snakemake-Based Gene Annotation Pipeline with Augustus

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

AugusMake - Augustus-based gene prediction with Snakemake

Guided by a genome file, a transcriptome is de novo assembled with Trinity. It does so using RNA-Seq data provided by the user or automatically downloads paired-end (PE) reads following an accession number from NCBI'S SRA database. Trimming of PE reads is done with Trimmomatic and their quality is assessed pre- and post-trimming with FastQC. Alternatively, the user can also provide an already-assembled transcriptome. The completeness of the transcriptome is checked using BUSCO. Gene predictions can be done ab initio using just a genome file, i.e. no transcriptome is required, and/or with extrinsic hints using transcriptomic data. Both can be done with pre-trained parameters from Augustus or the user can choose to train the program for a new species. The results from training should be used with caution and carefully checked by the user after the analyzes are completed.

The rules for Augustus were developed after the protocol from Hoff & Stanke (2018) , the developers from Augustus.

System requirements

Local machine

I recommend running the workflow on a HPC system, as the analyses are resource and time-consuming.

If you don't have it yet, it is necessary to have conda or miniconda in your machine. Follow there instructions.
- After you are all set with conda, I highly ( highly! ) recommend installing a much much faster package manager to replace conda, mamba
  - First, activate your conda base:
  conda activate base
  - Then, type:
  conda install -n base -c conda-forge mamba
Likewise, follow this tutorial to install Git if you don't have it.

HPC system

Follow the instructions from your cluster administrator regarding loading of modules, such as loading a root distribution from Conda. For example, modules might be used to set up environmental variables, which have to first be loaded within the jobscripts.

e.g.: module load anaconda3/2022.05

Normally, the user doesn't have sudo rights to install anything to the root of the cluster. For example, you might want to work with a more updated distribution of conda as your "new", local base, and (ideally) install and use mamba to replace conda as a package manager. To do son, one needs to first load the anaconda module and then create a new environment, here called localconda

module load anaconda3/2022.05
conda create -n localconda -c conda-forge conda=22.9.0
conda install -n localconda -c conda-forge mamba
conda activate localconda

If you run conda env list you'll probably see something like this: /home/myusername/.conda/envs/localconda/

Installation

Clone this repository

git clone https://github.com/juliawiggeshoff/AugusMake.git

Activate your conda base

conda activate base

If you are working on a cluster or have your own "local", isolated environment you want to activate instead, use its name to activate it

conda activate localconda

Install AugusMake into an isolated software environment by navigating to the directory where this repo is and run:

conda env create --file environment.yaml

If you followed what was recommended in the System requirements , run this instead:

mamba env create --file environment.yaml

The environment from AugusMake is created

Always activate the environment before running the workflow

On a local machine:

conda activate AugusMake

If you are on a cluster and/or created the environment "within" another environment, you want to run this first:

conda env list

You will probably see something like this in your environment:

home/myusername/.conda/envs/localconda/envs/AugusMake

From no own, you have to give this full path when activating the environment prior to running the workflow

conda activate /home/myusername/.conda/envs/localconda/envs/AugusMake

Input data requirements

Configuration file config/config.yaml
Species Information table config/species_table.tsv

Detailed information on the required input files, including all three augustus processing alternatives found in config/README.md

Run AugusMake

Remember to always activate the environment first

conda activate AugusMake

conda activate /home/myusername/.conda/envs/localconda/envs/AugusMake

Before running the workflow locally or in a HPC system, make sure to do a "dry run" ( --dry-run or -n ) to see how many jobs will be run and if any errors are being flagged. Use --printshellcmds or -p for a full description of the rules or --quiet or -q to just print a summary of the jobs:

snakemake --use-conda --cores 51 -q -n

Building DAG of jobs...
Job stats:
job count min threads max threads
--------------------- ------- ------------- -------------
aa2nonred 2 20 20
all 1 1 1
all_busco_flag 1 1 1
all_fasterq_dump_done 1 1 1
augustus_ab_initio 1 1 1
augustus_hints 4 1 1
augustus_training 2 1 1
 ...
prefetch 1 1 1
raw_fastqc 2 2 2
samtools 2 5 5
trimmed_fastqc 8 2 2
trimmomatic 2 15 15
total 69 1 20

You can use the information from the total numbers of jobs to "guide" the value for --jobs , while keeping in mind not to "allow" too many jobs at once. So, a value of 25 to 30 jobs might be the safest choice. See cluster execution .

Local machine execution

Not recommended if you don't have a lot of storage and CPUs available (and time to wait...). Nevertheless, you can simply run like this:

nohup snakemake --keep-going --use-conda --verbose --printshellcmds --reason --nolock --cores 31 --max-threads 25 --rerun-incomplete > nohup_AugusMake_$(date +"%F_%H_%M_%S").out &

Modify number of cores accordingly.

Cluster execution

Tailored to work with a SGE job scheduler. Modify as needed. See Snakemake documentation for help.

Before the first execution of workflow, you need to create the environments, otherwise, Snakemake fails:

snakemake --cores 8 --use-conda --conda-create-envs-only

To run the workflow:

nohup snakemake --keep-going --use-conda --verbose --printshellcmds --reason --nolock --jobs 15 --cores 51 --local-cores 15 --max-threads 25 --rerun-incomplete --cluster "qsub -terse -V -b y -j y -o snakejob_logs/ -cwd -pe smp {threads} -q small.q,medium.q,large.q -M user.email@gmail.com -m be" > nohup_AugusMake_$(date +"%F_%H_%M_%S").out &

Remember to:

Create snakejob_logs in the working directory
Modify user.email@gmail.com
Change values for --jobs, --cores, --local-cores, and --max-threads accordingly
Change -q queue name

Make sure you set a low value for --local-cores to not take up too many resources from your host node. Likewise, it is not recommended to let a lot of --jobs to run in parallel, for similar reasons. Lastly, I like to set a --max-threads limit to ensure no rule "hogs" too many threads. This is often necessary because the number of thread-use per rule is set as a percentage of the total resources provided by the user. e.g.: If you can and want to use 110 threads, a rule like genome_guided_trinity will "take" 40% of these, 44 threads, and that is often an overkill. So, by limiting the maximum number of threads to 25, other jobs can run simultaneously, while making sure Trinity still has a "decent" number of threads to use.

If you named your configuration file differently, e.g. NEW_NAME, include the option --configfile config/NEW_NAME.yaml

Main results

report.zip

A report.zip file is automatically generated upon the successfully finished workflow. ALWAYS include --no-lock when running snakemake, otherwise, it is not automatically created and it returns an error. If that happens, the user can just manually generate the report with snakemake --report report.zip afterward, but the automatic generation is obviously preferred.

Charts to visually represent the BUSCO results are output in the report.zip. Each species has its own, individual chart. One chart combining all species is also available. This is done to compare the results between the assemblies. Similarly, the FastQC report files are also included for each species pre and post-trimming.

To be included: Augustus results

Code Snippets

shell:
    "(prefetch {wildcards.accession} --output-directory {params.outdir_sra} && "
    "echo \"{output} DONE\") &> {log}"

SnakeMake From line 134 of workflow/Snakefile

shell:
    "(fasterq-dump {input} --outdir {params.outdir} --threads {threads} && "
    "echo \"fasterq-dump\" finished extracting {input} && "
    "pigz -p {threads} {params.outdir}/{wildcards.accession}*.fastq && "
    "echo \"pigz finished compressing {output.fwd} {output.rev}\") &> {log}"

SnakeMake From line 152 of workflow/Snakefile

shell:
    "echo {input.fwd} > {output} && echo {input.rev} > {output}"

SnakeMake From line 166 of workflow/Snakefile

shell:
    "echo {input} > {output.all_SRA_downloads}"

SnakeMake From line 175 of workflow/Snakefile

shell:
    "(zcat {input.fwd} | fastqc stdin:{wildcards.species}_fwd -o {params.fastqc_pretrim_outdir} -t {threads} && "
    "zcat {input.rev} | fastqc stdin:{wildcards.species}_rev -o {params.fastqc_pretrim_outdir} -t {threads}) &> {log}"

SnakeMake FastQC From line 214 of workflow/Snakefile

shell:
    "trimmomatic PE -threads {threads} -phred33 {input.fwd} {input.rev} {output.fwd_p} {output.fwd_u} "
    "{output.rev_p} {output.rev_u} {params.clip}{params.adapter}{params.trim_params} {params.info_len} &> {log}"

SnakeMake Trimmomatic From line 241 of workflow/Snakefile

shell:
   "(zcat {input.trimmed} | fastqc stdin:{wildcards.species}_trimmed_{wildcards.suffix} "
   "-o {params.fastqc_posttrim_outdir} -t {threads}) &> {log}"

SnakeMake FastQC From line 265 of workflow/Snakefile

shell:
    "gmap_build --dir {params.out_dir} {input.genome} --genomedb {params.prefix} &> {log}"

SnakeMake From line 282 of workflow/Snakefile

shell:
    "(gsnap --gunzip --dir {params.out_dir} --db {params.prefix} "
    "{input.fwd_p} {input.rev_p} --nthreads {threads} --format sam --novelsplicing=1 "
    "--nofails --output-file {output.sam} --npaths=20) &> {log}"

SnakeMake From line 304 of workflow/Snakefile

shell:
    "(samtools view -bo {output.bam} {input.sam} -@ {threads} && "
    "echo Converted sam file to bam && "
    "samtools sort -o {output.coordsortbam} {output.bam} -@ {threads} && "
    "echo Converted read alignments to a coordinate-sorted bam file) &> {log}"

SnakeMake SAMtools From line 321 of workflow/Snakefile

shell:
    "(Trinity --genome_guided_bam {input.coordsortbam} --genome_guided_max_intron 10000 "
    "--max_memory 50G --CPU {threads} --output {params.out_dir}) &> {log}"

SnakeMake From line 341 of workflow/Snakefile

shell:
    "mkdir -p {params.flag_dir} && "
    "cp {input} {params.all_species_dir}"

SnakeMake From line 411 of workflow/Snakefile

shell:
    "ls {input} > {output}"

SnakeMake From line 421 of workflow/Snakefile

shell:
    "mkdir -p {params.outdir}; "
    "sed \"s/<__DATABASE__>/{params.path_db}/\" {params.template_config} | "
    "sed \"s/<__MIN_PERCENT_ALIGNED__>/0.8/\" | "
    "sed \"s/<__MIN_AVG_PER_ID__>/0.9/\" > {output.path_species_config}; "

SnakeMake From line 521 of workflow/Snakefile

shell:
    "(grep \"complete\" {input.cds} | perl -pe 's/>(\S+).*/$1/' > {output.complete_orfs} && "
    "echo \"Created a list of complete ORFs: {output.complete_orfs}\" && "
    "grep -F -f {output.complete_orfs} {input.gff3} | "
    "grep -P \"(\tCDS\t|\texon\t)\" | perl -pe 's/cds\.//; s/\.exon\d+//;' > {output.training_set_complete_gff3} && "
    "echo \"Used the list with complete ORFs to search for them in the pasa filtered transcriptome mapping file, "
    "keeping only the exon and CDS entries; renamed exon entries to have the same identifier as CDS entries, "
    "generating training gene candidates: {output.training_set_complete_gff3}\" && "
    "cat {output.training_set_complete_gff3} | perl -pe 's/\t\S*(asmbl_ \d+).*/\t$1/' | "
    "sort -n -k 4 | sort -s -k 9 | sort -s -k 1,1 > {output.bonafide_gtf} && "
    "echo \"Sorted the training gene candidates: {output.bonafide_gtf}\") &> {log}"

SnakeMake From line 563 of workflow/Snakefile

shell:
    "(computeFlankingRegion.pl {input.bonafide_gtf} && "
    "grep \"flanking_DNA\" {log} | xargs echo > {output.flanking_length} && "
    "echo \"Flanking region length for training gene structures listed in the GTF file: {output.flanking_length}\") &> {log}"

SnakeMake From line 587 of workflow/Snakefile

shell:
    "(length=$(cut -d \" \" -f5 {input.flanking_length}) && "
    "gff2gbSmallDNA.pl {input.bonafide_gtf} {input.genome} \"$length\" {output.bonafide_gbff} && "
    "echo \"Converted genome file and gtf file into GenBank flatfile format using the flanking "
    "length \"$length\": {output.bonafide_gbff}\") &> {log}"

SnakeMake From line 604 of workflow/Snakefile

shell:
    "(echo \"Peptide sequences for the final candidate ORFs generated by TransDecoder within the PASApipeline: {input.pep}\" && "
    "cat {input.pep} | grep -Pzo \">.*complete.*\\n[^\\n]*\\n(?:[^>].*\\n)*\" | sed -E 's/^>([^[:space:]]+).*/>\\1/' | "
    "sed '/^$/d' > {output.mod_headers_pep}; "
    "echo \"Filtered the peptide file to keep sequences corresponding to complete ORFs, "
    "while also simplifying the sequence headers: {output.mod_headers_pep}\" && "
    "aa2nonred.pl {output.mod_headers_pep} {output.nonred_pep} --cores {threads} && "
    "grep \">\" {output.nonred_pep} | sed \"s/^>//g\" > {output.nonred_pep_headers} && "
    "echo \"Generated a fasta file containing protein sequences that are less than 80% redundant "
    "with any other sequence in the set: {output.nonred_pep}\" && "
    "echo \"Stored information on the headers from the non-redudant sequences: {output.nonred_pep_headers}\") &> {log}"

SnakeMake From line 628 of workflow/Snakefile

shell:
    "(cat {input.bonafide_gbff} | perl -ne 'if($_ =~ m/LOCUS\s+(\S+)\s/) {{$txLocus = $1;}} "
    "elsif($_ =~ m/\/gene=\\\"(\S+)\\\"/) {{$txInGb3{{$1}} = $txLocus}} if(eof()){{foreach(keys %txInGb3) "
    "{{print \"$_\\t$txInGb3{{$_}}\\n\";}}}}' > {output.gene_loci_tsv} && "
    "echo \"Generated a tab-separated file with information on gene names and their corresponding loci names: {output.gene_loci_tsv}\" && "
    "grep -f {input.nonred_pep_headers} {output.gene_loci_tsv} | cut -f2 > {output.nonred_loci_lst} && "
    "echo \"List of loci without redudant sequences on the amino acid level: {output.nonred_loci_lst}\" && "
    "filterGenesIn.pl {output.nonred_loci_lst} {input.bonafide_gbff} > {output.filtered_bonafide_gbff} && "
    "loci_orig=$(grep -c \"LOCUS\" {input.bonafide_gbff}) && "
    "loci_filt=$(grep -c \"LOCUS\" {output.filtered_bonafide_gbff}) && "
    "rmvd_loci=$(($loci_orig - $loci_filt)) && "
    "echo \"Removed \"$rmvd_loci\" genes from the original gbff file due to sequence redundancy on the amino acid level "
    "and generated a filtered gbff file: {output.filtered_bonafide_gbff}\" && "
    "echo \"Now ready to start training Augustus using \"$loci_filt\" genes from the species {wildcards.species}\") &> {log}"

SnakeMake From line 653 of workflow/Snakefile

shell:
    "(if ! [ -f {params.augustus_config} ]; then "
    "new_species.pl --species={wildcards.species}; "
    "echo \"Created subdirectory for the new species {wildcards.species} with template parameter files: {params.augustus_config_dir}\"; "
    "fi && " 
    "etraining --species={wildcards.species} {input.filtered_bonafide_gbff} &&"
    "err_count=$(grep -c \"Variable stopCodonExcludedFromCDS set right\" {log}) && "
    "loci_filt=$(grep -c \"LOCUS\" {input.filtered_bonafide_gbff}) && "
    "cutoff=$(echo \"\"$loci_filt\"*0.50\" | bc -l | xargs printf \"%.0f\") && "
    "if [ \"$err_count\" -gt \"$cutoff\" ]; then "
    "echo \"More than half of the genes ($err_count) returned an error during etraining\"; "
    "echo \"Modifying value stopCodonExcludedFromCDS to true from {params.augustus_config}\"; "
    "sed -i \"s/stopCodonExcludedFromCDS false/stopCodonExcludedFromCDS true/\" {params.augustus_config}; "
    "else "
    "echo \"More than half of the genes are adequate to train augustus, with a total of $err_count displaying an error\"; "
    "echo \"Keeping value stopCodonExcludedFromCDS as false in {params.augustus_config}\"; "
    "fi && "
    "echo \"Running etraining again\" && "
    "etraining --species={wildcards.species} {input.filtered_bonafide_gbff} 2>&1 | grep \"in sequence\" | "
    "perl -pe 's/.*n sequence (\S+):.*/$1/' | sort -u > {output.error_genes_lst} && "
    "echo \"List with genes displaying errors, which will be removed from gbff file: {output.error_genes_lst}\" && "
    "filterGenes.pl {output.error_genes_lst} {input.filtered_bonafide_gbff} > {output.noerror_genes_gbff} && "

SnakeMake From line 687 of workflow/Snakefile

shell:
    "(augustus --species={params.sp_name} {input.genome} --gff3=on --outfile={output.predicted_genes_gff3} --errfile={output.error_out} && "
    "echo \"Genes predicted ab initio: {output.predicted_genes_gff3}\" && "
    "echo \"Error messages: {output.error_out}\") &> {log}"

SnakeMake From line 777 of workflow/Snakefile

shell:
    "(blat -noHead -minIdentity=92 {input.genome} {input.transcriptome} {output.psl} && "
    "echo \"cDNA aligned against genome with BLAT: {output.psl}\" && "
    "pslCDnaFilter -minId=0.9 -localNearBest=0.005 -ignoreNs -bestOverlap {output.psl} {output.filt_psl} && "
    "echo \"Alignments filtered to obtain those that are potentially most useful for gene prediction: {output.filt_psl}\" && "
    "cat {output.filt_psl} | sort -n -k 16,16 | sort -s -k 14,14 > {output.sort_filt_psl} && "
    "echo \"Sorted filtered alignments according to start position and target sequence name: {output.sort_filt_psl}\") &> {log}"

SnakeMake From line 820 of workflow/Snakefile

shell:
    "(blat2hints.pl --in={input.sort_filt_psl} --out={output.hints} --minintronlen=35 --trunkSS && "
    "echo \"Convert sort-filtered blat alignments in psl format to hints: {output.hints}\" ) &> {log}"

SnakeMake From line 837 of workflow/Snakefile

shell:
    "(cp {params.augustus_orig_config} {output.augustus_new_config} && "
    "echo \"Augustus configuration file: {output.augustus_new_config}\" && "
    "augustus --species={params.sp_name} --extrinsicCfgFile={output.augustus_new_config} "
    "--hintsfile={input.hints} --allow_hinted_splicesites=atac "
    "--softmasking=off {input.genome} > {output.predicted_with_hints} && "
    "echo \"Genes predicted with extrinsic hints: {output.predicted_with_hints}\") &> {log}"