AugusMake: Snakemake-Based Gene Annotation Pipeline with Augustus
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
AugusMake - Augustus-based gene prediction with Snakemake
Guided by a genome file, a transcriptome is de novo assembled with Trinity. It does so using RNA-Seq data provided by the user or automatically downloads paired-end (PE) reads following an accession number from NCBI'S SRA database. Trimming of PE reads is done with Trimmomatic and their quality is assessed pre- and post-trimming with FastQC. Alternatively, the user can also provide an already-assembled transcriptome. The completeness of the transcriptome is checked using BUSCO. Gene predictions can be done ab initio using just a genome file, i.e. no transcriptome is required, and/or with extrinsic hints using transcriptomic data. Both can be done with pre-trained parameters from Augustus or the user can choose to train the program for a new species. The results from training should be used with caution and carefully checked by the user after the analyzes are completed.
The rules for Augustus were developed after the protocol from Hoff & Stanke (2018) , the developers from Augustus.
System requirements
Local machine
I recommend running the workflow on a HPC system, as the analyses are resource and time-consuming.
-
If you don't have it yet, it is necessary to have conda or miniconda in your machine. Follow there instructions.
-
After you are all set with conda, I highly ( highly! ) recommend installing a much much faster package manager to replace conda, mamba
- First, activate your conda base:
conda activate base
- Then, type:
conda install -n base -c conda-forge mamba
-
-
Likewise, follow this tutorial to install Git if you don't have it.
HPC system
Follow the instructions from your cluster administrator regarding loading of modules, such as loading a root distribution from Conda. For example, modules might be used to set up environmental variables, which have to first be loaded within the jobscripts.
e.g.:
module load anaconda3/2022.05
Normally, the user doesn't have sudo rights to install anything to the root of the cluster. For example, you might want to work with a more updated distribution of conda as your "new", local base, and (ideally) install and use mamba to replace conda as a package manager. To do son, one needs to first load the anaconda module and then create a new environment, here called
localconda
-
module load anaconda3/2022.05
-
conda create -n localconda -c conda-forge conda=22.9.0
-
conda install -n localconda -c conda-forge mamba
-
conda activate localconda
If you run
conda env list
you'll probably see something like this:
/home/myusername/.conda/envs/localconda/
Installation
- Clone this repository
git clone https://github.com/juliawiggeshoff/AugusMake.git
- Activate your conda base
conda activate base
- If you are working on a cluster or have your own "local", isolated environment you want to activate instead, use its name to activate it
conda activate localconda
- Install AugusMake into an isolated software environment by navigating to the directory where this repo is and run:
conda env create --file environment.yaml
If you followed what was recommended in the System requirements , run this instead:
mamba env create --file environment.yaml
The environment from AugusMake is created
- Always activate the environment before running the workflow
On a local machine:
conda activate AugusMake
If you are on a cluster and/or created the environment "within" another environment, you want to run this first:
conda env list
You will probably see something like this in your environment:
home/myusername/.conda/envs/localconda/envs/AugusMake
From no own, you have to give this full path when activating the environment prior to running the workflow
conda activate /home/myusername/.conda/envs/localconda/envs/AugusMake
Input data requirements
-
Configuration file
config/config.yaml
-
Species Information table
config/species_table.tsv
Detailed information on the required input files, including all three augustus processing alternatives found in
config/README.md
Run AugusMake
Remember to always activate the environment first
conda activate AugusMake
or
conda activate /home/myusername/.conda/envs/localconda/envs/AugusMake
Before running the workflow locally or in a HPC system, make sure to do a "dry run" (
--dry-run
or
-n
) to see how many jobs will be run and if any errors are being flagged. Use
--printshellcmds
or
-p
for a full description of the rules or
--quiet
or
-q
to just print a summary of the jobs:
snakemake --use-conda --cores 51 -q -n
Building DAG of jobs...
Job stats:
job count min threads max threads
--------------------- ------- ------------- -------------
aa2nonred 2 20 20
all 1 1 1
all_busco_flag 1 1 1
all_fasterq_dump_done 1 1 1
augustus_ab_initio 1 1 1
augustus_hints 4 1 1
augustus_training 2 1 1
...
prefetch 1 1 1
raw_fastqc 2 2 2
samtools 2 5 5
trimmed_fastqc 8 2 2
trimmomatic 2 15 15
total 69 1 20
You can use the information from the
total
numbers of jobs to "guide" the value for
--jobs
, while keeping in mind not to "allow" too many jobs at once. So, a value of 25 to 30 jobs might be the safest choice. See
cluster execution
.
Local machine execution
Not recommended if you don't have a lot of storage and CPUs available (and time to wait...). Nevertheless, you can simply run like this:
nohup snakemake --keep-going --use-conda --verbose --printshellcmds --reason --nolock --cores 31 --max-threads 25 --rerun-incomplete > nohup_AugusMake_$(date +"%F_%H_%M_%S").out &
Modify number of cores accordingly.
Cluster execution
Tailored to work with a SGE job scheduler. Modify as needed. See Snakemake documentation for help.
Before the first execution of workflow, you need to create the environments, otherwise, Snakemake fails:
snakemake --cores 8 --use-conda --conda-create-envs-only
To run the workflow:
nohup snakemake --keep-going --use-conda --verbose --printshellcmds --reason --nolock --jobs 15 --cores 51 --local-cores 15 --max-threads 25 --rerun-incomplete --cluster "qsub -terse -V -b y -j y -o snakejob_logs/ -cwd -pe smp {threads} -q small.q,medium.q,large.q -M user.email@gmail.com -m be" > nohup_AugusMake_$(date +"%F_%H_%M_%S").out &
Remember to:
-
Create snakejob_logs in the working directory
-
Modify user.email@gmail.com
-
Change values for --jobs, --cores, --local-cores, and --max-threads accordingly
-
Change -q queue name
Make sure you set a low value for
--local-cores
to not take up too many resources from your host node. Likewise, it is not recommended to let a lot of
--jobs
to run in parallel, for similar reasons. Lastly, I like to set a
--max-threads
limit to ensure no rule "hogs" too many threads. This is often necessary because the number of thread-use per rule is set as a percentage of the total resources provided by the user. e.g.: If you can and want to use 110 threads, a rule like genome_guided_trinity will "take" 40% of these, 44 threads, and that is often an overkill. So, by limiting the maximum number of threads to 25, other jobs can run simultaneously, while making sure Trinity still has a "decent" number of threads to use.
If you named your configuration file differently, e.g. NEW_NAME, include the option
--configfile config/NEW_NAME.yaml
Main results
report.zip
A report.zip file is automatically generated upon the successfully finished workflow. ALWAYS include
--no-lock
when running snakemake, otherwise, it is not automatically created and it returns an error. If that happens, the user can just manually generate the report with
snakemake --report report.zip
afterward, but the automatic generation is obviously preferred.
Charts to visually represent the BUSCO results are output in the report.zip. Each species has its own, individual chart. One chart combining all species is also available. This is done to compare the results between the assemblies. Similarly, the FastQC report files are also included for each species pre and post-trimming.
To be included: Augustus results
Code Snippets
134 135 136 | shell: "(prefetch {wildcards.accession} --output-directory {params.outdir_sra} && " "echo \"{output} DONE\") &> {log}" |
152 153 154 155 156 | shell: "(fasterq-dump {input} --outdir {params.outdir} --threads {threads} && " "echo \"fasterq-dump\" finished extracting {input} && " "pigz -p {threads} {params.outdir}/{wildcards.accession}*.fastq && " "echo \"pigz finished compressing {output.fwd} {output.rev}\") &> {log}" |
166 167 | shell: "echo {input.fwd} > {output} && echo {input.rev} > {output}" |
175 176 | shell: "echo {input} > {output.all_SRA_downloads}" |
214 215 216 | shell: "(zcat {input.fwd} | fastqc stdin:{wildcards.species}_fwd -o {params.fastqc_pretrim_outdir} -t {threads} && " "zcat {input.rev} | fastqc stdin:{wildcards.species}_rev -o {params.fastqc_pretrim_outdir} -t {threads}) &> {log}" |
241 242 243 | shell: "trimmomatic PE -threads {threads} -phred33 {input.fwd} {input.rev} {output.fwd_p} {output.fwd_u} " "{output.rev_p} {output.rev_u} {params.clip}{params.adapter}{params.trim_params} {params.info_len} &> {log}" |
265 266 267 | shell: "(zcat {input.trimmed} | fastqc stdin:{wildcards.species}_trimmed_{wildcards.suffix} " "-o {params.fastqc_posttrim_outdir} -t {threads}) &> {log}" |
282 283 | shell: "gmap_build --dir {params.out_dir} {input.genome} --genomedb {params.prefix} &> {log}" |
304 305 306 307 | shell: "(gsnap --gunzip --dir {params.out_dir} --db {params.prefix} " "{input.fwd_p} {input.rev_p} --nthreads {threads} --format sam --novelsplicing=1 " "--nofails --output-file {output.sam} --npaths=20) &> {log}" |
321 322 323 324 325 | shell: "(samtools view -bo {output.bam} {input.sam} -@ {threads} && " "echo Converted sam file to bam && " "samtools sort -o {output.coordsortbam} {output.bam} -@ {threads} && " "echo Converted read alignments to a coordinate-sorted bam file) &> {log}" |
341 342 343 | shell: "(Trinity --genome_guided_bam {input.coordsortbam} --genome_guided_max_intron 10000 " "--max_memory 50G --CPU {threads} --output {params.out_dir}) &> {log}" |
411 412 413 | shell: "mkdir -p {params.flag_dir} && " "cp {input} {params.all_species_dir}" |
421 422 | shell: "ls {input} > {output}" |
521 522 523 524 525 | shell: "mkdir -p {params.outdir}; " "sed \"s/<__DATABASE__>/{params.path_db}/\" {params.template_config} | " "sed \"s/<__MIN_PERCENT_ALIGNED__>/0.8/\" | " "sed \"s/<__MIN_AVG_PER_ID__>/0.9/\" > {output.path_species_config}; " |
563 564 565 566 567 568 569 570 571 572 573 | shell: "(grep \"complete\" {input.cds} | perl -pe 's/>(\S+).*/$1/' > {output.complete_orfs} && " "echo \"Created a list of complete ORFs: {output.complete_orfs}\" && " "grep -F -f {output.complete_orfs} {input.gff3} | " "grep -P \"(\tCDS\t|\texon\t)\" | perl -pe 's/cds\.//; s/\.exon\d+//;' > {output.training_set_complete_gff3} && " "echo \"Used the list with complete ORFs to search for them in the pasa filtered transcriptome mapping file, " "keeping only the exon and CDS entries; renamed exon entries to have the same identifier as CDS entries, " "generating training gene candidates: {output.training_set_complete_gff3}\" && " "cat {output.training_set_complete_gff3} | perl -pe 's/\t\S*(asmbl_ \d+).*/\t$1/' | " "sort -n -k 4 | sort -s -k 9 | sort -s -k 1,1 > {output.bonafide_gtf} && " "echo \"Sorted the training gene candidates: {output.bonafide_gtf}\") &> {log}" |
587 588 589 590 | shell: "(computeFlankingRegion.pl {input.bonafide_gtf} && " "grep \"flanking_DNA\" {log} | xargs echo > {output.flanking_length} && " "echo \"Flanking region length for training gene structures listed in the GTF file: {output.flanking_length}\") &> {log}" |
604 605 606 607 608 | shell: "(length=$(cut -d \" \" -f5 {input.flanking_length}) && " "gff2gbSmallDNA.pl {input.bonafide_gtf} {input.genome} \"$length\" {output.bonafide_gbff} && " "echo \"Converted genome file and gtf file into GenBank flatfile format using the flanking " "length \"$length\": {output.bonafide_gbff}\") &> {log}" |
628 629 630 631 632 633 634 635 636 637 638 | shell: "(echo \"Peptide sequences for the final candidate ORFs generated by TransDecoder within the PASApipeline: {input.pep}\" && " "cat {input.pep} | grep -Pzo \">.*complete.*\\n[^\\n]*\\n(?:[^>].*\\n)*\" | sed -E 's/^>([^[:space:]]+).*/>\\1/' | " "sed '/^$/d' > {output.mod_headers_pep}; " "echo \"Filtered the peptide file to keep sequences corresponding to complete ORFs, " "while also simplifying the sequence headers: {output.mod_headers_pep}\" && " "aa2nonred.pl {output.mod_headers_pep} {output.nonred_pep} --cores {threads} && " "grep \">\" {output.nonred_pep} | sed \"s/^>//g\" > {output.nonred_pep_headers} && " "echo \"Generated a fasta file containing protein sequences that are less than 80% redundant " "with any other sequence in the set: {output.nonred_pep}\" && " "echo \"Stored information on the headers from the non-redudant sequences: {output.nonred_pep_headers}\") &> {log}" |
653 654 655 656 657 658 659 660 661 662 663 664 665 666 | shell: "(cat {input.bonafide_gbff} | perl -ne 'if($_ =~ m/LOCUS\s+(\S+)\s/) {{$txLocus = $1;}} " "elsif($_ =~ m/\/gene=\\\"(\S+)\\\"/) {{$txInGb3{{$1}} = $txLocus}} if(eof()){{foreach(keys %txInGb3) " "{{print \"$_\\t$txInGb3{{$_}}\\n\";}}}}' > {output.gene_loci_tsv} && " "echo \"Generated a tab-separated file with information on gene names and their corresponding loci names: {output.gene_loci_tsv}\" && " "grep -f {input.nonred_pep_headers} {output.gene_loci_tsv} | cut -f2 > {output.nonred_loci_lst} && " "echo \"List of loci without redudant sequences on the amino acid level: {output.nonred_loci_lst}\" && " "filterGenesIn.pl {output.nonred_loci_lst} {input.bonafide_gbff} > {output.filtered_bonafide_gbff} && " "loci_orig=$(grep -c \"LOCUS\" {input.bonafide_gbff}) && " "loci_filt=$(grep -c \"LOCUS\" {output.filtered_bonafide_gbff}) && " "rmvd_loci=$(($loci_orig - $loci_filt)) && " "echo \"Removed \"$rmvd_loci\" genes from the original gbff file due to sequence redundancy on the amino acid level " "and generated a filtered gbff file: {output.filtered_bonafide_gbff}\" && " "echo \"Now ready to start training Augustus using \"$loci_filt\" genes from the species {wildcards.species}\") &> {log}" |
687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 | shell: "(if ! [ -f {params.augustus_config} ]; then " "new_species.pl --species={wildcards.species}; " "echo \"Created subdirectory for the new species {wildcards.species} with template parameter files: {params.augustus_config_dir}\"; " "fi && " "etraining --species={wildcards.species} {input.filtered_bonafide_gbff} &&" "err_count=$(grep -c \"Variable stopCodonExcludedFromCDS set right\" {log}) && " "loci_filt=$(grep -c \"LOCUS\" {input.filtered_bonafide_gbff}) && " "cutoff=$(echo \"\"$loci_filt\"*0.50\" | bc -l | xargs printf \"%.0f\") && " "if [ \"$err_count\" -gt \"$cutoff\" ]; then " "echo \"More than half of the genes ($err_count) returned an error during etraining\"; " "echo \"Modifying value stopCodonExcludedFromCDS to true from {params.augustus_config}\"; " "sed -i \"s/stopCodonExcludedFromCDS false/stopCodonExcludedFromCDS true/\" {params.augustus_config}; " "else " "echo \"More than half of the genes are adequate to train augustus, with a total of $err_count displaying an error\"; " "echo \"Keeping value stopCodonExcludedFromCDS as false in {params.augustus_config}\"; " "fi && " "echo \"Running etraining again\" && " "etraining --species={wildcards.species} {input.filtered_bonafide_gbff} 2>&1 | grep \"in sequence\" | " "perl -pe 's/.*n sequence (\S+):.*/$1/' | sort -u > {output.error_genes_lst} && " "echo \"List with genes displaying errors, which will be removed from gbff file: {output.error_genes_lst}\" && " "filterGenes.pl {output.error_genes_lst} {input.filtered_bonafide_gbff} > {output.noerror_genes_gbff} && " |
777 778 779 780 | shell: "(augustus --species={params.sp_name} {input.genome} --gff3=on --outfile={output.predicted_genes_gff3} --errfile={output.error_out} && " "echo \"Genes predicted ab initio: {output.predicted_genes_gff3}\" && " "echo \"Error messages: {output.error_out}\") &> {log}" |
820 821 822 823 824 825 826 | shell: "(blat -noHead -minIdentity=92 {input.genome} {input.transcriptome} {output.psl} && " "echo \"cDNA aligned against genome with BLAT: {output.psl}\" && " "pslCDnaFilter -minId=0.9 -localNearBest=0.005 -ignoreNs -bestOverlap {output.psl} {output.filt_psl} && " "echo \"Alignments filtered to obtain those that are potentially most useful for gene prediction: {output.filt_psl}\" && " "cat {output.filt_psl} | sort -n -k 16,16 | sort -s -k 14,14 > {output.sort_filt_psl} && " "echo \"Sorted filtered alignments according to start position and target sequence name: {output.sort_filt_psl}\") &> {log}" |
837 838 839 | shell: "(blat2hints.pl --in={input.sort_filt_psl} --out={output.hints} --minintronlen=35 --trunkSS && " "echo \"Convert sort-filtered blat alignments in psl format to hints: {output.hints}\" ) &> {log}" |
878 879 880 881 882 883 884 | shell: "(cp {params.augustus_orig_config} {output.augustus_new_config} && " "echo \"Augustus configuration file: {output.augustus_new_config}\" && " "augustus --species={params.sp_name} --extrinsicCfgFile={output.augustus_new_config} " "--hintsfile={input.hints} --allow_hinted_splicesites=atac " "--softmasking=off {input.genome} > {output.predicted_with_hints} && " "echo \"Genes predicted with extrinsic hints: {output.predicted_with_hints}\") &> {log}" |
Support
- Future updates
Related Workflows





