Trinity RNA Assembly pipeline

public 1yr ago Version: Version 2 0 bookmarks

View Workflow

Trinity assembly pipeline for BioCommons / USydney Informatics Hub

The pipeline requires Nextflow to run. DSL2 syntax is used, so that Nextflow version 20.07.1 or higher is required.

NOTE : this project was run in the context of targeting workflow automation, reproducibility and scalability. The scope was to port an existing bash pipeline into Nextflow, and in doing so investigating a few points, namely:

packing of multiple serial analyses into a single process;
option to leverage node-local disks;
option to leverage overlayFS in Singularity;
ease of adding configuration files for more computing clusters (Gadi was tested in this case).

The first three items tackle scalability, in that they allow to process large input datasets; the fourth one is clearly about portability.

Pipeline and requirements

This pipeline is based on SIH-Raijin-Trinity , with scheduler parameters updated following Gadi-Trinity :

Jellyfish -> Inchworm -> Chrysalis -> Butterfly mini-assemblies -> Aggregate

There are two software requirements:

Trinity , the main bioinformatics package; tests have been run with Trinity version 2.8.6 (official container);
GNU Parallel , to orchestrate mini-assemblies within each compute node; version 20191022 has been tested.

Basic usage

nextflow run marcodelapierre/trinity-nf \ --reads='reads_{1,2}.fq.gz' \ -profile zeus --slurm_account='<Your Pawsey Project>'

The flag --reads is required to specify the name of the pair of input read files.
Note some syntax requirements:

encapsulate the file name specification between single quotes;
within a file pair, use names that differ only by a character, which distinguishes the two files, in this case 1 or 2 ;
use curly brackets to specify the wild character within the file pair, e.g. {1,2} ;
the prefix to the wild character serves as the sample ID, e.g. reads_ .

The flag -profile (note the single dash) allows to select the appropriate profile for the machine in use, Zeus in this case. On Zeus, use the flag --slurm_account to set your Pawsey account; on Gadi (NCI), use the flag --pbs_account instead.

The pipeline will output two files prefixed by the sample ID, in this case: reads_Trinity.fasta and reads_Trinity.fasta.gene_trans_map . By default, they are saved in the same directory as the input read files.

Multiple inputs at once

The pipeline allows to feed in multiple datasets at once. You can use input file name patterns to this end:

multiple input read pairs in the same directory, e.g. sample1_R{1,2}.fq , sample2_R{1,2}.fq and so on, use: --reads='sample*{1,2}.fq' ;
multiple read pairs in distinct directories, e.g. sample1/R{1,2}.fq , sample2/R{1,2}.fq and so on, use: --reads='sample*/R{1,2}.fq' .

Major options

The pipeline can be used with the additional profile localdisk , for instance -profile zeus,localdisk , to enable executing I/O intensive processes in node-local disks; a configuration parameter allows to define the naming convention for the corresponding node-local scratch directories.

In alternative, the pipeline can be used with the additional profile overlay , as in -profile zeus,overlay , to enable execution inside an overlayFS (virtual filesystem in a file) and mitigate I/O intensive analyses. This option requires the use of Singularity. A configuration parameter allows to define the size for the overlay files (one file per concurrent task).

In the case of Gadi at NCI, you can use -profile gadi,localdisk to enable executing I/O intensive processes in node-local disks (JOBFS). The default Gadi profile makes use of environment modules to provide the required packages; to switch to a Singularity container instead, use the flag -profile gadi,singularity .

Usage on different systems

The main pipeline file, main.nf , contains the pipeline logic and its almost completely machine independent.
All system specific information is contained in configuration files under the config directory, whose information is included in nextflow.config .

Examples are provided for Zeus and Nimbus at Pawsey, and Gadi at NCI; you can use them as templates for other systems.
Typical information to be specified includes scheduler configuration (including project name), software availability (containers, conda, modules, ..), and eventually other specificities such as location of the work directory for runtime, filesystem options ( e.g. set cache mode to lenient when using parallel filesystems), pipeline configurations ( e.g. local directory naming for localdisk , size of overlay files).

NOTE on Gadi: it is assumed that all scripts and data required at runtime can be found in directories belonging to the PBS project that is specified in the gadi.config and in the pipeline submission script (the two must match).

Additional resources

The extra directory contains an example Slurm script, job_zeus.sh , to run on Zeus, and an example PBS script, job_gadi.sh , to run on Gadi at NCI. There is also a sample script log.sh that takes a run name as input and displays formatted runtime information.
This directory also contains scripts that can be used to install a patched version of Nextflow on Gadi at NCI, required to comply with its PBS configuration: hack-nextflow-pbs-gadi.sh , which in turn requires patch.PbsProExecutor.groovy .

The test directory contains a small input dataset and launching scripts for quick testing of the pipeline (both for Zeus and Gadi), with total runtime of a few minutes.

Code Snippets

"""
singularity exec docker://ubuntu:18.04 bash -c ' \
out_file=\"${params.overfileprefix}one\" && \
mkdir -p overlay_tmp/upper overlay_tmp/work && \
dd if=/dev/zero of=\${out_file} count=${params.overlay_size_mb_one} bs=1M && \
mkfs.ext3 -d overlay_tmp \${out_file} && \
rm -rf overlay_tmp \
'
"""

NextFlow Singularity Hub From line 31 of master/main.nf

"""
singularity exec docker://ubuntu:18.04 bash -c ' \
out_file=\"${params.overfileprefix}${reads_fa.toString().minus('.tgz')}\" && \
mkdir -p overlay_tmp/upper overlay_tmp/work && \
dd if=/dev/zero of=\${out_file} count=${params.overlay_size_mb_many} bs=1M && \
mkfs.ext3 -d overlay_tmp \${out_file} && \
rm -rf overlay_tmp \
'
"""

NextFlow Singularity Hub From line 57 of master/main.nf

"""
mem='${task.memory}'
mem=\${mem%B}
mem=\${mem// /}

Trinity \
  --left $read1 \
  --right $read2 \
  --seqType fq \
  --no_normalize_reads \
  --verbose \
  --no_version_check \
  --output ${params.taskoutdir} \
  --max_memory \${mem} \
  --CPU ${task.cpus} \
  --no_run_inchworm
"""

NextFlow seqtk-trinity From line 80 of master/main.nf

"""
mem='${task.memory}'
mem=\${mem%B}
mem=\${mem// /}

Trinity \
  --left $read1 \
  --right $read2 \
  --seqType fq \
  --no_normalize_reads \
  --verbose \
  --no_version_check \
  --output ${params.taskoutdir} \
  --max_memory \${mem} \
  --CPU ${task.cpus} \
  --inchworm_cpu ${task.cpus} \
  --no_run_chrysalis
"""

NextFlow seqtk-trinity From line 110 of master/main.nf

"""
if [ "${params.localdisk}" == "true" ] ; then
  here=\$PWD
  rm -rf ${params.localdir}
  mkdir ${params.localdir}
  cp -r \$( readlink $read1 ) ${params.localdir}/
  cp -r \$( readlink $read2 ) ${params.localdir}/
  cp -r \$( readlink ${params.taskoutdir} ) ${params.localdir}/
  cd ${params.localdir}
fi

mem='${task.memory}'
mem=\${mem%B}
mem=\${mem// /}

Trinity \
  --left $read1 \
  --right $read2 \
  --seqType fq \
  --no_normalize_reads \
  --verbose \
  --no_version_check \
  --output ${params.taskoutdir} \
  --max_memory \${mem} \
  --CPU ${task.cpus} \
  --no_distributed_trinity_exec

if [ "${params.localdisk}" == "true" ] ; then
  find ${params.taskoutdir}/read_partitions -name "*inity.reads.fa" >output_list
  split -l ${params.bf_collate} -a 4 output_list chunk
  for f in chunk* ; do
    tar -cz -h -f \${f}.tgz -T \${f}
  done
  cd \$here
  cp ${params.localdir}/chunk*.tgz .
  rm -r ${params.localdir}
fi
"""

NextFlow seqtk-trinity From line 141 of master/main.nf

  """
  if [ "${params.localdisk}" == "true" ] ; then
    here=\$PWD
    rm -rf ${params.localdir}
    mkdir ${params.localdir}
    cp -r \$( readlink $reads_fa ) ${params.localdir}/
    cd ${params.localdir}
  fi

  mem='${params.bf_mem}'
  mem=\${mem%B}
  export mem=\${mem// /}

  cat << "EOF" >trinity.sh
Trinity \
  --single \${1} \
  --run_as_paired \
  --seqType fa \
  --verbose \
  --no_version_check \
  --workdir trinity_workdir \
  --output \${1}.out \
  --max_memory \${mem} \
  --CPU ${params.bf_cpus} \
  --trinity_complete \
  --full_cleanup \
  --no_distributed_trinity_exec
EOF
  chmod +x trinity.sh

  if [ "${params.localdisk}" == "true" ] ; then
    tar xzf ${reads_fa}
    find ${params.taskoutdir}/read_partitions -name "*inity.reads.fa" | parallel -j ${task.cpus} ./trinity.sh {}
    find ${params.taskoutdir}/read_partitions -name "*inity.fasta" | tar -cz -h -f out_${reads_fa} -T -
    cd \$here
    cp ${params.localdir}/out_chunk*.tgz .
    rm -r ${params.localdir}
  else
    ls *inity.reads.fa | parallel -j ${task.cpus} ./trinity.sh {}
  fi
  """

NextFlow oschmod seqtk-trinity From line 193 of master/main.nf

"""
my_trinity=\$(which Trinity)
my_trinity=\$(dirname \$my_trinity)

if [ "${params.localdisk}" == "true" ] ; then
  here=\$PWD
  rm -rf ${params.localdir}
  mkdir ${params.localdir}
  cd ${params.localdir}
  for f in ${reads_fasta} ; do
    cp \$( readlink \$here/\$f ) .
    tar xzf \${f}
  done
  find ${params.taskoutdir}/read_partitions -name "*inity.fasta" >input_list
else
  ls *inity.fasta >input_list
fi

cat input_list | \${my_trinity}/util/support_scripts/partitioned_trinity_aggregator.pl \
  --token_prefix TRINITY_DN --output_prefix Trinity.tmp
mv Trinity.tmp.fasta Trinity.fasta

\${my_trinity}/util/support_scripts/get_Trinity_gene_to_trans_map.pl Trinity.fasta > Trinity.fasta.gene_trans_map

if [ "${params.localdisk}" == "true" ] ; then
  cd \$here
  cp ${params.localdir}/Trinity.fasta* .
  rm -r ${params.localdir}
fi
"""