Trinity assembly pipeline for BioCommons / USydney Informatics Hub
The pipeline requires
Nextflow
to run.
DSL2 syntax is used, so that Nextflow version
20.07.1
or higher is required.
NOTE : this project was run in the context of targeting workflow automation, reproducibility and scalability. The scope was to port an existing bash pipeline into Nextflow, and in doing so investigating a few points, namely:
-
packing of multiple serial analyses into a single process;
-
option to leverage node-local disks;
-
option to leverage overlayFS in Singularity;
-
ease of adding configuration files for more computing clusters (Gadi was tested in this case).
The first three items tackle scalability, in that they allow to process large input datasets; the fourth one is clearly about portability.
Pipeline and requirements
This pipeline is based on SIH-Raijin-Trinity , with scheduler parameters updated following Gadi-Trinity :
Jellyfish -> Inchworm -> Chrysalis -> Butterfly mini-assemblies -> Aggregate
There are two software requirements:
-
Trinity , the main bioinformatics package; tests have been run with Trinity version
2.8.6
(official container); -
GNU Parallel , to orchestrate mini-assemblies within each compute node; version
20191022
has been tested.
Basic usage
nextflow run marcodelapierre/trinity-nf \ --reads='reads_{1,2}.fq.gz' \ -profile zeus --slurm_account='<Your Pawsey Project>'
The flag
--reads
is required to specify the name of the pair of input read files.
Note some syntax requirements:
-
encapsulate the file name specification between single quotes;
-
within a file pair, use names that differ only by a character, which distinguishes the two files, in this case
1
or2
; -
use curly brackets to specify the wild character within the file pair, e.g.
{1,2}
; -
the prefix to the wild character serves as the sample ID, e.g.
reads_
.
The flag
-profile
(note the single dash) allows to select the appropriate profile for the machine in use, Zeus in this case. On Zeus, use the flag
--slurm_account
to set your Pawsey account; on Gadi (NCI), use the flag
--pbs_account
instead.
The pipeline will output two files prefixed by the sample ID, in this case:
reads_Trinity.fasta
and
reads_Trinity.fasta.gene_trans_map
. By default, they are saved in the same directory as the input read files.
Multiple inputs at once
The pipeline allows to feed in multiple datasets at once. You can use input file name patterns to this end:
-
multiple input read pairs in the same directory, e.g.
sample1_R{1,2}.fq
,sample2_R{1,2}.fq
and so on, use:--reads='sample*{1,2}.fq'
; -
multiple read pairs in distinct directories, e.g.
sample1/R{1,2}.fq
,sample2/R{1,2}.fq
and so on, use:--reads='sample*/R{1,2}.fq'
.
Major options
The pipeline can be used with the additional profile
localdisk
, for instance
-profile zeus,localdisk
, to enable executing I/O intensive processes in node-local disks; a configuration parameter allows to define the naming convention for the corresponding node-local scratch directories.
In alternative, the pipeline can be used with the additional profile
overlay
, as in
-profile zeus,overlay
, to enable execution inside an overlayFS (virtual filesystem in a file) and mitigate I/O intensive analyses. This option requires the use of Singularity. A configuration parameter allows to define the size for the overlay files (one file per concurrent task).
In the case of Gadi at NCI, you can use
-profile gadi,localdisk
to enable executing I/O intensive processes in node-local disks (JOBFS). The default Gadi profile makes use of environment modules to provide the required packages; to switch to a Singularity container instead, use the flag
-profile gadi,singularity
.
Usage on different systems
The main pipeline file,
main.nf
, contains the pipeline logic and its almost completely machine independent.
All system specific information is contained in configuration files under the
config
directory, whose information is included in
nextflow.config
.
Examples are provided for Zeus and Nimbus at Pawsey, and Gadi at NCI; you can use them as templates for other systems.
Typical information to be specified includes scheduler configuration (including project name), software availability (containers, conda, modules, ..), and eventually other specificities such as location of the work directory for runtime, filesystem options (
e.g.
set cache mode to
lenient
when using parallel filesystems), pipeline configurations (
e.g.
local directory naming for
localdisk
, size of
overlay
files).
NOTE on Gadi: it is assumed that all scripts and data required at runtime can be found in directories belonging to the PBS project that is specified in the
gadi.config
and in the pipeline submission script (the two must match).
Additional resources
The
extra
directory contains an example Slurm script,
job_zeus.sh
, to run on Zeus, and an example PBS script,
job_gadi.sh
, to run on Gadi at NCI. There is also a sample script
log.sh
that takes a run name as input and displays formatted runtime information.
This directory also contains scripts that can be used to install a patched version of Nextflow on Gadi at NCI, required to comply with its PBS configuration:
hack-nextflow-pbs-gadi.sh
, which in turn requires
patch.PbsProExecutor.groovy
.
The
test
directory contains a small input dataset and launching scripts for quick testing of the pipeline (both for Zeus and Gadi), with total runtime of a few minutes.
Code Snippets
31 32 33 34 35 36 37 38 39 | """ singularity exec docker://ubuntu:18.04 bash -c ' \ out_file=\"${params.overfileprefix}one\" && \ mkdir -p overlay_tmp/upper overlay_tmp/work && \ dd if=/dev/zero of=\${out_file} count=${params.overlay_size_mb_one} bs=1M && \ mkfs.ext3 -d overlay_tmp \${out_file} && \ rm -rf overlay_tmp \ ' """ |
57 58 59 60 61 62 63 64 65 | """ singularity exec docker://ubuntu:18.04 bash -c ' \ out_file=\"${params.overfileprefix}${reads_fa.toString().minus('.tgz')}\" && \ mkdir -p overlay_tmp/upper overlay_tmp/work && \ dd if=/dev/zero of=\${out_file} count=${params.overlay_size_mb_many} bs=1M && \ mkfs.ext3 -d overlay_tmp \${out_file} && \ rm -rf overlay_tmp \ ' """ |
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | """ mem='${task.memory}' mem=\${mem%B} mem=\${mem// /} Trinity \ --left $read1 \ --right $read2 \ --seqType fq \ --no_normalize_reads \ --verbose \ --no_version_check \ --output ${params.taskoutdir} \ --max_memory \${mem} \ --CPU ${task.cpus} \ --no_run_inchworm """ |
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | """ mem='${task.memory}' mem=\${mem%B} mem=\${mem// /} Trinity \ --left $read1 \ --right $read2 \ --seqType fq \ --no_normalize_reads \ --verbose \ --no_version_check \ --output ${params.taskoutdir} \ --max_memory \${mem} \ --CPU ${task.cpus} \ --inchworm_cpu ${task.cpus} \ --no_run_chrysalis """ |
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | """ if [ "${params.localdisk}" == "true" ] ; then here=\$PWD rm -rf ${params.localdir} mkdir ${params.localdir} cp -r \$( readlink $read1 ) ${params.localdir}/ cp -r \$( readlink $read2 ) ${params.localdir}/ cp -r \$( readlink ${params.taskoutdir} ) ${params.localdir}/ cd ${params.localdir} fi mem='${task.memory}' mem=\${mem%B} mem=\${mem// /} Trinity \ --left $read1 \ --right $read2 \ --seqType fq \ --no_normalize_reads \ --verbose \ --no_version_check \ --output ${params.taskoutdir} \ --max_memory \${mem} \ --CPU ${task.cpus} \ --no_distributed_trinity_exec if [ "${params.localdisk}" == "true" ] ; then find ${params.taskoutdir}/read_partitions -name "*inity.reads.fa" >output_list split -l ${params.bf_collate} -a 4 output_list chunk for f in chunk* ; do tar -cz -h -f \${f}.tgz -T \${f} done cd \$here cp ${params.localdir}/chunk*.tgz . rm -r ${params.localdir} fi """ |
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 | """ if [ "${params.localdisk}" == "true" ] ; then here=\$PWD rm -rf ${params.localdir} mkdir ${params.localdir} cp -r \$( readlink $reads_fa ) ${params.localdir}/ cd ${params.localdir} fi mem='${params.bf_mem}' mem=\${mem%B} export mem=\${mem// /} cat << "EOF" >trinity.sh Trinity \ --single \${1} \ --run_as_paired \ --seqType fa \ --verbose \ --no_version_check \ --workdir trinity_workdir \ --output \${1}.out \ --max_memory \${mem} \ --CPU ${params.bf_cpus} \ --trinity_complete \ --full_cleanup \ --no_distributed_trinity_exec EOF chmod +x trinity.sh if [ "${params.localdisk}" == "true" ] ; then tar xzf ${reads_fa} find ${params.taskoutdir}/read_partitions -name "*inity.reads.fa" | parallel -j ${task.cpus} ./trinity.sh {} find ${params.taskoutdir}/read_partitions -name "*inity.fasta" | tar -cz -h -f out_${reads_fa} -T - cd \$here cp ${params.localdir}/out_chunk*.tgz . rm -r ${params.localdir} else ls *inity.reads.fa | parallel -j ${task.cpus} ./trinity.sh {} fi """ |
248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 | """ my_trinity=\$(which Trinity) my_trinity=\$(dirname \$my_trinity) if [ "${params.localdisk}" == "true" ] ; then here=\$PWD rm -rf ${params.localdir} mkdir ${params.localdir} cd ${params.localdir} for f in ${reads_fasta} ; do cp \$( readlink \$here/\$f ) . tar xzf \${f} done find ${params.taskoutdir}/read_partitions -name "*inity.fasta" >input_list else ls *inity.fasta >input_list fi cat input_list | \${my_trinity}/util/support_scripts/partitioned_trinity_aggregator.pl \ --token_prefix TRINITY_DN --output_prefix Trinity.tmp mv Trinity.tmp.fasta Trinity.fasta \${my_trinity}/util/support_scripts/get_Trinity_gene_to_trans_map.pl Trinity.fasta > Trinity.fasta.gene_trans_map if [ "${params.localdisk}" == "true" ] ; then cd \$here cp ${params.localdir}/Trinity.fasta* . rm -r ${params.localdir} fi """ |
Support
- Future updates
Related Workflows





