Metagenome and metatranscriptome assembly in CWL
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in category output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This repository contains two workflows for metagenome and metatranscriptome assembly of short read data. MetaSPAdes is used as default for paired-end data, and MEGAHIT for single-end data and co-assemblies. MEGAHIT can be specified as the default assembler in the yaml file if preferred. Steps include:
- QC : removal of short reads, low quality regions, adapters and host decontamination
- Assembly : with metaSPADES or MEGAHIT
- Post-assembly : Host and PhiX decontamination, contig length filter (500bp), stats generation
Databases
You will need to pre-download fasta files for host decontamination and generate the following databases accordingly:
- bwa index
- blast index
Specify the locations in the yaml file when running the pipeline.
Main pipeline executables
-
src/workflows/metagenome_pipeline.cwl
-
src/workflows/metatranscriptome_pipeline.cwl
Code Snippets
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | baseCommand: ["blastn"] arguments: - prefix: -task position: 1 valueFrom: 'megablast' - prefix: -word_size position: 2 valueFrom: '28' - prefix: -best_hit_overhang position: 3 valueFrom: '0.1' - prefix: -best_hit_score_edge position: 4 valueFrom: '0.1' - prefix: -dust position: 5 valueFrom: 'yes' - prefix: -evalue position: 6 valueFrom: '0.0001' - prefix: -min_raw_gapped_score position: 7 valueFrom: '100' - prefix: -penalty position: 7 valueFrom: '-5' - prefix: -perc_identity position: 8 valueFrom: '80.0' - prefix: -soft_masking position: 9 valueFrom: 'true' - prefix: -window_size position: 10 valueFrom: '100' - prefix: -outfmt position: 11 valueFrom: '6 qseqid ppos' inputs: query_seq: type: File format: edam:format_1929 # FASTA inputBinding: prefix: "-query" blastdb_dir: type: Directory database_flag: type: string inputBinding: prefix: "-db" valueFrom: $(inputs.blastdb_dir.path)/$(inputs.database_flag) |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | baseCommand: ['/opt/miniconda/bin/python', '/data/trim_fasta.py'] inputs: name: type: string label: prefix for fasta file inputBinding: position: 1 prefix: --run_id contigs: type: File format: edam:format_1929 # FASTQ label: assembly contig file inputBinding: position: 2 prefix: --contig_file min_length: type: int? default: 500 label: contig length threshold inputBinding: position: 3 prefix: --threshold assembler: type: string label: assembler used inputBinding: position: 4 prefix: --assembler blastn: type: File label: concatenated blastn output against contaminant dbs inputBinding: position: 5 prefix: '--blast' |
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | baseCommand: [ 'map_host.sh' ] arguments: - -t - $(runtime.cores) - -o - $(runtime.outdir) inputs: name: type: string label: prefix for fastq files ref: type: File? secondaryFiles: - '.amb' - '.ann' - '.pac' - '.0123' - '.bwt.2bit.64' label: host genome fasta file inputBinding: prefix: -c position: 1 reads1: type: File format: edam:format_1930 # FASTQ label: fastp trimmed forward file inputBinding: position: 2 prefix: -f reads2: type: File? format: edam:format_1930 # FASTQ label: fastp trimmed reverse file inputBinding: position: 3 prefix: -r coassembly: type: string |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | baseCommand: [ fastp ] arguments: - valueFrom: $(runtime.cores) prefix: -w - valueFrom: $(inputs.name)_fastp.qc.json prefix: --json - valueFrom: $(inputs.name)_fastp.qc.html prefix: --html - valueFrom: | ${ var ext = ""; if (inputs.reads2) { ext = inputs.name + "_fastp_1.fastq.gz"; } else { ext = inputs.name + "_fastp.fastq.gz"; } return ext; } prefix: --out1 - valueFrom: | ${ var ext = ""; if (inputs.reads2) { ext = inputs.name + "_fastp_2.fastq.gz"; } else { ext = null ; } return ext; } prefix: --out2 inputs: name: type: string label: prefix for fasta file reads1: type: File format: edam:format_1930 # FASTQ label: forward fastq file inputBinding: position: 1 prefix: --in1 reads2: type: File? format: edam:format_1930 # FASTQ label: reverse fastq file inputBinding: position: 2 prefix: --in2 minLength: type: int? default: 50 label: filter reads shorter than this value inputBinding: position: 3 prefix: -l polya_trim: type: int? label: additional polyA tail trimming to metatranscriptomes inputBinding: position: 4 prefix: '--trim_poly_x --poly_x_min_len' |
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | baseCommand: [ 'megahit' ] inputs: #arrays allow for co-assembly memory: type: [ int?, string? ] label: Memory to run assembly. When 0 < -m < 1, fraction of all available memory of the machine is used, otherwise it specifies the memory in BYTE. default: '5000000000' inputBinding: position: 4 prefix: "--memory" reads: type: - File[] - type: array items: File inputBinding: prefix: "-r" itemSeparator: "," position: 4 reads2: type: File[]? label: reads in place for assembly.cwl conditional to check reverse reads don't exist. Should always be null |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | baseCommand: [ metaspades.py ] arguments: - valueFrom: $(runtime.outdir) prefix: -o - valueFrom: '8' prefix: -t - --only-assembler inputs: memory: type: int default: 150 label: memory in gb inputBinding: prefix: -m position: 4 forward_reads: type: File format: edam:format_1930 # FASTQ label: forward file after qc inputBinding: prefix: "-1" reverse_reads: type: File format: edam:format_1930 # FASTQ label: reverse file after qc inputBinding: prefix: "-2" |
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | baseCommand: [ 'pigz' ] arguments: - valueFrom: $(inputs.raw_reads) position: 2 prefix: '-dc' shellQuote: false - valueFrom: '|' shellQuote: false position: 3 - valueFrom: 'awk' shellQuote: false position: 4 - valueFrom : 'NR%4==2{c++; l+=length($0)} END { print c; print l }' shellQuote: true position: 5 |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | baseCommand: [ 'bwa-mem2', 'mem' ] inputs: min_std_max_min: type: 'int[]?' inputBinding: position: 1 prefix: '-I' itemSeparator: ',' minimum_seed_length: type: int? inputBinding: position: 1 prefix: '-k' doc: '-k INT minimum seed length [19]' output_filename: type: string? default: 'aln-se.sam' reads: type: File[] inputBinding: position: 3 reference: type: File inputBinding: position: 2 secondaryFiles: - '.amb' - '.ann' - '.pac' - '.0123' - '.bwt.2bit.64' threads: type: int? inputBinding: position: 1 prefix: '-t' doc: '-t INT number of threads [1]' |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | baseCommand: [ jgi_summarize_bam_contig_depths ] inputs: input: type: File inputBinding: position: 1 doc: | One or more bam files outputDepth: type: string inputBinding: prefix: --outputDepth doc: | The file to put the contig by bam depth matrix (default: STDOUT) percentIdentity: type: int? inputBinding: prefix: --percentIdentity doc: | The minimum end-to-end % identity of qualifying reads (default: 97) pairedContigs: type: File? inputBinding: prefix: --pairedContigs doc: | The file to output the sparse matrix of contigs which paired reads span (default: none) unmappedFastq: type: string? inputBinding: prefix: --unmappedFastq doc: | The prefix to output unmapped reads from each bam file suffixed by 'bamfile.bam.fastq.gz' noIntraDepthVariance: type: boolean? inputBinding: prefix: --noIntraDepthVariance doc: | Do not include variance from mean depth along the contig showDepth: type: boolean? inputBinding: prefix: --showDepth doc: | Output a .depth file per bam for each contig base minMapQual: type: int? inputBinding: prefix: --minMapQual doc: | The minimum mapping quality necessary to count the read as mapped (default: 0) weightMapQual: type: float? inputBinding: prefix: --weightMapQual doc: | Weight per-base depth based on the MQ of the read (i.e uniqueness) (default: 0.0 (disabled)) includeEdgeBases: type: boolean? inputBinding: prefix: --includeEdgeBases doc: | When calculating depth & variance, include the 1-readlength edges (off by default) maxEdgeBases: type: int? inputBinding: prefix: --maxEdgeBases doc: | When calculating depth & variance, and not --includeEdgeBases, the maximum length (default:75) # Following options require --referenceFasta outputGC: type: File? inputBinding: prefix: --outputGC doc: | The file to print the gc coverage histogram gcWindow: type: int? inputBinding: prefix: --gcWindow doc: | The sliding window size for GC calculations outputReadStats: type: File? inputBinding: prefix: --outputGC doc: | The file to print the per read statistics outputKmers: type: int? inputBinding: prefix: --gcWindow doc: | The file to print the perfect kmer counts # Options to control shredding contigs that are under represented by the reads shredLength: type: int? inputBinding: prefix: --shredLength doc: | The maximum length of the shreds shredDepth: type: int? inputBinding: prefix: --shredDepth doc: | The depth to generate overlapping shreds minContigLength: type: int? inputBinding: prefix: --minContigLength doc: | The mimimum length of contig to include for mapping and shredding minContigDepth: type: int? inputBinding: prefix: --minContigDepth doc: | The minimum depth along contig at which to break the contig |
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | baseCommand: [ 'samtools', 'view', '-uS' ] inputs: bedoverlap: type: File? inputBinding: position: 1 prefix: '-L' doc: | only include reads overlapping this BED FILE [null] cigar: type: int? inputBinding: position: 1 prefix: '-m' doc: | only include reads with number of CIGAR operations consuming query sequence >= INT [0] default: false collapsecigar: type: boolean inputBinding: position: 1 prefix: '-B' doc: | collapse the backward CIGAR operation default: false count: type: boolean inputBinding: position: 1 prefix: '-c' doc: | print only the count of matching records default: false fastcompression: type: boolean inputBinding: position: 1 prefix: '-1' doc: | use fast BAM compression (implies -b) default: false input: type: File inputBinding: position: 4 doc: | Input bam file. default: false isbam: type: boolean inputBinding: position: 2 prefix: '-b' doc: | output in BAM format default: false iscram: type: boolean inputBinding: position: 2 prefix: '-C' doc: | output in CRAM format default: false output_name: type: string inputBinding: position: 2 prefix: '-o' randomseed: type: float? inputBinding: position: 1 prefix: '-s' doc: | integer part sets seed of random number generator [0]; rest sets fraction of templates to subsample [no subsampling] readsingroup: type: string? inputBinding: position: 1 prefix: '-r' doc: | only include reads in read group STR [null] readsingroupfile: type: File? inputBinding: position: 1 prefix: '-R' doc: | only include reads with read group listed in FILE [null] readsinlibrary: type: string? inputBinding: position: 1 prefix: '-l' doc: | only include reads in library STR [null] readsquality: type: int? inputBinding: position: 1 prefix: '-q' doc: | only include reads with mapping quality >= INT [0] readswithbits: type: int? inputBinding: position: 1 prefix: '-f' doc: | only include reads with all bits set in INT set in FLAG [0] readswithoutbits: type: int? inputBinding: position: 1 prefix: '-F' doc: | only include reads with none of the bits set in INT set in FLAG [0] readtagtostrip: type: 'string[]?' inputBinding: position: 1 doc: | read tag to strip (repeatable) [null] referencefasta: type: File? inputBinding: position: 1 prefix: '-T' doc: | reference sequence FASTA FILE [null] region: type: string? inputBinding: position: 5 doc: | [region ...] samheader: type: boolean inputBinding: position: 1 prefix: '-h' doc: | include header in SAM output default: false threads: type: int? inputBinding: position: 1 prefix: '-@' doc: | number of BAM compression threads [0] default: false uncompressed: type: boolean inputBinding: position: 1 prefix: '-u' doc: | uncompressed BAM output (implies -b) |
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | baseCommand: ['/opt/miniconda/bin/python', '/data/gen_stats_report.py'] inputs: sequences: type: File label: cleaned contig file inputBinding: position: 2 prefix: --sequences coverage_file: type: File? label: coverage depth file inputBinding: position: 3 prefix: --coverage_file assembler: type: string label: assembler used metaspades, spades or megahit inputBinding: position: 4 prefix: --assembler assembly_log: type: File label: logfile from assembly inputBinding: position: 5 prefix: --logfile base_count: type: File[] label: raw reads base count output of readfq inputBinding: position: 6 prefix: --base_count |
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | baseCommand: [ 'count_fastq.sh' ] inputs: rawreads: type: File format: edam:format_1930 # FASTQ label: raw forward file inputBinding: position: 1 prefix: -f trimmedreads: type: File format: edam:format_1930 # FASTQ label: fastp trimmed forward file inputBinding: position: 2 prefix: -g cleanedreads: type: File format: edam:format_1930 # FASTQ label: host removed forward file inputBinding: position: 3 prefix: -h |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/EBI-Metagenomics/CWL-assembly.git
Name:
metagenome-and-metatranscriptome-assembly-in-cwl
Version:
master @ 39efebc
Downloaded:
0
Copyright:
Public Domain
License:
None
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...