Metagenome and metatranscriptome assembly in CWL

public public 1yr ago Version: master @ 39efebc 0 bookmarks

This repository contains two workflows for metagenome and metatranscriptome assembly of short read data. MetaSPAdes is used as default for paired-end data, and MEGAHIT for single-end data and co-assemblies. MEGAHIT can be specified as the default assembler in the yaml file if preferred. Steps include:

  • QC : removal of short reads, low quality regions, adapters and host decontamination
  • Assembly : with metaSPADES or MEGAHIT
  • Post-assembly : Host and PhiX decontamination, contig length filter (500bp), stats generation

Databases

You will need to pre-download fasta files for host decontamination and generate the following databases accordingly:

  • bwa index
  • blast index

Specify the locations in the yaml file when running the pipeline.

Main pipeline executables

  • src/workflows/metagenome_pipeline.cwl
  • src/workflows/metatranscriptome_pipeline.cwl

Code Snippets

15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
baseCommand: ["blastn"]

arguments:
  - prefix: -task
    position: 1
    valueFrom: 'megablast'
  - prefix: -word_size
    position: 2
    valueFrom: '28'
  - prefix: -best_hit_overhang
    position: 3
    valueFrom: '0.1'
  - prefix: -best_hit_score_edge
    position: 4
    valueFrom: '0.1'
  - prefix: -dust
    position: 5
    valueFrom: 'yes'
  - prefix: -evalue
    position: 6
    valueFrom: '0.0001'
  - prefix: -min_raw_gapped_score
    position: 7
    valueFrom: '100'
  - prefix: -penalty
    position: 7
    valueFrom: '-5'
  - prefix: -perc_identity
    position: 8
    valueFrom: '80.0'
  - prefix: -soft_masking
    position: 9
    valueFrom: 'true'
  - prefix: -window_size
    position: 10
    valueFrom: '100'
  - prefix: -outfmt
    position: 11
    valueFrom: '6 qseqid ppos'

inputs:

  query_seq:
    type: File
    format: edam:format_1929 # FASTA
    inputBinding:
      prefix: "-query"

  blastdb_dir:
    type: Directory

  database_flag:
    type: string
    inputBinding:
      prefix: "-db"
      valueFrom: $(inputs.blastdb_dir.path)/$(inputs.database_flag)
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
baseCommand: ['/opt/miniconda/bin/python', '/data/trim_fasta.py']

inputs:
  name:
    type: string
    label: prefix for fasta file
    inputBinding:
      position: 1
      prefix: --run_id
  contigs:
    type: File
    format: edam:format_1929  # FASTQ
    label: assembly contig file
    inputBinding:
      position: 2
      prefix: --contig_file
  min_length:
    type: int?
    default: 500
    label: contig length threshold
    inputBinding:
      position: 3
      prefix: --threshold
  assembler:
    type: string
    label: assembler used
    inputBinding:
       position: 4
       prefix: --assembler
  blastn:
    type: File
    label: concatenated blastn output against contaminant dbs
    inputBinding:
        position: 5
        prefix: '--blast'
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
baseCommand: [ 'map_host.sh' ]

arguments:
- -t
- $(runtime.cores)
- -o
- $(runtime.outdir)

inputs:
  name:
    type: string
    label: prefix for fastq files
  ref:
    type: File?
    secondaryFiles:
        - '.amb'
        - '.ann'
        - '.pac'
        - '.0123'
        - '.bwt.2bit.64'
    label: host genome fasta file
    inputBinding:
        prefix: -c
        position: 1
  reads1:
    type: File
    format: edam:format_1930  # FASTQ
    label: fastp trimmed forward file
    inputBinding:
      position: 2
      prefix: -f
  reads2:
    type: File?
    format: edam:format_1930  # FASTQ
    label: fastp trimmed reverse file
    inputBinding:
      position: 3
      prefix: -r
  coassembly:
    type: string
CWL fastp From line 18 of bwa/bwa.cwl
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
baseCommand: [ fastp ]

arguments:
  - valueFrom: $(runtime.cores)
    prefix: -w
  - valueFrom: $(inputs.name)_fastp.qc.json
    prefix: --json
  - valueFrom: $(inputs.name)_fastp.qc.html
    prefix: --html
  - valueFrom: |
      ${ var ext = "";
      if (inputs.reads2) { ext = inputs.name + "_fastp_1.fastq.gz"; }
      else { ext = inputs.name + "_fastp.fastq.gz"; }
      return ext; }
    prefix: --out1
  - valueFrom: |
      ${ var ext = "";
      if (inputs.reads2) { ext = inputs.name + "_fastp_2.fastq.gz"; }
      else { ext = null ; }
      return ext; }
    prefix: --out2

inputs:
  name:
    type: string
    label: prefix for fasta file
  reads1:
    type: File
    format: edam:format_1930  # FASTQ
    label: forward fastq file
    inputBinding:
      position: 1
      prefix: --in1
  reads2:
    type: File?
    format: edam:format_1930  # FASTQ
    label: reverse fastq file
    inputBinding:
      position: 2
      prefix: --in2
  minLength:
    type: int?
    default: 50
    label: filter reads shorter than this value
    inputBinding:
      position: 3
      prefix: -l
  polya_trim:
    type: int?
    label: additional polyA tail trimming to metatranscriptomes
    inputBinding:
      position: 4
      prefix: '--trim_poly_x --poly_x_min_len'
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
baseCommand: [ 'megahit' ]

inputs:
  #arrays allow for co-assembly
  memory:
    type: [ int?, string? ]
    label: Memory to run assembly. When 0 < -m < 1, fraction of all available memory of the machine is used, otherwise it specifies the memory in BYTE.
    default: '5000000000'
    inputBinding:
      position: 4
      prefix: "--memory"

  reads:
    type:
      - File[]
      - type: array
        items: File
    inputBinding:
      prefix: "-r"
      itemSeparator: ","
      position: 4

  reads2:
    type: File[]?
    label: reads in place for assembly.cwl conditional to check reverse reads don't exist. Should always be null
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
baseCommand: [ metaspades.py ]

arguments:
  - valueFrom: $(runtime.outdir)
    prefix: -o
  - valueFrom: '8'
    prefix: -t
  - --only-assembler

inputs:
  memory:
    type: int
    default: 150
    label: memory in gb
    inputBinding:
      prefix: -m
      position: 4
  forward_reads:
    type: File
    format: edam:format_1930  # FASTQ
    label: forward file after qc
    inputBinding:
      prefix: "-1"
  reverse_reads:
    type: File
    format: edam:format_1930  # FASTQ
    label: reverse file after qc
    inputBinding:
      prefix: "-2"
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
baseCommand: [ 'pigz' ]

arguments:
  - valueFrom: $(inputs.raw_reads)
    position: 2
    prefix: '-dc'
    shellQuote: false
  - valueFrom: '|'
    shellQuote: false
    position: 3
  - valueFrom: 'awk'
    shellQuote: false
    position: 4
  - valueFrom : 'NR%4==2{c++; l+=length($0)} END { print c; print l }'
    shellQuote: true
    position: 5
CWL From line 23 of stats/base_count.cwl
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
baseCommand: [ 'bwa-mem2', 'mem' ]

inputs:
  min_std_max_min:
    type: 'int[]?'
    inputBinding:
      position: 1
      prefix: '-I'
      itemSeparator: ','
  minimum_seed_length:
    type: int?
    inputBinding:
      position: 1
      prefix: '-k'
    doc: '-k INT        minimum seed length [19]'
  output_filename:
    type: string?
    default: 'aln-se.sam'
  reads:
    type: File[]
    inputBinding:
      position: 3
  reference:
    type: File
    inputBinding:
      position: 2
    secondaryFiles:
      - '.amb'
      - '.ann'
      - '.pac'
      - '.0123'
      - '.bwt.2bit.64'
  threads:
    type: int?
    inputBinding:
      position: 1
      prefix: '-t'
    doc: '-t INT        number of threads [1]'
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
baseCommand: [ jgi_summarize_bam_contig_depths ]

inputs:
  input:
    type: File
    inputBinding:
      position: 1
    doc: |
      One or more bam files
  outputDepth:
    type: string
    inputBinding:
      prefix: --outputDepth
    doc: |
      The file to put the contig by bam depth matrix (default: STDOUT)
  percentIdentity:
    type: int?
    inputBinding:
      prefix: --percentIdentity
    doc: |
      The minimum end-to-end % identity of qualifying reads (default: 97)
  pairedContigs:
    type: File?
    inputBinding:
      prefix: --pairedContigs
    doc: |
      The file to output the sparse matrix of contigs which paired reads span (default: none)
  unmappedFastq:
    type: string?
    inputBinding:
      prefix: --unmappedFastq
    doc: |
      The prefix to output unmapped reads from each bam file suffixed by 'bamfile.bam.fastq.gz'
  noIntraDepthVariance:
    type: boolean?
    inputBinding:
      prefix: --noIntraDepthVariance
    doc: |
      Do not include variance from mean depth along the contig
  showDepth:
    type: boolean?
    inputBinding:
      prefix: --showDepth
    doc: |
      Output a .depth file per bam for each contig base
  minMapQual:
    type: int?
    inputBinding:
      prefix: --minMapQual
    doc: |
      The minimum mapping quality necessary to count the read as mapped (default: 0)
  weightMapQual:
    type: float?
    inputBinding:
      prefix: --weightMapQual
    doc: |
      Weight per-base depth based on the MQ of the read (i.e uniqueness) (default: 0.0 (disabled))
  includeEdgeBases:
    type: boolean?
    inputBinding:
      prefix: --includeEdgeBases
    doc: |
      When calculating depth & variance, include the 1-readlength edges (off by default)
  maxEdgeBases:
    type: int?
    inputBinding:
      prefix: --maxEdgeBases
    doc: |
      When calculating depth & variance, and not --includeEdgeBases, the maximum length (default:75)

# Following options require --referenceFasta
  outputGC:
    type: File?
    inputBinding:
      prefix: --outputGC
    doc: |
      The file to print the gc coverage histogram
  gcWindow:
    type: int?
    inputBinding:
      prefix: --gcWindow
    doc: |
      The sliding window size for GC calculations
  outputReadStats:
    type: File?
    inputBinding:
      prefix: --outputGC
    doc: |
     The file to print the per read statistics
  outputKmers:
    type: int?
    inputBinding:
      prefix: --gcWindow
    doc: |
      The file to print the perfect kmer counts
# Options to control shredding contigs that are under represented by the reads
  shredLength:
    type: int?
    inputBinding:
      prefix: --shredLength
    doc: |
      The maximum length of the shreds
  shredDepth:
    type: int?
    inputBinding:
      prefix: --shredDepth
    doc: |
      The depth to generate overlapping shreds
  minContigLength:
    type: int?
    inputBinding:
      prefix: --minContigLength
    doc: |
      The mimimum length of contig to include for mapping and shredding
  minContigDepth:
    type: int?
    inputBinding:
      prefix: --minContigDepth
    doc: |
      The minimum depth along contig at which to break the contig
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
baseCommand: [ 'samtools', 'view', '-uS' ]

inputs:
  bedoverlap:
    type: File?
    inputBinding:
      position: 1
      prefix: '-L'
    doc: |
      only include reads overlapping this BED FILE [null]
  cigar:
    type: int?
    inputBinding:
      position: 1
      prefix: '-m'
    doc: |
      only include reads with number of CIGAR operations
      consuming query sequence >= INT [0]
    default: false
  collapsecigar:
    type: boolean
    inputBinding:
      position: 1
      prefix: '-B'
    doc: |
      collapse the backward CIGAR operation
    default: false
  count:
    type: boolean
    inputBinding:
      position: 1
      prefix: '-c'
    doc: |
      print only the count of matching records
    default: false
  fastcompression:
    type: boolean
    inputBinding:
      position: 1
      prefix: '-1'
    doc: |
      use fast BAM compression (implies -b)
    default: false
  input:
    type: File
    inputBinding:
      position: 4
    doc: |
      Input bam file.
    default: false
  isbam:
    type: boolean
    inputBinding:
      position: 2
      prefix: '-b'
    doc: |
      output in BAM format
    default: false
  iscram:
    type: boolean
    inputBinding:
      position: 2
      prefix: '-C'
    doc: |
      output in CRAM format
    default: false
  output_name:
    type: string
    inputBinding:
      position: 2
      prefix: '-o'
  randomseed:
    type: float?
    inputBinding:
      position: 1
      prefix: '-s'
    doc: |
      integer part sets seed of random number generator [0];
      rest sets fraction of templates to subsample [no subsampling]
  readsingroup:
    type: string?
    inputBinding:
      position: 1
      prefix: '-r'
    doc: |
      only include reads in read group STR [null]
  readsingroupfile:
    type: File?
    inputBinding:
      position: 1
      prefix: '-R'
    doc: |
      only include reads with read group listed in FILE [null]
  readsinlibrary:
    type: string?
    inputBinding:
      position: 1
      prefix: '-l'
    doc: |
      only include reads in library STR [null]
  readsquality:
    type: int?
    inputBinding:
      position: 1
      prefix: '-q'
    doc: |
      only include reads with mapping quality >= INT [0]
  readswithbits:
    type: int?
    inputBinding:
      position: 1
      prefix: '-f'
    doc: |
      only include reads with all bits set in INT set in FLAG [0]
  readswithoutbits:
    type: int?
    inputBinding:
      position: 1
      prefix: '-F'
    doc: |
      only include reads with none of the bits set in INT set in FLAG [0]
  readtagtostrip:
    type: 'string[]?'
    inputBinding:
      position: 1
    doc: |
      read tag to strip (repeatable) [null]
  referencefasta:
    type: File?
    inputBinding:
      position: 1
      prefix: '-T'
    doc: |
      reference sequence FASTA FILE [null]
  region:
    type: string?
    inputBinding:
      position: 5
    doc: |
      [region ...]
  samheader:
    type: boolean
    inputBinding:
      position: 1
      prefix: '-h'
    doc: |
      include header in SAM output
    default: false
  threads:
    type: int?
    inputBinding:
      position: 1
      prefix: '-@'
    doc: |
      number of BAM compression threads [0]
    default: false
  uncompressed:
    type: boolean
    inputBinding:
      position: 1
      prefix: '-u'
    doc: |
      uncompressed BAM output (implies -b)
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
baseCommand: ['/opt/miniconda/bin/python', '/data/gen_stats_report.py']

inputs:
  sequences:
    type: File
    label: cleaned contig file
    inputBinding:
      position: 2
      prefix: --sequences
  coverage_file:
    type: File?
    label: coverage depth file
    inputBinding:
      position: 3
      prefix: --coverage_file
  assembler:
    type: string
    label: assembler used metaspades, spades or megahit
    inputBinding:
      position: 4
      prefix: --assembler
  assembly_log:
    type: File
    label: logfile from assembly
    inputBinding:
       position: 5
       prefix: --logfile
  base_count:
    type: File[]
    label: raw reads base count output of readfq
    inputBinding:
      position: 6
      prefix: --base_count
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
baseCommand: [ 'count_fastq.sh' ]

inputs:
  rawreads:
    type: File
    format: edam:format_1930  # FASTQ
    label: raw forward file
    inputBinding:
      position: 1
      prefix: -f
  trimmedreads:
    type: File
    format: edam:format_1930  # FASTQ
    label: fastp trimmed forward file
    inputBinding:
      position: 2
      prefix: -g
  cleanedreads:
    type: File
    format: edam:format_1930  # FASTQ
    label: host removed forward file
    inputBinding:
      position: 3
      prefix: -h
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/EBI-Metagenomics/CWL-assembly.git
Name: metagenome-and-metatranscriptome-assembly-in-cwl
Version: master @ 39efebc
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...