Metagenome and metatranscriptome assembly in CWL

public 1yr ago Version: master @ 39efebc 0 bookmarks

View Workflow

metagenome-and-metatranscriptome-assembly-in-cwl — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in category output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This repository contains two workflows for metagenome and metatranscriptome assembly of short read data. MetaSPAdes is used as default for paired-end data, and MEGAHIT for single-end data and co-assemblies. MEGAHIT can be specified as the default assembler in the yaml file if preferred. Steps include:

QC : removal of short reads, low quality regions, adapters and host decontamination
Assembly : with metaSPADES or MEGAHIT
Post-assembly : Host and PhiX decontamination, contig length filter (500bp), stats generation

Databases

You will need to pre-download fasta files for host decontamination and generate the following databases accordingly:

bwa index
blast index

Specify the locations in the yaml file when running the pipeline.

Main pipeline executables

src/workflows/metagenome_pipeline.cwl
src/workflows/metatranscriptome_pipeline.cwl

Code Snippets

baseCommand: ["blastn"]

arguments:
  - prefix: -task
    position: 1
    valueFrom: 'megablast'
  - prefix: -word_size
    position: 2
    valueFrom: '28'
  - prefix: -best_hit_overhang
    position: 3
    valueFrom: '0.1'
  - prefix: -best_hit_score_edge
    position: 4
    valueFrom: '0.1'
  - prefix: -dust
    position: 5
    valueFrom: 'yes'
  - prefix: -evalue
    position: 6
    valueFrom: '0.0001'
  - prefix: -min_raw_gapped_score
    position: 7
    valueFrom: '100'
  - prefix: -penalty
    position: 7
    valueFrom: '-5'
  - prefix: -perc_identity
    position: 8
    valueFrom: '80.0'
  - prefix: -soft_masking
    position: 9
    valueFrom: 'true'
  - prefix: -window_size
    position: 10
    valueFrom: '100'
  - prefix: -outfmt
    position: 11
    valueFrom: '6 qseqid ppos'

inputs:

  query_seq:
    type: File
    format: edam:format_1929 # FASTA
    inputBinding:
      prefix: "-query"

  blastdb_dir:
    type: Directory

  database_flag:
    type: string
    inputBinding:
      prefix: "-db"
      valueFrom: $(inputs.blastdb_dir.path)/$(inputs.database_flag)

CWL BLAST From line 15 of assembly-qc/blast.cwl

baseCommand: ['/opt/miniconda/bin/python', '/data/trim_fasta.py']

inputs:
  name:
    type: string
    label: prefix for fasta file
    inputBinding:
      position: 1
      prefix: --run_id
  contigs:
    type: File
    format: edam:format_1929  # FASTQ
    label: assembly contig file
    inputBinding:
      position: 2
      prefix: --contig_file
  min_length:
    type: int?
    default: 500
    label: contig length threshold
    inputBinding:
      position: 3
      prefix: --threshold
  assembler:
    type: string
    label: assembler used
    inputBinding:
       position: 4
       prefix: --assembler
  blastn:
    type: File
    label: concatenated blastn output against contaminant dbs
    inputBinding:
        position: 5
        prefix: '--blast'

CWL BLAST From line 16 of assembly-qc/fasta-trimming.cwl

baseCommand: [ 'map_host.sh' ]

arguments:
- -t
- $(runtime.cores)
- -o
- $(runtime.outdir)

inputs:
  name:
    type: string
    label: prefix for fastq files
  ref:
    type: File?
    secondaryFiles:
        - '.amb'
        - '.ann'
        - '.pac'
        - '.0123'
        - '.bwt.2bit.64'
    label: host genome fasta file
    inputBinding:
        prefix: -c
        position: 1
  reads1:
    type: File
    format: edam:format_1930  # FASTQ
    label: fastp trimmed forward file
    inputBinding:
      position: 2
      prefix: -f
  reads2:
    type: File?
    format: edam:format_1930  # FASTQ
    label: fastp trimmed reverse file
    inputBinding:
      position: 3
      prefix: -r
  coassembly:
    type: string

CWL fastp From line 18 of bwa/bwa.cwl

baseCommand: [ fastp ]

arguments:
  - valueFrom: $(runtime.cores)
    prefix: -w
  - valueFrom: $(inputs.name)_fastp.qc.json
    prefix: --json
  - valueFrom: $(inputs.name)_fastp.qc.html
    prefix: --html
  - valueFrom: |
      ${ var ext = "";
      if (inputs.reads2) { ext = inputs.name + "_fastp_1.fastq.gz"; }
      else { ext = inputs.name + "_fastp.fastq.gz"; }
      return ext; }
    prefix: --out1
  - valueFrom: |
      ${ var ext = "";
      if (inputs.reads2) { ext = inputs.name + "_fastp_2.fastq.gz"; }
      else { ext = null ; }
      return ext; }
    prefix: --out2

inputs:
  name:
    type: string
    label: prefix for fasta file
  reads1:
    type: File
    format: edam:format_1930  # FASTQ
    label: forward fastq file
    inputBinding:
      position: 1
      prefix: --in1
  reads2:
    type: File?
    format: edam:format_1930  # FASTQ
    label: reverse fastq file
    inputBinding:
      position: 2
      prefix: --in2
  minLength:
    type: int?
    default: 50
    label: filter reads shorter than this value
    inputBinding:
      position: 3
      prefix: -l
  polya_trim:
    type: int?
    label: additional polyA tail trimming to metatranscriptomes
    inputBinding:
      position: 4
      prefix: '--trim_poly_x --poly_x_min_len'

CWL fastp From line 16 of fastp/fastp.cwl

baseCommand: [ 'megahit' ]

inputs:
  #arrays allow for co-assembly
  memory:
    type: [ int?, string? ]
    label: Memory to run assembly. When 0 < -m < 1, fraction of all available memory of the machine is used, otherwise it specifies the memory in BYTE.
    default: '5000000000'
    inputBinding:
      position: 4
      prefix: "--memory"

  reads:
    type:
      - File[]
      - type: array
        items: File
    inputBinding:
      prefix: "-r"
      itemSeparator: ","
      position: 4

  reads2:
    type: File[]?
    label: reads in place for assembly.cwl conditional to check reverse reads don't exist. Should always be null

CWL MEGAHIT From line 17 of megahit/megahit_single.cwl

baseCommand: [ metaspades.py ]

arguments:
  - valueFrom: $(runtime.outdir)
    prefix: -o
  - valueFrom: '8'
    prefix: -t
  - --only-assembler

inputs:
  memory:
    type: int
    default: 150
    label: memory in gb
    inputBinding:
      prefix: -m
      position: 4
  forward_reads:
    type: File
    format: edam:format_1930  # FASTQ
    label: forward file after qc
    inputBinding:
      prefix: "-1"
  reverse_reads:
    type: File
    format: edam:format_1930  # FASTQ
    label: reverse file after qc
    inputBinding:
      prefix: "-2"

CWL metaspades From line 11 of metaspades/metaspades.cwl

baseCommand: [ 'pigz' ]

arguments:
  - valueFrom: $(inputs.raw_reads)
    position: 2
    prefix: '-dc'
    shellQuote: false
  - valueFrom: '|'
    shellQuote: false
    position: 3
  - valueFrom: 'awk'
    shellQuote: false
    position: 4
  - valueFrom : 'NR%4==2{c++; l+=length($0)} END { print c; print l }'
    shellQuote: true
    position: 5

CWL From line 23 of stats/base_count.cwl

baseCommand: [ 'bwa-mem2', 'mem' ]

inputs:
  min_std_max_min:
    type: 'int[]?'
    inputBinding:
      position: 1
      prefix: '-I'
      itemSeparator: ','
  minimum_seed_length:
    type: int?
    inputBinding:
      position: 1
      prefix: '-k'
    doc: '-k INT        minimum seed length [19]'
  output_filename:
    type: string?
    default: 'aln-se.sam'
  reads:
    type: File[]
    inputBinding:
      position: 3
  reference:
    type: File
    inputBinding:
      position: 2
    secondaryFiles:
      - '.amb'
      - '.ann'
      - '.pac'
      - '.0123'
      - '.bwt.2bit.64'
  threads:
    type: int?
    inputBinding:
      position: 1
      prefix: '-t'
    doc: '-t INT        number of threads [1]'

CWL Bwa-mem2 From line 16 of stats/bwa-mem.cwl

baseCommand: [ jgi_summarize_bam_contig_depths ]

inputs:
  input:
    type: File
    inputBinding:
      position: 1
    doc: |
      One or more bam files
  outputDepth:
    type: string
    inputBinding:
      prefix: --outputDepth
    doc: |
      The file to put the contig by bam depth matrix (default: STDOUT)
  percentIdentity:
    type: int?
    inputBinding:
      prefix: --percentIdentity
    doc: |
      The minimum end-to-end % identity of qualifying reads (default: 97)
  pairedContigs:
    type: File?
    inputBinding:
      prefix: --pairedContigs
    doc: |
      The file to output the sparse matrix of contigs which paired reads span (default: none)
  unmappedFastq:
    type: string?
    inputBinding:
      prefix: --unmappedFastq
    doc: |
      The prefix to output unmapped reads from each bam file suffixed by 'bamfile.bam.fastq.gz'
  noIntraDepthVariance:
    type: boolean?
    inputBinding:
      prefix: --noIntraDepthVariance
    doc: |
      Do not include variance from mean depth along the contig
  showDepth:
    type: boolean?
    inputBinding:
      prefix: --showDepth
    doc: |
      Output a .depth file per bam for each contig base
  minMapQual:
    type: int?
    inputBinding:
      prefix: --minMapQual
    doc: |
      The minimum mapping quality necessary to count the read as mapped (default: 0)
  weightMapQual:
    type: float?
    inputBinding:
      prefix: --weightMapQual
    doc: |
      Weight per-base depth based on the MQ of the read (i.e uniqueness) (default: 0.0 (disabled))
  includeEdgeBases:
    type: boolean?
    inputBinding:
      prefix: --includeEdgeBases
    doc: |
      When calculating depth & variance, include the 1-readlength edges (off by default)
  maxEdgeBases:
    type: int?
    inputBinding:
      prefix: --maxEdgeBases
    doc: |
      When calculating depth & variance, and not --includeEdgeBases, the maximum length (default:75)

# Following options require --referenceFasta
  outputGC:
    type: File?
    inputBinding:
      prefix: --outputGC
    doc: |
      The file to print the gc coverage histogram
  gcWindow:
    type: int?
    inputBinding:
      prefix: --gcWindow
    doc: |
      The sliding window size for GC calculations
  outputReadStats:
    type: File?
    inputBinding:
      prefix: --outputGC
    doc: |
     The file to print the per read statistics
  outputKmers:
    type: int?
    inputBinding:
      prefix: --gcWindow
    doc: |
      The file to print the perfect kmer counts
# Options to control shredding contigs that are under represented by the reads
  shredLength:
    type: int?
    inputBinding:
      prefix: --shredLength
    doc: |
      The maximum length of the shreds
  shredDepth:
    type: int?
    inputBinding:
      prefix: --shredDepth
    doc: |
      The depth to generate overlapping shreds
  minContigLength:
    type: int?
    inputBinding:
      prefix: --minContigLength
    doc: |
      The mimimum length of contig to include for mapping and shredding
  minContigDepth:
    type: int?
    inputBinding:
      prefix: --minContigDepth
    doc: |
      The minimum depth along contig at which to break the contig

CWL From line 11 of stats/metabat-jgi-summarise.cwl

baseCommand: [ 'samtools', 'view', '-uS' ]

inputs:
  bedoverlap:
    type: File?
    inputBinding:
      position: 1
      prefix: '-L'
    doc: |
      only include reads overlapping this BED FILE [null]
  cigar:
    type: int?
    inputBinding:
      position: 1
      prefix: '-m'
    doc: |
      only include reads with number of CIGAR operations
      consuming query sequence >= INT [0]
    default: false
  collapsecigar:
    type: boolean
    inputBinding:
      position: 1
      prefix: '-B'
    doc: |
      collapse the backward CIGAR operation
    default: false
  count:
    type: boolean
    inputBinding:
      position: 1
      prefix: '-c'
    doc: |
      print only the count of matching records
    default: false
  fastcompression:
    type: boolean
    inputBinding:
      position: 1
      prefix: '-1'
    doc: |
      use fast BAM compression (implies -b)
    default: false
  input:
    type: File
    inputBinding:
      position: 4
    doc: |
      Input bam file.
    default: false
  isbam:
    type: boolean
    inputBinding:
      position: 2
      prefix: '-b'
    doc: |
      output in BAM format
    default: false
  iscram:
    type: boolean
    inputBinding:
      position: 2
      prefix: '-C'
    doc: |
      output in CRAM format
    default: false
  output_name:
    type: string
    inputBinding:
      position: 2
      prefix: '-o'
  randomseed:
    type: float?
    inputBinding:
      position: 1
      prefix: '-s'
    doc: |
      integer part sets seed of random number generator [0];
      rest sets fraction of templates to subsample [no subsampling]
  readsingroup:
    type: string?
    inputBinding:
      position: 1
      prefix: '-r'
    doc: |
      only include reads in read group STR [null]
  readsingroupfile:
    type: File?
    inputBinding:
      position: 1
      prefix: '-R'
    doc: |
      only include reads with read group listed in FILE [null]
  readsinlibrary:
    type: string?
    inputBinding:
      position: 1
      prefix: '-l'
    doc: |
      only include reads in library STR [null]
  readsquality:
    type: int?
    inputBinding:
      position: 1
      prefix: '-q'
    doc: |
      only include reads with mapping quality >= INT [0]
  readswithbits:
    type: int?
    inputBinding:
      position: 1
      prefix: '-f'
    doc: |
      only include reads with all bits set in INT set in FLAG [0]
  readswithoutbits:
    type: int?
    inputBinding:
      position: 1
      prefix: '-F'
    doc: |
      only include reads with none of the bits set in INT set in FLAG [0]
  readtagtostrip:
    type: 'string[]?'
    inputBinding:
      position: 1
    doc: |
      read tag to strip (repeatable) [null]
  referencefasta:
    type: File?
    inputBinding:
      position: 1
      prefix: '-T'
    doc: |
      reference sequence FASTA FILE [null]
  region:
    type: string?
    inputBinding:
      position: 5
    doc: |
      [region ...]
  samheader:
    type: boolean
    inputBinding:
      position: 1
      prefix: '-h'
    doc: |
      include header in SAM output
    default: false
  threads:
    type: int?
    inputBinding:
      position: 1
      prefix: '-@'
    doc: |
      number of BAM compression threads [0]
    default: false
  uncompressed:
    type: boolean
    inputBinding:
      position: 1
      prefix: '-u'
    doc: |
      uncompressed BAM output (implies -b)

CWL SAMtools From line 14 of stats/samtools-view.cwl

baseCommand: ['/opt/miniconda/bin/python', '/data/gen_stats_report.py']

inputs:
  sequences:
    type: File
    label: cleaned contig file
    inputBinding:
      position: 2
      prefix: --sequences
  coverage_file:
    type: File?
    label: coverage depth file
    inputBinding:
      position: 3
      prefix: --coverage_file
  assembler:
    type: string
    label: assembler used metaspades, spades or megahit
    inputBinding:
      position: 4
      prefix: --assembler
  assembly_log:
    type: File
    label: logfile from assembly
    inputBinding:
       position: 5
       prefix: --logfile
  base_count:
    type: File[]
    label: raw reads base count output of readfq
    inputBinding:
      position: 6
      prefix: --base_count

CWL MEGAHIT From line 14 of stats/stats-report.cwl

baseCommand: [ 'count_fastq.sh' ]

inputs:
  rawreads:
    type: File
    format: edam:format_1930  # FASTQ
    label: raw forward file
    inputBinding:
      position: 1
      prefix: -f
  trimmedreads:
    type: File
    format: edam:format_1930  # FASTQ
    label: fastp trimmed forward file
    inputBinding:
      position: 2
      prefix: -g
  cleanedreads:
    type: File
    format: edam:format_1930  # FASTQ
    label: host removed forward file
    inputBinding:
      position: 3
      prefix: -h