MGnify ( http://www.ebi.ac.uk/metagenomics ) provides a free to use platform for the assembly, analysis and archiving of microbiome data derived from sequencing microbial populations that are present in particular environments. Over the past 2 years, MGnify (formerly EBI Metagenomics) has more than doubled the number of publicly available analysed datasets held within the resource. Recently, an updated approach to data analysis has been unveiled (version 5.0), replacing the previous single pipeline with multiple analysis pipelines that are tailored according to the input data, and that are formally described using the Common Workflow Language, enabling greater provenance, reusability, and reproducibility. MGnify's new analysis pipelines offer additional approaches for taxonomic assertions based on ribosomal internal transcribed spacer regions (ITS1/2) and expanded protein functional annotations. Biochemical pathways and systems predictions have also been added for assembled contigs. MGnify's growing focus on the assembly of metagenomic data has also seen the number of datasets it has assembled and analysed increase six-fold. The non-redundant protein database constructed from the proteins encoded by these assemblies now exceeds 1 billion sequences. Meanwhile, a newly developed contig viewer provides fine-grained visualisation of the assembled contigs and their enriched annotations.
Documentation: https://docs.mgnify.org/en/latest/analysis.html#raw-reads-analysis-pipeline
Code Snippets
42 | baseCommand: [ run_antismash_short.sh ] |
35 | baseCommand: [ change_antismash_output.py ] |
27 | baseCommand: [ change_geneclusters_ctg.py ] |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | baseCommand: [antismash_to_gff.py] inputs: antismash_geneclus: type: File inputBinding: prefix: -g antismash_embl: type: File inputBinding: prefix: -e output_name: type: string inputBinding: prefix: -o |
13 14 15 16 17 18 19 20 21 22 23 24 25 | baseCommand: [reformat_antismash.py] inputs: glossary: type: string inputBinding: position: 1 prefix: -g geneclusters: type: File inputBinding: position: 2 prefix: -a |
31 | baseCommand: [ antismash_rename_contigs.py ] |
21 | baseCommand: [move_antismash_summary.py] |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | baseCommand: - diamond - blastp inputs: - id: blockSize type: float? inputBinding: position: 0 prefix: '--block-size' label: sequence block size in billions of letters (default=2.0) - id: databaseFile type: string inputBinding: position: 0 prefix: '--db' label: DIAMOND database input file doc: Path to the DIAMOND database file. - id: outputFormat type: string? # Diamond-output_formats.yaml#output_formats? inputBinding: position: 0 prefix: '--outfmt' label: Format of the output file doc: |- 0 = BLAST pairwise 5 = BLAST XML 6 = BLAST tabular 100 = DIAMOND alignment archive (DAA) 101 = SAM Value 6 may be followed by a space-separated list of these keywords - id: queryGeneticCode type: int? inputBinding: position: 0 prefix: '--min-orf' label: Genetic code used for the translation of the query sequences doc: > Ignore translated sequences that do not contain an open reading frame of at least this length. By default this feature is disabled for sequences of length below 30, set to 20 for sequences of length below 100, and set to 40 otherwise. Setting this option to 1 will disable this feature. - id: queryInputFile format: edam:format_1929 type: File inputBinding: position: 0 prefix: '--query' label: Query input file in FASTA doc: > Path to the query input file in FASTA or FASTQ format (may be gzip compressed). If this parameter is omitted, the input will be read from stdin - id: strand type: string? # Diamond-strand_values.yaml#strand? inputBinding: position: -3 prefix: '--strand' label: Set strand of query to align for translated searches doc: >- Set strand of query to align for translated searches. By default both strands are searched. Valid values are {both, plus, minus} - id: taxonList type: 'int[]?' inputBinding: position: 0 prefix: '--taxonlist' label: Protein accession to taxon identifier NCBI mapping file doc: > Comma-separated list of NCBI taxonomic IDs to filter the database by. Any taxonomic rank can be used, and only reference sequences matching one of the specified taxon ids will be searched against. Using this option requires setting the --taxonmap and --taxonnodes parameters for makedb. - id: threads type: int? inputBinding: position: 0 prefix: '--threads' label: Number of CPU threads doc: > Number of CPU threads. By default, the program will auto-detect and use all available virtual cores on the machine. - id: maxTargetSeqs type: int? inputBinding: position: 0 prefix: '--max-target-seqs' label: Max number of target sequences per query doc: > The maximum number of target sequences per query to report alignments for (default=25). Setting this to 0 will report all alignments that were found. - id: top type: int? inputBinding: position: 0 prefix: '--top' label: Percentage range of the top alignment score doc: > Report alignments within the given percentage range of the top alignment score for a query (overrides --max-target-seqs option). For example, setting this to 10 will report all align- ments whose score is at most 10% lower than the best alignment score for a query. outputs: - id: matches type: File outputBinding: glob: $(inputs.queryInputFile.basename).diamond_matches format: edam:format_2333 doc: | DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. The key features are: + Pairwise alignment of proteins and translated DNA at 500x-20,000x speed of BLAST. + Frameshift alignments for long read analysis. + Low resource requirements and suitable for running on standard desktops or laptops. + Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification. Please visit https://github.com/bbuchfink/diamond for full documentation. Releases can be downloaded from https://github.com/bbuchfink/diamond/releases label: Aligns DNA query sequences against a protein reference database arguments: - position: 0 prefix: '--out' valueFrom: $(inputs.queryInputFile.basename).diamond_matches |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | baseCommand: [diamond_post_run_join.sh] inputs: input_diamond: format: edam:format_2333 type: File inputBinding: separate: true prefix: -i input_db: type: string inputBinding: separate: true prefix: -d filename: string |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | baseCommand: [emapper_wrapper.sh] inputs: fasta_file: format: edam:format_1929 # FASTA type: File? inputBinding: separate: true prefix: -i label: Input FASTA file containing query sequences db: type: string? # data/eggnog.db inputBinding: prefix: --database label: specify the target database for sequence searches (euk,bact,arch, host:port, local hmmpressed database) db_diamond: type: string? # data/eggnog_proteins.dmnd inputBinding: prefix: --dmnd_db label: Path to DIAMOND-compatible database data_dir: type: string? # data/ inputBinding: prefix: --data_dir label: Directory to use for DATA_PATH mode: type: string? inputBinding: prefix: -m label: hmmer or diamond no_annot: type: boolean? inputBinding: prefix: --no_annot label: Skip functional annotation, reporting only hits no_file_comments: type: boolean? inputBinding: prefix: --no_file_comments label: No header lines nor stats are included in the output files cpu: type: int? inputBinding: prefix: --cpu annotate_hits_table: type: File? inputBinding: prefix: --annotate_hits_table label: Annotatate TSV formatted table of query->hits output: type: string? inputBinding: prefix: -o |
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | baseCommand: [assign_genome_properties.pl] # without docker arguments: - position: 1 valueFrom: "-all" - position: 2 valueFrom: "table" prefix: "-outfiles" - position: 4 valueFrom: "summary" prefix: "-outfiles" - position: 3 valueFrom: "web_json" prefix: "-outfiles" inputs: input_tsv_file: type: File format: edam:format_3475 inputBinding: separate: true prefix: "-matches" flatfiles_path: type: string inputBinding: prefix: "-gpdir" GP_txt: type: string inputBinding: prefix: "-gpff" out_dir: type: string? inputBinding: prefix: "-outdir" name: type: string? inputBinding: prefix: "-name" |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | baseCommand: [ build_assembly_gff.py ] inputs: ips_results: type: File format: edam:format_3475 inputBinding: prefix: -i eggnog_results: format: edam:format_3475 type: File inputBinding: prefix: -e input_faa: format: edam:format_1929 type: File inputBinding: prefix: -f output_name: type: string inputBinding: prefix: -o |
20 21 22 | arguments: ["-n", $(inputs.fasta.basename)] baseCommand: [ "run_samtools.sh" ] |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | baseCommand: [give_pathways.py] inputs: input_table: format: edam:format_3475 # TXT type: File inputBinding: separate: true prefix: -i graphs: type: string inputBinding: prefix: -g pathways_names: type: string inputBinding: prefix: -n pathways_classes: type: string inputBinding: prefix: -c outputname: type: string inputBinding: prefix: -o |
17 18 19 20 21 22 23 24 25 26 27 28 29 30 | baseCommand: ['parsing_hmmscan.py'] inputs: table: format: edam:format_3475 type: File inputBinding: separate: true prefix: -i fasta: type: File inputBinding: separate: true prefix: -f |
45 | baseCommand: [ esl-ssplit.sh ] |
40 41 42 43 44 45 46 47 48 49 | arguments: - valueFrom: '> /dev/null' shellQuote: false position: 10 - valueFrom: '2> /dev/null' shellQuote: false position: 11 baseCommand: [ split_to_chunks.py ] |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | baseCommand: [ run_FGS.sh ] arguments: inputs: input_fasta: format: 'edam:format_1929' type: File inputBinding: separate: true prefix: "-i" output: type: string inputBinding: separate: true prefix: "-o" seq_type: type: string inputBinding: separate: true prefix: "-s" train: type: string inputBinding: separate: true prefix: "-t" default: "illumina_5" |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | baseCommand: [ unite_protein_predictions.py ] inputs: masking_file: type: File inputBinding: prefix: "--mask" predicted_proteins_prodigal_out: type: File? inputBinding: prefix: "--prodigal-out" predicted_proteins_prodigal_ffn: type: File? inputBinding: prefix: "--prodigal-ffn" predicted_proteins_prodigal_faa: type: File? inputBinding: prefix: "--prodigal-faa" predicted_proteins_fgs_out: type: File inputBinding: prefix: "--fgs-out" predicted_proteins_fgs_ffn: type: File inputBinding: prefix: "--fgs-ffn" predicted_proteins_fgs_faa: inputBinding: prefix: "--fgs-faa" type: File basename: inputBinding: prefix: "--name" type: string genecaller_order: inputBinding: prefix: "--caller-priority" type: string |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | baseCommand: [ prodigal ] arguments: - valueFrom: "sco" prefix: "-f" - valueFrom: "meta" prefix: "-p" - valueFrom: $(inputs.input_fasta.basename).prodigal prefix: "-o" - valueFrom: $(inputs.input_fasta.basename).prodigal.ffn prefix: "-d" - valueFrom: $(inputs.input_fasta.basename).prodigal.faa prefix: "-a" inputs: input_fasta: format: 'edam:format_1929' type: File inputBinding: separate: true prefix: "-i" |
39 | baseCommand: ["go_summary_pipeline-1.0.py"] |
23 24 25 26 27 | baseCommand: [ hmmscan_tab.py ] # old was with sed arguments: - valueFrom: $(inputs.input_table.nameroot).tsv prefix: -o |
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | baseCommand: ["hmmsearch"] arguments: - valueFrom: '> /dev/null' shellQuote: false position: 10 - valueFrom: '2> /dev/null' shellQuote: false position: 11 - prefix: --domtblout valueFrom: $(inputs.seqfile.nameroot)_hmmsearch.tbl position: 2 - prefix: --cpu valueFrom: '4' - prefix: -o valueFrom: '/dev/null' inputs: omit_alignment: type: boolean? inputBinding: position: 1 prefix: "--noali" gathering_bit_score: type: boolean? inputBinding: position: 4 prefix: "--cut_ga" path_database: type: string inputBinding: position: 5 seqfile: format: edam:format_1929 # FASTA type: File inputBinding: position: 6 separate: true |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | baseCommand: interproscan.sh inputs: - id: inputFile type: File format: edam:format_1929 inputBinding: position: 8 prefix: '--input' label: Input file path doc: >- Optional, path to fasta file that should be loaded on Master startup. Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert. - id: applications type: string[]? inputBinding: position: 9 itemSeparator: ',' prefix: '--applications' label: Analysis doc: >- Optional, comma separated list of analyses. If this option is not set, ALL analyses will be run. - id: outputFormat type: string[] inputBinding: position: 10 itemSeparator: ',' prefix: '--formats' label: output format doc: >- Optional, case-insensitive, comma separated list of output formats. Supported formats are TSV, XML, JSON, GFF3, HTML and SVG. Default for protein sequences are TSV, XML and GFF3, or for nucleotide sequences GFF3 and XML. - id: databases type: string? #Directory? - id: disableResidueAnnotation type: boolean? inputBinding: position: 11 prefix: '--disable-residue-annot' label: Disables residue annotation doc: 'Optional, excludes sites from the XML, JSON output.' - id: seqtype type: - 'null' - type: enum symbols: - p - n name: seqtype inputBinding: position: 12 prefix: '--seqtype' label: Sequence type doc: >- Optional, the type of the input sequences (dna/rna (n) or protein (p)). The default sequence type is protein. outputs: - id: i5Annotations format: edam:format_3475 type: File outputBinding: glob: $(inputs.inputFile.nameroot).f*.tsv doc: >- InterProScan is the software package that allows sequences (protein and nucleic) to be scanned against InterPro's signatures. Signatures are predictive models, provided by several different databases, that make up the InterPro consortium. This tool description is using a Docker container tagged as version v5.30-69.0. Documentation on how to run InterProScan 5 can be found here: https://github.com/ebi-pf-team/interproscan/wiki/HowToRun label: 'InterProScan: protein sequence classifier' arguments: - position: 0 valueFrom: '--disable-precalc' - position: 1 valueFrom: '--goterms' - position: 2 valueFrom: '--pathways' - position: 3 prefix: '--tempdir' valueFrom: $(runtime.tmpdir) |
32 33 34 35 36 | baseCommand: [bedtools, maskfasta] arguments: - valueFrom: ITS_masked.fasta prefix: -fo |
23 24 25 | baseCommand: [format_bedfile] #reverse start and end where start < end (i.e. neg strand) |
42 | baseCommand: [ its-length-new.py ] |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | baseCommand: ["run_quality_filtering.py"] inputs: seq_file: type: File # format: edam:format_1929 # FASTA inputBinding: position: 1 label: 'Trimmed sequence file' doc: > Trimmed and FASTQ to FASTA converted sequences file. submitted_seq_count: type: int label: 'Number of submitted sequences' doc: > Number of originally submitted sequences as in the user submitted FASTQ file - single end FASTQ or pair end merged FASTQ file. stats_file_name: type: string default: stats_summary label: 'Post QC stats output file name' doc: > Give a name for the file which will hold the stats after QC. min_length: type: int default: 100 # For assemblies we need to set this in the input YAML to 500 label: 'Minimum read or contig length' doc: > Specify the minimum read or contig length for sequences to pass QC filtering. input_file_format: string outputs: filtered_file: label: Filtered output file format: edam:format_1929 # FASTA type: File outputBinding: glob: $(inputs.seq_file.nameroot).fasta stats_summary_file: label: Stats summary output file type: File outputBinding: glob: $(inputs.stats_file_name) arguments: - position: 2 valueFrom: $(inputs.seq_file.nameroot).fasta - position: 3 valueFrom: $(inputs.stats_file_name) - position: 4 valueFrom: $(inputs.submitted_seq_count) - position: 5 prefix: '--min_length' valueFrom: $(inputs.min_length) - position: 6 prefix: '--extension' valueFrom: $(inputs.input_file_format) |
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | baseCommand: ["MGRAST_base.py" ] inputs: QCed_reads: type: File format: edam:format_1929 # FASTA inputBinding: prefix: -i length_sum: label: Prefix for the files assocaited with sequence length distribution type: string default: seq-length.out gc_sum: label: Prefix for the files associated with GC distribution type: string default: GC-distribution.out nucleotide_distribution: label: Prefix for the files associated with nucleotide distribution type: string default: nucleotide-distribution.out summary: label: File names for summary of sequences, e.g. number, min/max length etc. type: string default: summary.out max_seq: label: Maximum number of sequences to sub-sample type: int? default: 2000000 out_dir_name: label: Specifies output subdirectory type: string default: qc-statistics sequence_count: label: Specifies the number of sequences in the input read file (FASTA formatted) type: int outputs: output_dir: label: Contains all stats output files type: Directory outputBinding: glob: $(inputs.out_dir_name) summary_out: label: Contains the summary statistics for the input sequence file type: File format: iana:text/plain outputBinding: glob: $(inputs.out_dir_name)/$(inputs.summary) arguments: - position: 1 prefix: '-o' valueFrom: $(inputs.out_dir_name)/$(inputs.summary) - position: 2 prefix: '-d' valueFrom: | ${ var suffix = '.full'; if (inputs.sequence_count > inputs.max_seq) { suffix = '.sub-set'; } return "".concat(inputs.out_dir_name, '/', inputs.nucleotide_distribution, suffix); } - position: 3 prefix: '-g' valueFrom: | ${ var suffix = '.full'; if (inputs.sequence_count > inputs.max_seq) { suffix = '.sub-set'; } return "".concat(inputs.out_dir_name, '/', inputs.gc_sum, suffix); } - position: 4 prefix: '-l' valueFrom: | ${ var suffix = '.full'; if (inputs.sequence_count > inputs.max_seq) { suffix = '.sub-set'; } return "".concat(inputs.out_dir_name, '/', inputs.length_sum, suffix); } - position: 5 valueFrom: ${ if (inputs.sequence_count > inputs.max_seq) { return '-m '.concat(inputs.max_seq)} else { return ''} } |
21 | baseCommand: [clean_motus_output.sh] |
38 39 40 | baseCommand: [motus] arguments: [profile, -c, -q] |
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | baseCommand: [ "biom-convert.sh" ] inputs: biom: type: File? format: edam:format_3746 # BIOM inputBinding: prefix: --input-fp table_type: type: string? #biom-convert-table.yaml#table_type? inputBinding: prefix: --table-type # --table-type= <- worked for cwlexec separate: true # false <- worked for cwlexec valueFrom: $(inputs.table_type) # $('"' + inputs.table_type + '"') <- worked for cwlexec json: type: boolean? label: Output as JSON-formatted table. inputBinding: prefix: --to-json hdf5: type: boolean? label: Output as HDF5-formatted table. inputBinding: prefix: --to-hdf5 tsv: type: boolean? label: Output as TSV-formatted (classic) table. inputBinding: prefix: --to-tsv header_key: type: string? doc: | The observation metadata to include from the input BIOM table file when creating a tsv table file. By default no observation metadata will be included. inputBinding: prefix: --header-key arguments: - valueFrom: | ${ var ext = ""; if (inputs.json) { ext = "_json.biom"; } if (inputs.hdf5) { ext = "_hdf5.biom"; } if (inputs.tsv) { ext = "_tsv.biom"; } var pre = inputs.biom.nameroot.split('.'); pre.pop() return pre.join('.') + ext; } prefix: --output-fp - valueFrom: "--collapsed-observations" |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | baseCommand: [ cmsearch-deoverlap.pl ] inputs: - id: clan_information type: string? inputBinding: position: 0 prefix: '--clanin' label: clan information on the models provided doc: Not all models provided need to be a member of a clan - id: cmsearch_matches type: File format: edam:format_3475 inputBinding: position: 1 valueFrom: $(self.basename) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | baseCommand: - cmsearch inputs: - id: covariance_model_database type: [ string, File ] inputBinding: position: 1 - id: cpu type: int? inputBinding: position: 0 prefix: '--cpu' label: Number of parallel CPU workers to use for multithreads - default: false id: cut_ga type: boolean? inputBinding: position: 0 prefix: '--cut_ga' label: use CM's GA gathering cutoffs as reporting thresholds - id: omit_alignment_section type: boolean? inputBinding: position: 0 prefix: '--noali' label: Omit the alignment section from the main output. doc: This can greatly reduce the output volume. - default: false id: only_hmm type: boolean? inputBinding: position: 0 prefix: '--hmmonly' label: 'Only use the filter profile HMM for searches, do not use the CM' doc: | Only filter stages F1 through F3 will be executed, using strict P-value thresholds (0.02 for F1, 0.001 for F2 and 0.00001 for F3). Additionally a bias composition filter is used after the F1 stage (with P=0.02 survival threshold). Any hit that survives all stages and has an HMM E-value or bit score above the reporting threshold will be output. - id: query_sequences type: File format: edam:format_1929 # FASTA inputBinding: position: 2 # streamable: true - id: search_space_size type: int inputBinding: position: 0 prefix: '-Z' label: search space size in *Mb* to <x> for E-value calculations outputs: - id: matches doc: 'http://eddylab.org/infernal/Userguide.pdf#page=60' label: 'target hits table, format 2' type: File format: edam:format_3475 outputBinding: glob: | ${ var name = ""; if (typeof inputs.covariance_model_database == "string") { name = inputs.query_sequences.basename + "." + inputs.covariance_model_database.split("/").slice(-1)[0] + ".cmsearch_matches.tbl"; } else { name = inputs.query_sequences.basename + "." + inputs.covariance_model_database.nameroot + ".cmsearch_matches.tbl"; } return name; } - id: programOutput label: 'direct output to file, not stdout' type: File format: edam:format_3475 outputBinding: glob: | ${ var name = ""; if (typeof inputs.covariance_model_database == "string") { name = inputs.query_sequences.basename + "." + inputs.covariance_model_database.split("/").slice(-1)[0] + ".cmsearch.out"; } else { name = inputs.query_sequences.basename + "." + inputs.covariance_model_database.nameroot + ".cmsearch.out"; } return name; } doc: > Infernal ("INFERence of RNA ALignment") is for searching DNA sequence databases for RNA structure and sequence similarities. It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs). A CM is like a sequence profile, but it scores a combination of sequence consensus and RNA secondary structure consensus, so in many cases, it is more capable of identifying RNA homologs that conserve their secondary structure more than their primary sequence. Please visit http://eddylab.org/infernal/ for full documentation. Version 1.1.2 can be downloaded from http://eddylab.org/infernal/infernal-1.1.2.tar.gz label: Search sequence(s) against a covariance model database arguments: - position: 0 prefix: '--tblout' valueFrom: | ${ var name = ""; if (typeof inputs.covariance_model_database == "string") { name = inputs.query_sequences.basename + "." + inputs.covariance_model_database.split("/").slice(-1)[0] + ".cmsearch_matches.tbl"; } else { name = inputs.query_sequences.basename + "." + inputs.covariance_model_database.nameroot + ".cmsearch_matches.tbl"; } return name; } - position: 0 prefix: '-o' valueFrom: | ${ var name = ""; if (typeof inputs.covariance_model_database == "string") { name = inputs.query_sequences.basename + "." + inputs.covariance_model_database.split("/").slice(-1)[0] + ".cmsearch.out"; } else { name = inputs.query_sequences.basename + "." + inputs.covariance_model_database.nameroot + ".cmsearch.out"; } return name; } - valueFrom: '> /dev/null' shellQuote: false position: 10 - valueFrom: '2> /dev/null' shellQuote: false position: 11 hints: - class: SoftwareRequirement packages: infernal: specs: - 'https://identifiers.org/rrid/RRID:SCR_011809' version: - 1.1.2 - class: DockerRequirement dockerPull: 'quay.io/biocontainers/infernal:1.1.2--h470a237_1' |
12 | baseCommand: [ esl-index.sh ] |
35 | baseCommand: [ esl-sfetch ] |
35 | baseCommand: awk_tool |
28 | baseCommand: get_subunits_coords.py |
45 | baseCommand: get_subunits.py |
21 22 23 24 25 | baseCommand: ktImportText arguments: - valueFrom: "krona.html" prefix: -o |
43 44 45 46 47 48 49 50 51 | baseCommand: ['mapseq2biom.pl'] arguments: - valueFrom: $(inputs.query.basename).tsv prefix: --outfile - valueFrom: $(inputs.query.basename).txt prefix: --krona - valueFrom: $(inputs.query.basename).notaxid.tsv prefix: --notaxidfile |
40 41 | baseCommand: mapseq arguments: ['-nthreads', '8', '-tophits', '80', '-topotus', '40', '-outfmt', 'simple'] |
24 | baseCommand: [pull_ncrnas.sh] |
34 35 36 37 38 39 40 41 42 43 44 45 46 47 | baseCommand: SeqPrep arguments: - "-1" - forward_unmerged.fastq.gz - "-2" - reverse_unmerged.fastq.gz - valueFrom: | ${ return inputs.namefile.nameroot.split('_')[0] + '_MERGED.fastq.gz' } prefix: "-s" # - "-3" # - forward_discarded.fastq.gz # - "-4" # - reverse_discarded.fastq.gz |
43 | baseCommand: [functional_stats.py] |
52 | baseCommand: [write_summaries.py] |
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | baseCommand: [ trimmomatic.sh ] inputs: phred: type: string? #trimmomatic-phred.yaml#phred? inputBinding: prefix: -phred separate: false position: 4 label: 'quality score format' doc: > Either PHRED "33" or "64" specifies the base quality encoding. Default: 64 tophred64: type: boolean? inputBinding: position: 12 prefix: TOPHRED64 separate: false label: 'quality score conversion to phred64' doc: > This (re)encodes the quality part of the FASTQ file to base 64. headcrop: type: int? inputBinding: position: 13 prefix: 'HEADCROP:' separate: false label: 'read head trimming' doc: > Removes the specified number of bases, regardless of quality, from the beginning of the read. The numbser specified is the number of bases to keep, from the start of the read. tophred33: type: boolean? inputBinding: position: 12 prefix: TOPHRED33 separate: false label: 'quality score conversion to phred33' doc: > This (re)encodes the quality part of the FASTQ file to base 33. minlen: type: int? inputBinding: position: 100 prefix: 'MINLEN:' separate: false label: 'minimum length read filter' doc: > This module removes reads that fall below the specified minimal length. If required, it should normally be after all other processing steps. Reads removed by this step will be counted and included in the "dropped reads" count presented in the trimmomatic summary. java_opts: type: string? inputBinding: position: 1 shellQuote: false doc: > JVM arguments should be a quoted, space separated list (e.g. "-Xms128m -Xmx512m") leading: type: int? inputBinding: position: 14 prefix: 'LEADING:' separate: false label: 'read tail trimming' doc: > Remove low quality bases from the beginning. As long as a base has a value below this threshold the base is removed and the next base will be investigated. slidingwindow: type: string? #trimmomatic-sliding_window.yaml#slidingWindow? inputBinding: position: 15 prefix: 'SLIDINGWINDOW:' separate: false label: 'read filtering sliding window' doc: > Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. By considering multiple bases, a single poor quality base will not cause the removal of high quality data later in the read. <windowSize> specifies the number of bases to average across <requiredQuality> specifies the average quality required illuminaClip: type: File? #trimmomatic-illumina_clipping.yaml#illuminaClipping? inputBinding: valueFrom: | ${ if ( self ) { return "ILLUMINACLIP:" + inputs.illuminaClip.adapters.path + ":" + self.seedMismatches + ":" + self.palindromeClipThreshold + ":" + self.simpleClipThreshold + ":" + self.minAdapterLength + ":" + self.keepBothReads; } else { return self; } } position: 11 label: 'sequencing adaptater removing' doc: > Cut adapter and other illumina-specific sequences from the read. crop: type: int? inputBinding: position: 13 prefix: 'CROP:' separate: false label: 'read cropping' doc: > Removes bases regardless of quality from the end of the read, so that the read has maximally the specified length after this step has been performed. Steps performed after CROP might of course further shorten the read. The value is the number of bases to keep, from the start of the read. reads2: type: File? inputBinding: position: 6 label: 'FASTQ read file 2' doc: > FASTQ file of R2 reads in Paired End mode reads1: type: File inputBinding: position: 5 label: 'FASTQ read file 1' doc: > FASTQ file of reads (R1 reads in Paired End mode) avgqual: type: int? inputBinding: position: 101 prefix: 'AVGQUAL:' separate: false label: 'minimum average quality required' doc: > Drop the read if the average quality is below the specified level trailing: type: int? inputBinding: position: 14 prefix: 'TRAILING:' separate: false label: 'read tail quality filtering' doc: > Remove low quality bases from the end. As long as a base has a value below this threshold the base is removed and the next base (which as trimmomatic is starting from the 3' prime end would be base preceding the just removed base) will be investigated. This approach can be used removing the special Illumina "low quality segment" regions (which are marked with quality score of 2), but we recommend Sliding Window or MaxInfo instead maxinfo: type: int? #trimmomatic-max_info.yaml#maxinfo? inputBinding: position: 15 valueFrom: | ${ if ( self ) { return "MAXINFO:" + self.targetLength + ":" + self.strictness; } else { return self; } } label: 'maxinfo: read score quality filtering' doc: > Performs an adaptive quality trim, balancing the benefits of retaining longer reads against the costs of retaining bases with errors. <targetLength>: This specifies the read length which is likely to allow the location of the read within the target sequence to be determined. <strictness>: This value, which should be set between 0 and 1, specifies the balance between preserving as much read length as possible vs. removal of incorrect bases. A low value of this parameter (<0.2) favours longer reads, while a high value (>0.8) favours read correctness. end_mode: type: string #trimmomatic-end_mode.yaml#end_mode inputBinding: position: 3 label: 'read -end mode format' doc: > Single End (SE) or Paired End (PE) mode outputs: reads1_trimmed: type: File format: edam:format_1930 # fastq outputBinding: glob: $(inputs.reads1.nameroot).trimmed log_file: type: File outputBinding: glob: 'trim.log' label: Log file doc: | log of all read trimmings, indicating the following details: the read name the surviving sequence length the location of the first surviving base, aka. the amount trimmed from the start the location of the last surviving base in the original read the amount trimmed from the end reads1_trimmed_unpaired: type: File? format: edam:format_1930 # fastq outputBinding: glob: $(inputs.reads1.basename).trimmed.unpaired.fastq arguments: - valueFrom: trim.log prefix: -trimlog position: 4 - valueFrom: $(runtime.cores) position: 4 prefix: -threads - valueFrom: $(inputs.reads1.nameroot).trimmed position: 7 #- valueFrom: | # ${ # if (inputs.end_mode == "PE" && inputs.reads2) { # return inputs.reads1.nameroot + '.trimmed.unpaired.fastq'; # } else { # return null; # } # } # position: 8 #- valueFrom: | # ${ # if (inputs.end_mode == "PE" && inputs.reads2) { # return inputs.reads2.nameroot + '.trimmed.fastq'; # } else { # return null; # } # } # position: 9 #- valueFrom: | # ${ # if (inputs.end_mode == "PE" && inputs.reads2) { # return inputs.reads2.nameroot + '.trimmed.unpaired.fastq'; # } else { # return null; # } # } # position: 10 |
18 19 20 21 22 23 24 25 26 27 28 29 | baseCommand: [ add_header ] inputs: input_table: #format: [edam:format_3475, edam:format_2333] type: File inputBinding: prefix: -i header: type: string inputBinding: prefix: -h |
14 15 16 17 18 19 20 21 22 23 24 | baseCommand: [ count_lines.py ] inputs: sequences: type: File inputBinding: prefix: -f number: type: int inputBinding: prefix: -n |
24 25 26 27 28 29 | baseCommand: [ bash ] arguments: - valueFrom: | expr \$(cat $(inputs.input_file.path) | wc -l) prefix: -c |
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | baseCommand: [ fastp ] inputs: fastq1: type: File format: - edam:format_1930 # FASTA - edam:format_1929 # FASTQ inputBinding: prefix: -i fastq2: format: - edam:format_1930 # FASTA - edam:format_1929 # FASTQ type: File? inputBinding: prefix: -I threads: type: int? default: 1 inputBinding: prefix: --thread qualified_phred_quality: type: int? default: 20 inputBinding: prefix: --qualified_quality_phred unqualified_phred_quality: type: int? default: 20 inputBinding: prefix: --unqualified_percent_limit min_length_required: type: int? default: 50 inputBinding: prefix: --length_required force_polyg_tail_trimming: type: boolean? inputBinding: prefix: --trim_poly_g disable_trim_poly_g: type: boolean? default: true inputBinding: prefix: --disable_trim_poly_g base_correction: type: boolean? default: true inputBinding: prefix: --correction arguments: - prefix: -o valueFrom: $(inputs.fastq1.nameroot).fastp.fastq - | ${ if (inputs.fastq2){ return '-O'; } else { return ''; } } - | ${ if (inputs.fastq2){ return inputs.fastq2.nameroot + ".fastp.fastq"; } else { return ''; } } |
25 26 27 28 29 | arguments: - valueFrom: $(inputs.fastq.nameroot).unclean prefix: '-o' baseCommand: [ fastq_to_fasta.py ] |
22 | baseCommand: [ generate_checksum.py ] |
12 13 14 15 16 17 18 19 20 21 22 23 | baseCommand: [ make_csv.py ] inputs: tab_sep_table: format: edam:format_3475 type: File inputBinding: prefix: '-i' output_name: type: string inputBinding: prefix: '-o' |
46 | baseCommand: [ gunzip, -c ] |
17 18 | baseCommand: [ pigz ] arguments: ["-p", "16", "-c"] |
32 | baseCommand: [run_result_file_chunker.py] |
Support
- Future updates
Related Workflows





