Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
1 - Download this repository
Download this repository to get the files describing the pipeline rules:
wget https://github.com/arthurvinx/Medusa/archive/refs/heads/main.zip
unzip main.zip
cd Medusa-main/
2 - Create the expected folder structure
Go to the folder containing the Snakefile (via command line) and create the expected folder structure with:
mkdir -p ./Pipeline/{result,data/{merged,assembled,collapsed,removal/{index,reference},raw,trimmed},alignment/{db,index},taxonomic/db,functional/db}
3 - Install the conda package manager
An Anaconda environment was created to ease the installation of the software required by this pipeline. The environment recipe is available at the Anaconda cloud: https://anaconda.org/arthurvinx/medusapipeline.
A conda package manager, such as miniconda, must be installed to get this environment and to install the Snakemake workflow management system.
Download and install the latest miniconda release for Linux (adapt the commands if needed):
cd ~/Downloads
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh
chmod u+x Miniconda3-py39_4.9.2-Linux-x86_64.sh
./Miniconda3-py39_4.9.2-Linux-x86_64.sh
Close the terminal after the installation, open it again, and then check if the installation was successful with:
conda -V
4 - Get the pipeline environment
Get the pipeline environment from the Anaconda cloud:
conda activate base
conda install anaconda-client -y
conda env create arthurvinx/medusaPipeline
conda activate medusaPipeline
pip3 install -U plyvel --no-cache-dir --no-deps --force-reinstall
5 - Install Snakemake
The recommended way to install Snakemake is via the conda package manager. The following commands will create a conda environment with the full version of Snakemake. More details can be found at the Snakemake installation guide .
conda activate base
conda install -c conda-forge mamba -y
mamba create -c bioconda -c conda-forge -n snakemake snakemake
Check whether your installation succeeded by typing:
conda activate snakemake
snakemake --help
6 - Move the input files to the expected location
Move your raw FASTQ files to the inputDIR specified in the Snakefile. By default, the inputDIR is "Pipeline/data/raw" and paired-end filenames are expected to present the suffixes "_1.fastq" and "_2.fastq". Alternatively, you may change the inputDIR editing the Snakefile. It is worth to mention that all paths in the Snakefile are interpreted relative to the directory Snakemake is executed in.
7 - Run snakemake
To start this pipeline, go to the folder containing the Snakefile (via command line) and run:
snakemake --cores
This will use all available cores whenever possible. Alternatively, you may define the number of available cores as seen in the Snakemake command line interface guide .
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | args <- commandArgs(trailingOnly = TRUE) library(dplyr) library(tidyr) library(data.table) fixCollapsed <- function(df){ colnames(df) <- c("key", "value") df <- df %>% mutate(key = strsplit(key, "; ")) %>% unnest(key) df <- df[, c(2, 1)] return(df) } fixDuplicated <- function(df){ df <- df %>% group_by(key) %>% summarise(value = paste(value, collapse = "; ")) values <- strsplit(df$value, "; ") values <- lapply(values, unique) values <- sapply(values, paste, collapse = "; ") df$value <- values return(df) } removeUnknown <- function(df){ idx <- grepl("^-", df$key) df <- df[!idx,] return(df) } df <- fread(args[2], stringsAsFactors = FALSE, head = FALSE, nThread = as.integer(args[4])) df <- as.data.frame(df) df %>% fixCollapsed() %>% fixDuplicated() %>% removeUnknown() %>% fwrite(file = args[1], sep = "\t", nThread = as.integer(args[4])) df <- fread(args[3], stringsAsFactors = FALSE, head = FALSE, nThread = as.integer(args[4])) df <- as.data.frame(df) df %>% fixCollapsed() %>% fixDuplicated() %>% removeUnknown() %>% fwrite(file = args[1], sep = "\t", append = TRUE, nThread = as.integer(args[4])) |
51 52 53 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && fastp -i {input} -o {output.trimmed} -q {phredQuality} -w {threads} -h {output.html} -j {output.json}" |
65 66 67 68 69 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && fastp -i {input.f} -I {input.r} -o {output.f} \ -O {output.r} -q {phredQuality} -w {threads} \ --detect_adapter_for_pe -h {output.html} -j {output.json}" |
74 75 76 77 78 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && wget -P {referenceDIR} \ ftp://ftp.ensembl.org/pub/release-{ensemblRelease}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \ && pigz -d {output} -p {threads}" |
87 88 89 90 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && bowtie2-build --large-index {input} {params.indexPrefix} --threads {threads} \ && rm {input}" |
100 101 102 103 104 105 106 107 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && bowtie2 -x {params.indexPrefix} -U {input.trimmed} -S {removalDIR}/{wildcards.id}.sam -p {threads} \ && samtools view -bS {removalDIR}/{wildcards.id}.sam > {removalDIR}/{wildcards.id}.bam \ && samtools view -b -f 4 -F 256 {removalDIR}/{wildcards.id}.bam > {removalDIR}/{wildcards.id}_unaligned.bam \ && samtools sort -n {removalDIR}/{wildcards.id}_unaligned.bam -o {removalDIR}/{wildcards.id}_unaligned_sorted.bam \ && samtools bam2fq {removalDIR}/{wildcards.id}_unaligned_sorted.bam > {removalDIR}/{wildcards.id}_unaligned.fastq \ && rm {removalDIR}/{wildcards.id}.sam {removalDIR}/{wildcards.id}_unaligned.bam {removalDIR}/{wildcards.id}_unaligned_sorted.bam {input.trimmed}" |
120 121 122 123 124 125 126 127 128 129 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && bowtie2 -x {params.indexPrefix} -1 {input.f} -2 {input.r} -S {removalDIR}/{wildcards.id}.sam -p {threads} \ && samtools view -bS {removalDIR}/{wildcards.id}.sam > {removalDIR}/{wildcards.id}.bam \ && samtools view -b -f 12 -F 256 {removalDIR}/{wildcards.id}.bam > {removalDIR}/{wildcards.id}_unaligned.bam \ && samtools sort -n {removalDIR}/{wildcards.id}_unaligned.bam -o {removalDIR}/{wildcards.id}_unaligned_sorted.bam \ && samtools bam2fq {removalDIR}/{wildcards.id}_unaligned_sorted.bam > {removalDIR}/{wildcards.id}_unaligned.fastq \ && cat {removalDIR}/{wildcards.id}_unaligned.fastq | grep '^@.*/1$' -A 3 --no-group-separator > {removalDIR}/{wildcards.id}_unaligned_1.fastq \ && cat {removalDIR}/{wildcards.id}_unaligned.fastq | grep '^@.*/2$' -A 3 --no-group-separator > {removalDIR}/{wildcards.id}_unaligned_2.fastq \ && rm {removalDIR}/{wildcards.id}.sam {removalDIR}/{wildcards.id}_unaligned.bam {removalDIR}/{wildcards.id}_unaligned_sorted.bam {removalDIR}/{wildcards.id}_unaligned.fastq {input.f} {input.r}" |
140 141 142 143 144 145 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && fastp -i {input.f} -I {input.r} -o {mergedDIR}/{wildcards.id}_unmerged_1.fastq \ -O {mergedDIR}/{wildcards.id}_unmerged_2.fastq -q {phredQuality} -w {threads} \ --detect_adapter_for_pe -h {output.html} -j {output.json} \ -m --merged_out {output.merged}" |
153 154 155 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && megahit -r {input.reads} --force -o {assembledDIR}/{wildcards.id} -t {threads} -m {megahitMemory}" |
164 165 166 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && megahit -1 {input.f} -2 {input.r} --force -o {assembledDIR}/{wildcards.id} -t {threads} -m {megahitMemory}" |
178 179 180 181 182 183 184 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && kaiju -t {input.nodes} -f {input.fmi} -i {input.reads} -o {resultDIR}/kaiju.out -z {threads} \ && kaiju-addTaxonNames -t {input.nodes} -n {input.names} -r superkingdom,phylum,class,order,family,genus,species -i {resultDIR}/kaiju.out -o {output.ranks} \ && kaiju2krona -t {input.nodes} -n {input.names} -i {resultDIR}/kaiju.out -o {resultDIR}/kaiju_krona \ && ktImportText -o {output.html} {resultDIR}/kaiju_krona \ && rm {resultDIR}/kaiju.out {resultDIR}/kaiju_krona" |
197 198 199 200 201 202 203 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && kaiju -t {input.nodes} -f {input.fmi} -i {input.f} -j {input.r} -o {resultDIR}/kaiju.out -z {threads} \ && kaiju-addTaxonNames -t {input.nodes} -n {input.names} -r superkingdom,phylum,class,order,family,genus,species -i {resultDIR}/kaiju.out -o {output.ranks} \ && kaiju2krona -t {input.nodes} -n {input.names} -i {resultDIR}/kaiju.out -o {resultDIR}/kaiju_krona \ && ktImportText -o {output.html} {resultDIR}/kaiju_krona \ && rm {resultDIR}/kaiju.out {resultDIR}/kaiju_krona" |
215 216 217 218 219 220 221 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && kaiju -t {input.nodes} -f {input.fmi} -i {input.reads} -o {resultDIR}/kaiju.out -z {threads} \ && kaiju-addTaxonNames -t {input.nodes} -n {input.names} -r superkingdom,phylum,class,order,family,genus,species -i {resultDIR}/kaiju.out -o {output.ranks} \ && kaiju2krona -t {input.nodes} -n {input.names} -i {resultDIR}/kaiju.out -o {resultDIR}/kaiju_krona \ && ktImportText -o {output.html} {resultDIR}/kaiju_krona \ && rm {resultDIR}/kaiju.out {resultDIR}/kaiju_krona" |
226 227 228 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && fastx_collapser -i {input} -o {output}" |
233 234 235 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && fastx_collapser -i {input} -o {output}" |
240 241 242 243 244 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && wget -P {NRDIR} \ ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz \ && pigz -d {output} -p {threads}" |
250 251 252 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && diamond makedb --in {input} -d {output} --threads {threads}" |
261 262 263 264 265 266 267 268 269 270 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && wget -P {taxonomicDIR}/db ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz \ && tar -xf {taxonomicDIR}/db/taxdump.tar.gz nodes.dmp names.dmp \ && mv nodes.dmp {taxonomicDIR}/db && mv names.dmp {taxonomicDIR}/db \ && rm {taxonomicDIR}/db/taxdump.tar.gz \ && wget -P {taxonomicDIR}/db ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz \ && pigz -d {taxonomicDIR}/db/prot.accession2taxid.gz -p {threads} \ && kaiju-convertNR -t {output.nodes} -g {taxonomicDIR}/db/prot.accession2taxid -e $(conda info --base)/envs/medusaPipeline/bin/kaiju-excluded-accessions.txt -a -o {output.convertedNR} -i {input} \ && rm {taxonomicDIR}/db/prot.accession2taxid {input}" |
278 279 280 281 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && kaiju-mkbwt -n {threads} -a ACDEFGHIKLMNPQRSTVWY -o {taxonomicDIR}/db/kaijuNR {input} \ && rm {input}" |
289 290 291 292 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && kaiju-mkfmi {taxonomicDIR}/db/kaijuNR \ && rm {input.bwt} {input.sa}" |
302 303 304 305 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && touch {output.unaligned} \ && diamond blastx -d {input.index} -q {input.reads} -o {output.matches} --top 3 --un {output.unaligned} --threads {threads}" |
315 316 317 318 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && touch {output.unaligned} \ && diamond blastx -d {input.index} -q {input.reads} -o {output.matches} --top 3 -F 15 --range-culling --un {output.unaligned} --threads {threads}" |
323 324 325 326 327 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && wget -P {functionalDIR} \ ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz \ && pigz -d {output} -p {threads}" |
334 335 336 337 338 339 340 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && awk -F \"\t\" '{{if(($7!=\"\") && ($18!=\"\")){{print $18\"\t\"$7}}}}' {input} > {functionalDIR}/genbank2GO.txt \ && awk -F \"\t\" '{{if(($4!=\"\") && ($7!=\"\")){{print $4\"\t\"$7}}}}' {input} > {functionalDIR}/refseq2GO.txt \ && Rscript createDictionary.R {functionalDIR}/NR2GO.txt {functionalDIR}/genbank2GO.txt {functionalDIR}/refseq2GO.txt {threads} \ && annotate createdb {functionalDIR}/NR2GO.txt NR2GO 0 1 -d {functionalDIR}/db \ && rm {functionalDIR}/genbank2GO.txt {functionalDIR}/refseq2GO.txt" |
349 350 351 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && annotate idmapping {input.matches} {output.GO} NR2GO -l 1 -d {functionalDIR}/db" |
360 361 362 | shell: "set +eu \ && . $(conda info --base)/etc/profile.d/conda.sh && conda activate medusaPipeline \ && annotate idmapping {input.matchesContigs} {output.contigsGO} NR2GO -l 1 -d {functionalDIR}/db" |
Support
- Future updates
Related Workflows





