A biobakery workflow pipeline for Whole Metagenome Shotgun (wmgx) data processing implemented using Snakemake rather than AnADAMA2. Programs supported thus far are Kneaddata, Metaphlan, and Humann2.
https://bitbucket.org/biobakery/biobakery_workflows/wiki/Home
How to Run
Intended to run for example data in the example-data directory. If real data is to be used, some changes are necessary.
To run on HPC cluster:
$ sh run_biobakery_workflow.sh
To run locally (for testing purposes) WARNING computational heavy :
$ /shares/hii/sw/snakemake/latest/bin/snakemake
Singularity Image can be found at:
$ /shares/hii/images/morenoj/biobakery
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | import sys import os import argparse # This script will count all features for each sample. It can ignore stratification by species # and also "un"-features if these options are set by the user. def parse_arguments(args): """ Parse the arguments from the user """ parser = argparse.ArgumentParser( description= "Count features for each species\n", formatter_class=argparse.RawTextHelpFormatter) parser.add_argument( "-i", "--input", help="the feature table\n[REQUIRED]", metavar="<input.tsv>", required=True) parser.add_argument( "-o", "--output", help="file to write counts table\n[REQUIRED]", metavar="<output>", required=True) parser.add_argument( "--reduce-sample-name", help="remove the extra strings from the sample names", action='store_true') parser.add_argument( "--ignore-un-features", help="do not count UNMAPPED, UNGROUPED, or UNINTEGRATED features", action='store_true') parser.add_argument( "--ignore-stratification", help="only count the main feature not the stratified instances", action='store_true') parser.add_argument( "--include", help="only count features with this string included") parser.add_argument( "--filter", help="do not count features with this string included") return parser.parse_args() def main(): args=parse_arguments(sys) data=[] samples=[] with open(args.input) as file_handle: # remove RPKs from sample name if included if args.reduce_sample_name: samples=[sample.replace("_Abundance-RPKs","").replace("_genefamilies_Abundance","").replace("_Abundance","").replace("_taxonomic_profile","") for sample in file_handle.readline().rstrip().split("\t")[1:]] else: samples=file_handle.readline().rstrip().split("\t")[1:] for line in file_handle: if "|" in line and args.ignore_stratification: store=False else: store=True if "UNMAPPED" in line or "UNGROUPED" in line or "UNINTEGRATED" in line and args.ignore_un_features: store=False if args.include and not args.include in line: store=False if args.filter and args.filter in line: store=False if store: data.append(line.rstrip().split("\t")[1:]) try: file_handle=open(args.output,"w") except EnvironmentError: sys.exit("Error: Unable to open output file: " + args.output) # write out the header file_handle.write("\t".join(["# samples","total features"])+"\n") # count the total features for each sample for i, sample in enumerate(samples): features=0 for row in data: if float(row[i]) > 0: features+=1 file_handle.write(sample+"\t"+str(features)+"\n") file_handle.close() if __name__ == "__main__": main() |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | import sys import os import argparse TOTAL_COUNT_TAG="reads; of these:" NUCLEOTIDE_COUNT_TAG="Unaligned reads after nucleotide alignment:" TRANSLATED_COUNT_TAG="Unaligned reads after translated alignment:" SPECIES_COUNT_TAG="Total species selected from prescreen:" def parse_arguments(args): """ Parse the arguments from the user """ parser = argparse.ArgumentParser( description= "Reads the HUMAnN2 logs and prints a table of read counts\n", formatter_class=argparse.RawTextHelpFormatter) parser.add_argument( "-i", "--input", help="the folder of log files\n[REQUIRED]", metavar="<input_folder>", required=True) parser.add_argument( "-o", "--output", help="file to write counts table\n[REQUIRED]", metavar="<output>", required=True) return parser.parse_args() def main(): # parse arguments from the user args = parse_arguments(sys) # search subfolders to find the log files log_files=[] for path, directories, files in os.walk(args.input): for file in filter(lambda x: x.endswith(".log"), files): log_files.append(os.path.join(path,file)) try: file_handle=open(args.output,"w") except EnvironmentError: sys.exit("Error: Unable to open output file: " + args.output) # write out the header file_handle.write("\t".join(["# samples","total reads","total nucleotide aligned","total translated aligned","total species"])+"\n") for file in log_files: sample=os.path.basename(file).split(".log")[0] data=[sample,"NA","NA","NA","NA"] for line in open(file): if TOTAL_COUNT_TAG in line: print(line) print (line.split()) data[1]=int(line.split()[7]) elif NUCLEOTIDE_COUNT_TAG in line: try: data[2]=int(data[1]*((100-float(line.split()[-2]))/100.0)) except (ValueError, TypeError): print("Warning: Unable to compute nucleotide reads aligned from log line: " + line) elif TRANSLATED_COUNT_TAG in line: try: data[3]=int(data[1]*((100-float(line.split()[-2]))/100.0)) except (ValueError, TypeError): print("Warning: Unable to compute translated reads aligned from log line: " + line) elif SPECIES_COUNT_TAG in line: data[4]=line.split()[-1] file_handle.write("\t".join([str(i) for i in data])+"\n") file_handle.close() print("Read table written to file: " + args.output) if __name__ == "__main__": main() |
45 46 47 48 49 50 51 52 53 54 55 56 57 | shell: """ {SINGULARITY} kneaddata \ --input {input.R1} \ --input {input.R2} \ --reference-db {BOWTIE_GRCH38_BUILD} \ --bowtie2 {BOWTIE_DIR} \ --trimmomatic {TRIMMOMATIC_DIR} \ --output {params.directory} cat {params}/*_paired_*.fastq >> {output} """ |
64 65 66 67 68 69 70 71 72 | shell: """ find output/kneaddata_output/ -name "*.log" | while read i; do \ mv $i output/kneaddata_output/logs/; done {SINGULARITY} kneaddata_read_count_table \ --input {KNEADDATA_LOGS} \ --output {output} """ |
80 81 82 83 84 85 86 87 88 | shell: """ {SINGULARITY} python {METAPHLAN_PY} {input} \ --input_type fastq \ --bowtie2_exe {BOWTIE2_EXEC} \ --bowtie2_build {BOWTIE2_BUILD} \ --samout output/metaphlan_analysis/$(basename {input}).sam \ --output {output} """ |
96 97 98 99 100 | shell: """ {SINGULARITY} python {METAPHLAN_DIR}/utils/merge_metaphlan_tables.py {input} \ > {output} """ |
110 111 112 113 114 115 116 117 118 119 120 | shell: """ {SINGULARITY} humann2 -i {input.fastq} \ --bowtie2 {BOWTIE_DIR} \ --metaphlan {METAPHLAN_DIR} \ --taxonomic-profile {input.profile} \ --nucleotide-database {NUCLEO_DB} \ --protein-database {PROTEIN_DB} \ --threads 4 \ -o {params} """ |
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | shell: """ find output/humann2_analysis/ -name "*_pathabundance.tsv" | while read i; do \ echo $i echo $(basename $i) {SINGULARITY} humann2_renorm_table -i $i \ --units relab \ --output {params}/$(basename $i); \ done {SINGULARITY} humann2_join_tables \ --input output/humann2_analysis/path_Abundance_Norm \ --output output/humann2_analysis/path_Abundance_Norm/pathAbun_merged_tables.tsv find output/humann2_analysis/ -name "*_genefamilies.tsv" | while read i; do \ {SINGULARITY} humann2_regroup_table -i $i \ -c {UTILITY_MAPPING} \ --output output/humann2_analysis/geneFamilies_Grouped/$(basename $i)_ecs.tsv; \ done find output/humann2_analysis/ -name "*_genefamilies.tsv" | while read i; do \ {SINGULARITY} humann2_renorm_table -i $i \ --units relab \ --output output/humann2_analysis/geneFamilies_Norm/$(basename $i)_genefamilies_norm.tsv; \ done find output/humann2_analysis/ -name "*ecs.tsv" | while read i; do \ {SINGULARITY} humann2_renorm_table -i $i \ --units relab \ --output output/humann2_analysis/geneFamilies_Grouped_Norm/$(basename $i)_ecs_norm.tsv; \ done {SINGULARITY} humann2_join_tables \ --input output/humann2_analysis/geneFamilies_Grouped_Norm \ --output output/humann2_analysis/geneFamilies_Grouped_Norm/EC_merged.tsv {SINGULARITY} humann2_join_tables \ --input output/humann2_analysis/geneFamilies_Norm \ --output output/humann2_analysis/geneFamilies_Norm/geneFam_merged_tables.tsv {SINGULARITY} python biobakery-scripts/count_features.py \ -i output/humann2_analysis/geneFamilies_Norm/geneFam_merged_tables.tsv \ --reduce-sample-name --ignore-un-features --ignore-stratification \ -o vis_inputs/humann2/merged/geneFamilies_counts.tsv {SINGULARITY} python biobakery-scripts/count_features.py \ -i output/humann2_analysis/geneFamilies_Grouped_Norm/EC_merged.tsv \ --reduce-sample-name --ignore-un-features --ignore-stratification \ -o vis_inputs/humann2/merged/EC_counts.tsv {SINGULARITY} python biobakery-scripts/count_features.py \ -i output/humann2_analysis/path_Abundance_Norm/pathAbun_merged_tables.tsv \ --reduce-sample-name --ignore-un-features --ignore-stratification \ -o vis_inputs/humann2/merged/pathabundance_relab.tsv {SINGULARITY} humann2_join_tables -i vis_inputs/humann2/merged \ --file_name _counts.tsv \ -o vis_inputs/humann2/counts/humann2_feature_counts.tsv find output/humann2_analysis/ -name "*_pathcoverage.tsv" | while read i; do \ cp $i output/humann2_analysis/path_Coverage/; done touch vis_inputs/anadama.log find output/humann2_analysis/ -name "*.log" | while read i; do \ cp $i output/humann2_analysis/logs/; done {SINGULARITY} python biobakery-scripts/get_counts_from_humann2_logs.py \ --input output/humann2_analysis/logs \ --output vis_inputs/humann2/counts/humann2_read_and_species_count_table.tsv """ |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/j2moreno/biobakery
Name:
biobakery
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
GNU General Public License v3.0
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...