Reproducible Scalable Pipeline For Amplicon-based Metagenomics (RSP4ABM) is a bioinformatic pipeline designed for convenient, reproducible and scalable amplicon-based metagenomics
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Overview
This is a comprehensive pipeline for amplicon-based metagenomics integrating in a Snakemake workflow the best functions of many tools . It enables performant and reproducibile processing of 16S rRNA or ITS Illumina paired-end reads. The whole process from local .fastq or SRA depository files to generation of basic visualization plots, including quality control plots of intermediate steps, is covered.
The Snakemake pipeline can be exectued by cloning this repository and relying on conda environments (Method 1) or singularity (Method 2)
################### TO BE UPDATED #######################
Method 1 - Snakemake with conda environnements
Allows flexibility, with possibility to easily modify and personalize the pipeline. However, there are risks of errors or result inconsistencies due to changes in versions. Furthermore, simulate_PCR must be installed independently for the in silico validation, since it is not available through conda.
Requirements:
Computer
A linux machine would be the best (should work as well on MacOSX, yet not tested). At least 16Gb of RAM are needed, even more with larger datasets and depending of the used classifier. ( RDP requiring more RAM than decipher )
Tested with Ubuntu 18.04 with 4 CPUs and 32Gb of RAM
Cloned pipeline
git clone https://github.com/metagenlab/microbiome16S_pipeline.git
Miniconda3
Installed following developers' recommendations and with relevant channels added running in a thermal the following commands :
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set restore_free_channel true
Tested with version 4.6.14
Snakemake
Installed in a dedicated snakemake environments with :
conda create -n snakemake snakemake=5.6.0
Tested with version 5.6.0
Use
snakemake --snakefile ${pipeline_folder}/Snakefile --use-conda --conda-prefix ${conda_path} --cores {threads_number} --configfile {config_file_name} --resources max_copy={max_number_of_files_copied_at_the_same_time}
Method 2 - With Singularity
Relatively easy to set-up and easier to work with than Docker, due to simpler user-rights management. We take advantage of the ability of Singularity to run the Docker container prepared for this pipeline. Insures software stability thanks to containerization. Here, all dependencies are contained within the container.
Requirements:
Computer
As for Method 1 but here only adapted to Linux (an alpha version for Singularity exists for MacOS).
Tested with Ubuntu 18.04 with 4 CPUs and 32Gb of RAM
Singularity
Singularity is a system enabling the use of singularity or Docker containers. It should be installed as indicated here .
Tested with version 3.0.1
Commands:
## Run the container interactively
singularity shell docker://metagenlab/amplicon_pipeline:v.0.9.13
## Run the pipeline from within the container.
snakemake --snakefile /home/pipeline_user/microbiome16S_pipeline/Snakefile --use-conda --conda-prefix /opt/conda/ --cores {threads_number} --configfile {config_file_path} --resources max_copy={max_number_of_files_copied_at_the_same_time} mem_mb = {available_memory}
Method 3 - With Docker
Computer
Works on Windows, MacOS and Linux. Tested on Linux, Windows 10 and MacOSX
User settings:
Our Docker image is fitted for a user called "pipeline_user" whose UID is 1080. It is advised to create this user on your computer before using the Docker image to run your analysis:
sudo useradd -G docker,sudo -u 1080 pipeline_user
sudo mkdir /home/pipeline_user/
sudo chown pipeline_user -R /home/pipeline_user/
sudo passwd pipeline_user
Alternatively, you can run the Docker as root (--user root) but the created folders will belong to the root user of your computer.
Docker
Install the CE version following these instructions for ubuntu. Also make sure you have created the docker group and that you can run Docker without sudo following these instruction . If you can't have access to the internet when inside a Docker container, apply those changes .
Use
Connected as pipeline_user :
docker run -it --rm --mount source="$(pwd)",target=/home/pipeline_user/data/analysis/,type=bind metagenlab/amplicon_pipeline:v.0.9.13
and then
snakemake --snakefile /home/pipeline_user/microbiome16S_pipeline/Snakefile --use-conda --conda-prefix /opt/conda/ --cores {threads_number} --configfile {config_file_path} --resources max_copy={max_number_of_files_copied_at_the_same_time}
or directly
docker run -it --rm --mount source="$(pwd)",target=/home/pipeline_user/data/analysis/,type=bind metagenlab/amplicon_pipeline:v.0.9.13 \ sh -c 'snakemake --snakefile /home/pipeline_user/microbiome16S_pipeline/Snakefile --use-conda --conda-prefix /opt/conda/ --cores {threads_number} --configfile {config_file_path} --resources max_copy={max_number_of_files_copied_at_the_same_time}
References
-
Snakemake
Köster, J., & Rahmann, S. (2012). Snakemake-a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520–2522. https://doi.org/10.1093/bioinformatics/bts480
-
FASTQC
Andrews, S. (2010). FASTQC. A quality control tool for high throughput sequence data. 2010. Http://Www.Bioinformatics.Babraham.Ac.Uk/Projects/Fastqc/
-
MultiQC
Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354
-
DADA2
Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581–583. https://doi.org/10.1038/nmeth.3869
-
VSEARCH
Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. https://doi.org/10.7717/peerj.2584
-
Qiime
Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., … Knight, R. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5), 335–336. https://doi.org/10.1038/nmeth.f.303
-
Qiime2
Bolyen, E., Dillon, M., Bokulich, N., Abnet, C., Al-Ghalith, G., Alexander, H., … Caporaso, G. (2018). QIIME 2: Reproducible, interactive, scalable, and extensible microbiome data science. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.27295
-
RDP
Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267. https://doi.org/10.1128/AEM.00062-07
-
IDTAXA in Decipher
Murali, A., Bhargava, A., & Wright, E. S. (2018). IDTAXA: A novel approach for accurate taxonomic classification of microbiome sequences. Microbiome. https://doi.org/10.1186/s40168-018-0521-5
-
EzBioCloud
Yoon, S.-H., Ha, S.-M., Kwon, S., Lim, J., Kim, Y., Seo, H., & Chun, J. (2017). Introducing EzBioCloud: a taxonomically united database of 16S rRNA gene sequences and whole-genome assemblies. International Journal of Systematic and Evolutionary Microbiology, 67(5), 1613–1617. https://doi.org/10.1099/ijsem.0.001755
-
Silva
Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., … Glöckner, F. O. (2013). The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Research. https://doi.org/10.1093/nar/gks1219
-
phyloseq
McMurdie, P. J., & Holmes, S. (2013). phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE, 8(4), e61217. https://doi.org/10.1371/journal.pone.0061217
-
Krona
Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 12(1), 385. https://doi.org/10.1186/1471-2105-12-385
-
ALDex2
Fernandes, A. D., Reid, J. N. S., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 1–13. https://doi.org/10.1186/2049-2618-2-15
-
Vegan
Oksanen, J., Kindt, R., Legendre, P., O’Hara, B., Simpson, G. L., Solymos, P. M., … & Wagner, H. (2008). The vegan package. Community Ecology Package, (May 2014), 190. Retrieved from https://bcrc.bio.umass.edu/biometry/images/8/85/Vegan.pdf
-
metagenomeSeq
Joseph, A., Paulson, N., Olson, N. D., Wagner, J., Talukder, H., & Corrada, H. (2019). Package ‘ metagenomeSeq .’
-
edgeR
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616
-
Simulate_PCR
Gardner, S. N., & Slezak, T. (2014). Simulate_PCR for amplicon prediction and annotation from multiplex, degenerate primers and probes. BMC Bioinformatics, 15(1), 2–7. https://doi.org/10.1186/1471-2105-15-237
Code Snippets
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | shell: """ if gzip -t {input[0]} then cp {input[0]} {output[0]} else cp {input[0]} $(dirname {output[0]}) gzip -qc $(dirname "{output[0]}")/$(basename "{input[0]}") > {output[0]} rm $(dirname "{output[0]}")/$(basename "{input[0]}") fi if gzip -t {input[1]} then cp {input[1]} {output[1]} else cp {input[1]} $(dirname {output[1]}) gzip -qc $(dirname "{output[1]}")/$(basename "{input[1]}") > {output[1]} rm $(dirname "{output[1]}")/$(basename "{input[1]}") fi """ |
42 43 44 45 46 47 48 49 50 51 52 | shell: """ if gzip -t {input[0]} then cp {input[0]} {output[0]} else cp {input[0]} $(dirname {output[0]}) gzip -qc $(dirname "{output[0]}")/$(basename "{input[0]}") > {output[0]} rm $(dirname "{output[0]}")/$(basename "{input[0]}") fi """ |
71 72 73 74 75 76 | shell: """ #cache_dir=$(vdb-config --cfg -o n | grep "/repository/user/main/public/root" | cut -f2 -d'=' | sed "s/\\"//g") vdb-config --restore-defaults prefetch -v {wildcards.sample} --max-size 100000000 &> {log} """ |
90 91 92 93 | shell: """ fastq-dump --split-3 --gzip --log-level 1 --disable-multithreading --minReadLen 0 --outdir $(dirname {output[0]}) {input[0]} &> {log} """ |
107 108 109 110 | shell: """ fastq-dump --gzip --outdir $(dirname {output[0]}) --log-level 0 --disable-multithreading --minReadLen 0 {wildcards.sample} &> {log} """ |
16 17 18 19 | shell: """ fastqc -t {threads} {input} -o $(dirname {output[0]}) --nogroup 2> {log[0]} """ |
47 48 49 50 | shell: """ multiqc --interactive -f -n {output[0]} $(dirname {input} | tr "\n" " ") --cl_config "max_table_rows: 100000" 2> {log[0]} """ |
80 81 82 83 | shell: """ multiqc -f -n {output[0]} $(dirname {input} | tr "\n" " ") --cl_config "max_table_rows: 100000" 2> {log[0]} """ |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | from shutil import copyfile import sys import datetime import os import getpass ## Define log directory path if "logging_folder" not in config.keys(): logging_folder = "logs/" else: logging_folder=config["logging_folder"] if not logging_folder.endswith("/"): logging_folder = logging_folder + "/" ## Generate a day_hour folder to store execution specific parameters (command, config fie, sample sheet) today = datetime.datetime.now() date = today.strftime("%Y_%m_%d") time = today.strftime('%H_%M_%S_%f')[:-4] time_tag = str(date + "-" + time) timed_logging_folder = logging_folder + time_tag + "/" ## Except for dryrun execution, create a log folder. This will be filled by execution parameters and be provided to all rules to record their standard ouput or specific log files. if "--dryrun" not in sys.argv and "-n" not in sys.argv: os.makedirs(logging_folder, exist_ok=True) os.makedirs(timed_logging_folder, exist_ok=True) ## Copy the command line, the config file and the sample sheet in a subdirectory of the log directory ### Log command line cmd_file = timed_logging_folder + "cmd.txt" with open(cmd_file, "w") as f: f.write(" ".join(sys.argv)+"\n") ### Log config file if workflow.overwrite_configfiles[0] is not None: copyfile(workflow.overwrite_configfiles[0], timed_logging_folder+"config.yaml") ### Log local or SRA sample sheet if "local_samples" in config.keys(): copyfile(config["local_samples"], timed_logging_folder+"local_samples.tsv") if "sra_samples" in config.keys(): copyfile(config["sra_samples"], timed_logging_folder+"sra_samples.tsv") ### Log git hash of the pipeline git_log_path = timed_logging_folder + "git.txt" with open(git_log_path, "w") as devnull: subprocess.run(args=["git -C " + workflow.basedir + " rev-parse HEAD"], shell=True, stdout=devnull) ### Log user ID user_path = timed_logging_folder + "user.txt" user_cred = getpass.getuser() with open(user_path, "w") as f: f.write(user_cred +"\n") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | import pandas import re import itertools ## Define a function to extract the .fastq name pattern def get_read_naming_patterns(directory): result = [] extension= {} for fname in sorted(os.listdir(directory)): if fname.endswith("fastq.gz") or fname.endswith("fq.gz") or fname.endswith("fastq") or fname.endswith("fq"): regex_str = '(_L0+[1-9]+)?_(R)?(1|2)(\.|_)' #regex for finding R1 and R2, if L001 is present before, it is also included regex = re.compile(regex_str) ext = re.search(regex, fname) if ext is None: ext = re.search(r'f(?:ast)?q(?:\.gz)?', fname) samp = re.sub("\.$", "", re.search(r'^([^\.]*)\.*', fname).group(0)) if samp in extension.keys(): if ext.group(0).endswith(".gz"): extension[samp] = [ext.group(0)] else: extension[samp] = [ext.group(0)] else: regex_after = re.compile(regex_str+".*") regex_before = re.compile(".*"+regex_str) read = re.compile(re.search(regex_after, fname).group(0)) samp = re.sub(regex, '', re.search(regex_before, fname).group(0)) extension.setdefault(samp, []) extension[samp].append(re.sub("^_", "", read.pattern)) return(extension) ## Check for the presence of a directory for .fastq in config. If none, will be "links". if "link_directory" in config.keys(): link_directory = config["link_directory"] if not link_directory.endswith("/"): link_directory = link_directory + "/" else: link_directory = "links/" ## Check that "tax_DB_path" ends with "/". Otherwise, correct. if not config["tax_DB_path"].endswith("/"): config["tax_DB_path"] = config["tax_DB_path"] + "/" ## Check for the existence/accessbility of tax_DB_path if os.path.isdir(config['tax_DB_path']) is False: raise IOError("Cannot access to '{}' variable.".format('tax_DB_path')) ## Check for the presence of a metadata table in in config, either for local reads ("local_samples") or distant reads ("sra_samples") if "local_samples" not in config.keys() and "sra_samples" not in config.keys(): raise ValueError("No samples defined in the config file. local_samples or sra_samples must be defined in config") ## Check the local_samples/sra_samples tables for the presence of following variables in config: 'sample_label', 'grouping_column', 'run_column'. ### read sample dataframe from config file if "local_samples" in config.keys(): metadata = config["local_samples"] elif "sra_samples" in config.keys(): metadata = config["sra_samples"] df = pandas.read_csv(metadata, sep="\t", index_col=0) df_colnames = set(df.columns) ### Get values for which presence should be tested in dataframe to_check = ['sample_label', 'grouping_column', 'run_column'] # a list of config values describing columns that should be in the dataframe to_check_dict = (dict((k, config[k]) for k in to_check)) ### Set a function to compare list def list_diff(list1, list2): return (list(set(list1) - set(list2))) ### Loop over the values of config variable to check for key in to_check_dict.keys(): ## Extract values from dictionnary to_check_values = to_check_dict[key] ## If it is a single value, test that it is in the columns of the metadata if isinstance(to_check_values, str): ## if the column name list is not in the column names of df dataframe then raise an error if to_check_values not in df_colnames: message = f"{' and '.join(to_check_values)} not available in the sample TSV file." raise IOError(message) ## If it is a list, check that it contained in metadata elif isinstance(to_check_values, list): diff = [] diff = list_diff(to_check_values, df_colnames) ## if the set of column names in to_check list is not in the column names of df dataframe then raise an error if diff: for item in diff: """if the set of column names in to_check list is not a subset of column names of df dataframe then raise an error""" message = f"{' and '.join(diff)} not available in the sample TSV file." raise IOError(message) else: message = f"{key} should be a string or a list of string" raise IOError(message) ## Generate a list of input files, either local (when "local_samples" is in config, or SRA-hosted (when "sra_samples" is in config)) ### Create the final pandas dataframe that will be completed all_samples=pandas.DataFrame() ### Create a set of dictionaries to store sample characteristics sras_ext = {} reads_sra = {} reads_local = {} original_names = {} reads_ext = {} paths = {} layout = {} ### In case of local samples, work our way through the local_samples table to extract read paths (if indicated in the R1/R2 columns) or extract match it with .fastq found in the "links" directory. if "local_samples" in config.keys(): ## Read the metadata table local_data = pandas.read_csv(config["local_samples"], sep="\t", index_col=0) local_data.index = [str(x) for x in local_data.index] all_local_sample_names = "".join(list(local_data.index)) ## Check for forbidden characters in sample names if "(" in all_local_sample_names or ")" in all_local_sample_names or "_-_" in all_local_sample_names: raise ValueError("Forbidden character in sample name in sample name file") ## Extract path if indicated in "R1" column if "R1" in list(local_data): for sample, sample_data in local_data.iterrows(): if sample in paths: ## Check for sample names to be unique raise IOError("Identical sample name used multiple times: %s" % sample_name) paths[sample] =[sample_data.loc["R1"]] reads_ext[sample] = "single" layout[sample] = "single" if 'R2' in local_data.columns.values: if "R1" in str(sample_data.loc["R2"]): raise IOError("ATTENTION! R1 flag within R2 filename: %s", sample_data.loc["R2"]) if (str(sample_data.loc["R2"]).strip()) != "nan": paths[sample].append(sample_data.loc["R2"]) reads_ext[sample] = ["R1", "R2"] layout[sample] = "paired" all_samples = local_data paths = {**paths} ## In absence of a "R1" column indicating file paths, try to match the sample names with the content of the "links" folder else: reads_local = get_read_naming_patterns(link_directory) original_names = { x : x for x in reads_local.keys() } read_correct = {} original_correct = {} ## In absence of a "OldSampleName", the leftmost column is used to match with "links" content if "OldSampleName" not in list(local_data): for sample in list(local_data.index): regex = re.compile(r'%s([^a-zA-Z0-9]|$)' % sample) # this regex ensures that the matching of the sample names end at the end of the str, to prevent S1 matching S10 for instance match = [bool(re.match(regex, x)) for x in sorted(list(original_names.keys()))] if sum(match) != 1: #there must be one and only one entry matching one sample name raise ValueError("Problem matching SampleName to read file names: " + sample) read_name = str(sorted(list(original_names.keys()))[match.index(True)]) # Sample name with _SX original_correct[sample] = original_names[read_name] read_correct[sample] = reads_local[read_name] paths[sample] = expand(link_directory + read_name + "_{reads}" , reads = read_correct[sample]) ## In presence of a "LibraryLayout" column, samples can be specifid to be "single" or "paired". if "LibraryLayout" in list(local_data): if local_data.loc[sample, "LibraryLayout"].lower()=="paired": reads_ext[sample]=["R1", "R2"] layout[sample] = "paired" elif local_data.loc[sample, "LibraryLayout"].lower()=="single": reads_ext[sample]=["single"] layout[sample] = "single" else: raise ValueError("Problem in the Local_sample file, LibraryLayout badly defined") ## Otherwise, "paired" is assumed else: reads_ext[sample]=["R1", "R2"] layout[sample] = "paired" ## In presence of a "OldSampleName", the this column is used to match with "links" content else: for Old in list(local_data["OldSampleName"]): regex = re.compile(r'%s([^a-zA-Z0-9]|$)' % Old) match = [bool(re.match(regex, x)) for x in sorted(list(original_names.keys()))] if sum(match) != 1: raise ValueError("Problem matching OldSampleName to read file names : " + Old) read_name=str(sorted(list(original_names.keys()))[match.index(True)]) # Sample with have SX to unify with above sample = str(local_data.index[local_data['OldSampleName'] == Old][0]) original_correct[sample] = original_names[read_name] read_correct[sample] = reads_local[read_name] paths[sample] = expand(link_directory + read_name + "_{reads}" , reads = read_correct[sample]) ## In presence of a "LibraryLayout" column, samples can be specifid to be "single" or "paired". if "LibraryLayout" in list(local_data): if local_data.loc[sample, "LibraryLayout"].lower()=="paired": reads_ext[sample]=["R1", "R2"] layout[sample] = "paired" elif local_data.loc[sample, "LibraryLayout"].lower()=="single": reads_ext[sample]=["single"] layout[sample] = "single" else: raise ValueError("Problem in the Local_sample file, LibraryLayout badly defined") ## Otherwise, "paired" is assumed else: reads_ext[sample]=["R1", "R2"] layout[sample] = "paired" original_names = original_correct reads_local = read_correct reads_ext = reads_ext all_samples=local_data ### In case of sra-hosted samples, work our way through the "sra_table" to extract LibraryLayout and clean sample names. if "sra_samples" in config.keys(): sra_data = pandas.read_csv(config["sra_samples"], sep="\t", index_col=0).drop_duplicates() all_sra_sample_names = "".join(list(sra_data.index)) if "(" in all_sra_sample_names or ")" in all_sra_sample_names or "_-_" in all_sra_sample_names: raise ValueError("Forbidden character in sample name in sra file") for sra_sample in list(sra_data.index): sample_name = str(sra_sample).replace(" ", "_").replace("&", "and").replace(":", "-") if sample_name in reads_sra.keys(): # if the sample name is already used, add _(n+1) at the end sample_name = sample_name+"_"+str(list(reads_sra.keys()).count(sample_name)) reads_sra[sample_name]=str(sra_sample) if sra_data.loc[sra_sample, "LibraryLayout"].lower()=="paired": sras_ext[sample_name]=["1.fastq.gz", "2.fastq.gz"] reads_ext[sample_name]=["R1", "R2"] elif sra_data.loc[sra_sample, "LibraryLayout"].lower()=="single": sras_ext[sample_name] = ["fastq.gz"] reads_ext[sample_name]=["single"] else: raise ValueError("Problem in the sra file, LibraryLayout badly defined") all_samples=sra_data config["local_samples"] = config["sra_samples"] # all_samples.loc[sample_name, "Replicate"]=sra_data.loc[i, "Replicate"] read_naming = {**reads_local, **sras_ext} original_names = {**original_names, **reads_sra} |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | def get_grouping_key(column_of_interest): file_list = [] ## Convert to list if provided as a string in config if isinstance(column_of_interest,str): column_of_interest=[column_of_interest] unique_col=list(set(column_of_interest)) for i in unique_col: combined_values = expand("{column_of_interest}/{column_values}", column_of_interest = i, column_values = list(set(all_samples[i]))) file_list.extend(combined_values) return(file_list) ### Define rarefaction levels (values in config + no rarefaction) def get_rarefaction_key(rarefaction_values): file_list = [] for i in set(rarefaction_values): combined_values = expand("rarefaction_{rarefaction_values}", rarefaction_values = i) file_list = file_list + combined_values file_list = file_list + ["norarefaction"] return(file_list) # Transform the collapse levels from the plotting taxonomic rank def get_taxa_collapse_level_key(collapse_level): file_list = [] for i in set(collapse_level): if i == "OTU": value = 'no_collapse' elif i == "Species" : value = 'collap_7' elif i == "Genus" : value = 'collap_6' elif i == "Family" : value = 'collap_5' elif i == "Order" : value = 'collap_4' elif i == "Class" : value = 'collap_3' elif i == "Phylum" : value = 'collap_2' elif i == "Kingdom" : value = 'collap_1' else : raise ValueError("Forbidden value in taxa level for barplots") file_list.append(value) file_list = file_list + ["no_collapse"] return(file_list) ## Set of function to generate list of output from config ############################################################## ### Light output, including QC, count table, consensus sequences and taxonomic assignement def light_output_list(): output = [] output = MultiQC output.append(light_output) if "DADA2" in config["denoiser"]: output.append("DADA2/2_denoised/DADA2_denoising_stats.tsv") return(output) ### Basic output, diagnostic plots, KRONA plots and rarefaction curves def basic_plots_list(): output = [] output = light_output_list() output.append(basic_plots) return(output) ### Complete set of phyloseq, in option including transposed count table and metadata (wide to long) def phyloseq_output_list(): output = [] output = basic_plots_list() output.append(phyloseq) if config["melted_phyloseq"] == True: output.append(phyloseq_melted) if config["transposed_tables"] == True: output.append(transposed_output) return(output) ### Qiime2 outputs, offering interactive visualization of the data def Qiime2_output_list(): output = [] output = basic_plots_list() if config["Qiime2_basic_output_visualization"] == True: output.append(Qiime2_vis_qzv) return(output) ### PICRUSt2 outputs, to predict genomic content based on taxonomic profiles def PICRUSt2_list(): output = [] output = basic_plots_list() output.append(picrust2) return(output) ### Rule all, the default rule including all default output (not PICRUSt2, since long to compute) def rule_all_list(): output = [] output.append(basic_plots_list()) output.append(Qiime2_vis_qzv) output.append(phyloseq_output_list()) return(output) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | include: "make_output_fcts.py" ### Basic output #### MuliQC MultiQC = expand("QC/{RUN}_multiqc_raw_reads_report.html", RUN = set(all_samples[config["run_column"]])) MultiQC.append("QC/multiqc_raw_reads_report.html") #### Light pipeline output light_output = expand("{denoiser}/2_denoised/dna-sequences.fasta", denoiser = config["denoiser"]) light_output.append(expand("{denoiser}/2_denoised/count_table.tsv", denoiser = config["denoiser"])) light_output.append(expand("{denoiser}/3_classified/{classifier}_{tax_DB}/dna-sequences_tax_assignments.qzv", classifier = config["classifier"], denoiser = config["denoiser"], tax_DB = config["tax_DB_name"])) #### DADA2 stat table DADA2_stats_tables = "DADA2/2_denoised/DADA2_denoising_stats.tsv" ### Basic evaluation plots basic_plots = expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/reads/reads_plot_with_filtered.png", classifier = config["classifier"], denoiser = config["denoiser"], tax_DB = config["tax_DB_name"]) basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/reads/reads_plot_with_{filter_lineage}_in_{filter_tax_rank}_filtered.png", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"])) basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/rarefaction_curve/nofiltering/{rarefaction_value}_rarefaction_curve_{grouping_column}.png", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), grouping_column = config["grouping_column"][0])) basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/rarefaction_curve/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}_rarefaction_curve_{grouping_column}.png", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"], grouping_column = config["grouping_column"][0])) basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/KRONA/{grouping_key}.html", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], grouping_key = get_grouping_key(config["grouping_column"]))) ### Advances Phyloseq set of output #### Not filtered phyloseq = expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree.rds", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"])) phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree.rds", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]))) phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}.rds", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), norm_value = config["normalization"], abund_value = config["min_abundance"], prev_value = config["min_prevalence"])) #### Tax filtered phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree.rds", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"])) phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree.rds", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"])) phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}.rds", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"], norm_value = config["normalization"], abund_value = config["min_abundance"], prev_value = config["min_prevalence"])) ### Potentially spurious ASVs phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/norarefaction/no_collapse/base_export/tree_treeshrink/output.txt", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"], norm_value = config["normalization"], abund_value = config["min_abundance"], prev_value = config["min_prevalence"])) #### Melted Phyloseq set of output #### Not filtered phyloseq_melted = expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"])) phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]))) phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_melted.tsv", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), norm_value = config["normalization"], abund_value = config["min_abundance"], prev_value = config["min_prevalence"])) #### Tax filtered phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"])) phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"])) phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_melted.tsv", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"], norm_value = config["normalization"], abund_value = config["min_abundance"], prev_value = config["min_prevalence"])) #### Transposed output tables #### Count-table transposed format - filtered transposed_output = expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_export/count_table_transposed.txt", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"], norm_value = config["normalization"], abund_value = config["min_abundance"], prev_value = config["min_prevalence"]) #### Count-table transposed format - not filtered transposed_output.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_export/count_table_transposed.txt", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]), rarefaction_value = get_rarefaction_key(config["rarefaction_value"]), filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"], norm_value = config["normalization"], abund_value = config["min_abundance"], prev_value = config["min_prevalence"])) ### Qiime2 output #### Qiime2 interactive visualization Qiime2_vis_qzv = expand("{denoiser}/2_denoised/dna-sequences.fasta", denoiser = config["denoiser"]) Qiime2_vis_qzv.append(expand("{denoiser}/2_denoised/count-table.qzv", denoiser = config["denoiser"])) Qiime2_vis_qzv.append(expand("{denoiser}/2_denoised/rep-seqs.qzv", denoiser = config["denoiser"])) Qiime2_vis_qzv.append(expand("{denoiser}/3_classified/{classifier}_{tax_DB}/dna-sequences_tax_assignments.qzv", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"])) ### Picrust2 picrust2 = expand("{denoiser}/6_picrust2/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/picrust/", denoiser = config["denoiser"], classifier = config["classifier"], tax_DB = config["tax_DB_name"], filter_tax_rank = config["filter_tax_rank"], filter_lineage = config["filter_lineage"], rarefaction_value = get_rarefaction_key(config["rarefaction_value"]) ) |
22 23 | script: "scripts/1_cutadapt_paired.py" |
43 44 | script: "scripts/1_cutadapt_single.py" |
40 41 | script: "scripts/1_DADA2_q_filtering_paired.R" |
62 63 | script: "scripts/1_DADA2_q_filtering_single.R" |
85 86 | script: "scripts/2a_big_data_DADA2_learn_errors_paired.R" |
104 105 | script: "scripts/2a_big_data_DADA2_learn_errors_single.R" |
132 133 | script: "scripts/2b_big_data_DADA2_infer_ASV_paired.R" |
157 158 | script: "scripts/2b_big_data_DADA2_infer_ASV_single.R" |
196 197 | script: "scripts/2c_big_data_DADA2_merge_ASV.R" |
221 222 | script: "scripts/2d_big_data_DADA2_merge_chimera.R" |
254 255 | script: "scripts/2e_big_data_DADA2_filtering_stats.R" |
271 272 273 274 | shell: """ fastqc {input} -o $(dirname {output[0]}) 2> {log[0]} """ |
306 307 308 309 | shell: """ multiqc -f -n {output[0]} $(dirname {input} | tr "\n" " ") --cl_config "max_table_rows: 100000" 2> {log[0]} """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | from Bio.Seq import Seq from Bio.Alphabet import IUPAC from snakemake import shell if snakemake.params['amplicon_type'] == "ITS": print("ITS Trimming") forward_primer_compl = Seq.reverse_complement(Seq(snakemake.params['forward_primer'], IUPAC.ambiguous_dna)) reverse_primer_compl = Seq.reverse_complement(Seq(snakemake.params['reverse_primer'], IUPAC.ambiguous_dna)) shell("""cutadapt \ --cores {snakemake.threads} \ --error-rate 0.1 \ --times 2 \ --overlap 3 \ -o {snakemake.output[R1_trimmed_reads]} \ -p {snakemake.output[R2_trimmed_reads]} \ -g '{snakemake.params[forward_primer]}' \ -a '{reverse_primer_compl}' \ -G '{snakemake.params[reverse_primer]}' \ -A '{forward_primer_compl}' \ --match-read-wildcards \ --discard-untrimmed \ {snakemake.input[R1_raw_reads]} \ {snakemake.input[R2_raw_reads]} >> {snakemake.log[0]}""") elif snakemake.params['amplicon_type'] == "16S": print("16S Trimming") shell("""cutadapt \ --cores {snakemake.threads} \ --error-rate 0.1 \ --times 1 \ --overlap 3 \ -o {snakemake.output[R1_trimmed_reads]} \ -p {snakemake.output[R2_trimmed_reads]} \ -g '{snakemake.params[forward_primer]}' \ -G '{snakemake.params[reverse_primer]}' \ --match-read-wildcards \ --discard-untrimmed \ {snakemake.input[R1_raw_reads]} \ {snakemake.input[R2_raw_reads]} >> {snakemake.log[0]}""") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | from Bio.Seq import Seq from Bio.Alphabet import IUPAC from snakemake import shell if snakemake.params['amplicon_type'] == "ITS": print("ITS Trimming") forward_primer_compl = Seq.reverse_complement(Seq(snakemake.params['forward_primer'], IUPAC.ambiguous_dna)) shell("""cutadapt \ --cores {snakemake.threads} \ --error-rate 0.1 \ --times 2 \ --overlap 3 \ -o {snakemake.output[R1_trimmed_reads]} \ -g '{snakemake.params[forward_primer]}' \ -a '{reverse_primer_compl}' \ --match-read-wildcards \ --discard-untrimmed \ {snakemake.input[R1_raw_reads]} >> {snakemake.log[0]}""") elif snakemake.params['amplicon_type'] == "16S": print("16S Trimming") reverse_primer_compl = Seq.reverse_complement(Seq(snakemake.params['reverse_primer'], IUPAC.ambiguous_dna)) shell("""cutadapt \ --cores {snakemake.threads} \ --error-rate 0.1 \ --times 1 \ --overlap 3 \ -o {snakemake.output[R1_trimmed_reads]} \ -g '{snakemake.params[forward_primer]}' \ -a '{reverse_primer_compl}' \ --match-read-wildcards \ --discard-untrimmed \ {snakemake.input[R1_raw_reads]} >> {snakemake.log[0]}""") |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input fnFs <- snakemake@input[[1]] fnRs <- snakemake@input[[2]] ## Output q_score_filtered_F <- snakemake@output[["q_score_filtered_F"]] q_score_filtered_R <- snakemake@output[["q_score_filtered_R"]] q_filtering_stats_path <- snakemake@output[["q_filtering_stats"]] ## Parameters F_length <- snakemake@params[["F_reads_length_trim"]] R_length <- snakemake@params[["R_reads_length_trim"]] F_extected_error <- snakemake@params[["F_reads_extected_error"]] R_extected_error <- snakemake@params[["R_reads_extected_error"]] sample_name <- snakemake@params[["sample_name"]] ## Load needed libraries library(dada2);packageVersion("dada2") ## Filter and trim ### Reads are filtered based on the number of errors expected for the read (integration of the qscore and the length of the read). All reads with uncalled nucleotide (N) are removed too. Remaining phiX reads will be removed too. Finally, reads are cut at a length set in config. ### Filter and trim. The filtered reads are directly written while the filtering stats are being save for later compilation. filtering_stats <- filterAndTrim(fwd = fnFs, filt = q_score_filtered_F, rev = fnRs, filt.rev = q_score_filtered_R, truncLen=c(F_length,R_length), maxN=0, maxEE=c(F_extected_error,R_extected_error), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=snakemake@threads, verbose = TRUE) filtering_stats <- as.data.frame(filtering_stats) filtering_stats$Sample <- sample_name ### Save the stats for this sample in a R object saveRDS(filtering_stats, file = q_filtering_stats_path) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input fnFs <- snakemake@input[[1]] ## Output q_score_filtered_F <- snakemake@output[["q_score_filtered_F"]] q_filtering_stats_path <- snakemake@output[["q_filtering_stats"]] ## Parameters F_length <- snakemake@params[["F_reads_length_trim"]] F_extected_error <- snakemake@params[["F_reads_extected_error"]] sample_name <- snakemake@params[["sample_name"]] ## Load needed libraries library(dada2);packageVersion("dada2") ## Filter and trim ### Reads are filtered based on the number of errors expected for the read (integration of the qscore and the length of the read). All reads with uncalled nucleotide (N) are removed too. Remaining phiX reads will be removed too. Finally, reads are cut at a length set in config. ### Filter and trim. The filtered reads are directly written while the filtering stats are being save for later compilation. filtering_stats <- filterAndTrim(fwd = fnFs, filt = q_score_filtered_F, truncLen=c(F_length), maxN=0, maxEE=c(F_extected_error), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=snakemake@threads, verbose = TRUE) filtering_stats <- as.data.frame(filtering_stats) filtering_stats$Sample <- sample_name ### Save the stats for this sample in a R object saveRDS(filtering_stats, file = q_filtering_stats_path) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]] q_score_filtered_R <- snakemake@input[["q_score_filtered_R"]] ## Output error_profile_F <- snakemake@output[["error_profile_F"]] error_profile_R <- snakemake@output[["error_profile_R"]] error_plot_F <- snakemake@output[["error_profile_F_plot"]] error_plot_R <- snakemake@output[["error_profile_R_plot"]] ## Load needed libraries library(dada2); packageVersion("dada2") ## Learn error rates errF <- learnErrors(q_score_filtered_F, nbases=1e8, multithread = snakemake@threads, verbose = 1) errR <- learnErrors(q_score_filtered_R, nbases=1e8, multithread = snakemake@threads, verbose = 1) ## Write these error profiles saveRDS(object = errF, file = error_profile_F) saveRDS(object = errR, file = error_profile_R) ## Write error_plots png(error_plot_F) plotErrors(errF, nominalQ=TRUE) png(error_plot_R) plotErrors(errR, nominalQ=TRUE) dev.off() |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]] ## Output error_profile_F <- snakemake@output[["error_profile_F"]] error_plot_F <- snakemake@output[["error_profile_F_plot"]] ## Load needed libraries library(dada2); packageVersion("dada2") ## Learn error rates errF <- learnErrors(q_score_filtered_F, nbases=1e8, multithread = snakemake@threads, verbose = 1) ## Write these error profiles saveRDS(object = errF, file = error_profile_F) ## Write error_plots png(error_plot_F) plotErrors(errF, nominalQ=TRUE) dev.off() |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]] q_score_filtered_R <- snakemake@input[["q_score_filtered_R"]] error_profile_F <- snakemake@input[["error_profile_F"]] error_profile_R <- snakemake@input[["error_profile_R"]] ## Output infer_stats <- snakemake@output[["infer_stats"]] sample_seq_tab <- snakemake@output[["sample_seq_tab"]] ## Parameters sample_name <- snakemake@params[["sample_name"]] run <- snakemake@params[["run"]] x_column_value <- snakemake@params[["x_column_value"]] min_overlap <- snakemake@params[["min_overlap"]] print(paste("min_overlap is :", min_overlap)) ## Load needed libraries library(dada2); packageVersion("dada2") ## Create a useful function to count the number of sequences getN <- function(x) sum(getUniques(x)) ## File renaming names(q_score_filtered_F) <- sample_name names(q_score_filtered_R) <- sample_name ## Read error rates errF <- readRDS(error_profile_F) errR <- readRDS(error_profile_R) ## Prepare named vector mergers <- vector("list", 1) names(mergers) <- sample_name ## Sample inference and merger of paired-end reads cat("Processing:", sample_name, "\n") ## Forward derepF <- derepFastq(q_score_filtered_F, verbose = TRUE) ddF <- dada(derepF, err=errF, multithread = snakemake@threads, verbose = 1, pool = FALSE, selfConsist = TRUE) ## Reverse derepR <- derepFastq(q_score_filtered_R, verbose = TRUE) ddR <- dada(derepR, err=errR, multithread = snakemake@threads, verbose = 1, pool = FALSE, selfConsist = TRUE) ## Merge merger <- mergePairs(dadaF = ddF, derepF= derepF, dadaR = ddR, derepR = derepR, verbose = TRUE, minOverlap = min_overlap, maxMismatch = 0) mergers[[sample_name]] <- merger ## Save the dereplicated, corrected and merged sequences for this sample saveRDS(mergers, file = sample_seq_tab) ## For statistics record the number of reads ### Write the statistics in a dataframe infer <- data.frame(denoisedF = getN(ddF)) infer$denoisedR <- getN(ddR) infer$merged_pairs <- getN(merger) infer$Sample <- sample_name infer$label <- x_column_value infer$RUN <- run ### Save the sequences stats for this sample saveRDS(infer, file = infer_stats) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]] error_profile_F <- snakemake@input[["error_profile_F"]] ## Output infer_stats <- snakemake@output[["infer_stats"]] sample_seq_tab <- snakemake@output[["sample_seq_tab"]] ## Parameters sample_name <- snakemake@params[["sample_name"]] run <- snakemake@params[["run"]] x_column_value <- snakemake@params[["x_column_value"]] min_overlap <- snakemake@params[["min_overlap"]] print(paste("min_overlap is :", min_overlap)) ## Load needed libraries library(dada2); packageVersion("dada2") ## Create a useful function to count the number of sequences getN <- function(x) sum(getUniques(x)) ## File renaming names(q_score_filtered_F) <- sample_name ## Read error rates errF <- readRDS(error_profile_F) ## Prepare named vector mergers <- vector("list", 1) names(mergers) <- sample_name ## Sample inference cat("Processing:", sample_name, "\n") ## Forward derepF <- derepFastq(q_score_filtered_F, verbose = TRUE) ddF <- dada(derepF, err=errF, multithread = snakemake@threads, verbose = 1, pool = FALSE, selfConsist = TRUE) mergers[[sample_name]] <- ddF ## Save the dereplicated, corrected sequences for this sample saveRDS(mergers, file = sample_seq_tab) ## For statistics record the number of reads ### Write the statistics in a dataframe infer <- data.frame(denoisedF = getN(ddF)) infer$denoisedR <- NA infer$merged_pairs <- infer$denoisedF infer$Sample <- sample_name infer$label <- x_column_value infer$RUN <- run ### Save the sequences stats for this sample saveRDS(infer, file = infer_stats) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input infer_stats <- snakemake@input[["infer_stats"]] sample_seq_table <- snakemake@input[["sample_seq_tab"]] ## Output run_stats <- snakemake@output[["run_stats"]] run_seq_table <- snakemake@output[["run_seq_table"]] ## Load needed libraries library(dada2); packageVersion("dada2") ## Merge and write paired-end merge reads of the run input_M <- sapply(sample_seq_table, readRDS, simplify = TRUE, USE.NAMES = FALSE) st.all <- makeSequenceTable(input_M) saveRDS(object = st.all, file = run_seq_table) ## Merge and write forward reads stats of the run stats <- do.call("rbind", lapply(infer_stats, readRDS)) saveRDS(object = stats, file = run_stats) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # Input seq_tab <- snakemake@input[["seq_tab"]] # Output no_chim <- snakemake@output[["no_chim"]] length_filtered <- snakemake@output[["length_filtered"]] rep_seqs <- snakemake@output[["rep_seqs"]] count_table <- snakemake@output[["count_table"]] length_histo <- snakemake@output[["length_histo"]] # Parameters merged_min_length <- snakemake@params[["merged_min_length"]] merged_max_length <- snakemake@params[["merged_max_length"]] # Load needed libraries library(dada2); packageVersion("dada2") library(ggplot2); packageVersion("ggplot2") # Merge data from multiple runs (if necessary) if (length(seq_tab) == 1){ print("Unique RUN, no merging of seq_tabl") st.all <- readRDS(seq_tab) }else{ print("Multiple RUN, merging") st.all <- do.call("mergeSequenceTables", lapply(seq_tab, readRDS)) } # Remove chimeras print("Filter chimera") seqtab <- removeBimeraDenovo(st.all, method="consensus", multithread=snakemake@threads, verbose=TRUE) print("Chimera filtered") # Sequences length inspection and filtration #### That's the little added trick, the reason why we are using this script and not the one in Qiime2. Indeed we typically are here keeping only sequences between 390 and 500 bp of length after merging. This tcorresponds to the expected length of the V3V4 region of the 16S rRNA gene. ## Inspect distribution of sequence lengths unfiltered_length_table <- tapply(colSums(seqtab), factor(nchar(getSequences(seqtab))), sum) unfiltered_length_table unfiltered_length_table_df <- data.frame(length=as.numeric(names(unfiltered_length_table)), count=unfiltered_length_table) png(length_histo) p <- ggplot(data=unfiltered_length_table_df, aes(x=length, y=count)) + geom_col(binwidth=20) print(p) dev.off() # Filter based on length seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% seq(merged_min_length, merged_max_length)] # Inspect distribution of sequence lengths after filtration table(nchar(getSequences(seqtab2))) # Export reads and count #### We are writing in files the product of this DADA2 process. These are one .fasta file contanining the dereplicated, errors corrected, paired-end merged representative sequences and one .txt file indicating the prevalence of sequencne in each sample (this is the result of dereplication). asv_seqs <- colnames(seqtab2) ## giving our seq headers more manageable names (ASV_1, ASV_2...) asv_headers <- vector(dim(seqtab2)[2], mode="character") for (i in 1:dim(seqtab2)[2]) { asv_headers[i] <- paste(">ASV", i, sep="_") } ## making and writing out a fasta of our final ASV seqs: asv_fasta <- c(rbind(asv_headers, asv_seqs)) write(asv_fasta, rep_seqs) ## count table: print("Create count table") asv_tab <- t(seqtab2) ## Renamed row.names(asv_tab) <- sub(">", "", asv_headers) write.table(asv_tab, count_table , sep="\t", quote=F) ## Write sequences objects in .rds for use in statistics ## Before length filtration saveRDS(seqtab, no_chim) ## After length filtration saveRDS(seqtab2, length_filtered) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input q_filtering_stats <- snakemake@input[["q_filtering_stats"]] run_stats <- snakemake@input[["run_stats"]] no_chim <- snakemake@input[["no_chim"]] l_filtered <- snakemake@input[["length_filtered"]] ## Output filtering_stats <- snakemake@output[["filtering_stats"]] ## Load needed libraries library(dplyr); packageVersion("dplyr") ## Load the q score filtration R stats filtration <- do.call("rbind", lapply(q_filtering_stats, readRDS)) ## Load the forward, reverse and paired reads correction stats for each run dada_infer <- do.call("rbind", lapply(run_stats, readRDS)) ## Load reads with filtered chimera no_chimera <- do.call("rowSums", lapply(no_chim, readRDS)) no_chimera <- as.data.frame(no_chimera) no_chimera$Sample <- row.names(no_chimera) ## Load length_filtered sequences length_filtered <- do.call("rowSums", lapply(l_filtered, readRDS)) length_filtered <- as.data.frame(length_filtered) length_filtered$Sample <- row.names(length_filtered) ## Merge all dataframe together all_stats <- Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "Sample", all.x = TRUE), list(filtration, dada_infer, no_chimera, length_filtered)) ## set RUN at the very end of the table all_stats <- all_stats%>%select(Sample, label, RUN, everything()) ## Compute % of passed reads all_stats$passed_reads_pct <- (100*all_stats$length_filtered)/all_stats$reads.in ## Write the stat table write.table(x = all_stats, file = filtering_stats, sep = "\t", row.names = FALSE) |
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | shell: ''' pandaseq \ -f {input[R1_raw_reads]} \ -r {input[R2_raw_reads]} \ -p {params[forward_primer]} \ -q {params[reverse_primer]} \ -A simple_bayesian \ -l {params[merged_min_length]} \ -L {params[merged_max_length]} \ -G {log} \ -w {output[paired_trimmed_reads]} \ -B \ -T {threads} \ -o {params[min_overlap]} \ -N ''' |
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | shell: ''' pandaseq \ -f {input[R1_raw_reads]} \ -r {input[R2_raw_reads]} \ -A simple_bayesian \ -l {params[merged_min_length]} \ -L {params[merged_max_length]} \ -G {log} \ -w {output[paired_trimmed_reads]} \ -B \ -T {threads} \ -o {params[min_overlap]} \ -N ''' |
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | shell: ''' if [ -s {input} ] then vsearch --derep_fulllength {input} \ --sizeout \ --output {output[0]} \ --uc {output[1]} 2> {log} else echo "{input} is empty" echo "> \n" > {output[0]} touch {output[1]} fi ''' |
50 51 52 53 | shell: ''' cat {input} >> {output} ''' |
68 69 70 71 72 73 74 75 | shell: ''' vsearch --derep_fulllength {input} \ --sizeout \ --minuniquesize 2 \ --output {output} 2> {log} ''' |
93 94 95 96 97 98 99 100 101 102 103 | shell: ''' vsearch --cluster_size {input} \ --sizein \ --id 0.97 \ --centroids {output[centroid]} \ --consout {output[consout]} \ --profile {output[profile]} \ --threads {threads} \ 2> {log} ''' |
121 122 123 124 125 126 127 128 129 130 | shell: ''' vsearch --uchime_denovo {input} \ --abskew 2 \ --sizein \ --nonchimeras {output[non_chimeras]} \ --borderline {output[borderline]} \ --uchimeout {output[chimera_out]} 2> {log} ''' |
146 147 148 149 150 151 152 153 154 | shell: ''' vsearch --derep_fulllength {input} \ --sizein \ --relabel OTU_ \ --xsize \ --output {output} \ 2> {log} ''' |
171 172 173 174 175 176 177 178 179 180 | shell: ''' vsearch --usearch_global {input[samples]} \ -otutabout {output} \ -id 0.97 \ -strand plus \ --db {input[rep_seq]} \ --threads {threads} \ 2> {log} ''' |
195 196 | script: "scripts/create_count_table_from_vsearch.R" |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Load needed library library(dplyr);packageVersion("dplyr") library(reshape2);packageVersion("reshape2") library(magrittr);packageVersion("magrittr") ## Input count_table_samples <- snakemake@input[["count_table_samples"]] ## Output count_table <- snakemake@output[["count_table"]] ## Reformat otus_table <- data.frame(array(dim=c(0,3))) ### Loop over each sample file. If it is empty, then we just add the factor in the levels to have it then for (file_path in count_table_samples){ sample_name <- gsub("_count_table.tsv", "", basename(file_path)) sample_otu_table <- read.table(file = file_path, sep="\t", as.is=T, check.names = F, header=T, comment.char = "", skipNul = TRUE) if (nrow(sample_otu_table)>0){ sample_otu_table <- cbind("Sample"=sample_name, sample_otu_table) otus_table <- rbind(otus_table, sample_otu_table) }else if (nrow(sample_otu_table) == 0){ levels(otus_table$Sample) <- c(levels(otus_table$Sample), sample_name) } } colnames(otus_table) <- c("Sample", "OTU_ID", "counts") ## Transform this table to have a wide format where we have a column by sample transf_vsearch_table <- otus_table %>% dplyr::group_by(Sample, OTU_ID, .drop = FALSE) %>% dplyr::summarise(counts = sum(counts)) %>% reshape2::dcast(OTU_ID ~ Sample) %>% dplyr::filter(!is.na(OTU_ID)) ## Set OTU as rownames transf_vsearch_table <- set_rownames(x = transf_vsearch_table, value = transf_vsearch_table$OTU_ID) transf_vsearch_table[,1] <- NULL ## Write output write.table(x = transf_vsearch_table, file = count_table, sep="\t", quote=F, na = "0") |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input seqs <- snakemake@input[["seqs"]] King_to_Species <- snakemake@input[["King_to_Species"]] King_to_Genus <- snakemake@input[["King_to_Genus"]] Genus_species <- snakemake@input[["Genus_species"]] ## Output tax <- snakemake@output[["tax"]] ## Parameters cpu <- snakemake@threads ## Load needed libraries library(dada2);packageVersion("dada2") library(dplyr);packageVersion("dplyr") library(tidyr);packageVersion("tidyr") library(Biostrings);packageVersion("Biostrings") ## Read data fastaFile <- readDNAStringSet(seqs) seq.name = names(fastaFile) sequence = paste(fastaFile) fasta <- data.frame(seq.name, sequence) ## Format seqs seq_table <- as.character(fasta$sequence) names(seq_table) <- as.character(fasta$seq.name) ## Assign taxonomy print("Assigning") taxa <- assignTaxonomy(seqs = seq_table, refFasta = King_to_Species, taxLevels = c("Kingdom","Phylum","Class","Order","Family","Genus", "Species"), multithread=cpu, tryRC = TRUE, minBoot = 50, verbose = TRUE, outputBootstraps = TRUE) #taxa <- addSpecies(taxtab = taxa, refFasta = Genus_species, verbose=TRUE, allowMultiple = TRUE, tryRC = TRUE) # Not working for some reason ## Format an write output taxa_table <- data.frame(taxa) taxa_table <- data.frame(cbind(Row.Names = names(rownames(unlist(taxa$tax))), taxa_table)) taxa_table <- taxa_table %>% unite(taxonomy, starts_with(match = "tax."), sep = ";", remove = TRUE) taxa_table <- taxa_table %>% unite(boot, starts_with(match = "boot"), sep = ";", remove = TRUE) #taxa_table$Row.Names<-NULL write.table(taxa_table, tax, row.names = FALSE, sep = "\t", col.names = FALSE, quote = FALSE) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input trained_tax_path <- snakemake@input[["trained_tax"]] seq_path <- snakemake@input[["seq"]] ## Output tax_path <- snakemake@output[["tax"]] tax_plot_path <- snakemake@output[["tax_plot"]] ## Parameters cpu <- snakemake@threads ## Load needed libraries library(DECIPHER);packageVersion("DECIPHER") library(dplyr);packageVersion("dplyr") library(tidyr);packageVersion("tidyr") # load a training set object (trainingSet) # see http://DECIPHER.codes/Downloads.html train_set <- readRDS(trained_tax_path ) query_seqs <- readDNAStringSet(seq_path) # classify the sequences ids <- IdTaxa(query_seqs, train_set, strand="both", # or "top" if same as trainingSet threshold=60, # 60 (very high) or 50 (high) processors=cpu) # use all available processors # look at the results pdf(file = tax_plot_path) plot(ids) dev.off() ## Format taxonomy and write it taxid <- data.frame(t(sapply(ids, function(x){ taxa <- c(unlist(x$taxon)[2:8], last(x$confidence)) taxa[startsWith(taxa, "unclassified_")] <- NA taxa }))) taxid <- unite(data = taxid, 1:7, col = "taxonomy", sep = ";", remove = TRUE) write.table(x = taxid, tax_path, row.names = TRUE, sep = "\t", col.names = FALSE, quote = FALSE) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input rdp_tax <- snakemake@input[["RDP_output"]] ## Output simplified_tax <-snakemake@output[["formatted_output"]] ## Load needed libraries library(dplyr);packageVersion("dplyr") library(tidyr);packageVersion("tidyr") ## Read data tax_table <- read.table(rdp_tax, sep="\t", stringsAsFactors=FALSE) ## Format table ### set rownames row.names(tax_table) <- tax_table$V1 ### Filter unecessary columns tax_table[,c("V1", "V2", "V3", "V4", "V5")] <- NULL ## Set a table for confidence score filtering ### Keep score columns score_table <- tax_table[,seq(from = 3, to = 21, by = 3)] ### Rename columns colnames(score_table) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species") ### Create a boolean dataframe for filtering based on confidence score_table_bool <- score_table < 0.5 ## Set a table with taxa names ### Keep names columns names_tables <- tax_table[,seq(from = 1, to = 19, by = 3)] ### Renames columns colnames(names_tables) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species") ## Filter taxa names based on boolean table taxa_names_filtered <- replace(x = names_tables, list = score_table_bool, values = NA) ## Filter scores based on boolean table score_table_filtered <- replace(x = score_table, list = score_table_bool, values = NA) ## Unite taxonomy taxa_names_filtered_unite <- unite(data = taxa_names_filtered, col = "", sep = ";", na.rm = TRUE, remove = TRUE) ## Add confidence taxa_names_filtered_unite$confidence <- apply(score_table_filtered, MARGIN = 1, FUN = min, na.rm = TRUE ) ## Write formatted table write.table(taxa_names_filtered_unite, sep = "\t", simplified_tax, row.names = TRUE, quote = FALSE, col.names = FALSE) |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 | shell: ''' export RDP_JAR_PATH=$(command -v rdp_classifier-2.2.jar); assign_path=$(which assign_taxonomy.py) python $assign_path \ -i {input[0]} \ -r {input[1]} \ -t {input[2]} \ -m rdp \ -o $(dirname {output[0]}) \ -c 0.5 \ --rdp_max_memory {resources[mem_mb]} 2> {log[0]} ''' |
52 53 | script: "scripts/dada2_rdp_tax_assign.R" |
72 73 74 75 76 77 78 79 | shell: ''' classifier -Xmx30g -XX:ConcGCThreads=1 classify \ -c 0.5 \ -f allrank \ -t {input[0]} \ -o {output[0]} {input[1]} 2> {log[0]} ''' |
97 98 | script: "scripts/format_RDP_output.R" |
119 120 | script: "scripts/decipher_assign_tax.R" |
33 34 | script: "scripts/physeq_initial_import.R" |
53 54 | script: "scripts/physeq_export.R" |
74 75 | script: "scripts/physeq_filter_taxa.R" |
91 92 | script: "scripts/physeq_rarefy.R" |
109 110 | script: "scripts/physeq_add_new_tree.R" |
125 126 | script: "scripts/physeq_melt_table.R" |
143 144 | script: "scripts/physeq_collapse_taxa.R" |
163 164 | script: "scripts/transpose_and_add_meta_count_table.R" |
182 183 | script: "scripts/transf_norm.R" |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input phyloseq_object <- snakemake@input[["phyloseq_object"]] new_tree <- snakemake@input[["new_tree"]] ## Ouput updated_phyloseq_path <- snakemake@output[["phyloseq_object"]] ## Load libraries library(phyloseq);packageVersion("phyloseq") ## Read the phyloseq object phyloseq_obj <- readRDS(phyloseq_object) phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample ## Read the new tree PHY <- read_tree(new_tree) ## Assign the new tree to the phyloseq object phy_tree(phyloseq_obj) <- PHY ## Write the new phyloseq object saveRDS(object = phyloseq_obj, file = updated_phyloseq_path) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input phyloseq_object <- snakemake@input[[1]] ## Output phyloseq_filtered_object <- snakemake@output[[1]] ## Parameters collapse_level <- snakemake@params[["collapse_level"]] ## Load needed libraries library(phyloseq);packageVersion("phyloseq") ## Import the phyloseq object phyloseq_object <- readRDS(phyloseq_object) ## Convert the taxonomic level to its name rank_names(phyloseq_object)[[as.numeric(collapse_level)]] ## Collapse taxa collapsed_physeq <- tax_glom(phyloseq_object, taxrank=rank_names(phyloseq_object)[[as.numeric(collapse_level)]], NArm=TRUE, bad_empty=c(NA, "", " ", "\t")) collapsed_physeq <- prune_taxa(taxa_sums(collapsed_physeq) > 0, collapsed_physeq) ## Removes taxa not at least present in one sample # Write the new phyloseq object saveRDS(object = collapsed_physeq, file = phyloseq_filtered_object) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input phyloseq_object <- snakemake@input[["phyloseq_object"]] ## Output tree_path <- snakemake@output[["tree_path"]] meta_path <- snakemake@output[["meta_path"]] taxonomy_path <- snakemake@output[["taxonomy_path"]] OTU_path <- snakemake@output[["OTU_path"]] rep_seq_path <- snakemake@output[["rep_seq_path"]] ## Load needed libraries library(phyloseq);packageVersion("phyloseq") library(dplyr);packageVersion("dplyr") library(stringr);packageVersion("stringr") library(Biostrings);packageVersion("Biostrings") ## Import the phyloseq phyloseq_object phyloseq_obj <- readRDS(phyloseq_object) phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample ## Write the tree of the phyloseq object tree <- phy_tree(phyloseq_obj) ape::write.tree(tree, tree_path) ## Write the metadata table of the phyloseq object metadata <- (sample_data(phyloseq_obj)) metadata$Sample <- rownames(metadata) metadata <- metadata %>% select(Sample, everything()) write.table(metadata, meta_path , sep="\t", row.names = FALSE) ## Write the taxonomy table of the phyloseq object taxa_df<- as.data.frame(tax_table(phyloseq_obj), stringsAsFactors = F) taxa_df$taxonomy <- apply(taxa_df, 1, function(x) str_c(x[!is.na(x)], collapse = ";")) taxa_df <- taxa_df %>% select(taxonomy) write.table(taxa_df, taxonomy_path , sep="\t", quote=F, col.names = F) ## Write the OTU table of the phyloseq object write.table(otu_table(phyloseq_obj), OTU_path , sep="\t", quote=F) ## Write the representative sequences writeXStringSet(refseq(phyloseq_obj), rep_seq_path, append=FALSE, compress=FALSE, format="fasta") |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input phyloseq_object <- snakemake@input[["phyloseq_object"]] ## Output phyloseq_filtered_object <- snakemake@output[["phyloseq_filtered_object"]] ## Parameters tax_rank <- snakemake@params[["filter_tax_rank"]] lineage <- snakemake@params[["filter_lineage"]] filter_out_tax_rank <- snakemake@params[["filter_out_tax_rank"]] filter_out_lineage <- snakemake@params[["filter_out_lineage"]] ## Load needed libraries library(phyloseq);packageVersion("phyloseq") library(dplyr);packageVersion("dplyr") ## Import the phyloseq object phyloseq_obj <- readRDS(phyloseq_object) ## Filter taxa filtered_taxa <- subset_taxa(phyloseq_obj, get(tax_rank) == as.character(lineage)) filtered_taxa <- subset_taxa(filtered_taxa, get(filter_out_tax_rank) != as.character(filter_out_lineage)) filtered_taxa <- prune_taxa(taxa_sums(filtered_taxa) > 0, filtered_taxa) ## Removes taxa not at least present in one sample ## Recompute alpha diversity indexes after this filtration ### Remove the previously computed values #sample_data(filtered_taxa) <- select(sample_data(filtered_taxa), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Fisher, Observed_min_1)) drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1") sample_data(filtered_taxa) <- sample_data(filtered_taxa)[,!(names(sample_data(filtered_taxa)) %in% drop)] ### Add alpha diversity indexes to metadata alpha_div <- estimate_richness(physeq = filtered_taxa, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson")) sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div) ### In addition, add the Observed over 1% metric #### Keep the IDSs of the taxa above 1% physeqrF <- filter_taxa(physeq = filtered_taxa, function(x) mean(x) > 0.01, FALSE) #### Keep only those physeqaF <- prune_taxa(physeqrF,filtered_taxa) #### Calculate the Observed index alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed") #### Rename this index colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1")) #### Again bind this new column sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div_1) ## Write the new phyloseq object saveRDS(object = filtered_taxa, file = phyloseq_filtered_object) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input count_table <- snakemake@input[["count_table"]] Metadata_table <- snakemake@input[["Metadata_table"]] taxonomy_table <- snakemake@input[["taxonomy_table"]] rep_seqs <- snakemake@input[["rep_seqs"]] tax_tree <- snakemake@input[["tax_tree"]] ## Ouput phyloseq_object <- snakemake@output[["phyloseq_object"]] ## Parameters replace_empty_tax <- snakemake@params[["viz_replace_empty_tax"]] ## Load libraries library(dplyr);packageVersion("dplyr") library(tibble);packageVersion("tibble") library(tidyr);packageVersion("tidyr") library(phyloseq);packageVersion("phyloseq") library(Biostrings);packageVersion("Biostrings") ## Import data ### Read count table print("reading count table") count_table <- read.table(file = count_table, header = TRUE, check.names=FALSE) ### Read sample_data print("reading metadata") metadata <- read.delim(file = Metadata_table, sep = "\t", header = TRUE, na.strings = "NA") ### Read taxonomic tree print("reading taxonomic tree") PHY <- read_tree(tax_tree) ### Read representative sequences print("importing representative sequences from fasta") SEQS <- readDNAStringSet(rep_seqs) ### Read and format taxonomy table print("reading taxonomy table") taxonomy_table<-read.table(file = taxonomy_table, header = FALSE, sep = "\t") ### Convert the table into a tabular split version taxonomy_table<-taxonomy_table %>% as_tibble() %>% separate(V2, sep=";", c("Kingdom","Phylum","Class","Order","Family","Genus","Species")) ### Replace the not properly named headers into proper ones colnames(taxonomy_table)[colnames(taxonomy_table)=="V1"] <- "Feature.ID" colnames(taxonomy_table)[colnames(taxonomy_table)=="V3"] <- "Confidence" ### Convert taxonomic levels as character (needed for the next steps) taxonomy_table$Kingdom<-as.character(taxonomy_table$Kingdom) taxonomy_table$Phylum<-as.character(taxonomy_table$Phylum) taxonomy_table$Class<-as.character(taxonomy_table$Class) taxonomy_table$Order<-as.character(taxonomy_table$Order) taxonomy_table$Family<-as.character(taxonomy_table$Family) taxonomy_table$Genus<-as.character(taxonomy_table$Genus) taxonomy_table$Species<-as.character(taxonomy_table$Species) print(paste("replace empty taxonomy :", replace_empty_tax)) if(isTRUE(replace_empty_tax)) { ### Replace NA by the previous order + a space_holder for each taxonomic level taxonomy_table$Kingdom[is.na(taxonomy_table$Kingdom)] <- (("Unkown_Kingdom")[is.na(taxonomy_table$Kingdom)]) taxonomy_table$Phylum[is.na(taxonomy_table$Phylum)] <- ((paste(taxonomy_table$Kingdom,"_phy",sep=""))[is.na(taxonomy_table$Phylum)]) taxonomy_table$Class[is.na(taxonomy_table$Class)] <- ((paste(taxonomy_table$Phylum,"_clas",sep=""))[is.na(taxonomy_table$Class)]) taxonomy_table$Order[is.na(taxonomy_table$Order)] <- ((paste(taxonomy_table$Class,"_ord",sep=""))[is.na(taxonomy_table$Order)]) taxonomy_table$Family[is.na(taxonomy_table$Family)] <- ((paste(taxonomy_table$Order,"_fam",sep=""))[is.na(taxonomy_table$Family)]) taxonomy_table$Genus[is.na(taxonomy_table$Genus)] <- ((paste(taxonomy_table$Family,"_gen",sep=""))[is.na(taxonomy_table$Genus)]) taxonomy_table$Species[is.na(taxonomy_table$Species)] <- ((paste(taxonomy_table$Genus,"_sp",sep=""))[is.na(taxonomy_table$Species)]) print("table NA remplaced by spaceholders") }else{ print("table NA NOT remplaced by spaceholders") } ## Import as physeq object where needed OTU <- otu_table(count_table, taxa_are_rows = TRUE) TAX <- taxonomy_table %>% column_to_rownames("Feature.ID") %>% as.matrix() %>% tax_table() META <- metadata %>% as.data.frame() %>% column_to_rownames("Sample") %>% sample_data() ## Merge all in a phyloseq object phyloseq_obj <- phyloseq(OTU, TAX, META, PHY, SEQS) ## Compute alpha diversity indexes after this filtration ### Remove the previously computed values, in case #sample_data(phyloseq_obj) <- select(sample_data(phyloseq_obj), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Fisher, Observed_min_1)) drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1") sample_data(phyloseq_obj) <- sample_data(phyloseq_obj)[,!(names(sample_data(phyloseq_obj)) %in% drop)] ### Add alpha diversity indexes to metadata alpha_div <- estimate_richness(physeq = phyloseq_obj, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson")) sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div) ### In addition, add the Observed over 1% metric #### Keep the IDSs of the taxa above 1% physeqrF <- filter_taxa(physeq = phyloseq_obj, function(x) mean(x) > 0.01, FALSE) #### Keep only those physeqaF <- prune_taxa(physeqrF,phyloseq_obj) #### Calculate the Observed index alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed") #### Rename this index colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1")) #### Again bind this new column sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div_1) phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample ## Write the phyloseq object saveRDS(object = phyloseq_obj, file = phyloseq_object) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input phyloseq_object <- snakemake@input[[1]] ## Output phyloseq_melted_table <- snakemake@output[[1]] ## Load needed libraries library(phyloseq);packageVersion("phyloseq") ## Import the physeq object phyloseq_obj <- readRDS(phyloseq_object) phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample ### Melt the physeq objet into a dataframe with one row per feature.id and per sample physeq_df <- psmelt(phyloseq_obj) ### Write this table in a tsv format write.table(x = physeq_df, file = phyloseq_melted_table, append = F, sep = "\t", eol = "\n", row.names = F, col.names = T ) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input phyloseq_object <- snakemake@input[["phyloseq_object"]] ## Ouput rarefied_phyloseq <- snakemake@output[["phyloseq_object"]] ## Parameters rarefy_value <- snakemake@params[["rarefaction_value"]] ## Load libraries library(vegan);packageVersion("vegan") library(phyloseq);packageVersion("phyloseq") library(dplyr);packageVersion("dplyr") ## Set seed for reproducibility set.seed(1) ## Import the phyloseq object phyloseq_obj <- readRDS(phyloseq_object) ## Rarefy the count table otu_table(phyloseq_obj) <- t(rrarefy(t(otu_table(phyloseq_obj)), sample = as.numeric(rarefy_value))) ## Compute alpha diversity indexes after this rarefaction ### Remove the previously computed values, in case #sample_data(phyloseq_obj) <- select(sample_data(phyloseq_obj), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Observed_min_1)) drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1") sample_data(phyloseq_obj) <- sample_data(phyloseq_obj)[,!(names(sample_data(phyloseq_obj)) %in% drop)] ### Add alpha diversity indexes to metadata alpha_div <- estimate_richness(physeq = phyloseq_obj, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson")) sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div) ### In addition, add the Observed over 1% metric #### Keep the IDSs of the taxa above 1% physeqrF <- filter_taxa(physeq = phyloseq_obj, function(x) mean(x) > 0.01, FALSE) #### Keep only those physeqaF <- prune_taxa(physeqrF,phyloseq_obj) #### Calculate the Observed index alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed") #### Rename this index colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1")) #### Again bind this new column sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div_1) phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample # Write the phyloseq object saveRDS(object = phyloseq_obj, file = rarefied_phyloseq) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input phyloseq_object <- snakemake@input[["phyloseq_object"]] ## Output phyloseq_norm <- snakemake@output[["phyloseq_norm"]] ## Parameters #lower all normalization to ease matching normalization <- tolower(snakemake@params[["normalization"]]) min_abundance <- as.numeric(snakemake@params[["min_abundance_value"]]) min_prevalence <- as.numeric(snakemake@params[["min_prevalence_value"]]) print(paste("normalization setting is", normalization)) ## Load needed libraries library("phyloseq");packageVersion("phyloseq") library("vegan");packageVersion("vegan") library("edgeR");packageVersion("edgeR") library("metagenomeSeq");packageVersion("metagenomeSeq") ## Import phyloseq physeq <- readRDS(phyloseq_object) ## Recover OTU table OTU <- data.frame(otu_table(physeq), check.names = FALSE) ## List methods computed by vegan vegan_methods <- c("total", "max", "freq", "normalize", "range", "pa", "chi.square", "hellinger" ,"log") ## CLR must we computed sample-wise, on rows. In Phyloseq we have taxa_are_rows = TRUE. Thus, must be computed on the t() if OTU table if (normalization == "clr"){ print("clr normalization") OTU1 <- OTU + 1 reads_trfs <- t(chemometrics::clr(t(OTU1))) ## Vegan decostand normalizations (with double t() to take in account the transposed structure of the counts used here, modified from https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/decostand) }else if (normalization %in% vegan_methods){ print(paste(normalization, "normalization by vegan")) reads_trfs <- t(vegan::decostand(t(OTU), method = normalization)) ## Percent normalization with basic R }else if (normalization == "pct"){ print("Percent normalization with base R") reads_trfs <- sapply(OTU, function(x) 100*x/sum(x)) rownames(reads_trfs) <- rownames(OTU) ## CSS normalization with metagenomeseq (modified form https://bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/inst/doc/metagenomeSeq.pdf ) }else if (normalization == "css"){ print("css normalization by metagenomeseq") OTU_mtgs <- metagenomeSeq::newMRexperiment(counts = OTU, phenoData = NULL, featureData = NULL, libSize = NULL, normFactors = NULL) p <- metagenomeSeq::cumNormStat(OTU_mtgs) OTU_norm <- metagenomeSeq::cumNorm(OTU_mtgs, p = p) reads_trfs = as.data.frame(metagenomeSeq::MRcounts(OTU_norm, norm = TRUE, log = FALSE)) ## TMM normalization with edgeR (modified from https://joey711.github.io/phyloseq-extensions/edgeR.html) } else if (normalization == "tmm"){ print("TMM normalization by edgeR") OTU1 <- OTU + 1 group <- rownames(get_variable(physeq)) tax <- tax_table(physeq, errorIfNULL=FALSE) y <- edgeR::DGEList(counts=OTU1, group=group, genes=tax, remove.zeros = TRUE) z = edgeR::calcNormFactors(y, method="TMM") reads_trfs <- data.frame(z$counts, check.names = FALSE) #reads_trfs[reads_trfs==0.5] <- 0 } else if (normalization == "none"){ print('No normalization (normalization = "none"') reads_trfs <- OTU }else{ stop(paste('"normalization" was', normalization, '.Must be one of : "clr", "pct", "css", "tmm", "total", "max", "freq", "normalize", "range", "pa", "chi.square", "hellinger" or "log", ')) } ## Filter based on counts and prevalence ### Filter based on the cumulated abundance if(is.numeric(min_abundance)){ print(paste("filter based on cumulated min_abundance:", min_abundance)) filtered_data <- reads_trfs[,colSums(reads_trfs) > min_abundance, drop = FALSE] }else if (min_abundance == "none"){ print("NO filtering based on cumulated abundance") filtered_data <- reads_trfs }else{ stop(paste(min_abundance, 'must be a numeric value or "none')) } ### Filter based on prevalence if(is.numeric(min_prevalence)){ print(paste("filter based on prevalence:",min_prevalence)) prevalence_cutoff <- (min_prevalence/100) * nsamples(physeq) print(paste("Features must be found in more than", prevalence_cutoff, "samples to be retained")) filtered_data <- filtered_data[,colSums(filtered_data != 0) > prevalence_cutoff, drop = FALSE] }else if (min_prevalence == "none"){ print("NO filtering based on cumulated abundance") }else{ stop(paste(min_prevalence, 'must be a numeric value or "none')) } ## Repopulate physeq object (identical script than for taxa filter) print(1) physeq_filtered <- physeq sample_names(physeq_filtered) head(filtered_data) otu_table(physeq_filtered) <- otu_table(round(filtered_data, digits = 0), taxa_are_rows = TRUE) filtered_taxa <- prune_taxa(taxa_sums(physeq_filtered) > 0, physeq_filtered) ## Removes taxa not at least present in one sample print(2) ## Recompute alpha diversity indexes after this filtration ### Remove the previously computed values #sample_data(filtered_taxa) <- select(sample_data(filtered_taxa), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Fisher, Observed_min_1)) drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1") sample_data(filtered_taxa) <- sample_data(filtered_taxa)[,!(names(sample_data(filtered_taxa)) %in% drop)] ### Add alpha diversity indexes to metadata alpha_div <- estimate_richness(physeq = filtered_taxa, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson")) sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div) ### In addition, add the Observed over 1% metric #### Keep the IDSs of the taxa above 1% physeqrF <- filter_taxa(physeq = filtered_taxa, function(x) mean(x) > 0.01, FALSE) #### Keep only those physeqaF <- prune_taxa(physeqrF,filtered_taxa) #### Calculate the Observed index alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed") #### Rename this index colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1")) #### Again bind this new column sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div_1) ## Write the new phyloseq object saveRDS(object = filtered_taxa, file = phyloseq_norm) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input count_table <- snakemake@input[["count_table"]] print(count_table) meta <- snakemake@input[["meta"]] ## Output transposed_table <- snakemake@output[["transposed_table"]] merged_meta <- snakemake@output[["merged_meta"]] ## Load the phyloseq phyloseq_object table <- read.table(count_table, sep = "\t", header = TRUE) meta <- read.delim(meta, sep = "\t", header = TRUE, na.strings = "NA") ## Tranpose table table_t <- t(table) ## Bind tables binded_tables <- cbind(meta, table_t) ### Write this table in a tsv file since it is ground for coming analysis and slow to compute write.table(x = table_t, file = transposed_table, append = F, quote = F, sep = "\t", eol = "\n", row.names = F, col.names = T ) write.table(x = binded_tables, file = merged_meta, append = F, quote = F, sep = "\t", eol = "\n", row.names = F, col.names = T ) |
18 19 20 21 | shell: """ run_treeshrink.py -o $(dirname {output[0]}) -q {params[quantile]} -t {input} """ |
16 17 | script: "scripts/raw_to_processed_reads_stats.R" |
37 38 | script: "scripts/raw_to_processed_reads_plot.R" |
55 56 | script: "scripts/raw_to_tax_filtered_reads_stats.R" |
75 76 | script: "scripts/raw_to_processed_reads_plot.R" |
95 96 | script: "scripts/KRONA_plots.R" |
113 114 | script: "scripts/rarefaction_curve.R" |
12 13 | script: "scripts/create_biom_from_count_table.R" |
26 27 28 29 30 31 32 33 | shell: ''' qiime tools import \ --input-path {input[0]} \ --type 'FeatureTable[Frequency]' \ --input-format BIOMV100Format \ --output-path {output[0]} ''' |
46 47 48 49 50 51 | shell: ''' qiime feature-table summarize \ --i-table {input[0]} \ --o-visualization {output[0]} ''' |
61 62 63 64 | shell: ''' awk '/^>/ {{print($0)}}; /^[^>]/ {{print(toupper($0))}}' {input[0]} > {output[0]} ''' |
78 79 80 81 82 83 84 | shell: ''' qiime tools import \ --input-path {input[0]} \ --output-path {output[0]} \ --type 'FeatureData[Sequence]' ''' |
104 105 106 107 108 109 110 111 112 113 | shell: ''' qiime phylogeny align-to-tree-mafft-fasttree \ --p-n-threads {threads} \ --i-sequences {input[0]} \ --o-alignment {output[aligned]} \ --o-masked-alignment {output[masked]} \ --o-tree {output[unrooted]} \ --o-rooted-tree {output[rooted]} ''' |
126 127 128 129 | shell: ''' qiime tools export --input-path {input} --output-path $(dirname {output[0]}) ''' |
143 144 145 146 147 148 | shell: ''' qiime feature-table tabulate-seqs \ --i-data {input[0]} \ --o-visualization {output[0]} ''' |
161 162 163 164 165 166 167 168 | shell: ''' qiime tools import \ --type FeatureData[Taxonomy] \ --input-path {input[0]} \ --input-format HeaderlessTSVTaxonomyFormat \ --output-path {output[0]} ''' |
181 182 183 184 185 186 | shell: ''' qiime metadata tabulate \ --m-input-file {input} \ --o-visualization {output} ''' |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input count_table <- snakemake@input[["count_table"]] ## Output biom_count <- snakemake@output[["biom_count"]] ## Library library(biomformat);packageVersion("biomformat") library(phyloseq);packageVersion("phyloseq") ## Reformat asv_tab <- read.table(file = count_table) otu <- as(otu_table(asv_tab, taxa_are_rows = TRUE),"matrix") otu_biom <- make_biom(data=otu) ## Write write_biom(x = otu_biom, biom_file = biom_count) |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input phyloseq_melted_table <- snakemake@input[["phyloseq_melted_table"]] ## Output output_folder <- snakemake@output[["output"]] output_folder <- (dirname(output_folder)[1]) ## Parameters sample_label <- snakemake@params[["sample_label"]] grouping_column <- snakemake@params[["grouping_column"]] grouping_filter_column_value <- snakemake@params[["grouping_col_value"]] ## Load needed library library(dplyr);packageVersion("dplyr") ## Read data melted_dataframe<- read.csv(file.path(phyloseq_melted_table), header = TRUE, sep = "\t") ## Create KRONA df <- filter(melted_dataframe, melted_dataframe[[grouping_column]] == grouping_filter_column_value) df <- filter(df, df[["Abundance"]] != 0) df <- df[, c("Abundance", sample_label, "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species", "OTU")] df <- as.data.frame(unclass(df)) df[, 2] <- gsub(" |\\(|\\)", "", df[, 2]) df[, 2] <- as.factor(df[, 2]) dir.create(file.path(output_folder,"/",grouping_filter_column_value)) for (lvl in levels(df[, 2])) { write.table(unique(df[which(df[, 2] == lvl), -2]), file = paste0(output_folder,"/",grouping_filter_column_value, "/", lvl, "taxonomy.txt"), sep = "\t", row.names = F, col.names = F, na = "", quote = F) } #As it is possible that some negative controls or samples have no reads, here we are tryuing to say if the entire sample is empty then make a log file with the message that sample has no read! otherwise make krona plt. if (all(df[,2] == 0) == 1){ dir.create(file.path(output_folder,"/",grouping_filter_column_value)) cat(format(Sys.time(), "%a %b %d %Y %X TZ(%z)")," ", "All samples have 0 abundance for this sample group.",file= paste0(output_folder,"/",grouping_filter_column_value,".html")) } else { dir.create(file.path(output_folder,"/",grouping_filter_column_value)) for (lvl in levels(df[, 2])) { write.table(unique(df[which(df[, 2] == lvl), -2]), file = paste0(output_folder,"/",grouping_filter_column_value, "/", lvl, "taxonomy.txt"), sep = "\t", row.names = F, col.names = F, na = "", quote = F) } krona_args <- paste0(output_folder,"/", grouping_filter_column_value, "/", levels(df[,2]), "taxonomy.txt,", levels(df[, 2]), collapse = " ") output <- paste0(output_folder,"/",grouping_filter_column_value,".html") system(paste("ktImportText", krona_args, "-o", output, sep = " ")) } |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input phyloseq_object <- snakemake@input[["phyloseq_object"]] ## Ouput rarefaction_curve <- snakemake@output[["rarefaction_curve"]] ## Parameters color_factor <- snakemake@params[["grouping_column"]] ## Load libraries library('phyloseq') library('ggplot2') library('plyr') # ldply library('reshape2') # melt library('vegan') ## Load the phyloseq object phyloseq_obj <- readRDS(phyloseq_object) ## Create a modified version of estimate_richness function of phyloseq to accept sample names containing numbers only modified_estimated_richness <- function (physeq, split = TRUE, measures = NULL) { if (!any(otu_table(physeq) == 1)) { warning("The data you have provided does not have\n", "any singletons. This is highly suspicious. Results of richness\n", "estimates (for example) are probably unreliable, or wrong, if you have already\n", "trimmed low-abundance taxa from the data.\n", "\n", "We recommended that you find the un-trimmed data and retry.") } if (!split) { OTU <- taxa_sums(physeq) } else if (split) { OTU <- as(otu_table(physeq), "matrix") if (taxa_are_rows(physeq)) { OTU <- t(OTU) } } renamevec = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher") names(renamevec) <- c("S.obs", "S.chao1", "S.ACE", "shannon", "simpson", "invsimpson", "fisher") if (is.null(measures)) { measures = as.character(renamevec) } if (any(measures %in% names(renamevec))) { measures[measures %in% names(renamevec)] <- renamevec[names(renamevec) %in% measures] } if (!any(measures %in% renamevec)) { stop("None of the `measures` you provided are supported. Try default `NULL` instead.") } outlist = vector("list") estimRmeas = c("Chao1", "Observed", "ACE") if (any(estimRmeas %in% measures)) { outlist <- c(outlist, list(t(data.frame(estimateR(OTU), check.names = FALSE)))) } if ("Shannon" %in% measures) { outlist <- c(outlist, list(shannon = diversity(OTU, index = "shannon"))) } if ("Simpson" %in% measures) { outlist <- c(outlist, list(simpson = diversity(OTU, index = "simpson"))) } if ("InvSimpson" %in% measures) { outlist <- c(outlist, list(invsimpson = diversity(OTU, index = "invsimpson"))) } if ("Fisher" %in% measures) { fisher = tryCatch(fisher.alpha(OTU, se = TRUE), warning = function(w) { warning("phyloseq::estimate_richness: Warning in fisher.alpha(). See `?fisher.fit` or ?`fisher.alpha`. Treat fisher results with caution") suppressWarnings(fisher.alpha(OTU, se = TRUE)[, c("alpha", "se")]) }) if (!is.null(dim(fisher))) { colnames(fisher)[1:2] <- c("Fisher", "se.fisher") outlist <- c(outlist, list(fisher)) } else { outlist <- c(outlist, Fisher = list(fisher)) } } out = do.call("cbind", outlist) namechange = intersect(colnames(out), names(renamevec)) colnames(out)[colnames(out) %in% namechange] <- renamevec[namechange] colkeep = sapply(paste0("(se\\.){0,}", measures), grep, colnames(out), ignore.case = TRUE) out = out[, sort(unique(unlist(colkeep))), drop = FALSE] out <- as.data.frame(out) return(out) } # Replicate curve function from : https://github.com/joey711/phyloseq/issues/143 set.seed(42) calculate_rarefaction_curves <- function(psdata, measures, depths) { estimate_rarified_richness <- function(psdata, measures, depth) { if(max(sample_sums(psdata)) < depth) return() psdata <- prune_samples(sample_sums(psdata) >= depth, psdata) rarified_psdata <- rarefy_even_depth(psdata, depth, verbose = FALSE) alpha_diversity <- modified_estimated_richness(rarified_psdata, measures = measures) # as.matrix forces the use of melt.array, which includes the Sample names (rownames) molten_alpha_diversity <- melt(as.matrix(alpha_diversity), varnames = c('Sample', 'Measure'), value.name = 'Alpha_diversity') molten_alpha_diversity } names(depths) <- depths # this enables automatic addition of the Depth to the output by ldply rarefaction_curve_data <- ldply(depths, estimate_rarified_richness, psdata = psdata, measures = measures, .id = 'Depth', .progress = ifelse(interactive(), 'text', 'none')) # convert Depth from factor to numeric rarefaction_curve_data$Depth <- as.numeric(levels(rarefaction_curve_data$Depth))[rarefaction_curve_data$Depth] rarefaction_curve_data } rarefaction_curve_data <- calculate_rarefaction_curves(psdata = phyloseq_obj, measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson"), depth = rep(c(1, 10, 100, 1000, 1:100 * 10000), each = 10)) summary(rarefaction_curve_data) #### rarefaction_curve_data_summary <- ddply(rarefaction_curve_data, c('Depth', 'Sample', 'Measure'), summarise, Alpha_diversity_mean = mean(Alpha_diversity), Alpha_diversity_sd = sd(Alpha_diversity)) #### rarefaction_curve_data_summary_verbose <- merge(rarefaction_curve_data_summary, data.frame(sample_data(phyloseq_obj)), by.x = 'Sample', by.y = 'row.names') rarefaction_curve_data_summary_verbose p <- ggplot( data = rarefaction_curve_data_summary_verbose, mapping = aes( x = Depth, y = Alpha_diversity_mean, ymin = Alpha_diversity_mean - Alpha_diversity_sd, ymax = Alpha_diversity_mean + Alpha_diversity_sd, colour = get(color_factor), group = get("Sample"))) + labs(col = "Sample type") + geom_line(size = 0.15) + #geom_pointrange(size = 0.2) + facet_wrap(facets = ~ Measure, scales = 'free_y') ggsave(filename = rarefaction_curve, plot = p, width = 10) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input raw_to_filtered_reads_stats_path <- snakemake@input[["raw_to_filtered_reads_stats"]] raw_to_filtered_reads_stats <- read.table(file = raw_to_filtered_reads_stats_path, sep = "\t", header = TRUE) metadata_path <- snakemake@input[["Metadata_table"]] metadata <- read.delim(file = metadata_path, sep = "\t", header = TRUE) multi_QC_report_path <- snakemake@input[["multi_QC_report_path"]] multi_QC_report <- read.table(multi_QC_report_path, header = T) ## Ouput reads_plot_with_filtered <- snakemake@output[["reads_plot_with_filtered"]] ## Parameters x_axis_column <- snakemake@params[["sample_label"]] ## Load needed libaries library("phyloseq");packageVersion("phyloseq") library("data.table");packageVersion("data.table") library("dplyr");packageVersion("dplyr") library("ggplot2");packageVersion("ggplot2") library("RColorBrewer"); packageVersion("RColorBrewer") ## Set theme theme_set(theme_bw()) ### Record data on the distribution of number of reads (useful later to scale plots axis) smin <- min(multi_QC_report$FastQC_mqc.generalstats.fastqc.total_sequences) print(smin) smean <- mean(multi_QC_report$FastQC_mqc.generalstats.fastqc.total_sequences) print(smean) smax <- max(multi_QC_report$FastQC_mqc.generalstats.fastqc.total_sequences) print(smax) ### Order the x axis as in the metadata_table raw_to_filtered_reads_stats[[x_axis_column]] = factor(raw_to_filtered_reads_stats[[x_axis_column]], levels = unique(metadata[[x_axis_column]]), ordered = TRUE) #raw_to_filtered_reads_stats[[grouping_column]] = factor(raw_to_filtered_reads_stats[[grouping_column]], levels = unique(metadata[[grouping_column]]), ordered = TRUE) ### Order the reads count in logical ordered ordered_factors <- c("Tax filtered reads", "Reads processing", "Maintained reads") raw_to_filtered_reads_stats$Reads <- factor(raw_to_filtered_reads_stats$Reads, ordered = TRUE, levels = ordered_factors) ### Generate colors colors palette getPalette <- colorRampPalette(brewer.pal(n=9, "Set3")) ColList <- ordered_factors ColPalette = getPalette(length(ColList)) names(ColPalette) = ColList colors_palette <- ColPalette overall_reads_barplot <- ggplot(raw_to_filtered_reads_stats, aes(x = get(x_axis_column), y = Count, fill = Reads)) + geom_col() + scale_fill_manual(values = colors_palette) + labs(x= x_axis_column, y ="Reads") + ggtitle(paste("Reads counts overall")) + scale_x_discrete(drop = TRUE) + # Keep all groups, included the ones with values. Alternative : (drop = FALSE) scale_y_continuous(labels = scales::comma, limits = c(0,smax)) + theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 7)) #+ # guides(fill=guide_legend(title=filling_column)) ### Save it w <- 7 + 0.07 * (length(unique(raw_to_filtered_reads_stats[[x_axis_column]]))) ggsave(overall_reads_barplot, filename = reads_plot_with_filtered, width = w, height = 5) |
8
of
scripts/raw_to_processed_reads_plot.R
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input phyloseq_object <- snakemake@input[["phyloseq_object"]] multi_QC_report_path <- snakemake@input[["multi_QC_report_path"]] multi_QC_report <- read.table(multi_QC_report_path, header = T) ## Ouput raw_to_filtered_reads_stats <- snakemake@output[["raw_to_filtered_reads_stats"]] ## Load needed libaries library("phyloseq");packageVersion("phyloseq") library("data.table");packageVersion("data.table") library("dplyr");packageVersion("dplyr") ## Load the phyloseq object phyloseq_obj <- readRDS(phyloseq_object) ## Create a table with the number of raw reads and filtered reads ### Filtered reads reads_counts_df <- data.table(as(sample_data(phyloseq_obj), "data.frame"), TotalReads = sample_sums(phyloseq_obj), keep.rownames = TRUE) setnames(reads_counts_df, "rn", "Sample") # Rename the first column of this news dataframe -> Sample ### Raw reads multi_QC_report <- multi_QC_report %>% filter(grepl("R1", Sample)) %>% select(c("Sample","FastQC_mqc.generalstats.fastqc.total_sequences")) # keep only the total of raw sequences. Since it is already twice (R1, R2), keep only R1. multi_QC_report$Sample <- gsub(x=multi_QC_report$Sample, pattern = "_R1", replacement = "") # Remove the "_R1" ### Raw reads Join the two dataframes, the one with the filtered reads from phyloseq and the one with the raw reads. reads_counts_df_with_raw <- merge(multi_QC_report, reads_counts_df, by=c("Sample")) ### Raw reads Calculate the difference of reads between the raw and the filtered reads. reads_counts_df_with_raw <- mutate(reads_counts_df_with_raw, filtered = FastQC_mqc.generalstats.fastqc.total_sequences - TotalReads) ### Raw reads Keep two version of the table, one with the filtered reads in "Count" and one with the difference with the raw reads in the same count column. reads_counts_df_raw_count_only <- reads_counts_df_with_raw %>% select(-"filtered") %>% rename("Count" = "TotalReads") reads_counts_df_raw_count_only$Reads <- "Maintained reads" reads_counts_df_raw_filtered_only <- reads_counts_df_with_raw %>% select(-"TotalReads") %>% rename("Count" = "filtered") reads_counts_df_raw_filtered_only$Reads <- "Reads processing" ### Raw reads Bind the rows of the two just prepared tables melted_reads_count <- rbind(reads_counts_df_raw_count_only, reads_counts_df_raw_filtered_only) ### Raw reads Write this table write.table(x = melted_reads_count, file = raw_to_filtered_reads_stats, sep = "\t", col.names = NA, row.names = TRUE) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Input read_filtering <- snakemake@input[["read_filtering"]] taxonomic_filtering <- snakemake@input[["taxonomic_filtering"]] multi_QC_report_path <- snakemake@input[["multi_QC_report_path"]] ## Ouput raw_to_filtered_reads_stats <- snakemake@output[["raw_to_filtered_reads_stats"]] ## Load needed libaries library("phyloseq");packageVersion("phyloseq") library("data.table");packageVersion("data.table") library("dplyr");packageVersion("dplyr") ## Read files read_filtering_physeq <- readRDS(read_filtering) tax_filtering_physeq <- readRDS(taxonomic_filtering) multi_QC_report <- read.table(multi_QC_report_path, header = T) ## Create a table with the number of raw reads and filtered reads ### Raw reads multi_QC_report <- multi_QC_report %>% filter(grepl("R1", Sample)) %>% select(c("Sample","FastQC_mqc.generalstats.fastqc.total_sequences")) # keep only the total of raw sequences. Since it is already twice (R1, R2), keep only R1. multi_QC_report$Sample <- gsub(x=multi_QC_report$Sample, pattern = "_R1", replacement = "") # Remove the "_R1" in label ### Filtered reads filtered_reads_counts_df <- data.table(as(sample_data(read_filtering_physeq), "data.frame"), TotalReads_processing = sample_sums(read_filtering_physeq), keep.rownames = TRUE) setnames(filtered_reads_counts_df, "rn", "Sample") # Rename the first column of this news dataframe -> Sample ### Taxonomically filtered reads tax_filtered_reads_counts_df <- data.table(as(sample_data(tax_filtering_physeq), "data.frame"), TotalReads_taxonomy = sample_sums(tax_filtering_physeq), keep.rownames = TRUE) setnames(tax_filtered_reads_counts_df, "rn", "Sample") # Rename the first column of this news dataframe -> Sample tax_filtered_reads_counts_df <- select(tax_filtered_reads_counts_df, c("TotalReads_taxonomy","Sample")) ### Join the columns of interest merged_columns <- merge(filtered_reads_counts_df, multi_QC_report, by=c("Sample")) merged_columns <- merge(merged_columns, tax_filtered_reads_counts_df, by = c("Sample")) ### Calculate the differences merged_columns$reads_processing <- merged_columns$FastQC_mqc.generalstats.fastqc.total_sequences - merged_columns$TotalReads_processing merged_columns$tax_filtration <- merged_columns$TotalReads_processing - merged_columns$TotalReads_taxonomy ### Keep three versions of the table, where the "Count" is consecutively the final number of reads, the reads lost during processing and the reads taxonomically filtered. #### Keep the final count of reads, remove the two other column final_reads_count <- merged_columns %>% rename("Count" = "TotalReads_taxonomy") final_reads_count$Reads <- "Maintained reads" final_reads_count[ ,c("reads_processing", "tax_filtration")] <- list(NULL) #### Keep the reads filtered during processing remove the two other column processing_reads_count <- merged_columns %>% rename("Count" = "reads_processing") processing_reads_count$Reads <- "Reads processing" processing_reads_count[ ,c("tax_filtration", "TotalReads_taxonomy")] <- list(NULL) #### Keep the reads removed by taxonomic filtering remove the two other column tax_filtered_reads_count <- merged_columns %>% rename("Count" = "tax_filtration") tax_filtered_reads_count$Reads <- "Tax filtered reads" tax_filtered_reads_count[ ,c("reads_processing", "TotalReads_taxonomy")] <- list(NULL) #### Bind the rows of the two just prepared tables melted_reads_count <- rbind(final_reads_count, tax_filtered_reads_count) melted_reads_count <- rbind(melted_reads_count, processing_reads_count) # Write this table write.table(x = melted_reads_count, file = raw_to_filtered_reads_stats, sep = "\t", col.names = NA, row.names = TRUE) |
17 18 19 20 21 22 23 24 25 | shell: ''' picrust2_pipeline.py \ --in_traits COG,EC,KO,PFAM,TIGRFAM \ -s {input[seqs]} \ -i {input[table]} \ -o {output}/output \ -p {threads} ''' |
Support
- Future updates
Related Workflows





