Repository to process ITS amplicon sequencing data and produce figures for Schneider et al (2021)
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This repository contains code to analyse the ITS amplicon sequencing data with DADA2 for further comparisons with RNA-seq data from the same samples. It uses data collected for Haas et al (2018), but using dada2 and Swarm clustering instead of OTU clustering.
Repository contents
The repository consists of two units that are run separately, and an additional folder with R scripts for analysis and plotting of the results:
1 demultiplex_wf -> download and demultiplex raw data
The folder contains instructions on how to run the workflow as a docker container. It will download the raw data from the ENA and demultiplex the sequences into files per sample, as well as concatenate technical replicates.
2 workflow -> ITS amplicon sequencing data preprocessing workflow
The folder contains a snakemake workflow that will reproduce the preprocessing of the demultiplexed ITS amplicon sequencing data as used in the study.
How to use it
The workflow is run through snakemake, from the root folder of the repository (where this readme sits). To continue with the demultiplexed data, we need to move it from the demultiplex_wf subfolder.
mkdir $(pwd)/data
mv $(pwd)/demultiplex_wf/data/ $(pwd)/
Next we will create a conda environment ("its_wf") needed to execute the snakemake workflow, and then activate it:
conda env create -n its_wf -f $(pwd)/environment.yml
conda activate its_wf
Once the conda environment has been created successfully, we can execute the workflow with the following command:
snakemake -s $(pwd)/workflow/Snakefile -pr -j 4 --use-conda
This will output the final count matrix and other results (such as sequences for every Swarm OTU and taxonomic assignments) into the results/ folder.
analysis -> R scripts to analyse ITS and RNA data
The folder contains scripts to reproduce the figures in the publication.
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | require(dada2, quietly=TRUE) #Parameters if (is.null(snakemake@params$maxN)) {snakemake@params$maxN <- 0} if (is.null(snakemake@params$truncQ)) {snakemake@params$truncQ <- 2} if (is.null(snakemake@params$truncLen)) {snakemake@params$truncLen <- 0} if (is.null(snakemake@params$maxEE_R1)) {snakemake@params$maxEE_R1 <- Inf} if (is.null(snakemake@params$maxEE_R2)) {snakemake@params$maxEE_R2 <- Inf} if (is.null(snakemake@params$rm_phix)) {snakemake@params$rm_phix <- TRUE} if (is.null(snakemake@params$minLen)) {snakemake@params$minLen <- 0} maxEE <- c(snakemake@params$maxEE_R1, snakemake@params$maxEE_R2) # Performance threads <- snakemake@threads out <- filterAndTrim(fwd = snakemake@input$R1, filt = snakemake@output$R1, rev = snakemake@input$R2, filt.rev = snakemake@output$R2, maxN = snakemake@params$maxN, truncQ = snakemake@params$truncQ, maxEE = maxEE, rm.phix = snakemake@params$rm_phix, minLen = snakemake@params$minLen, multithread = threads, verbose=TRUE) saveRDS(out, snakemake@output$rds) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | args = commandArgs(trailingOnly=TRUE) itsx_in <- snakemake@input$its_in itsx_out <- snakemake@input$its_out sq.nc <- snakemake@input$seqtab out <- snakemake@output$seqtab out_mock <- snakemake@output$mock threads <- snakemake@threads mock <- snakemake@params$mock mock_samples <- strsplit(mock, ",")[[1]] library(dada2); packageVersion("dada2") library(dplyr) library(Biostrings) dna <- readDNAStringSet(itsx_in) dna_itsx <- as.character(readDNAStringSet(itsx_out)) seqtab.nc <- readRDS(sq.nc) colnames(seqtab.nc) <- names(dna) seqtab.nc <- seqtab.nc[,colnames(seqtab.nc)%in%names(dna_itsx)] colnames(seqtab.nc) <- dna_itsx seqtab.nc <- t(seqtab.nc) #Remove sequences below 50bp if (any(nchar(getSequences(t(seqtab.nc)))<50)) { dna_itsx <- dna_itsx[!(nchar(dna_itsx)<50)] seqtab.nc <- seqtab.nc[nchar(rownames(seqtab.nc))>50,] } #At this point we can summarise all identical sequences seqtab.nc2 <- cbind.data.frame(sequence=rownames(seqtab.nc), seqtab.nc) seqtab.nc3 <- group_by(seqtab.nc2, sequence) %>% summarise_each(funs(sum)) if (length(intersect(colnames(seqtab.nc3), mock_samples)) > 0) { mock_cols <- append(c("sequence"), intersect(colnames(seqtab.nc3), mock_samples)) mock <- seqtab.nc3 %>% select(all_of(mock_cols)) seqtab.nc4 <- seqtab.nc3 %>% select(!any_of(mock_cols)) saveRDS(mock, out_mock) } else { seqtab.nc4 <- seqtab.nc3[,colnames(seqtab.nc3)!="sequence"] } rownames(seqtab.nc4) <- seqtab.nc3$sequence seqtab.nc4 <- data.matrix(t(seqtab.nc4)) saveRDS(seqtab.nc4, out) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from argparse import ArgumentParser from Bio.Seq import reverse_complement def revcomp(seq): return reverse_complement(seq) def main(args): r = revcomp(args.seq) print(r, end="") if __name__ == "__main__": parser = ArgumentParser() parser.add_argument("seq", type=str, help="Input fasta string") args = parser.parse_args() main(args) |
34 35 36 37 38 39 40 | shell: """ fastq-dump {params.spots} --split-3 --gzip -O {params.data_dir} \ {params.acc} > {log} 2>&1 mv {params.data_dir}/{params.acc}_1.fastq.gz {output.R1} mv {params.data_dir}/{params.acc}_2.fastq.gz {output.R2} """ |
60 61 | run: symlink(input[0], output[0]) |
81 82 | script: "scripts/filter.R" |
96 97 98 99 100 | shell: """ python workflow/scripts/revcomp.py {params.fwd} > {output.fwd_rc} python workflow/scripts/revcomp.py {params.rev} > {output.rev_rc} """ |
126 127 128 129 130 131 132 133 134 | shell: """ A=$(cat {input.fwd_rc}) a=$(cat {input.rev_rc}) cutadapt -g {params.FWD} -a $a -G {params.REV} -A $A -j {threads} \ -n {params.n} -o {output.R1} -p {output.R2} --minimum-length {params.min_len} \ {input.R1} {input.R2} > {log} 2>&1 """ |
161 162 | script: "scripts/filter.R" |
203 204 205 206 207 | shell: """ Rscript --vanilla workflow/scripts/runDada2.R {params.fw_dir} {params.rv_dir} \ {params.out_dir} {threads} > {log} 2>&1 """ |
229 230 231 232 233 | shell: """ ITSx -i {input} -o {params.prefix}/itsx --cpu {threads} --multi_thread T \ --preserve T --partial 50 --minlen 50 > {log} 2>&1 """ |
256 257 | script: "scripts/ProcessITSx.R" |
276 277 278 279 280 | shell: """ Rscript --vanilla workflow/scripts/MergeLibraries.R {params.output_dir} \ {params.dirnames} > {log} 2>&1 """ |
298 299 300 301 302 | shell: """ swarm -t {threads} -d 3 -z --output-file {output.txt} \ --seeds {output.seeds} {input} > {log} 2>&1 """ |
321 322 323 324 325 | shell: """ Rscript --vanilla workflow/scripts/processSwarm.R {params.swarm_res} \ {params.swarm_in} {params.swarm_res} > {log} 2>&1 """ |
335 336 337 338 339 340 341 342 | shell: """ curl -o {params.dir}/unite.zip -L {params.unite_url} > {log} 2>&1 unzip -d {params.dir} {params.dir}/unite.zip > {log} 2>&1 cat {params.dir}/sh*.fasta > {params.dir}/unite rm -rf {params.dir}/developer {params.dir}/unite.zip {params.dir}/*.fasta mv {params.dir}/unite {params.dir}/unite.fasta """ |
357 358 359 360 361 | shell: """ Rscript --vanilla workflow/scripts/rundada2TAX.R {input[0]} {input[1]} \ {output[0]} {threads} > {log} 2>&1 """ |
Support
- Future updates
Related Workflows





