Snakemake workflow to download and/or align reads to targets and produce useful outputs.
A Snakemake workflow to download/align reads to targets and produce useful outputs. With hiss, you can:
-
Download reads for all samples from a study archived in NCBI's sequence read archive (optional). If you'd like to ONLY do this, check out grabseqs , an easy-to-use utility that spun off from this project.
-
Stringently align .fastq files to one or more target sequences using bowtie-2
-
Generate coverage maps for each positive sample
-
Generate standardized summary tables with information on depth and breadth of coverage for each positive sample.
Installing
Getting it up and running is hopefully simple:
git clone https://github.com/louiejtaylor/hisss
cd hisss
We use conda to handle dependencies; you can install miniconda from
here
. Make a new conda enviroment, then install dependencies from
requirements.txt
like so:
conda create -n hisss
conda activate hisss
conda install -c bioconda -c conda-forge -c louiejtaylor --file requirements.txt
Conda is great for managing dependencies and environments without requiring any admin privileges.
Configuration
Hisss can run on both local and remote fastqs that are either paired or unpaired. The options in
config_template.yml
should be self-explanatory--just replace the placeholders with the relevant paths to your samples and alignment targets.
The final step in setting up your config is to add your samples. We include two utilites to simplify adding samples to your config file depending on where your data are located:
list_SRA.py
and
list_samples.py
.
SRA or MG-RAST data
If you're using data from
the SRA
, grabbing all the samples from a study is as simple as passing the project identifier (SRP#) to
list_SRA.py
like so:
python ./scripts/list_SRA.py SRP####### >> my_config.yml
This command will append nicely-formatted sample names to
my_config.yml
, along with some metadata of questionable utility. It also saves the full SRA metadata file as a .csv (so we recommend running this in your project directory). You also don't need to know whether the reads are paired- or single-end beforehand--as long as the information is in the SRA metadata it'll be included.
A similar script is available for MG-RAST data, which can be used like so:
python ./scripts/list_MGRAST.py mgp###### >> my_config.yml
Local data
If you're running on local samples, use
list_samples.py
. Let's say your fastqs are paired, located in
/project/fastq/
, and are named like
Sample_[sample_name]_R[pair].fastq
:
python ./scripts/list_samples.py -pattern "Sample_{sample}_R{rp}.fastq" /project/fastq/ >> my_config.yml
Running
To run, simply execute the following in the hisss root dir. The
-p
flag will print out the shell commands that will be executed. If you'd like to do a dry run (see the commands without running them), pass
-np
instead of
-p
.
snakemake -p --configfile path/to/my_config.yml all
If you're running on SRA data, we recommend using
--restart-times
since we've encountered issues with downloads randomly failing:
snakemake -p --restart-times 5 --configfile path/to/my_config.yml all
And if you'd just like to use hisss only to grab the data from SRA, pass the
--notemp
flag and specify
download_only
as the target rule:
snakemake -p --restart-times 5 --notemp --configfile path/to/my_config.yml download_only
There are many more useful options that you could pass to snakemake that are beyond the scope of this tutorial. Read more about them here !
When you're done, to leave the conda environment:
source deactivate
Troubleshooting
To run the dummy data (which should complete very quickly):
snakemake -p --configfile test_data/test_config.yml all
If you want to run the dummy data again after tinkering with the Snakefile or rules, you can clean up the test output like so:
cd test_data
bash clean_test.sh
Feel free to open an issue or tweet @Louviridae or @A2_Insanity if you have problems. Good luck!
Current workflow
(generated by graphviz )
Code Snippets
11 12 13 14 15 16 17 18 19 | shell: """ samtools depth -a {input} > {output.cov} if [ -s {output.cov} ]; then Rscript scripts/plot_coverage.R {output.cov} {params.plot}; else echo "No valid alignments detected"; fi """ |
29 30 31 32 | shell: """ cat {params.plot_dir}*.cov.depth.txt > {output} """ |
9 10 | shell: "bowtie2-build --threads {threads} -f {input} {input}" |
21 22 23 24 25 26 | shell: """ samtools view -bT {params.target} {input} > {output.bam} samtools sort -o {output.sorted} {output.bam} samtools index {output.sorted} {output.bai} """ |
9 10 11 12 | shell: """ samtools idxstats {input.bam} > {output} """ |
19 20 21 22 | shell: """ samtools view -b {input.bam}|genomeCoverageBed -ibam stdin|grep -v 'genome'| perl scripts/coverage_counter.pl > {output} """ |
30 31 32 33 | shell: """ Rscript scripts/summarize_alignments.R {input.cov} {input.stats} {output} """ |
43 44 45 46 47 | shell: """ echo -e "Sample\tAlignTarget\tFractionCoverage\tTargetLength\tMappedReads" > {output} cat {params.align_dir}*.align.summary.txt >> {output} """ |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | library(methods) library(ggplot2) library(magrittr) library(reshape2) #Set up argument parser and QC------------------------------------------------- args <- commandArgs() #print(args) cov <- args[6] output <-args[7] if(!file_test("-f", cov)) { stop("Coverage file not defined.") } if(is.na(output)) { stop("Output file not defined.") } #Convert to a dataframe if a positive hit-------------------------------------- if(!file.size(cov) == 0){ cov_df <-read.table(cov, sep = "\t") } else { quit() } colnames(cov_df) <- c("AlignTarget", "Position", "Count") #Coverage Maps----------------------------------------------------------------- ##Function for creating plots. plot_nuc_cov <- function(cov_df) { p.list <- lapply(sort(unique(cov_df$AlignTarget)), function(i) { ggplot(cov_df[cov_df$AlignTarget==i,], aes(x = Position, y = Count), fill = "black") + geom_col() + facet_wrap(~AlignTarget, scales = "free", ncol = 1) + ggtitle(NULL) + xlab("Position") + ylab("Coverage") + theme_bw(base_size = 14) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.spacing = unit(1.2, "lines"), strip.background = element_rect(colour="white", fill="white"), legend.position="right", axis.title = element_text(size = rel(1.5)), axis.text = element_text(size = rel(1.25), color = "black"), strip.text = element_text(size = rel(1.25), color = "black")) }) return(p.list) } #Plot alignments--------------------------------------------------------------- plots <- plot_nuc_cov(cov_df) pdf(output, onefile = TRUE) invisible(lapply(plots, print)) dev.off() |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | library(methods) #Set up argument parser and QC------------------------------------------------- args <- commandArgs() #print(args) cov <- args[6] read <-args[7] output <-args[8] if(is.na(cov)) { stop("Coverage file not defined.") } if(is.na(read)) { stop("Mapped read file not defined.") } if(is.na(output)) { stop("Output file not defined") } #Define sample name------------------------------------------------------------ # cov <- "test-A.genomecoverage.txt" # read <- "test-A.sorted.idxstats.tsv" sample_name = sub(".genomecoverage.txt", "", cov) sample_name = sub(".*\\/", "", sample_name) ##Build a dataframe of coverage and mapped reads------------------------------- cov_df <- read.delim(cov, sep = "\t", header=F) cov_df$V3 <- sample_name colnames(cov_df) <- c("AlignTarget", "Coverage", "Sample") read_df <-read.delim(read, sep = "\t", header=F) read_df$V5 <-sample_name colnames(read_df) <- c("AlignTarget", "TargetLength", "MappedReads", "UnmappedReads", "Sample") read_df <- subset(read_df, AlignTarget != "*", select = c(Sample, AlignTarget, TargetLength, MappedReads)) #Merge data. Keep all observations all_align_data <- merge(cov_df, read_df, by = c("Sample", "AlignTarget"), all.x = TRUE, all.y = TRUE) ##Write output write.table(all_align_data, sep = "\t", file = output, quote = FALSE, col.names = FALSE, row.names = FALSE) |
Support
- Future updates
Related Workflows





