Schneider NET-Seq pipeline

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This pipeline is inteded for the use in the Schneider lab on UAB's Cheaha HPC cluster. It utilizes snakemake to organize and execute the pipeline, allowing for the parallelization.

File Overview

In the main pipeline folder you will see several files and a folder labeled data. Files and their uses are as listed:

Snakefile
- This file contains the rules that the pipeline will apply to your sequencing reads. This should not need to be modified unless you need to change steps to the pipeline or maybe if you start to tweak parameters for specific tools. It is possible if the sequencing adapters sequences change you may have to make those changes in this file. But for the most part this file should stay same.
cluster_config.yaml
- This file contains the parameters to run your pipeline on Cheaha. This file and config yaml will be the most modified. I will go into detail on what changes may be necessary later in this document.
config.yaml
- This file contains the names and locations for your fastq files. The default config file has names of dummy samples that aren't include. The names of the files will have to be changed before running the pipeline.
reset.sh
- This is just a handy script to clear the output folder and any .err and .out files from Cheaha. It is not necessary, but it can be helpful if you are making significant changes and are debugging as it resets the folder back to default.
runpipeline.sh
- This script runs the pipeline. It calls snakemake to use the Snakefile with the cluster_config.yaml settings. Only one modification is necessary in this file and that would be to possibly change the number of jobs.
tempdirclear.sh
- This script clears temporary directories for steps ca3 and ca5. Snakemake actually can delete the temp .fastqs itself but can't delete the folders so thats all this does. It is not necessary, but it can help keep your output folder cleaner and more organized.

There is one directory called data with three directories inside:

gtf
- In this folder there is genes.gtf which contains the gene annotations for the reference genome this pipeline uses (basically the locations of genes in the raw sequence). Additionally there is a notes.txt file which contains the specific modifications made to this .gtf file.
ref
- SacCer3onerDNArepeat.fna is the reference genome used for this pipeline. Significant edits are made in this genome basically masking out repeated regions of the rDNA with Ns. This allows for the rDNA reads to be uniquely mapped according to the aligner. Details on modifications are on the notes.txt file in this folder.
- STARindex.job and STARindex.sh are the files used to index the genome. You should only have to run these scripts once. I typically split my scripts that I plan on submitting through sbatch into a .sh file (which has the base command I want to run) and a .job file (which simply has the slurm parameters and calls the .sh file). I'm not sure if this is the best practice but it works. Details about how to run these will be in the initial set up section.
samples
- This folder is empty except for a placeholder file. This is where your .fastq's will go.

Initial Set Up

To begin, download the folder to your user data storage in Cheaha. To do this you can either download the ZIP file from this page or with the command
wget https://github.com/nathanbellis/SchneiderNETSeqPipeline/archive/refs/heads/main.zip
from there you can unzip the zipfile with unzip main.zip Since all of our samples are done in S. cerevisiae , you should only need to index the genome once. The yeast genome is small enough it should handle copying and pasting to and from scratch pretty well.

Indexing the Genome

Navigate to the data/ref/ folder and run the STARindex.job with the following command.
sbatch STARindex.job
This should only need to be run once unless you make modifications to the genome file or use a different genome.

Misc. Changes

In the cluster_config.yaml you will want to go in and change the account parameter from my username nfbellis to yours.
You can delete the placeholder file in the data/samples folder
You may have to change the permissions to all the .sh files to make them executable. For this use the command: chmod +x filename.sh

Running the Pipeline

STEP 1:
Copy the main folder and all files and sub-folders to scratch space.
STEP 2:
Copy your fastq files into the data/samples folder.

STEP 3:
Edit config.yaml

samples:
 SRR13733831: data/samples/SRR13733831.fastq
 SRR13733832: data/samples/SRR13733832.fastq
 SRR13733833: data/samples/SRR13733833.fastq
 SRR13733834: data/samples/SRR13733834.fastq
 SRR13733835: data/samples/SRR13733835.fastq
 SRR13733836: data/samples/SRR13733836.fastq

samples:
 yoursamplename1: data/samples/yoursamplename1.fastq
 yoursamplename2: data/samples/yoursamplename2.fastq
 yoursamplename3: data/samples/yoursamplename3.fastq
 yoursamplename4: data/samples/yoursamplename4.fastq
 yoursamplename5: data/samples/yoursamplename5.fastq
 yoursamplename6: data/samples/yoursamplename6.fastq

You can include as many or as few samples as you if you are running it for the first time it may be wise to just run one to make sure everything runs well.

STEP 4:
Edit cluster_config.yaml There shouldn't need to be too many modifications made in this file. All of the jobs get submitted to the express partition with one CPU unless they specifically are allocated below in this file. All jobs are given one hour which is usually more than enough. The htseq count step typically runs in 20-30 minutes so it is possible that for larger files you may want to add the following addition to cluster_config.yaml
```
htseq_count:
 time: 2:00:00 # Or however long you think it may need. 
```
This can also be applied to any steps where one hour is not enough.

Additionally fqtrim needs additional memory. For most of the steps 1 gigabyte (1G in the cluster_config.yaml) is enough, however fqtrim reads the whole file into memory so I have been changing that parameter to the size of your largest file. This pipeline was developed using a max fastq size of 9.75GB, hence the 10G in
```
fqtrim_collapse:
 mem-per-cpu: 10G # Change this to be larger than your largest fastq
```
STEP 5:
Edit runpipeline.sh Only one modification really needs to be made here and that is changes the --jobs parameter. I developed this pipeline primarily by running 6 fastq's at a time, so I chose 12 jobs as that is the theoretical max number of jobs that could be run at one time according to the Snakefile rules. Typically I would just set this to double the number of fastq files you plan on running at a time
```
snakemake --snakefile Snakefile \
--cluster "sbatch -A {cluster.account} \ -t {cluster.time} \ -p {cluster.partition} \ -N {cluster.nodes} \ -c {cluster.cpus-per-task} \ -o {cluster.output} \ -e {cluster.error} \ -n {cluster.ntasks} \ --mem-per-cpu {cluster.mem-per-cpu}" \
--cluster-config cluster_config.yaml \
--jobs 12 # Change this to double the number of fastqs you want to run
```
STEP 6:
Run the pipeline
First load snakemake with:
module load snakemake
Make sure you are in the main folder with the Snakemake and yaml files. Before I run the pipeline I like to do a dry run with the command:
snakemake -np
This will use the Snakefile to piece together the pipeline and give you a roadmap of the entire workflow and all the jobs with out actually running the commands. This is especially useful if you make any modifications in the Snakefile like adding or removing a rule

If everything looks good you are ready to run the pipeline with the following:
./runpipeline.sh
The ./ basically just lets the prompt know that this script is in the current directory. If it doesn't run make sure the file is executable. Look at the initial set up section for more information about changing the permissions.
STEP 7:
Monitor the pipeline
While the jobs are submitted to the cluster via SLURM and don't require the session to stay open, snakemake is running in the foreground submitting multiple jobs as they are needed. For this reason the window running the pipeline will need to be kept open. I suggest that you keep the window open the first time you run the pipe line and periodically check on it to make sure the jobs are running and being submitted successfully. You can check on the jobs through the jobs tab on the research computing homepage or by opening a new shell window and using squeue -u USERNAME command. I found the best way to monitor was to use the job tab to monitor jobs as it provided more information, and then have another window open where I would check the output files.

If all runs smoothly there are ways to run Snakemake in the background, or another window. If this is something you find you would like to learn. Look into the linux commands & , nohup , or tmux , which all provide ways of running snakemake in the background or even when you are logged out.
STEP 8 :
Copy the output file
The pipeline will put all resulting files into a folder labeled output. When the full pipeline is run and there are no more jobs running, copy the output folder and move it to your USER_DATA folder for more long term storage. From here you can take the files and do further analysis in R. Once data has been moved and you are satisfied with the results, you can then delete the main folder from scratch keeping the scratch space clean.

Step by step breakdown

rule all

rule all:
input: expand("output/sorted_reads/{sample}sorted.bam.bai", sample=config["samples"]), expand("output/rDNAcounts/{sample}plus.bed", sample=config["samples"]), expand("output/rDNAcounts/{sample}minus.bed", sample=config["samples"]), expand("output/htseq/{sample}counts.txt", sample=config["samples"]), expand("output/fastqcreports/{sample}_fastqc.html", sample=config["samples"]), expand("output/fastqcreports/{sample}fqt_fastqc.html", sample=config["samples"]), expand("output/fastqcreports/{sample}ca3_fastqc.html", sample=config["samples"]), expand("output/fastqcreports/{sample}ca5_fastqc.html", sample=config["samples"])

This rule tells snakemake what files should be created at the end of the pipeline. It then backtracks and works out what rules it needs to run based on matching input and output. You want to make sure that all files that all files that have a terminal endpoint are included in this rule. For example the first rule points to the {sample}sorted.bam.bai. No other rule uses this bai file so I need to tell snakemake to make it. However you will see I don't need to call the .bam file because it's creation and use is implicit as a necessary step to created the .bed files.

rule fqtrim_collapse
```
rule fqtrim_collapse:
 input:
 "data/samples/{sample}.fastq"
 output:
 temp("output/fqtctemp/{sample}fqt.fastq")
 params:
 prefix="{sample}",
 folder="output/fqtctemp"
 shell:
 """
 module load fqtrim
 fqtrim -C -o fastq --outdir {params.folder} {input}
 mv output/fqtctemp/{params.prefix}.fastq output/fqtctemp/{params.prefix}fqt.fastq
 """
```
This rule works on the original fastq files to collapse duplicate reads due to PCR. It relies on 5' adapters with UMI so it must be done before they are removed. -C is the parameter to indicate collapse. This is mentioned in the running the pipeline, but the -C command reads the entire file into memory so you need to request slightly more memory than your largest file via SLURM
fqtrim documentation
rule cutadapt_5
```
rule cutadapt_5:
input:
 "output/fqtctemp/{sample}fqt.fastq"
output:
 temp("output/ca5temp/{sample}ca5.fastq")
threads:
 8
shell:
 """
 module load cutadapt
 cutadapt -g AGNNNNNNNNTG --no-indels -j {threads} -o {output} {input}
 """
```
This rule trims the 5' adapter according to the sequence AGNNNNNNNNTG, where N is any nucleotide. -g indicates 5' adapter, and --no-indels means when matching no insertions or deletions will be allow. Unlike fqtrim cutadapt can be multithreaded to speed up performance so I have told cutadapt to use 8 cores, this must also be requested in the SLURM request via the cluster_config.yaml file or else it will not multithread. This applies to all other rules that use multithreading as well.
cutadapt documentation
rule cutadapt_3_1
```
rule cutadapt_3_1:
 input:
 "output/ca5temp/{sample}ca5.fastq"
 output:
 temp("output/ca3temp/{sample}ca31.fastq")
 threads:
 8
 shell:
 """
 module load cutadapt
 cutadapt -a CTGTAGGCACCAT -j {threads} --no-indels -e 0.10 -m 12 -o {output} {input}
 """
```
This rule is similar to cutadapt_5 except it trims the 3' adapter. The sequence for this is CTGTAGGCACCAT. If the 3' adapter sequence changes, this will need to be changed. Like the previous step --no-indels is used. -e is the error rate. -m 12 is the minimum read length, meaning if a sequence is trimmed to shorter than 12 nts it is discarded.
rule cutadapt_3_2
```
rule cutadapt_3_2:
 input:
 "output/ca3temp/{sample}ca31.fastq"
 output:
 temp("output/ca3temp/{sample}ca3.fastq")
 threads:
 8
 shell:
 """
 module load cutadapt
 cutadapt -a TAGGCACCATCAA -j {threads} --no-indels -e 0.10 -m 12 -o {output} {input}
 """
```
This is the same rule and settings as the step above however it runs a second 3' trim on the sequence minus the first two nts. The reason for this is that the 5' adapter ends in NTG so occasionally when the 5' adapter is trimmed it will also remove the beginning of the 3' adapter " CTG TAGGCACCAT" if the read is just adapters. This means that the subsequent 3' trimming will not recognize the 5' adapter. This step is not necessary but it does help clean up some off target reads.

rule fastqc_original

rule fastqc_original:
 input:
 "data/samples/{sample}.fastq"
 output:	
 zip="output/fastqcreports/{sample}_fastqc.zip",
 html="output/fastqcreports/{sample}_fastqc.html"
 params:
 path ="output/fastqcreports/"
 shell:
 """
 module load FastQC
 fastqc -o {params.path} {input}
 """

This runs a fastqc command on the original fastqs. This is a helpful tool for gauging the quality of your initial reads.
fastqc documentation

rule fastqc_fqt
This runs the same rule as fastqc_original but on the fastq after the fqtrim -C rule. It is helpful to have the fastqc's to track the changes being made to the fastq before aligning.
rule fastqc_ca5
This runs the same rule as fastqc_original but on the fastq after the cutadapt_5 rule It is helpful to have the fastqc's to track the changes being made to the fastq before aligning.
rule fastqc_ca3
This runs the same rule as fastqc_original but on the fastq after the cutadapt_3_2 rule It is helpful to have the fastqc's to track the changes being made to the fastq before aligning.

rule star_align

 rule star_align:
 input:
 fastq="output/ca3temp/{sample}ca3.fastq"
 params:
 starfolder="data/ref/",
 prefix="output/STAR_output/{sample}"
 output:
 bam="output/STAR_output/{sample}Aligned.sortedByCoord.out.bam",
 lfinal="output/STAR_output/{sample}Log.final.out",
 lprog="output/STAR_output/{sample}Log.progress.out",
 lout="output/STAR_output/{sample}Log.out",
 lsj="output/STAR_output/{sample}SJ.out.tab"
 threads:
 8
 shell:	
 """
 module load STAR
 STAR --runThreadN {threads} \
 --genomeDir {params.starfolder} \
 --runMode alignReads \
 --limitBAMsortRAM 40000000000 \
 --alignIntronMax 1 \
 --outFilterMultimapNmax 20 \
 --outFilterMismatchNoverLmax 0.05 \
 --outFilterScoreMinOverLread 0 \
 --outFilterMatchNminOverLread 0 \
 --readFilesIn {input} \
 --outFileNamePrefix {params.prefix} \
 --outSAMtype BAM SortedByCoordinate
 """

This is the alignment step. STAR aligner is used to map reads to the yeast genome. The genome file parameter is the starfolder in params:. This would need to be changed if you planned on using a different genome. There are many parameters all of which are explained in the STAR documentation. STAR documentation

rule move_bam

rule move_bam: input: "output/STAR_output/{sample}Aligned.sortedByCoord.out.bam" output: "output/sorted_reads/{sample}sorted.bam" shell: """ mv {input} {output} ""

This step simply moves the bam file output by STAR to a new folder for less clutter.

rule samtools_index
```
 rule samtools_index:
 input:
 "output/sorted_reads/{sample}sorted.bam"
 output:
 "output/sorted_reads/{sample}sorted.bam.bai"
 threads:
 8
 shell:
 """
 module load SAMtools
 samtools index -@ {threads} {input}
 """
```
This step creates an index file for the bam file created by STAR. It is not necessary for any downstream rules in this pipeline, however it is necessary for many tools that use .bam files. bam files must be sorted prior to indexing, however STAR can sort as it aligns which is why there is not a separate samtools_sort step. While not a major part of the pipeline, I have included documentation for both samtools and an overview of .bam and .sam file format. This is useful if you want to access the quality of the alignment at a single read level.
samtools documentation
sam/bam file format
rule htseq_gene_counts
```
rule htseq_gene_counts:
 input:
 "output/sorted_reads/{sample}sorted.bam"
 output:
 "output/htseq/{sample}counts.txt"
 shell:
 """
 module load HTSeq
 htseq-count -s reverse {input} ./data/gtf/genes.gtf > {output}
 """
```
This step creates gene counts for your aligned reads as if you were doing mRNA seq. This will not be a heavily used step for analysis but I find it's helpful data for quality control. For good Pol I NET-seq reads, the vast majority of the reads should be coming from rRNA regions (25S, 18S, 5.8S, ETS1/2, ITS1/2) if there are alot of protein coding genes or if the ITS and ETS regions have low reads this could be an indication of poor IP. A helpful command for looking at these reads is sort -nrk 2 file.counts | less which sorts the counts by highest value.
rule bedtools_gc_plus
```
rule bedtools_gc_plus:
 input:
 "output/sorted_reads/{sample}sorted.bam"
 output:
 "output/counts/{sample}plus.bed"
 shell:
 """
 module load BEDTools
 bedtools genomecov -d -5 -strand + -ibam {input} > {output}
 """
```
This rule creads a .bed file that counts where the 5' end is at an individual nucleotide leve over the entire genome, set by the -5 parameter. This rule specifically runs for the plus strand.
bedtools genomecov documentation

rule bedtools_gc_minus

rule bedtools_gc_plus:
 input:
 "output/sorted_reads/{sample}sorted.bam"
 output:
 "output/counts/{sample}plus.bed"
 shell:
 """
 module load BEDTools
 bedtools genomecov -d -5 -strand + -ibam {input} > {output}
 """

This rule is the exact same as above except the + is change to a - for the minus strand.

rule awk_rDNA_trim
```
rule awk_rDNA_trim: input: "output/counts/{sample}.bed" output: "output/rDNAcounts/{sample}.bed" shell: """ awk '($1 == "XII" && $2 >= 451575 && $2 < 458433)' {input} > {output} """
```
This rule trims the bed files to just the rDNA region. The .bed file for the entire genome is relatively large and quite unwieldy in terms of data so this step trims the data to just the region we are interested (Chromosome XII from position 451575 to 458433) making it easier to import for downstream analysis on R, which can be done on a local computer. This step can also be done on R from the full .bed files, which are also included in the final output folder. This step just captures the transcribed region so if you want the NTS reads use the full .bed file and trim in R or change these parameters to the coordinates you want.

Code Snippets

	shell:
		"""
		module load fqtrim
              	fqtrim -C -o fastq --outdir {params.folder} {input}
		mv output/fqtctemp/{params.prefix}.fastq output/fqtctemp/{params.prefix}fqt.fastq
              	"""

SnakeMake From line 22 of main/Snakefile

shell:
	"""
	module load cutadapt
	cutadapt -g AGNNNNNNNNTG --no-indels -j {threads} -o {output} {input}
	"""

SnakeMake Cutadapt From line 36 of main/Snakefile

shell:
	"""
	module load cutadapt
	cutadapt -a CTGTAGGCACCAT -j {threads} --no-indels -e 0.10 -m 12 -o {output} {input}
	"""

SnakeMake Cutadapt From line 49 of main/Snakefile

shell:
        """
        module load cutadapt
        cutadapt -a TAGGCACCATCAA -j {threads} --no-indels -e 0.10 -m 12 -o {output} {input}
        """

SnakeMake Cutadapt From line 62 of main/Snakefile

shell:
	"""
	module load FastQC
	fastqc -o {params.path} {input}
	"""

SnakeMake FastQC From line 77 of main/Snakefile

shell:
	"""
	module load FastQC
	fastqc -o {params.path} {input}
	"""

SnakeMake FastQC From line 91 of main/Snakefile

shell:
	"""
	module load FastQC
	fastqc -o {params.path} {input}
	"""

SnakeMake FastQC From line 106 of main/Snakefile

shell:
	"""
	module load FastQC
	fastqc -o {params.path} {input}
	"""

SnakeMake FastQC From line 121 of main/Snakefile

shell:	
	"""
	module load STAR
	STAR --runThreadN {threads} \
	--genomeDir {params.starfolder} \
	--runMode alignReads \
	--limitBAMsortRAM 40000000000 \
	--alignIntronMax 1 \
	--outFilterMultimapNmax 20 \
	--outFilterMismatchNoverLmax 0.05 \
	--outFilterScoreMinOverLread 0 \
	--outFilterMatchNminOverLread 0 \
	--readFilesIn {input} \
	--outFileNamePrefix {params.prefix} \
	--outSAMtype BAM SortedByCoordinate
	"""

SnakeMake STAR From line 141 of main/Snakefile

shell:
	"""
	mv {input} {output}
	"""

SnakeMake From line 163 of main/Snakefile

shell:
	"""
	module load SAMtools
	samtools index -@ {threads} {input}
	"""

SnakeMake SAMtools From line 175 of main/Snakefile

shell:
	"""
	module load HTSeq
	htseq-count -s reverse {input} ./data/gtf/genes.gtf > {output}
	"""

SnakeMake htseqcount HTSeq From line 186 of main/Snakefile

shell:
	"""
	module load BEDTools
	bedtools genomecov -d -5 -strand + -ibam {input} > {output}
	"""

SnakeMake BEDTools From line 198 of main/Snakefile

shell:
	"""
	module load BEDTools
	bedtools genomecov -d -5 -strand - -ibam {input} > {output}
	"""

SnakeMake BEDTools From line 209 of main/Snakefile

shell:
	"""
	awk '($1 == "XII" && $2 >= 458433 && $2 <= 467569)' {input} > {output}
	"""

SnakeMake From line 220 of main/Snakefile

ShowHide 9 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/nathanbellis/SchneiderNETSeqPipeline

Name: schneidernetseqpipeline

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

BAM SAM BEDTools Cutadapt FastQC HTSeq htseqcount SAMtools Snakemake STAR

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free