Snakemake Workflow: De Novo Transcriptome Assembly with HISAT2 and StringTie

public 1yr ago 0 bookmarks

View Workflow

snakemake-hisat2-stringtie — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Snakemake workflow: De novo transcriptome assembly with Hisat2 and StringTie.

This pipeline comprises all steps of the 'new tuxedo suit' workflow published by Pertea et al. (1) and can be used to perform genome-guided de novo transcriptome assembly on bulk RNA-seq data with default parameters (without downstream analysis in R). Additionally, the piepline comprises:

adapter and quality trimmimg
read quality control with FASTQC
generation of a representative alingment file and bed file for the purpose of visualizing read coverage.

(1) Pertea, Mihaela; Kim, Daehwan; Pertea, Geo M.; Leek, Jeffrey T.; Salzberg, Steven L. (2016): Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. In: Nature Protocols 11, 1650 EP

Code Snippets

__author__ = "Wibowo Arindrarto"
__copyright__ = "Copyright 2016, Wibowo Arindrarto"
__email__ = "bow@bow.web.id"
__license__ = "BSD"

from snakemake.shell import shell

# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Run log
log = snakemake.log_fmt_shell()

# Input file wrangling
reads = snakemake.input.get("reads")
if isinstance(reads, str):
    input_flags = "-U {0}".format(reads)
elif len(reads) == 1:
    input_flags = "-U {0}".format(reads[0])
elif len(reads) == 2:
    input_flags = "-1 {0} -2 {1}".format(*reads)
else:
    raise RuntimeError(
        "Reads parameter must contain at least 1 and at most 2"
        " input files.")

# Executed shell command
shell(
    "(hisat2 {extra} --threads {snakemake.threads}"
    " -x {snakemake.params.idx} {input_flags}"
    " | samtools view -Sbh -o {snakemake.output[0]} -)"
    " {log}")

Python Snakemake SAMtools From line 1 of scripts/hisat2_wrapper0.34.0.py

wrapper:
	"0.35.0/bio/trimmomatic/pe"

SnakeMake From line 74 of master/Snakefile

script:
	"scripts/hisat2_wrapper0.34.0.py"

SnakeMake From line 94 of master/Snakefile

wrapper:
	"0.35.1/bio/samtools/sort"

SnakeMake From line 107 of master/Snakefile

wrapper:
	"0.35.1/bio/samtools/index"		

SnakeMake From line 115 of master/Snakefile

shell:
	"stringtie -v --merge -p {threads} -G {input.anno} -o {output} {input.gtf} 2> {log}"

SnakeMake StringTie From line 150 of master/Snakefile

shell:
	"gffcompare -G -r {input.anno} -o output/gffcompare/GFFcompare {input.st_transcripts}"

SnakeMake gffcompare From line 164 of master/Snakefile

shell: "stringtie -e -B -p {threads} -G {input.anno} -o {output} {input.sbam}"

SnakeMake StringTie From line 178 of master/Snakefile

shell:
	"""
	cd output
	prepDE.py -i ballgown
	cd ..
	"""

SnakeMake From line 190 of master/Snakefile

wrapper:
	"0.34.0/bio/fastqc"

SnakeMake FastQC From line 212 of master/Snakefile

wrapper:
	"0.34.0/bio/fastqc"

SnakeMake FastQC From line 226 of master/Snakefile

wrapper:
	"0.35.1/bio/multiqc"

SnakeMake MultiQC From line 239 of master/Snakefile

wrapper:
	"0.34.0/bio/fastqc"

SnakeMake FastQC From line 257 of master/Snakefile

wrapper:
	"0.35.1/bio/multiqc"

SnakeMake MultiQC From line 274 of master/Snakefile

shell:
    """
    nreads=$(samtools view -c {input})
    rate=$(echo "scale=5;100000/$nreads" | bc)
    sambamba view -f bam -t 5 --subsampling-seed=42 -s $rate {input} | samtools sort -m 4G -@ 8 -T - > {output} 2> {log}
    """

SnakeMake SAMtools Sambamba From line 294 of master/Snakefile

wrapper:
    "0.35.1/bio/samtools/merge"

SnakeMake From line 311 of master/Snakefile

wrapper:
    "0.35.1/bio/samtools/index"

SnakeMake From line 323 of master/Snakefile

shell:
    "bamCoverage -b {input.bam} -o {output}"

SnakeMake DeepTools From line 336 of master/Snakefile

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs 
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell("fastqc {snakemake.params} --quiet "
          "--outdir {tempdir} {snakemake.input[0]}"
          " {log}")

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path} {snakemake.output.html}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path} {snakemake.output.zip}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


# Distribute available threads between trimmomatic itself and any potential pigz instances
def distribute_threads(input_files, output_files, available_threads):
    gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz"))
    gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz"))
    potential_threads_per_process = available_threads // (1 + gzipped_input_files + gzipped_output_files)
    if potential_threads_per_process > 0:
        # decompressing pigz creates at most 4 threads
        pigz_input_threads = min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0
        pigz_output_threads = \
            (available_threads - pigz_input_threads * gzipped_input_files) // (1 + gzipped_output_files) \
                if gzipped_output_files != 0 else 0
        trimmomatic_threads = available_threads - pigz_input_threads * gzipped_input_files - \
                              pigz_output_threads * gzipped_output_files
    else:
        # not enough threads for pigz
        pigz_input_threads = 0
        pigz_output_threads = 0
        trimmomatic_threads = available_threads
    return trimmomatic_threads, pigz_input_threads, pigz_output_threads


def compose_input_gz(filename, threads):
    if filename.endswith(".gz") and threads > 0:
        return "<(pigz -p {threads} --decompress --stdout {filename})".format(
            threads=threads,
            filename=filename
        )
    return filename


def compose_output_gz(filename, threads, compression_level):
    if filename.endswith(".gz") and threads > 0:
        return ">(pigz -p {threads} {compression_level} > {filename})".format(
            threads=threads,
            compression_level=compression_level,
            filename=filename
        )
    return filename


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)

# Distribute threads
input_files = [snakemake.input.r1, snakemake.input.r2]
output_files = [snakemake.output.r1, snakemake.output.r1_unpaired,
                snakemake.output.r2, snakemake.output.r2_unpaired]

trimmomatic_threads, input_threads, output_threads = distribute_threads(
    input_files,
    output_files,
    snakemake.threads
)

input_r1, input_r2 = [compose_input_gz(filename, input_threads) for filename in input_files]

output_r1, output_r1_unp, output_r2, output_r2_unp = [
    compose_output_gz(filename, output_threads, compression_level) for filename in output_files
]

shell(
    "trimmomatic PE -threads {trimmomatic_threads} {extra} "
    "{input_r1} {input_r2} "
    "{output_r1} {output_r1_unp} "
    "{output_r2} {output_r2_unp} "
    "{trimmer} "
    "{log}"
)

Python Snakemake Trimmomatic From line 12 of pe/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}")

Python Snakemake MultiQC From line 3 of multiqc/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of index/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell

# Samtools takes additional threads through its option -@
# One thread for samtools merge
# Other threads are *additional* threads passed to the '-@' argument
threads = (
    "" if snakemake.threads <= 1
    else " -@ {} ".format(snakemake.threads - 1)
)

shell("samtools merge {threads} {snakemake.params} "
      "{snakemake.output[0]} {snakemake.input}")

Python Snakemake SAMtools From line 1 of merge/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


import os
from snakemake.shell import shell


prefix = os.path.splitext(snakemake.output[0])[0]

# Samtools takes additional threads through its option -@
# One thread for samtools
# Other threads are *additional* threads passed to the argument -@
threads = (
    "" if snakemake.threads <= 1
    else " -@ {} ".format(snakemake.threads - 1)
)

shell(
    "samtools sort {snakemake.params} {threads} -o {snakemake.output[0]} "
    "-T {prefix} {snakemake.input[0]}")