RNA-Seq Analysis Workflow for Single-End Data using qproject

public 1yr ago Version: 7 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

RNA-Seq workflow for single-end data. The workflow can be downloaded and run on a cluster environment using the tool qproject provided on github: https://github.com/qbicsoftware/qproject

The workflow uses a module system to load the required software. Be sure to check the jobscript.sh file to see which software modules are required. The modules are loaded automatically when using qproject run to start the workflow, otherwise they have to be loaded manually.

One should use qproject to download the files. This also creates all folders necessary for the workflow.

qproject create -t . -w github:qbicsoftware/rnaseq

Be sure to add a config file "params.json" in etc which should look like this:

{
 "gtf": "path/to/gtf",
 "indexed_genome": "path/to/genome/basename",
 "stranded": "no",
 "overlap_mode": "union",
 "feature_type": "exon",
 "gff_attribute": "gene_id",
 "normalize_counts": "deseq2"
}

where indexed_genome and gtf are paths relative to ref .

indexed_genome is the basename of a bowtie2 index. gtf is the .gtf file of the reference genome

The parameters stranded , overlap_mode , feature_type and gff_attribute are explained in the htseq documentation.

Members of QBiC can download the data for analysis using qpostman : https://github.com/qbicsoftware/postman-cli

It should be installed on the computing stations and can be loaded with:

module load qbic/qpostman/0.1.2.3

To download the data navigate to the data folder and either provide a QBiC ID

java -jar qpostman.jar -i <QBiC ID> -u <your_qbic_username>

or a file containing the project QBiC IDs:

postman-0.1.2.3 -f sample_IDs.csv -u user-name

If you're not using qpostman just put the relevant files in the data folder (formats supported: .fastq , .fastq.gz ).

To run the workflow navigate to the src folder. Using snakemake -n one can display the operations the workflow will perform. Using the --dag parameter and piping it to dot one can create a .pdf version of the directed acyclic graph used by snakemake to inspect the behavious of the workflow on a local machine.

module load qbic/anaconda
cd src/
snakemake -n
snakemake --dag | dot -Tpdf > dag.pdf

To run the workflow:

qproject run -t ..

While running one can inspect the log files (e.g. in a different screen session) for the progress and errors generated by the workflow:

cd logs/
tail snake.err -f

And to check the jobs on the computing cluster one can use qstat .

Alternatively to using qproject run one could use snakemake -j to run the workflow, but then be sure to check the jobscript.sh to load the required modules manually and also note that this would also not use qsub to submit the jobs.

Code Snippets

run:
    out = os.path.abspath(str(output))
    if glob.glob(data("*.sha256sum")):
        shell("cd %s; "
              "sha256sum -c *.sha256sum && "
              "touch %s" % (data('.'), out))
    else:
        shell("touch %s" % out)

SnakeMake From line 111 of master/Snakefile

shell: "ln -s {input} {output}"

SnakeMake From line 123 of master/Snakefile

shell: "zcat {input} > {output}"

SnakeMake From line 128 of master/Snakefile

run:
    with open(str(input)) as infile, open(str(output), 'w') as outfile:
        num_ok = 0
        num_filtered = 0
        line = infile.readline()
        while line:
            assert line.startswith('@')
            body = ''.join(infile.readline() for _ in range(3))
            if all(':Y:' not in part for part in line.split(' ')[1:]):
                num_ok += 1
                outfile.write(line)
                outfile.write(body)
            else:
                num_filtered += 1
            line = infile.readline()
    num_all = num_ok + num_filtered
    if num_filtered / num_all > .1:
        raise ValueError("More than 10% of reads were filtered in "
            "PreFilterReads. This probably indicates a bug.")

SnakeMake From line 133 of master/Snakefile

shell: 'mkdir -p {output} && (fastqc {input} -o {output} --extract || (rm -rf {output} && exit 1))'

SnakeMake From line 156 of master/Snakefile

shell: "cp {input}/{wildcards.name}_fastqc.zip {output}"

SnakeMake From line 161 of master/Snakefile

shell: 'mkdir -p {output} && (fastqc {input} -o {output} --extract || (rm -rf {output} && exit 1))'

SnakeMake From line 166 of master/Snakefile

shell: "cp {input}/{wildcards.name}_fastqc.zip {output}"

SnakeMake From line 171 of master/Snakefile

run:
    f = open(str(input) + "/" + wildcards['name'] + "_fastqc/fastqc_data.txt")
    out = open(str(output), "w")
    sw = False
    for line in f:
        if line.startswith(">>Overrepresented"):
            sw = True
        if not line.startswith(">>END_MODULE") and sw:
            out.write(line)
        else:
            sw = False
    f.close()
    out.close()

SnakeMake From line 176 of master/Snakefile

run:
    f = open(str(input))
    out = open(str(output), "w")
    for line in f:
        if not (line.startswith("#") or line.startswith(">>")):
             tokens1 = line.split("\t")
             print(tokens1)
             fseq = tokens1[0]
             fheader = '>' + tokens1[3]
             out.write(fheader)#+'\n')#it just happens that is the last column
             out.write(fseq+'\n')
    f.close()
    out.close()

SnakeMake From line 193 of master/Snakefile

shell: "cat {input} > {output}"

SnakeMake From line 210 of master/Snakefile

shell:
    """
   awk '/^>/ {{P=index($0,"No Hit")==0}} {{if(P) print}} ' {input} > {output}
    """

SnakeMake From line 215 of master/Snakefile

run:
    with open(str(input[0])) as f:
        skip = not bool(f.read(1))
    if skip:
        os.symlink(os.path.abspath(str(input[1])), str(output))
    else:
        shell('cutadapt --discard-trimmed -a file:{input[0]} -o {output} {input[1]}')

SnakeMake Cutadapt From line 223 of master/Snakefile

run:
    gtf = parameters['gtf']
    genome = parameters['indexed_genome']
    shell('tophat --no-coverage-search -o {output} -p 2 -G %s %s {input}'
          % (gtf, genome))

SnakeMake From line 234 of master/Snakefile

run:
    sam_command = "samtools view {input}/accepted_hits.bam"
    htseq = ("htseq-count -i {gff_attribute} -t {feature_type} "
             "-m {overlap_mode} -s {stranded} - {gtf}").format(**parameters)
    shell("%s | %s > {output}" % (sam_command, htseq))

SnakeMake SAMtools htseqcount HTSeq From line 243 of master/Snakefile

run:
    import pandas as pd
    import re

    pattern = "HTSeqCounts_([0-9a-zA-Z_\- ]*).txt"
    names = [re.search(pattern, str(name)).groups()[0] for name in input]
    data = {}
    for name, file in zip(names, input):
        file = str(file)
        data[name] = pd.Series.from_csv(file, sep='\t')
    df = pd.DataFrame(data)
    df.index.name = parameters['gff_attribute']
    df.to_csv(str(output))

SnakeMake Pandas From line 254 of master/Snakefile

shell: "samtools index {input}/accepted_hits.bam && mv -f {input}/accepted_hits.bam.bai  {output}"

SnakeMake SAMtools From line 271 of master/Snakefile

shell: "cp {input}/align_summary.txt {output}"

SnakeMake From line 276 of master/Snakefile

shell: "samtools depth {input}/accepted_hits.bam > {output}"

SnakeMake SAMtools From line 281 of master/Snakefile

shell: '''dc -e "$(wc -l {input} | cut -f1 -d' ') 4 / p" > {output}'''

SnakeMake From line 286 of master/Snakefile

shell: '''dc -e "$(wc -l {input} | cut -f1 -d' ') 4 / p" > {output}'''

SnakeMake From line 291 of master/Snakefile

    shell: '''dc -e "$(wc -l {input} | cut -f1 -d' ') 4 / p" > {output}'''

"""
Rule to get software versions of used programs in workflow. Rule either calls program with --version flag if possible, or runs
it without parameters displaying output row containing version information with unix tail command.
It then redirects it to Summary/software_versions.txt 
"""

SnakeMake From line 296 of master/Snakefile

run:
    shell("anaconda --version > Summary/software_versions.txt")
    shell("conda --version >> Summary/software_versions.txt")
    shell("fastqc --version >> Summary/software_versions.txt")
    shell("htseq-count -h | tail -1 >> Summary/software_versions.txt")
    shell("bowtie2 --version | head -1 >> Summary/software_versions.txt")
    shell("tophat2 --version >> Summary/software_versions.txt")
    shell("samtools --version | head -2 >> Summary/software_versions.txt")