RNA-Seq Analysis Workflow for Single-End Data using qproject
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
RNA-Seq workflow for single-end data. The workflow can be downloaded and run on a cluster environment using the tool qproject provided on github: https://github.com/qbicsoftware/qproject
The workflow uses a module system to load the required software. Be sure to check the
jobscript.sh
file to see which software modules are required. The modules are loaded automatically when using
qproject run
to start the workflow, otherwise they have to be loaded manually.
One should use qproject to download the files. This also creates all folders necessary for the workflow.
qproject create -t . -w github:qbicsoftware/rnaseq
Be sure to add a config file "params.json" in
etc
which should look like this:
{
"gtf": "path/to/gtf",
"indexed_genome": "path/to/genome/basename",
"stranded": "no",
"overlap_mode": "union",
"feature_type": "exon",
"gff_attribute": "gene_id",
"normalize_counts": "deseq2"
}
where
indexed_genome
and
gtf
are paths relative to
ref
.
indexed_genome
is the basename of a bowtie2 index.
gtf
is the .gtf file of the reference genome
The parameters
stranded
,
overlap_mode
,
feature_type
and
gff_attribute
are explained in the htseq documentation.
Members of QBiC can download the data for analysis using
qpostman
: https://github.com/qbicsoftware/postman-cli
It should be installed on the computing stations and can be loaded with:
module load qbic/qpostman/0.1.2.3
To download the data navigate to the data folder and either provide a QBiC ID
java -jar qpostman.jar -i <QBiC ID> -u <your_qbic_username>
or a file containing the project QBiC IDs:
postman-0.1.2.3 -f sample_IDs.csv -u user-name
If you're not using
qpostman
just put the relevant files in the data folder (formats supported:
.fastq
,
.fastq.gz
).
To run the workflow navigate to the
src
folder. Using
snakemake -n
one can display the operations the workflow will perform. Using the
--dag
parameter and piping it to
dot
one can create a .pdf version of the directed acyclic graph used by snakemake to inspect the behavious of the workflow on a local machine.
module load qbic/anaconda
cd src/
snakemake -n
snakemake --dag | dot -Tpdf > dag.pdf
To run the workflow:
qproject run -t ..
While running one can inspect the log files (e.g. in a different screen session) for the progress and errors generated by the workflow:
cd logs/
tail snake.err -f
And to check the jobs on the computing cluster one can use
qstat
.
Alternatively to using
qproject run
one could use
snakemake -j
to run the workflow, but then be sure to check the
jobscript.sh
to load the required modules manually and also note that this would also not use
qsub
to submit the jobs.
Code Snippets
111 112 113 114 115 116 117 118 | run: out = os.path.abspath(str(output)) if glob.glob(data("*.sha256sum")): shell("cd %s; " "sha256sum -c *.sha256sum && " "touch %s" % (data('.'), out)) else: shell("touch %s" % out) |
123 | shell: "ln -s {input} {output}" |
128 | shell: "zcat {input} > {output}" |
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | run: with open(str(input)) as infile, open(str(output), 'w') as outfile: num_ok = 0 num_filtered = 0 line = infile.readline() while line: assert line.startswith('@') body = ''.join(infile.readline() for _ in range(3)) if all(':Y:' not in part for part in line.split(' ')[1:]): num_ok += 1 outfile.write(line) outfile.write(body) else: num_filtered += 1 line = infile.readline() num_all = num_ok + num_filtered if num_filtered / num_all > .1: raise ValueError("More than 10% of reads were filtered in " "PreFilterReads. This probably indicates a bug.") |
156 | shell: 'mkdir -p {output} && (fastqc {input} -o {output} --extract || (rm -rf {output} && exit 1))' |
161 | shell: "cp {input}/{wildcards.name}_fastqc.zip {output}" |
166 | shell: 'mkdir -p {output} && (fastqc {input} -o {output} --extract || (rm -rf {output} && exit 1))' |
171 | shell: "cp {input}/{wildcards.name}_fastqc.zip {output}" |
176 177 178 179 180 181 182 183 184 185 186 187 188 | run: f = open(str(input) + "/" + wildcards['name'] + "_fastqc/fastqc_data.txt") out = open(str(output), "w") sw = False for line in f: if line.startswith(">>Overrepresented"): sw = True if not line.startswith(">>END_MODULE") and sw: out.write(line) else: sw = False f.close() out.close() |
193 194 195 196 197 198 199 200 201 202 203 204 205 | run: f = open(str(input)) out = open(str(output), "w") for line in f: if not (line.startswith("#") or line.startswith(">>")): tokens1 = line.split("\t") print(tokens1) fseq = tokens1[0] fheader = '>' + tokens1[3] out.write(fheader)#+'\n')#it just happens that is the last column out.write(fseq+'\n') f.close() out.close() |
210 | shell: "cat {input} > {output}" |
215 216 217 218 | shell: """ awk '/^>/ {{P=index($0,"No Hit")==0}} {{if(P) print}} ' {input} > {output} """ |
223 224 225 226 227 228 229 | run: with open(str(input[0])) as f: skip = not bool(f.read(1)) if skip: os.symlink(os.path.abspath(str(input[1])), str(output)) else: shell('cutadapt --discard-trimmed -a file:{input[0]} -o {output} {input[1]}') |
234 235 236 237 238 | run: gtf = parameters['gtf'] genome = parameters['indexed_genome'] shell('tophat --no-coverage-search -o {output} -p 2 -G %s %s {input}' % (gtf, genome)) |
243 244 245 246 247 | run: sam_command = "samtools view {input}/accepted_hits.bam" htseq = ("htseq-count -i {gff_attribute} -t {feature_type} " "-m {overlap_mode} -s {stranded} - {gtf}").format(**parameters) shell("%s | %s > {output}" % (sam_command, htseq)) |
254 255 256 257 258 259 260 261 262 263 264 265 266 | run: import pandas as pd import re pattern = "HTSeqCounts_([0-9a-zA-Z_\- ]*).txt" names = [re.search(pattern, str(name)).groups()[0] for name in input] data = {} for name, file in zip(names, input): file = str(file) data[name] = pd.Series.from_csv(file, sep='\t') df = pd.DataFrame(data) df.index.name = parameters['gff_attribute'] df.to_csv(str(output)) |
271 | shell: "samtools index {input}/accepted_hits.bam && mv -f {input}/accepted_hits.bam.bai {output}" |
276 | shell: "cp {input}/align_summary.txt {output}" |
281 | shell: "samtools depth {input}/accepted_hits.bam > {output}" |
286 | shell: '''dc -e "$(wc -l {input} | cut -f1 -d' ') 4 / p" > {output}''' |
291 | shell: '''dc -e "$(wc -l {input} | cut -f1 -d' ') 4 / p" > {output}''' |
296 297 298 299 300 301 302 | shell: '''dc -e "$(wc -l {input} | cut -f1 -d' ') 4 / p" > {output}''' """ Rule to get software versions of used programs in workflow. Rule either calls program with --version flag if possible, or runs it without parameters displaying output row containing version information with unix tail command. It then redirects it to Summary/software_versions.txt """ |
307 308 309 310 311 312 313 314 | run: shell("anaconda --version > Summary/software_versions.txt") shell("conda --version >> Summary/software_versions.txt") shell("fastqc --version >> Summary/software_versions.txt") shell("htseq-count -h | tail -1 >> Summary/software_versions.txt") shell("bowtie2 --version | head -1 >> Summary/software_versions.txt") shell("tophat2 --version >> Summary/software_versions.txt") shell("samtools --version | head -2 >> Summary/software_versions.txt") |
Support
- Future updates
Related Workflows





