Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
PPCG RNA working group
This repository contains the pipeline used by the Pan Prostate Cancer Group.
Requirements
All tools used in this pipeline are installed at runtime through a singularity container. The only requirements are the ones to run the Snakemake workflow.
-
mamba >= 1.2 (optional, for faster installation)
-
Snakemake >= 6.2
-
Singularity >=3.7 (Should work on previous versions although not tested)
-
Cookiecutter >= 1.7.2
With exception of Singularity, these can be easily installed using conda . Please do not install Singularity through conda .
Singularity must be owned by
root
in order to run this workflow. If you do not have access to a singularity installation owned by
root
try requesting to your system administrator
.
Before you begin
When executing snakemake, try to avoid canceling it with
ctrl + c
. If an error occurs, snakemake will clean the files from the failed step before shutting down. Forcing it to shutdown will cause corrupted files to be left behind, potentially causing troubles.
On the same note, avoid moving/creating/removing files under the
results/
folder. Snakemake relies on the time stamp of the files to coordinate the order of execution of each step. If you must remove files, only remove files resulting from the last step ran, otherwise this will cause snakemake to reprocess all steps that relies on these files.
How to run
- Download the snakemake workflow.
wget https://github.com/panprostate/RNA/releases/download/v1.2.4/PPCG_RNA_v1.2.4.tar.gz
tar -xzvf PPCG_RNA_v1.2.4.tar.gz
- This workflow requires a sample metadata table (tab separated) containing the following fields:
-
sample_name = The sample name, by default anything preceding the "_L00" pattern.
-
unit_name = The unit/lane name, by default it matches the following pattern ".+_(L00.).+".
-
fq1 = Absolute path to the R1 file of the fastq pair.
-
fq2 = Absolute path to the R2 file of the fastq pair. For single-ended libraries this field should contain
NA
.
Create this table under the
config/
folder.
As reference, a helper script
createTable.sh
is included under
config/
to create this table. Make sure the table format is correct before proceding.
-
If your computing enviroment is managed by Slurm, a cluster profile is included at
slurm_profile/
. There are two files that can be adjusted:
-
slurm_profile/config.yaml
Most likely you will want to adjust only two settings in this file:
# Number of cores used by Snakemake to manage the workflow - if increasing, be careful to not have your job killed if not running in an interactive session.
local-cores: 1
# Number of maximum simultaneous jobs that can be run.
jobs: 5
-
slurm_profile/settings.json
This file can be used to include center-specific parameters for your submissions. For example, to specify a partition
panda
on all jobs submissions.
"SBATCH_DEFAULTS": "partition=panda job-name={rule}_{wildcards} output=job_logs/slurm-%j.out"
-
Edit the amount of resources requested and other general settings in
config/config.yaml
. Some of the options are: NEW - library_type: Process single-ended libraries. Please note that a mixed sample table is not supported at the moment, make sure all samples are SE or PE in yoursamples
variable.
-
samples: path to the metadata table described in step 2.
-
buildIndex: True or False wether to build indexes from files or download built indexes.
-
adapters: adapter sequences used by cutadapt.
-
mem_*: Amount of RAM memory requested for the job in MB.
-
ncpus_*: Number of cores requested for the job.
-
rt_*: Maximum time a job is allowed to run before it is killed.
-
compression: Compression level from 1 to 9 for intermediated steps. Final outputs are always compressed at the default level.
- From the base directory execute Snakemake:
snakemake --use-conda --use-singularity --profile slurm_profile
The workflow will automatically pull and create a Singularity container from
docker://condaforge/mambaforge:4.10.1-0
and create environments with the necessary packages for each module.
-
If debugging is necessary, the folder
logs/
contains the error log generated by the software run in each step. In some cases the errors are not directly caused by a software, but by the execution of a script or the cluster resource manager - in those cases the error will be output injob_logs/
with the nameslurm-<job-number>.out
.
Troubleshooting
Depending on the number of samples and read length, you might end up with an excessive amount of SJs detect. In this case you will encounter the error bellow:
Fatal LIMIT error: the number of junctions to be inserted on the fly =XXXXXXX is larger than the limitSjdbInsertNsj=1000000
While you can increase this limit, this might cause you to encounter another error:
EXITING because of FATAL ERROR: cannot insert junctions on the fly because of strand GstrandBit problem
The solution for this is to decrease the number of SJs being inserted on the fly. You can do this by turning on the SJ filter in the
config.yaml
file, which by default will remove SJs supported by only 1 read (uniquely mapped).
SJ_filter: True
SJ_minCounts: 1
Workflow overview
Assuming a sample named
S1
which has been split in two files by lane (
L001
and
L002
).
Code Snippets
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | shell: """ set -e echo "Downloading resources..." [[ -f {output.dbSNP} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dbsnp138.vcf {output.dbSNP} [[ -f {output.dbSNPidx} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dbsnp138.vcf.idx {output.dbSNPidx} [[ -f {output.knowIndels} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Mills_and_1000G_gold_standard.indels.b37.sites.vcf {output.knowIndels} [[ -f {output.knowIndelsidx} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.idx {output.knowIndelsidx} [[ -f {output.reference} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta {output.reference} [[ -f {output.faidx} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta.fai {output.faidx} [[ -f {output.dictn} ]] || gsutil cp gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.dict {output.dictn} [[ -f {output.paSites} ]] || wget -O resources/pasites_hg19.bed.gz https://www.dropbox.com/s/i55pzcuh6113vvq/pasites_hg19.bed.gz [[ -f {output.tx2gene} ]] || wget -O resources/tx2gene.tsv.gz https://www.dropbox.com/s/ays94bytt7r02ex/tx2gene.tsv.gz [[ -f {output.fc} ]] || wget -O resources/FANTOM_CAT.lv3_robust.gtf.gz https://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/assembly/lv3_robust/FANTOM_CAT.lv3_robust.gtf.gz [[ -f {output.gtf} ]] || wget -O resources/gencode.v38lift37.annotation.gtf.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh37_mapping/gencode.v38lift37.annotation.gtf.gz [[ -f {output.transcripts} ]] || wget -O resources/gencode.v38lift37.transcripts.fa.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh37_mapping/gencode.v38lift37.transcripts.fa.gz [[ -f {output.blacklist} ]] || wget -O resources/arriba_v2.1.0.tar.gz https://github.com/suhrig/arriba/releases/download/v2.1.0/arriba_v2.1.0.tar.gz tar -xvf resources/arriba_v2.1.0.tar.gz --directory resources/ mv resources/arriba_v2.1.0/database/blacklist_hg19_hs37d5_GRCh37_v2.1.0.tsv.gz resources/arriba_v2.1.0/database/known_fusions_hg19_hs37d5_GRCh37_v2.1.0.tsv.gz resources/arriba_v2.1.0/database/protein_domains_hg19_hs37d5_GRCh37_v2.1.0.gff3 resources/ rm -rf resources/arriba_v2.1.0/ resources/arriba_v2.1.0.tar.gz echo "Unzipping some files..." gunzip resources/gencode.v38lift37.annotation.gtf.gz resources/gencode.v38lift37.transcripts.fa.gz resources/FANTOM_CAT.lv3_robust.gtf.gz resources/pasites_hg19.bed.gz sed -i 's/^chr//g' resources/gencode.v38lift37.annotation.gtf sed -i 's/^chr//g' resources/FANTOM_CAT.lv3_robust.gtf echo "Done!" """ |
66 67 68 69 70 71 | shell: """ set -e wget -O resources/STARindex_hg19.tar https://www.dropbox.com/s/7zf1ad0hokpizfi/STARindex_hg19.tar tar -xvf resources/STARindex_hg19.tar --directory resources/ """ |
85 86 87 88 89 90 | shell: """ set -e wget -O resources/salmon_hg19.tar https://www.dropbox.com/s/mybii6z71v9pt57/salmon_hg19.tar tar -xvf resources/salmon_hg19.tar --directory resources/ """ |
104 105 106 107 108 | shell: """ wget -O resources/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/__genome_libs_StarFv1.10/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz tar -xzvf resources/GRCh37_gencode_v19_CTAT_lib_Mar012021.plug-n-play.tar.gz --directory resources/ """ |
31 32 33 34 35 | shell: """ fastqc -o {params.dir} {input.bam} mv results/qc/fastqc/{wildcards.sample}/{wildcards.sample}_{wildcards.unit}.Aligned.sortedByCoord.out_fastqc.zip {output} """ |
53 54 55 | shell: "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} " "> {log[0]} 2>&1" |
74 75 76 | shell: "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} " "> {log} 2>&1" |
108 109 | shell: "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
127 128 | shell: "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1" |
144 145 | shell: "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
162 163 | shell: "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1" |
180 181 | shell: "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1" |
212 213 214 215 216 217 | shell: "multiqc" " --force" " -o {params.outDir}" " -n {params.name}" " {input}" |
248 249 250 251 252 253 254 255 256 | shell: """ set -e multiqc --force -o {params.outDir} -n {params.geneCount_gc} {input.gcg} multiqc --force -o {params.outDir} -n {params.geneCount_fc} {input.gcf} multiqc --force -o {params.outDir} -n {params.exonCount_gc} {input.ecg} multiqc --force -o {params.outDir} -n {params.exonCount_fc} {input.ecf} multiqc --force -o {params.outDir} -n {params.salmon} {input.salmon} {input.salmon_length} """ |
278 279 280 281 282 283 | shell: "multiqc" " --force -fp" " -o {params.outDir}" " -n {params.name}" " {input}" |
22 23 24 25 26 27 28 29 30 | shell: "cutadapt" " {params.adapters}" " -m 31" " --compression-level {params.compression}" " -o {output.fastq1}" " -p {output.fastq2}" " -j {threads}" " {input} > {output.qc} 2>>{log}" |
51 52 53 54 55 56 57 58 | shell: "STAR" " --runMode genomeGenerate" " --genomeDir {params.idx}" " --genomeFastaFiles {input.reference}" " --sjdbOverhang 250" " --sjdbGTFfile {input.gtf}" " --runThreadN {threads} 2>&1 | tee -a {log}" |
79 80 81 82 83 84 85 86 87 | shell: """ set -e cat {input.transcripts} {input.reference} > resources/gentrome.fa grep "^>" {input.reference} | cut -d " " -f 1 > resources/decoys.txt sed -i -e 's/>//g' resources/decoys.txt salmon index -t resources/gentrome.fa -i {params.idx} -p {threads} --decoys resources/decoys.txt --gencode -k 31 2>&1 | tee -a {log} rm -f resources/gentrome.fa resources/decoys.txt """ |
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | shell: """ STAR \ --genomeDir {params.idx} \ --readFilesIn {input.f1} {input.f2} \ --runThreadN {threads} \ --readFilesCommand zcat \ --outFilterMultimapScoreRange 1 \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --outFilterType BySJout \ --alignIntronMax 500000 \ --alignIntronMin 20 \ --alignMatesGapMax 1000000 \ --sjdbScore 2 \ --alignSJDBoverhangMin 5 \ --genomeLoad NoSharedMemory \ --outFilterMatchNminOverLread 0.1 \ --outFilterScoreMinOverLread 0.1 \ --sjdbOverhang 250 \ --outSAMstrandField intronMotif \ --peOverlapNbasesMin 10 \ --alignSplicedMateMapLminOverLmate 0.5 \ --alignSJstitchMismatchNmax 5 -1 5 5 \ --outSAMtype None \ --outSAMmode None \ --outFileNamePrefix results/STAR_1p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log} """ |
157 158 159 160 161 | shell: """ set -e awk -F'\t' '$7 > {params.minCount}' {input.SJs} > {output.filtered} """ |
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | shell: """ touch results/index/dummy.fastq mkdir -p results/index/STARindex_hg19_SJ cd results/index/STARindex_hg19_SJ ln -sf ../../../resources/STARindex_hg19/* . rm -f sjdbList.out.tab sjdbInfo.txt cd ../../../ STAR \ --genomeDir {params.idx} \ --readFilesIn results/index/dummy.fastq \ --runThreadN 2 \ --sjdbFileChrStartEnd {input.SJ} \ --outFileNamePrefix results/index/ \ --sjdbInsertSave Basic 2>&1 | tee -a {log} cp results/index/_STARgenome/sjdbList.out.tab results/index/STARindex_hg19_SJ/ cp results/index/_STARgenome/sjdbInfo.txt results/index/STARindex_hg19_SJ/ rm -rf results/index/_STARgenome """ |
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 | shell: """ STAR \ --genomeDir {params.idx} \ --readFilesIn {input.f1} {input.f2} \ --runThreadN {threads} \ --readFilesCommand zcat \ --outBAMcompression {params.compression} \ --outFilterType BySJout \ --outFilterMultimapScoreRange 1 \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --alignIntronMax 500000 \ --alignIntronMin 20 \ --alignMatesGapMax 1000000 \ --sjdbScore 2 \ --outSAMtype BAM Unsorted \ --alignSJDBoverhangMin 5 \ --genomeLoad NoSharedMemory \ --outFilterMatchNminOverLread 0.1 \ --outFilterScoreMinOverLread 0.1 \ --sjdbOverhang 250 \ --outSAMstrandField intronMotif \ -–outSAMattrRGline ID:{params.RG} SM:{params.SM} PL:illumina \ --peOverlapNbasesMin 10 \ --alignSplicedMateMapLminOverLmate 0.5 \ --alignSJstitchMismatchNmax 5 -1 5 5 \ --chimSegmentMin 10 \ --chimOutJunctionFormat 1 \ --chimOutType Junctions WithinBAM SoftClip \ --chimJunctionOverhangMin 10 \ --chimScoreDropMax 30 \ --chimScoreJunctionNonGTAG 0 \ --chimScoreSeparation 1 \ --chimSegmentReadGapMax 3 \ --chimMultimapNmax 20 \ --outFileNamePrefix results/STAR_2p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log} """ |
287 288 289 290 291 | shell: """ samtools sort -@ {threads} {input.bam} -o {output.sbam} 2>>{log} samtools index {output.sbam} """ |
315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 | shell: """ set -e INPUT=({input.bams}) if ((${{#INPUT[@]}} == 1)); then cp {input.bams} {output.mbam} cp {input.bams}.bai {output.mbam}.bai cp {input.chims} {output.chim} else samtools merge -f -1 {output.mbam} {input.bams} 2>>{log} samtools index {output.mbam} # Also merge the chimeric files for star-fusion cat {input.chims[0]} | head -n -2 > {output.chim} files=({input.chims}) unset files[0] files=("${{files[@]}}") rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts touch results/mergedBam/{wildcards.sample}chims cat {input.chims[0]} | tail -n 1 > results/mergedBam/{wildcards.sample}counts for file in $files; do cat $file | tail -n +2 | head -n -2 >> results/mergedBam/{wildcards.sample}chims cat $file | tail -n 1 >> results/mergedBam/{wildcards.sample}counts done cat results/mergedBam/{wildcards.sample}chims >> {output.chim} cat {input.chims[0]} | tail -n 2 |head -n 1 >> {output.chim} awk -F' ' '{{Nreads+=$3; NreadsUnique+=$5; NreadsMulti+=$7}}END{{print "# Nreads " Nreads "\t" "NreadsUnique " NreadsUnique "\t" "NreadsMulti " NreadsMulti}}' results/mergedBam/{wildcards.sample}counts >> {output.chim} rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts fi """ |
368 369 | shell: "salmon quant -p {threads} -i {params.idx} -l A -1 {input.f1} -2 {input.f2} --validateMappings --rangeFactorizationBins 4 --gcBias -o {params.dir} 2>&1 | tee -a {log}" |
398 399 400 401 402 403 404 | shell: """ featureCounts -p -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts} {input.bam} featureCounts -p -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts_fc} {input.bam} featureCounts -p -f -O -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts} {input.bam} featureCounts -p -f -O -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts_fc} {input.bam} """ |
427 428 | script: "../../scripts/txImport.R" |
452 453 454 455 456 457 458 459 460 461 462 463 464 465 | shell: """ arriba \ -x {input.bam} \ -a {input.reference} \ -g {input.gtf} \ -b {input.bl} \ -k {input.kf} \ -t {input.kf} \ -p {input.pd} \ -o {output.fusion} -O results/fusion/arriba/{wildcards.sample}_discarded.tsv 2>&1 | tee -a {log} gzip -9 -c results/fusion/arriba/{wildcards.sample}_discarded.tsv > {output.discarded} rm -f results/fusion/arriba/{wildcards.sample}_discarded.tsv """ |
486 487 488 489 490 491 492 493 | shell: """ workflow/scripts/fixChim.awk < {input.cj} > {input.cj}.fix mv -f {input.cj}.fix {input.cj} STAR-Fusion --genome_lib_dir {params.ctat_path} \ -J {input.cj} \ --output_dir results/fusion/STAR_fusion/{wildcards.sample} """ |
508 509 510 511 512 | shell: """ megadepth {input.bam} --annotation {input.bed} --op sum > results/paQuant/{wildcards.sample}_paQuant.tsv gzip results/paQuant/{wildcards.sample}_paQuant.tsv """ |
32 33 | script: "../../scripts/createInterval.R" |
56 57 58 59 60 61 62 63 64 65 66 67 | shell: """ gatk FastqToSam \ -F1 {input.f1} \ -F2 {input.f2} \ -O {output.ubam} \ -RG {params.RG} \ -SM {params.SM} \ -PL illumina \ --TMP_DIR {params.tmp_dir} \ --COMPRESSION_LEVEL {params.compression} 2>> {log} """ |
91 92 93 94 95 96 97 98 99 100 101 102 103 | shell: """ gatk \ MergeBamAlignment \ --REFERENCE_SEQUENCE {input.reference} \ --UNMAPPED_BAM {input.ubam} \ --ALIGNED_BAM {input.bam} \ --OUTPUT {output.merged} \ --INCLUDE_SECONDARY_ALIGNMENTS false \ --VALIDATION_STRINGENCY SILENT \ --TMP_DIR {params.tmp_dir} \ --COMPRESSION_LEVEL {params.compression} 2>> {log} """ |
120 121 122 123 124 125 126 127 128 | shell: """ INPUT=({input.bams}) if ((${{#INPUT[@]}} == 1)); then cp {input.bams} {output.mbam} else samtools merge -f -1 {output.mbam} {input.bams} 2>>{log} fi """ |
150 151 152 153 154 155 156 157 158 159 160 161 | shell: """ gatk \ MarkDuplicates \ --INPUT {input.bam} \ --CREATE_INDEX true \ --VALIDATION_STRINGENCY SILENT \ --METRICS_FILE {output.metrics} \ --OUTPUT {output.dedup} \ --TMP_DIR {params.tmp_dir} \ --COMPRESSION_LEVEL {params.compression} 2>> {log} """ |
185 186 187 188 189 190 191 192 193 | shell: """ gatk \ SplitNCigarReads \ -R {input.reference} \ -I {input.bam} \ --tmp-dir {params.tmp_dir} \ -O {output.sacr} 2>> {log} """ |
218 219 220 221 222 223 224 225 226 227 228 229 230 231 | shell: """ gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal \ -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \ -Xms4000m" \ BaseRecalibrator \ -R {input.reference} \ -I {input.bam} \ --use-original-qualities \ -O {output.table} \ --tmp-dir {params.tmp_dir} \ -known-sites {input.dbSNP} \ -known-sites {input.knowIndels} 2>> {log} """ |
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 | shell: """ gatk \ --java-options "-XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \ -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \ -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xms3000m" \ ApplyBQSR \ --add-output-sam-program-record \ -R {input.reference} \ -I {input.bam} \ --tmp-dir {params.tmp_dir} \ --use-original-qualities \ -O {output.recalbam} \ --bqsr-recal-file {input.table} 2>> {log} """ |
297 298 299 300 301 302 303 304 305 306 307 308 309 | shell: """ gatk --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:ParallelGCThreads=1" \ HaplotypeCaller \ -R {input.reference} \ -I {input.bam} \ -L {input.intervals} \ --tmp-dir {params.tmp_dir} \ -O {output.vcf} \ -dont-use-soft-clipped-bases \ --standard-min-confidence-threshold-for-calling 20 \ --dbsnp {input.dbSNP} 2>> {log} """ |
333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 | shell: """ gatk \ VariantFiltration \ --R {input.reference} \ --V {input.vcf} \ --tmp-dir {params.tmp_dir} \ --window 35 \ --cluster 3 \ --filter-name "FS" \ --filter "FS > 30.0" \ --filter-name "QD" \ --filter "QD < 2.0" \ -O {output.vcf_f} """ |
31 32 33 34 35 | shell: """ fastqc -o {params.dir} {input.bam} mv results/qc/fastqc/{wildcards.sample}/{wildcards.sample}_{wildcards.unit}.Aligned.sortedByCoord.out_fastqc.zip {output} """ |
53 54 55 | shell: "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} " "> {log[0]} 2>&1" |
74 75 76 | shell: "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} " "> {log} 2>&1" |
108 109 | shell: "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
127 128 | shell: "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1" |
144 145 | shell: "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}" |
162 163 | shell: "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1" |
180 181 | shell: "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1" |
212 213 214 215 216 217 | shell: "multiqc" " --force" " -o {params.outDir}" " -n {params.name}" " {input}" |
248 249 250 251 252 253 254 255 256 | shell: """ set -e multiqc --force -o {params.outDir} -n {params.geneCount_gc} {input.gcg} multiqc --force -o {params.outDir} -n {params.geneCount_fc} {input.gcf} multiqc --force -o {params.outDir} -n {params.exonCount_gc} {input.ecg} multiqc --force -o {params.outDir} -n {params.exonCount_fc} {input.ecf} multiqc --force -o {params.outDir} -n {params.salmon} {input.salmon} {input.salmon_length} """ |
278 279 280 281 282 283 | shell: "multiqc" " --force -fp" " -o {params.outDir}" " -n {params.name}" " {input}" |
21 22 23 24 25 26 27 28 | shell: "cutadapt" " {params.adapters}" " -m 31" " --compression-level {params.compression}" " -o {output.fastq1}" " -j {threads}" " {input} > {output.qc} 2>>{log}" |
49 50 51 52 53 54 55 56 | shell: "STAR" " --runMode genomeGenerate" " --genomeDir {params.idx}" " --genomeFastaFiles {input.reference}" " --sjdbOverhang 250" " --sjdbGTFfile {input.gtf}" " --runThreadN {threads} 2>&1 | tee -a {log}" |
77 78 79 80 81 82 83 84 85 | shell: """ set -e cat {input.transcripts} {input.reference} > resources/gentrome.fa grep "^>" {input.reference} | cut -d " " -f 1 > resources/decoys.txt sed -i -e 's/>//g' resources/decoys.txt salmon index -t resources/gentrome.fa -i {params.idx} -p {threads} --decoys resources/decoys.txt --gencode -k 31 2>&1 | tee -a {log} rm -f resources/gentrome.fa resources/decoys.txt """ |
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | shell: """ STAR \ --genomeDir {params.idx} \ --readFilesIn {input.f1} \ --runThreadN {threads} \ --readFilesCommand zcat \ --outFilterMultimapScoreRange 1 \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --outFilterType BySJout \ --alignIntronMax 500000 \ --alignIntronMin 20 \ --alignMatesGapMax 1000000 \ --sjdbScore 2 \ --alignSJDBoverhangMin 5 \ --genomeLoad NoSharedMemory \ --outFilterMatchNminOverLread 0.1 \ --outFilterScoreMinOverLread 0.1 \ --sjdbOverhang 250 \ --outSAMstrandField intronMotif \ --peOverlapNbasesMin 10 \ --alignSplicedMateMapLminOverLmate 0.5 \ --alignSJstitchMismatchNmax 5 -1 5 5 \ --outSAMtype None \ --outSAMmode None \ --outFileNamePrefix results/STAR_1p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log} """ |
154 155 156 157 158 | shell: """ set -e awk -F'\t' '$7 > {params.minCount}' {input.SJs} > {output.filtered} """ |
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | shell: """ touch results/index/dummy.fastq mkdir -p results/index/STARindex_hg19_SJ cd results/index/STARindex_hg19_SJ ln -sf ../../../resources/STARindex_hg19/* . rm -f sjdbList.out.tab sjdbInfo.txt cd ../../../ STAR \ --genomeDir {params.idx} \ --readFilesIn results/index/dummy.fastq \ --runThreadN 2 \ --sjdbFileChrStartEnd {input.SJ} \ --outFileNamePrefix results/index/ \ --sjdbInsertSave Basic 2>&1 | tee -a {log} cp results/index/_STARgenome/sjdbList.out.tab results/index/STARindex_hg19_SJ/ cp results/index/_STARgenome/sjdbInfo.txt results/index/STARindex_hg19_SJ/ rm -rf results/index/_STARgenome """ |
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | shell: """ STAR \ --genomeDir {params.idx} \ --readFilesIn {input.f1} \ --runThreadN {threads} \ --readFilesCommand zcat \ --outBAMcompression {params.compression} \ --outFilterType BySJout \ --outFilterMultimapScoreRange 1 \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --alignIntronMax 500000 \ --alignIntronMin 20 \ --alignMatesGapMax 1000000 \ --sjdbScore 2 \ --outSAMtype BAM Unsorted \ --alignSJDBoverhangMin 5 \ --genomeLoad NoSharedMemory \ --outFilterMatchNminOverLread 0.1 \ --outFilterScoreMinOverLread 0.1 \ --sjdbOverhang 250 \ --outSAMstrandField intronMotif \ -–outSAMattrRGline ID:{params.RG} SM:{params.SM} PL:illumina \ --peOverlapNbasesMin 10 \ --alignSplicedMateMapLminOverLmate 0.5 \ --alignSJstitchMismatchNmax 5 -1 5 5 \ --chimSegmentMin 10 \ --chimOutJunctionFormat 1 \ --chimOutType Junctions WithinBAM SoftClip \ --chimJunctionOverhangMin 10 \ --chimScoreDropMax 30 \ --chimScoreJunctionNonGTAG 0 \ --chimScoreSeparation 1 \ --chimSegmentReadGapMax 3 \ --chimMultimapNmax 20 \ --outFileNamePrefix results/STAR_2p/{wildcards.sample}_{wildcards.unit} 2>&1 | tee -a {log} """ |
284 285 286 287 288 | shell: """ samtools sort -@ {threads} {input.bam} -o {output.sbam} 2>>{log} samtools index {output.sbam} """ |
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 | shell: """ set -e INPUT=({input.bams}) if ((${{#INPUT[@]}} == 1)); then cp {input.bams} {output.mbam} cp {input.bams}.bai {output.mbam}.bai cp {input.chims} {output.chim} else samtools merge -f -1 {output.mbam} {input.bams} 2>>{log} samtools index {output.mbam} # Also merge the chimeric files for star-fusion cat {input.chims[0]} | head -n -2 > {output.chim} files=({input.chims}) unset files[0] files=("${{files[@]}}") rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts touch results/mergedBam/{wildcards.sample}chims cat {input.chims[0]} | tail -n 1 > results/mergedBam/{wildcards.sample}counts for file in $files; do cat $file | tail -n +2 | head -n -2 >> results/mergedBam/{wildcards.sample}chims cat $file | tail -n 1 >> results/mergedBam/{wildcards.sample}counts done cat results/mergedBam/{wildcards.sample}chims >> {output.chim} cat {input.chims[0]} | tail -n 2 |head -n 1 >> {output.chim} awk -F' ' '{{Nreads+=$3; NreadsUnique+=$5; NreadsMulti+=$7}}END{{print "# Nreads " Nreads "\t" "NreadsUnique " NreadsUnique "\t" "NreadsMulti " NreadsMulti}}' results/mergedBam/{wildcards.sample}counts >> {output.chim} rm -f results/mergedBam/{wildcards.sample}chims results/mergedBam/{wildcards.sample}counts fi """ |
364 365 | shell: "salmon quant -p {threads} -i {params.idx} -l A -r {input.f1} --fldMean 275 --fldSD 50 --validateMappings --rangeFactorizationBins 4 --gcBias -o {params.dir} 2>&1 | tee -a {log}" |
394 395 396 397 398 399 400 | shell: """ featureCounts -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts} {input.bam} featureCounts -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.gene_counts_fc} {input.bam} featureCounts -f -O -a {input.gtf} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts} {input.bam} featureCounts -f -O -a {input.fc} -T {threads} -s {params.strand} -t exon -g gene_id -o {output.exon_counts_fc} {input.bam} """ |
423 424 | script: "../../scripts/txImport.R" |
448 449 450 451 452 453 454 455 456 457 458 459 460 461 | shell: """ arriba \ -x {input.bam} \ -a {input.reference} \ -g {input.gtf} \ -b {input.bl} \ -k {input.kf} \ -t {input.kf} \ -p {input.pd} \ -o {output.fusion} -O results/fusion/arriba/{wildcards.sample}_discarded.tsv 2>&1 | tee -a {log} gzip -9 -c results/fusion/arriba/{wildcards.sample}_discarded.tsv > {output.discarded} rm -f results/fusion/arriba/{wildcards.sample}_discarded.tsv """ |
482 483 484 485 486 487 488 489 | shell: """ workflow/scripts/fixChim.awk < {input.cj} > {input.cj}.fix mv -f {input.cj}.fix {input.cj} STAR-Fusion --genome_lib_dir {params.ctat_path} \ -J {input.cj} \ --output_dir results/fusion/STAR_fusion/{wildcards.sample} """ |
504 505 506 507 508 | shell: """ megadepth {input.bam} --annotation {input.bed} --op sum > results/paQuant/{wildcards.sample}_paQuant.tsv gzip results/paQuant/{wildcards.sample}_paQuant.tsv """ |
32 33 | script: "../../scripts/createInterval.R" |
55 56 57 58 59 60 61 62 63 64 65 | shell: """ gatk FastqToSam \ -F1 {input.f1} \ -O {output.ubam} \ -RG {params.RG} \ -SM {params.SM} \ -PL illumina \ --TMP_DIR {params.tmp_dir} \ --COMPRESSION_LEVEL {params.compression} 2>> {log} """ |
89 90 91 92 93 94 95 96 97 98 99 100 101 | shell: """ gatk \ MergeBamAlignment \ --REFERENCE_SEQUENCE {input.reference} \ --UNMAPPED_BAM {input.ubam} \ --ALIGNED_BAM {input.bam} \ --OUTPUT {output.merged} \ --INCLUDE_SECONDARY_ALIGNMENTS false \ --VALIDATION_STRINGENCY SILENT \ --TMP_DIR {params.tmp_dir} \ --COMPRESSION_LEVEL {params.compression} 2>> {log} """ |
118 119 120 121 122 123 124 125 126 | shell: """ INPUT=({input.bams}) if ((${{#INPUT[@]}} == 1)); then cp {input.bams} {output.mbam} else samtools merge -f -1 {output.mbam} {input.bams} 2>>{log} fi """ |
148 149 150 151 152 153 154 155 156 157 158 159 | shell: """ gatk \ MarkDuplicates \ --INPUT {input.bam} \ --CREATE_INDEX true \ --VALIDATION_STRINGENCY SILENT \ --METRICS_FILE {output.metrics} \ --OUTPUT {output.dedup} \ --TMP_DIR {params.tmp_dir} \ --COMPRESSION_LEVEL {params.compression} 2>> {log} """ |
183 184 185 186 187 188 189 190 191 | shell: """ gatk \ SplitNCigarReads \ -R {input.reference} \ -I {input.bam} \ --tmp-dir {params.tmp_dir} \ -O {output.sacr} 2>> {log} """ |
216 217 218 219 220 221 222 223 224 225 226 227 228 229 | shell: """ gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal \ -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \ -Xms4000m" \ BaseRecalibrator \ -R {input.reference} \ -I {input.bam} \ --use-original-qualities \ -O {output.table} \ --tmp-dir {params.tmp_dir} \ -known-sites {input.dbSNP} \ -known-sites {input.knowIndels} 2>> {log} """ |
255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | shell: """ gatk \ --java-options "-XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \ -XX:+PrintGCDetails -XX:ParallelGCThreads=1 \ -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xms3000m" \ ApplyBQSR \ --add-output-sam-program-record \ -R {input.reference} \ -I {input.bam} \ --tmp-dir {params.tmp_dir} \ --use-original-qualities \ -O {output.recalbam} \ --bqsr-recal-file {input.table} 2>> {log} """ |
295 296 297 298 299 300 301 302 303 304 305 306 307 | shell: """ gatk --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:ParallelGCThreads=1" \ HaplotypeCaller \ -R {input.reference} \ -I {input.bam} \ -L {input.intervals} \ --tmp-dir {params.tmp_dir} \ -O {output.vcf} \ -dont-use-soft-clipped-bases \ --standard-min-confidence-threshold-for-calling 20 \ --dbsnp {input.dbSNP} 2>> {log} """ |
331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 | shell: """ gatk \ VariantFiltration \ --R {input.reference} \ --V {input.vcf} \ --tmp-dir {params.tmp_dir} \ --window 35 \ --cluster 3 \ --filter-name "FS" \ --filter "FS > 30.0" \ --filter-name "QD" \ --filter "QD < 2.0" \ -O {output.vcf_f} """ |
1 2 3 4 5 6 7 8 9 | options(scipen=999) gtf <- read.table(snakemake@input[["gtf"]], sep="\t") gtf <- subset(gtf, V3 == "exon") gtf <- data.frame(chrom=gtf[,'V1'], start=gtf[,'V4']-1, end=gtf[,'V5']) gtf$chrom <- gsub("^chr", "", gtf$chrom) gtf <- gtf[gtf$chrom != "M",] write.table(gtf, file=snakemake@output[["tmp"]], quote = F, sep="\t", col.names = F, row.names = F) system(paste0("gatk BedToIntervalList -I ", snakemake@output[["tmp"]], " -O ", snakemake@output[["intervals"]], " -SD ", snakemake@input[["refDict"]])) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import gffutils db = gffutils.create_db( snakemake.input[0], dbfn=snakemake.output.db, force=True, keep_order=True, merge_strategy="merge", sort_attribute_values=True, disable_infer_genes=True, disable_infer_transcripts=True, ) with open(snakemake.output.bed, "w") as outfileobj: for tx in db.features_of_type("transcript", order_by="start"): bed = [s.strip() for s in db.bed12(tx).split("\t")] bed[3] = tx.id outfileobj.write("{}\n".format("\t".join(bed))) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | options(scipen=999) require(tximport) require(GenomicFeatures) transcripts <- read.table(snakemake@input[["quant"]], sep ="\t", header = T) write.csv(transcripts[,c("Name", "NumReads")], file=snakemake@output[["transcript_raw"]], col.names = T, row.names = F) write.csv(transcripts[,c("Name", "TPM")], file=snakemake@output[["transcript_TPM"]], col.names = T, row.names = F) tx2gene <- read.table(snakemake@input[["tx2gene"]]) ### Read counts and summarize to gene level txi <- tximport(snakemake@input[["quant"]], type = "salmon", tx2gene = tx2gene) txiScaled <- tximport(snakemake@input[["quant"]], type = "salmon", tx2gene = tx2gene,countsFromAbundance = "lengthScaledTPM") ### Rename columns nms <- basename(gsub("\\/quant.sf", "", snakemake@input[["quant"]])) mat <- txi$counts matScaled <- txiScaled$counts colnames(mat) <- nms colnames(matScaled) <- nms write.table(mat, file=snakemake@output[["gene_raw"]], row.names = T, col.names = T, sep = "\t") write.table(matScaled, file=snakemake@output[["gene_scaled"]], row.names = T, col.names = T, sep="\t") |
35 36 | shell: "touch {output.complete}" |
76 77 | shell: "tar -czvf {output.tarfile} {input}" |
116 117 118 119 120 121 122 123 124 | shell: """ set -e mv {params.map_gene_gc_stat} {output.map_gene_gc_stat} mv {params.map_gene_fc_stat} {output.map_gene_fc_stat} mv {params.map_exon_gc_stat} {output.map_exon_gc_stat} mv {params.map_exon_fc_stat} {output.map_exon_fc_stat} tar -czvf {output.tarfile} {input} {output.map_gene_gc_stat} {output.map_gene_fc_stat} {output.map_exon_gc_stat} {output.map_exon_fc_stat} """ |
Support
- Future updates
Related Workflows





