A snakemake pipeline improves the gene annotation for cross species analysis of single cell RNA-Seq
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
General
A snakemake pipeline improves the gene annotation for cross species analysis of single cell RNA-Seq.
Droplet-based single-cell RNA-Seq protocols such as 10X Genomics Chromium, cel-seq2 are widely used because of the dramatic increase of throughput for detecting cells. Since these methods only enrich cDNA fragments closed to polyadenylation (polyA) tails, data generated by these protocols is highly biased to the 3’ end of transcripts where the 3’UTR is normally located. The sensitivity and specificity of scRNA-Seq on detecting expressed gene therefore are confounded by the quality of 3’UTR annotation.
Here, we implemented a computational pipeline can improve the tissue or species specific 3’UTR annotation by leveraging on 1) de novo assembly of transcriptome with bulk RNA-Seq data and 2) ortholog of 3’UTR from well-annotated species. We show that ~40%-70% more UMIs can be assigned back to genes after applying this approach to 10X scRNA-Seq data generated from Pig retina, of which the 3’UTRs are pooly annotated.
For version history, see the change log .
Authors
- Yang Shen (yang_3.shen@boehringer-ingelheim.com)
Requirements
- Miniconda
Miniconda can be installed with user's account:
wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
- Git (>=2.22.0)
conda install -c conda-forge git
- Snakemake (>=5.32.0)
Install Snakemake using conda : For installation details, see the instructions in the Snakemake documentation .
conda install snakemake
Or the pipeline can be run via Singualrity.
- Singualrity
Please check if Singualrity is available:
singularity run docker://godlovedc/lolcow
_________________________________________
/ Learn to pause -- or nothing worthwhile \
\ can catch up to you. /
-----------------------------------------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
You need root permission to install singularity, please follow the instructions on Singularity website .
Introduction
Two approaches for improving gene annotation
DAG of workflow
Without bulk RNA-Seq data
The GTF file can be improved by ortholog mapping.
With bulk RNA-Seq data
The pipeline can take the mapped bam files as inputs, the GTF file then will be improved by denovo assembly.
If the scRNA-Seq reads are provided
The pipeline then can quantify the provided reads by STARSolo with improved GTF file, the gene count matrix will be saved in 10X h5 format, which can be loaded via other scRNA-Seq analysis package (e.g. Seurat) .
Performance
By applying the workflow to pig retina data, the overall exon length can be extended by two folder on average
.
The improvements on either total UMIs detected per sample
, or # of detect gene/cell or # of UMIs/cell
are also shown below.
Usage
Setup
Step 1: Obtain a copy of this workflow
-
Create a new github repository using this workflow as a template .
-
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
mkdir ~/src/
cd ~/src/
git -c http.sslVerify="false" -c http.proxy= clone https://github.com/bi-compbio/scrnax.git
cd scrnax/resources/
tar xvfz tests.tar.gz
Step 2: Configure workflow
Configure the workflow according to your needs via editing the files in the
config/
folder. Adjust
config.selftest.json
to configure the workflow execution.
Step 3: Check if Snakemake is available
snakemake -v
Step 4: Execute workflow with self-test data
Test your configuration by performing a dry-run via
cd ~/src/scrnax/
snakemake --profile config/selftest/ --use-conda -n
Execute the workflow locally via
snakemake --profile config/selftest/ --use-conda --cores $N
using
$N
cores or run it in a HPC environment (slurm) via
snakemake --profile config/slurm_selftest/ --use-conda
If you not only want to fix the software stack but also the underlying OS, use
snakemake --profile config/selftest/ --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
If the files defined in config file are located outside of current working directory, then the folder or parent folder has to be mounted when running singularity For example, in the conf.json shown below, the files are all located in subfolder of /data/ .
{ "resultdir": "pigGTF/", "goodGTF": "/data/cbsync/referencedata/annotations/homo_sapiens/ensembl/95/Homo_sapiens.GRCh38.95.gtf", "refGTF": "/data/cbsync/referencedata/annotations/sus_scrofa/ensembl/86/Sus_scrofa.Sscrofa10.2.86.gtf", "liftChain": "/data/cbsync/referencedata/annotations/homo_sapiens/ucsc/liftOver/hg38ToSusScr3.over.chain", "liftMinMatch": 0.7
}
Therefore, the /data/ should be mounted by adding --singularity-args "-B /data" at the end of command:
snakemake --profile config/selftest/ --use-conda --use-singularity --singularity-args "-B /data"
Input
-
GTF gene annotation file of poorly annotated species of which the samples are from. (e.g., pig )
-
GTF gene annotation file of well annotated species that are evolutionary closed to the species of interest. (e.g, human), which can be downloaded from Ensembl (e.g., human ).
-
LiftOver Chain file defines the ortholog synteny blocks by whole alignment, which can be download from UCSC (e.g, hg38 )
-
Bam files of bulk RNA-Seq (optional). These files will be used for denovo assembly.
Config file
After downloading the files aforementioned, put the paths of those files into conf.json . The parameters defined in conf.json are:
Parameters for GTF improvement
-
resultdir
: the root path of outputs -
goodGTF
: the GTF file of well annotated species (e.g., human) -
refGTF
: the GTF file of pooly annotated species, of which the samples are from (e.g., pig) -
liftChain
: the chain file for liftOver -
liftMinMatch
: liftOver specific parameter, defines the minimum ratio of bases that must remap, which is range from 0~1 -
bamDir
, optional parameters, if the bulk RNASeq data is available, the bamDir defines the path of folder of bam files
Parameters for quantification using STARSolo with improved GTF
-
StarsoloGenome
: the path of geome index of STARSolo -
whitelist
: the path of barcode whitelist for correcting cell barcode. For 10X data, the whitelist can be obtained from cellRanger package, [V2] (https://github.com/10XGenomics/supernova/raw/master/tenkit/lib/python/tenkit/barcodes/737K-august-2016.txt) and V3 (local path: cellranger-3.0.2/cellranger-cs/3.0.2/lib/python/cellranger/barcodes/3M-february-2018.txt.gz)
The STARSolo can handle any data in which the cell barcode + UMI (read1 for 10X) and cDNA (read2 for 10X) are separated. User needs to define the actual cell barcode length and UMI start position and length when using STARSolo. Please refer to STARSolo manual for details
-
CBLen
: cell barcode length, which is 16 for 10X data -
UMIStart
: start position of UMI -
UMILen
: length of UMI
Run pipeline
Conda
snakemake --profile config/local/ --use-conda --cores 16 --configfile conf.json --config bamDir='bams/'
Conda + Singularity
snakemake --profile config/local/ --use-conda --use-singularity --cores 16 --configfile conf.json --config bamDir='bams/'
Output
The refined GTFs are three/two GTF files:
-
{resultdir}/ortholog/fixed.gtf
: GTF is refined by ortholog mapping -
{resultdir}/denovo/fixed.gtf
: GTF is refined by denovo assembly (will be generated only if the bulk RNA-Seq bam files are provided) -
{resultdir}/combined.fixed.gtf
: GTF is refined by combining ortholog mapping and denovo assembly.
For example:
pigGTF/
├── combined.fixed.gtf
├── denovo
│ └── fixed.gtf
└── ortholog
└── fixed.gtf
Example
In this example, the pig ensembl annotated can be refined by human annotation and scRNA-Seq matched bulk RNA-Seq. (assume the working directory is ~/src/scrnax).
If both the bam files of bulk RNA-Seq and the fastq files of scRNA-Seq are available, you can
skip
step 4
and
step 5
. All of analyses, including GTF improvement and gene quantification can be done in
step 6
.
- Step 1. Download the pipeline code of scrnax:
mkdir ~/src/
cd ~/src/
git -c http.sslVerify="false" -c http.proxy= clone https://git.eu.boehringer.com/bibc_compbio/scrnax.git
cd scrnax/resources/
tar xvfz tests.tar.gz
- Step 2. Download the pig and human GTF from Ensembl, the chain file from UCSC:
cd ~/src/scrnax/
# tar xvfz test.tar.gz
mkdir gtfs
cd gtfs
wget -c http://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
wget -c http://ftp.ensembl.org/pub/release-97/gtf/sus_scrofa_usmarc/Sus_scrofa_usmarc.USMARCv1.0.97.gtf.gz
gzip -d *.gz
cd ../
mkdir liftchain
cd liftchain
wget -c http://hgdownload.cse.ucsc.edu/goldenpath/hg38/liftOver/hg38ToSusScr11.over.chain.gz
gzip -d *.gz
- Step 3. Prepare the conf.json by filling the path of GTFs and liftOver chain file.
cd ~/src/scrnax/
cat <<EOT >conf.json
{
"resultdir": "pigGTF/",
"goodGTF": "gtfs/Homo_sapiens.GRCh38.97.gtf",
"refGTF": "gtfs/Sus_scrofa_usmarc.USMARCv1.0.97.gtf",
"liftChain": "liftchain/hg38ToSusScr11.over.chain",
"liftMinMatch": 0.7
}
EOT
- Step 4. Launch the pipeline for ortholog mapping only.
snakemake --use-singularity --use-conda --profile config/local/ --configfile conf.json
The final improved GTF files are:
pigGTF/
├── combined.fixed.gtf
└── ortholog
└── fixed.gtf
- Step 5. If the bam files are available, they can be added by either 1) defining in the conf.json or 2) adding "bamDir" when running snakemake. First, some small bam files are provided as an example.
cd ~/src/scrnax/
ls resources/tests/bams/
Define the bamDir in json file ( note, the bamDir should be ended with "/" ):
cd ~/src/scrnax/
cat <<EOT >conf.json
{
"resultdir": "pigGTF/",
"goodGTF": "gtfs/Homo_sapiens.GRCh38.97.gtf",
"refGTF": "gtfs/Sus_scrofa_usmarc.USMARCv1.0.97.gtf",
"liftChain": "liftchain/hg38ToSusScr11.over.chain",
"liftMinMatch": 0.7,
"bamDir": "resources/tests/bams/"
}
EOT
snakemake --use-singularity --use-conda --profile config/local/ --configfile conf.json
Or it can be provided as an additional parameter: bamDir
cd ~/src/scrnax/
snakemake --use-singularity --use-conda --profile config/local/ --configfile conf.json --config bamDir=resources/tests/bams/
Alternativly, the pipeline can be run on HPC cluster:
snakemake --use-singularity --use-conda --profile config/slurm/ --configfile conf.json --config bamDir=resources/tests/bams/
Now the
combined.fixed.gtf
and
denovo/fixed.gtf
should be updated/added in the result folder.
pigGTF/
├── combined.fixed.gtf
├── denovo
│ └── fixed.gtf
└── ortholog
└── fixed.gtf
-
Step 6. If the fastq files of scRNA-Seq are available, the pipeline can offer gene level quantification by using STARSolo with the improved GTF. To provide fastq files, the parameter fastqDir needs to be set in either conf.json file or provided as additional parameter when running the pipeline. By default, the "combined GTF" will be used for quantification. It is possible to quantify the gene expression according to other GTF file by setting
countBy
.ThecountBy
can be set as: ortholog, devnovo and combined.
Generate new json config file
cd ~/src/scrnax/
cat <<EOT >conf.fastq.json
{
"resultdir": "pigGTF/",
"goodGTF": "gtfs/Homo_sapiens.GRCh38.97.gtf",
"refGTF": "gtfs/Sus_scrofa_usmarc.USMARCv1.0.97.gtf",
"liftChain": "liftchain/hg38ToSusScr11.over.chain",
"liftMinMatch": 0.7,
"_comment_": "starsolo",
"StarsoloGenome": "resources/tests/Sscrofa10.2.chr12_star2.7.1a/",
"whitelist": "resources/tests/737K-august-2016.txt",
"CBLen": 16,
"UMIStart": 17,
"UMILen": 10
}
EOT
Quantification with combined GTF:
snakemake --use-singularity --use-conda --profile config/local/ --configfile conf.fastq.json --config bamDir=resources/tests/bams/ fastqDir=resources/tests/fastqs/ countBy=combined
Quantification with ortholog GTF:
snakemake --use-singularity --use-conda --profile config/local/ --configfile conf.fastq.json --config bamDir=resources/tests/bams/ fastqDir=resources/tests/fastqs/ countBy=ortholog
Quantification with denovo assembled GTF:
snakemake --use-singularity --use-conda --profile config/local/ --configfile conf.fastq.json --config bamDir=resources/tests/bams/ fastqDir=resources/tests/fastqs/ countBy=denovo
Command for submitting the jobs to HPC cluster
snakemake --use-singularity --use-conda --profile config/slurm/ --configfile conf.json --config bamDir=resources/tests/bams/ fastqDir=resources/tests/fastqs/ countBy=combined
The gene count matrix (h5 format) will be stored in subfolder of featureCount folder.
pigGTF/featureCount/
├── combined
│ ├── 736_001.Aligned.sortedByCoord.out.bam.featureCounts.bam
│ ├── 736_001.counts.txt
│ ├── 736_001.counts.txt.summary
│ ├── 736_001.h5
│ ├── 736_001.tsv.gz
│ ├── 736_008.Aligned.sortedByCoord.out.bam.featureCounts.bam
│ ├── 736_008.counts.txt
│ ├── 736_008.counts.txt.summary
│ ├── 736_008.h5
│ └── 736_008.tsv.gz
├── denovo
│ ├── 736_001.Aligned.sortedByCoord.out.bam.featureCounts.bam
│ ├── 736_001.counts.txt
│ ├── 736_001.counts.txt.summary
│ ├── 736_001.h5
│ ├── 736_001.tsv.gz
│ ├── 736_008.Aligned.sortedByCoord.out.bam.featureCounts.bam
│ ├── 736_008.counts.txt
│ ├── 736_008.counts.txt.summary
│ ├── 736_008.h5
│ └── 736_008.tsv.gz
└── ortholog
├── 736_001.Aligned.sortedByCoord.out.bam.featureCounts.bam
├── 736_001.counts.txt
├── 736_001.counts.txt.summary
├── 736_001.h5
├── 736_001.tsv.gz
├── 736_008.Aligned.sortedByCoord.out.bam.featureCounts.bam
├── 736_008.counts.txt
├── 736_008.counts.txt.summary
├── 736_008.h5
└── 736_008.tsv.gz
Troubleshoot
Unlock directory
If you have error like:
Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
Please rerun the snakemake command with
--unlock
option.
snakemake --use-singularity --singularity-args "-B /data" --profile config/local/ --configfile conf.json --config fastqDir=fastqsslim --unlock
Running pipeline outside
You can execute the pipeline in any directory. But you have to specify the path of pipeline script with
-s
. For example:
snakemake --use-singularity --singularity-args "-B /data" --profile config/local/ -s ~/src/scrnax/workflow/Snakefile --configfile conf.json --config fastqDir=fastqsslim
Change log
v1.0.0
Initial release
v1.1.0
Gene level quantification via STARSolo with improved GTF
Code Snippets
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | rule assemblGtf: input: config['bamDir'] + "{sample}.bam" output: fullgtf = config["outputDir"] + "denovo/{sample}.scallop.gtf", posGtf = config["outputDir"] + "denovo/{sample}.scallop.pos.gtf", negGtf = config["outputDir"] + "denovo/{sample}.scallop.neg.gtf" conda: "../envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ scallop --min_transcript_coverage 10 --min_num_hits_in_bundle 100 --min_splice_bundary_hits 10 \ --library_type first -i {input} -o {output.fullgtf} --verbose 0 > /dev/null mawk 'BEGIN{{FS="\\t";OFS="\\t"}} {{if($7=="-") print; }}' {output.fullgtf} > {output.negGtf} mawk 'BEGIN{{FS="\\t";OFS="\\t"}} {{if($7=="+") print; }}' {output.fullgtf} > {output.posGtf} """ rule mergeDevoGtf: input: inGtfs = expand(config["outputDir"] + "denovo/{sample}.scallop.{{strand}}.gtf", sample = IDS), refGtf = config["outputDir"] + "reference.{strand}.gtf" output: mergedDenovoGtf = config["outputDir"] + "denovo/merged.{strand}.gtf" conda: "../envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ stringtie --merge -g 250 {input.refGtf} {input.inGtfs} -l DENO | \ grep -v '^MT' > {output} """ rule annoDevoGtf: input: inGtf = rules.mergeDevoGtf.output.mergedDenovoGtf, refGtf = config["outputDir"] + "reference.{strand}.gtf" output: config["outputDir"] + "denovo/anno.{strand}.annotated.gtf" params: outprefix = lambda wildcards: config["outputDir"] + "denovo/anno." + wildcards.strand conda: "../envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ gffcompare -o {params.outprefix} -r {input.refGtf} {input.inGtf} """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from snakemake.utils import R from itertools import repeat from os.path import join import os import shutil import pandas import jsonschema import math # this container defines the underlying OS for each job when using the workflow # with --use-conda --use-singularity singularity: "docker://continuumio/miniconda3" ##### load config and sample sheets ##### configfile: "config/config.yaml" validate(config, schema="../schemas/config.schema.yaml") samples = pd.read_csv(config["samples"], sep="\t").set_index("sample", drop=False) samples.index.names = ["sample_id"] validate(samples, schema="../schemas/samples.schema.yaml") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | rule mapFastq: input: R1 = config["fastqDir"] + "{fastq}_R1.fastq.gz", R2 = config["fastqDir"] + "{fastq}_R2.fastq.gz" output: countMtx = config["outputDir"] + "starSolo/{fastq}/{fastq}.Solo.out/Gene/raw/matrix.mtx", sortBam = config["outputDir"] + "starSolo/{fastq}/{fastq}.Aligned.sortedByCoord.out.bam" params: outputPrefix = lambda wildcards: config["outputDir"] + "starSolo/" + wildcards.fastq + "/" + wildcards.fastq + ".", whitelist = config["whitelist"],soloType = config["STARArgs"].get("soloType","Droplet"), barcodeLen = 0, CBLen = config["CBLen"], UMIStart = config["UMIStart"], UMILen = config["UMILen"], genome = config["StarsoloGenome"], readFilesCommand = config["STARArgs"].get("readFilesCommand","zcat"), outSAMtype = config["STARArgs"].get("outSAMtype","BAM SortedByCoordinate"), soloStrand = config["STARArgs"].get("soloStrand","Forward"), winAnchorMultimapNmax = config["STARArgs"].get("winAnchorMultimapNmax",2000), outFilterMultimapNmax = config["STARArgs"].get("outFilterMultimapNmax",2000), outSAMprimaryFlag = config["STARArgs"].get("outSAMprimaryFlag","AllBestScore"), outSAMmultNmax = config["STARArgs"].get("outSAMmultNmax",1), limitBAMsortRAM = config["STARArgs"].get("limitBAMsortRAM",40000000000), limitOutSJoneRead = config["STARArgs"].get("limitOutSJoneRead",10000), limitOutSJcollapsed = config["STARArgs"].get("limitOutSJcollapsed",3000000), limitIObufferSize = config["STARArgs"].get("limitIObufferSize",300000000), outSAMattributes = config["STARArgs"].get("outSAMattributes"), additionalArguments = config["STARArgs"].get("additionalArguments", ""), starOptions = starsoloParameters threads: math.ceil(config["STARArgs"].get("threads",16) * scaleDownThreads) conda: "../envs/generateH5.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * config["STARArgs"].get("memory",40000) shell: """ STAR \ --genomeDir {params.genome} \ --runThreadN {threads} \ --readFilesIn {input.R2} {input.R1} \ --readFilesCommand {params.readFilesCommand} \ --outSAMtype {params.outSAMtype} \ --winAnchorMultimapNmax {params.winAnchorMultimapNmax} \ --outFilterMultimapNmax {params.outFilterMultimapNmax} \ --outSAMprimaryFlag {params.outSAMprimaryFlag} \ --outSAMmultNmax {params.outSAMmultNmax} \ --limitBAMsortRAM {params.limitBAMsortRAM} \ --limitOutSJoneRead {params.limitOutSJoneRead} \ --limitOutSJcollapsed {params.limitOutSJcollapsed} \ --limitIObufferSize {params.limitIObufferSize} \ --outFileNamePrefix {params.outputPrefix} \ --outSAMattributes {params.outSAMattributes} \ {starsoloParameters} \ {params.additionalArguments} """ rule featureCount: input: bamfile = rules.mapFastq.output.sortBam, gtf = config["featureCountGTF"] output: temp(config["outputDir"] + "featureCount/" + config["countBy"] + "/{fastq}.Aligned.sortedByCoord.out.bam.featureCounts.bam") params: outDir = config["outputDir"] + "featureCount/" + config["countBy"] + "/", strand = config["featureCountArgs"].get("strand",1) threads: math.ceil(config["featureCountArgs"].get("threads",16) * scaleDownThreads) conda: "../envs/generateH5.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * config["featureCountArgs"].get("memory",10000) shell: """ featureCounts --fraction -M -s {params.strand} -T {threads} -t exon -g gene_name -a {input.gtf} -o {params.outDir}{wildcards.fastq}.counts.txt {input.bamfile} -R BAM --Rpath {params.outDir} """ rule featureCountToDT: input: rules.featureCount.output output: config["outputDir"] + "featureCount/" + config["countBy"] + "/" + "{fastq}.tsv.gz" params: countScript = srcdir("../scripts/countBamFromSTARSolo.sh"), outDir = config["outputDir"] + "featureCount/", numHit = config['numHit'] threads: math.ceil(config["featureCountArgs"].get("threads",8) * scaleDownThreads) conda: "../envs/generateH5.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * config["featureCountArgs"].get("memory",4000) shell: """ samtools view {input} | NUMHIT={params.numHit} bash {params.countScript} | pigz -p {threads} -c > {output} """ rule DTToH5: input: rules.featureCountToDT.output output: countMtx = config["outputDir"] + "featureCount/" + config["countBy"] + "/{fastq}.h5", molInfo = config["outputDir"] + "featureCount/" + config["countBy"] + "/{fastq}.molInfo.h5" params: writeToh5 = srcdir("../scripts/writeHDF5.R"), whitelist = config["whitelist"] threads: math.ceil(config["DTToH5Args"].get("threads",16) * scaleDownThreads) conda: "../envs/generateH5.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * config["DTToH5Args"].get("memory",4000) shell: """ Rscript {params.writeToh5} {input} {params.whitelist} {output.countMtx} {threads} {wildcards.fastq} """ |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | rule extractUTR: input: config["goodGTF"] output: temp(config["outputDir"] + "ortholog/" + "threeUTR.gtf") conda: "../envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ mawk 'BEGIN{{FS="\\t";OFS="\\t"}} {{if($3=="three_prime_utr") {{$3 = "exon"; print "chr"$0;}} }}' {input} > {output} """ rule mapUTR: input: utrgtf = rules.extractUTR.output, liftChain = config["liftChain"] output: tmpMapGtf = temp(config["outputDir"] + "ortholog/" + "mapUTR.tmp.gtf"), mapGtfPos = config["outputDir"] + "ortholog/" + "mapUTR.pos.gtf", mapGtfNeg = config["outputDir"] + "ortholog/" + "mapUTR.neg.gtf" params: minMatch = config['liftMinMatch'], rtracklayer = srcdir("../scripts/rtracklayer.R") threads: 16 conda: "../envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 30000 shell: """ # liftOver -gff -minMatch={params.minMatch} {input.utrgtf} {input.liftChain} {output.tmpMapGtf} unmap.bed Rscript {params.rtracklayer} {input.utrgtf} {input.liftChain} {output.tmpMapGtf} # fix short exons mawk 'BEGIN{{FS="\\t";OFS="\\t"}}{{if($5-$4 > 50 && $7 == "+") {{if($1 ~ /^chr/) $1 = substr($1, 4, length($1)); print $0; $3 = "transcript"; print $0;}}}}' {output.tmpMapGtf} > {output.mapGtfPos} mawk 'BEGIN{{FS="\\t";OFS="\\t"}}{{if($5-$4 > 50 && $7 == "-") {{if($1 ~ /^chr/) $1 = substr($1, 4, length($1)); print $0; $3 = "transcript"; print $0;}}}}' {output.tmpMapGtf} > {output.mapGtfNeg} """ rule mergeOrthologGtf: input: config["outputDir"] + "ortholog/" + "mapUTR.{strand}.gtf", config["outputDir"] + "reference.{strand}.gtf" output: mergedGtf = config["outputDir"] + "ortholog/" + "merged.{strand}.gtf" conda: "../envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ stringtie -f 0 -T 0 -F 0 -l ORTH --merge -g 250 -o {output.mergedGtf} {input} """ rule annoOrthologGtf: input: mergedGtf = rules.mergeOrthologGtf.output.mergedGtf, refGtf = config["outputDir"] + "reference.{strand}.gtf" output: annoGtf = config["outputDir"] + "ortholog/" + "anno.{strand}.annotated.gtf" params: outprefix = lambda wildcards: config["outputDir"] + "ortholog/" + "anno." + wildcards.strand conda: "../envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ gffcompare -o {params.outprefix} -r {input.refGtf} {input.mergedGtf} -K """ |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | rule prepareNoneMTRef: input: config["refGTF"] output: posGtf = config["outputDir"] + "reference.pos.gtf", negGtf = config["outputDir"] + "reference.neg.gtf" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ grep -v '^MT' {input} | mawk 'BEGIN{{FS="\\t";OFS="\\t"}} {{if($7=="+") print; }}' > {output.posGtf} grep -v '^MT' {input} | mawk 'BEGIN{{FS="\\t";OFS="\\t"}} {{if($7=="-") print; }}' > {output.negGtf} """ rule prepareMTRef: input: config["refGTF"] output: posGtf = config["outputDir"] + "mt.pos.gtf", negGtf = config["outputDir"] + "mt.neg.gtf" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ grep '^MT' {input} | mawk 'BEGIN{{FS="\\t";OFS="\\t"}} {{if($7=="+") print; }}' > {output.posGtf} || true grep '^MT' {input} | mawk 'BEGIN{{FS="\\t";OFS="\\t"}} {{if($7=="-") print; }}' > {output.negGtf} || true """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | from snakemake.utils import R from itertools import repeat from os.path import join import os import shutil import pandas import jsonschema import math # report: "report/workflow.rst" # this container defines the underlying OS for each job when using the workflow # with --use-conda --use-singularity # singularity: "docker://512303535801.dkr.ecr.eu-west-1.amazonaws.com/scrnax_conda_env:v1.1.1" container: "docker://continuumio/miniconda3:4.7.12" ##### load config and sample sheets ##### # validate(config, schema="../schemas/config.schema.yaml") # samples = pd.read_csv(config["samples"], sep="\t").set_index("sample", drop=False) # samples.index.names = ["sample_id"] # validate(samples, schema="../schemas/samples.schema.yaml") # Globals --------------------------------------------------------------------- scaleDownThreads = config.get("scaleDownThreads", 1) # Selftest ------------------------------------------------------------------- if config.get('example', "") == "true": configfile: srcdir("../config/conf.selftest.json") else: configfile: srcdir("../config/conf.default.json") # Initial optional parameters ------------------------------------------------- # optionalKeys = { # "numHit": 10000, # "featureCountArgs": {}, # "featureCountToDTArgs": {}, # "STARArgs": {}, # "DTToH5Args": {} # } # for tmpKey, tmpValue in optionalKeys.items(): # if tmpKey not in config: # config[tmpKey] = tmpValue # Prepare outputs ------------------------------------------------------------- resultFiles = [config["outputDir"] + "ortholog/fixed.gtf"] if "bamDir" in config: if config['bamDir'][len(config['bamDir'])-1] != "/": config['bamDir'] = config['bamDir'] + "/" IDS, = glob_wildcards(config['bamDir'] + "{sample}.bam") resultFiles = [ resultFiles, config["outputDir"] + "denovo/fixed.gtf", config["outputDir"] + "combined.fixed.gtf" ] else: IDS = "" config["bamDir"] = "" # How the fastqs shall be counted, i.e., by original GTF, or from denovo assembled GTF, or from ortholog GTF, or combined GTF. if "countBy" in config: if config["countBy"] == "combined": config["featureCountGTF"] = config["outputDir"] + "combined.fixed.gtf" else: config["featureCountGTF"] = config["outputDir"] + config["countBy"] + "/fixed.gtf" elif config["bamDir"] != "": config["countBy"] = "combined" config["featureCountGTF"] = config["outputDir"] + "combined.fixed.gtf" else: config["countBy"] = "ortholog" config["featureCountGTF"] = config["outputDir"] + "ortholog/fixed.gtf" # define the parameters for STARSolo # Generate count matrix if fastq files are provided. if "fastqDir" in config: if config['fastqDir'][len(config['fastqDir'])-1] != "/": config['fastqDir'] = config['fastqDir'] + "/" FASTQS, = glob_wildcards(config['fastqDir'] + "{fastq}_R1.fastq.gz") resultFiles = resultFiles + [config["outputDir"] + "featureCount/" + config["countBy"] + "/" + str(i) + ".h5" for i in FASTQS] jsS = open(srcdir("schemas/jsonSchemaStarSolo.json")).read() # And validate the json config file else: jsS = open(srcdir("schemas/jsonSchemaWithoutStarSolo.json")).read() config["fastqDir"] = "" # validate json config file jsSchema = json.loads(jsS) try: jsonschema.validate(config, jsSchema) except jsonschema.exceptions.ValidationError as e: print("well-formed but invalid JSON:", e) sys.exit() except json.decoder.JSONDecodeError as e: print("poorly-formed text, not JSON:", e) sys.exit() # set default parameters for STARSolo starsoloParameters = ''' --soloType {solotype} \ --soloCBwhitelist {whitelist} \ --soloCBlen {CBLen} \ --soloUMIstart {UMIStart} \ --soloUMIlen {UMILen} \ --soloBarcodeReadLength 0 \ --soloStrand {soloStrand} \ --soloFeatures {soloFeatures} \ '''.format(solotype = config["STARArgs"].get("soloType","CB_UMI_Simple"), whitelist = config['whitelist'], CBLen = config['CBLen'], UMIStart = config["UMIStart"], UMILen = config["UMILen"], soloStrand = config["STARArgs"].get("soloStrand","Forward"), soloFeatures = config["STARArgs"].get("soloFeatures","Gene GeneFull SJ Velocyto")) config["STARArgs"]["outSAMattributes"] = "NH HI nM AS CR UR CB UB GX GN sS sQ sM" # print(resultFiles) rule all: input: resultFiles include: "rules/prepareOriginalGTF.py" include: "rules/assembleGTF.py" include: "rules/orthologGTF.py" include: "rules/countFastq.py" rule fixOriginGtf: input: config['refGTF'] output: fixedGtf = config["outputDir"] + "origin/" + "fixed.gtf", params: fixGTFRscript = srcdir("scripts/fixGffcompareGTF.R") conda: "envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ Rscript {params.fixGTFRscript} {input} {output.fixedGtf} """ rule fixOrthologGtf: input: annoGtf = rules.annoOrthologGtf.output.annoGtf, mtGtf = config["outputDir"] + "mt.{strand}.gtf" output: fixedGtf = config["outputDir"] + "ortholog/" + "fixed.{strand}.gtf", tmpGtf = temp(config["outputDir"] + "ortholog/anno.{strand}.combined.fixed.tmp.gtf") params: fixGTFRscript = srcdir("scripts/fixGffcompareGTF.R") conda: "envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ cat {input.annoGtf} > {output.tmpGtf} cat {input.mtGtf} >> {output.tmpGtf} Rscript {params.fixGTFRscript} {output.tmpGtf} {output.fixedGtf} """ rule mergeStrandedOrthologGtf: input: config["outputDir"] + "ortholog/" + "fixed.pos.gtf", config["outputDir"] + "ortholog/" + "fixed.neg.gtf" output: config["outputDir"] + "ortholog/" + "fixed.gtf" conda: "envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ cat {input} > {output} """ rule fixDenovoGtf: input: annoGtf = rules.annoDevoGtf.output, mtGtf = config["outputDir"] + "mt.{strand}.gtf" output: fixedGtf = config["outputDir"] + "denovo/fixed.{strand}.gtf", tmpGtf = temp(config["outputDir"] + "denovo/anno.{strand}.combined.fixed.tmp.gtf") params: fixGTFRscript = srcdir("scripts/fixGffcompareGTF.R") conda: "envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ cat {input.annoGtf} > {output.tmpGtf} cat {input.mtGtf} >> {output.tmpGtf} Rscript {params.fixGTFRscript} {output.tmpGtf} {output.fixedGtf} """ rule mergeStrandedDenovoGtf: input: config["outputDir"] + "denovo/" + "fixed.pos.gtf", config["outputDir"] + "denovo/" + "fixed.neg.gtf" output: config["outputDir"] + "denovo/" + "fixed.gtf" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 1 shell: """ cat {input} > {output} """ rule combineAndFixGtf: input: #rules.fixDenovoGtf.output.fixedGtf, #rules.fixOrthologGtf.output.fixedGtf denoAnnoGtf = rules.annoDevoGtf.output, orthAnnoGtf = rules.annoOrthologGtf.output, refGtf = config["outputDir"] + "reference.{strand}.gtf", mtGtf = config["outputDir"] + "mt.{strand}.gtf" output: combinedMergedGtf = config["outputDir"] + "merged.{strand}.gtf", combinedMergedGtfNR = config["outputDir"] + "merged.{strand}.nr.gtf", tmpGtf = config["outputDir"] + "tmp.{strand}.annotated.gtf", fixedGtf = config["outputDir"] + "combined.fixed.{strand}.gtf" # gtflist = config["outputDir"] + "gtflist.txt" params: fixGTFRscript = srcdir("scripts/fixGffcompareGTF.R"), # annoOrthPath = rules.annoOrthologGtf.output, # annoDevoPath = rules.annoDevoGtf.output, outprefix = lambda wildcards: config["outputDir"] + "tmp." + wildcards.strand, mergeGtfDir = config["outputDir"] conda: "envs/refineUTR.yaml" resources: mem_mb=lambda wildcards, attempt: attempt * 4000 threads: 8 shell: """ cat {input.denoAnnoGtf} {input.orthAnnoGtf} > {output.combinedMergedGtf} stringtie --merge -l COMB -g -1000000000 -i -F 0 -T 0 -f 0 {output.combinedMergedGtf} > {output.combinedMergedGtfNR} gffcompare -o {params.outprefix} -r {input.refGtf} {output.combinedMergedGtfNR} -K cat {input.mtGtf} >> {output.tmpGtf} Rscript {params.fixGTFRscript} {output.tmpGtf} {output.fixedGtf} """ rule mergeStrandedCombinedGtf: input: config["outputDir"] + "combined.fixed.pos.gtf", config["outputDir"] + "combined.fixed.neg.gtf" output: config["outputDir"] + "combined.fixed.gtf" resources: mem_mb=lambda wildcards, attempt: attempt * 1000 threads: 1 shell: """ cat {input} > {output} """ |
Support
- Future updates
Related Workflows





