Polyadenylation aware read trimming and alternative polyadenylation site analysis
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Description
Purpose of the work flow is to run RNA-seq trimming and alignment to a given reference steps and perform initial polyadenylation sites' analysis.
Dependencies
-
julia v0.6.1
-
snakemake 5.2.2
-
BBMAP 38.22
-
STAR 2.6.1a
-
seqkit 0.8.1
-
sambamba 0.6.6
You can install the dependencies manually or through conda environment as indicated below. If you choose to install the required software manually please directly to the step 6 of the Workflow setup.
Workflow setup
-
Install
conda
:
wget -P miniconda https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh &&
chmod 755 ./miniconda/Miniconda3-latest-Linux-x86_64.sh &&
./miniconda/Miniconda3-latest-Linux-x86_64.sh
-
Add path of
miniconda
to.bashrc
if not selected option to add automatically during installation:
cd ~/ && pth=$(pwd) &&
echo "PATH=$PATH:$pth/miniconda3/bin" >> ~/.bashrc
- Add channels to conda:
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels anaconda
conda config --add channels defaults
-
Create your
conda
environment:
conda env create -f envs/3EndD.yaml -n 3EndD
- Activate created environment:
source activate 3EndD
- Install the required julia packages:
bash envs/build_julia_pkgs.sh
Analysis
-
Run
snakemake
as follows:
snakemake --configfile config.yaml -j 24 -k -p
This would ran an analysis on a small dataset matching human brain RNA-seq data of 21th chromosome.
PolyAAnalysis tests
-
Run
julia
:
source activate 3EndD;
julia PolyAAnalysis.jl/tests/runtests.jl
Notes for analysis configuration
config.yaml
- Configuration of your working environment:
INPUT-DIR: dir with fastq files
SAMPLES: sample names, part of a file name. See example bellow.
LANE-REGEX: python regex for finding same sample files split by lanes. eg. use "L\\d\\d\\d_" to find any SAMPLE1_R[1,2]_L00[1,2,3]_001.fastq.gz SAMPLE2_R[1,2]_L00[1,2,3]_001.fastq.gz in INPUT-DIR. Sample names should be unique.
memory_java: for human 60 GB is recommended.
threads_julia: specify threads for `julia`. Usually no more then 8. Some scripts scales only reading GZ | BAM files.
threads_star: specify threads for `STAR` aligner.
threads_sambamba: specify threads for sambamba tools.
SCRATCH: location of tmp directory for faster computation (SSD).
REFERENCE: genome in fasta format.
GFF3: annotation files, should be the same version.
GTF: annotation files, should be the same version.
TRANSCRIPTS: Optional, generated from fasta and annotations if not provided.
- Supported file names:
SAMPLE1_L001_R1_001.fastq.gz
SAMPLE1_L001_R2_001.fastq.gz
SAMPLE1_L001_1.fq.gz
SAMPLE1_L001_2.fq.gz
SAMPLE1_L001_R1_001.sfq
SAMPLE1_L001_R2_001.sfq
SAMPLE1_L001_1.sfq
SAMPLE1_L001_1.sfq
SAMPLES and LANE-REGEX options in your config would be:
SAMPLES: SAMPLE1
LANE-REGEX: "L\\d\\d\\d_"
- If you want to normalize all samples by lowest read number found:
SUBSAMPLING:
run: true
- If you want to subsample to specific read number:
SUBSAMPLING:
run: true
subsample_to: N
- Additional parameters which are not listed in config can be passed as a string for BBDUK, STAR and PolyA annotator programs.
additional_params: "passed as a string for specified programs."
- Specific parameters for polyA | termination sites annotator:
ANNOTATE-TS:
k: distance from the cluster center allowed.
m: minimum distance between clusters allowed. Two adjacent clusters with distance <= m will be merged.
additional_params: pass -c|--cluster to run clustering, pass -v|--verbose to print proceeding of clustering.
mappingquality: Only reads with greater than this mapping value will pass.
strandedness: List your SAMPLES bellow providing for each strandedness.
HBR_100_Collibri_chr21small_A: If R1 read is the same as a gene sequence: "+". If R2 read is the same as a gene sequence: "-"
Annotated BED files for termination sites
Workflow generates annotated transcription termination sites in
ANNOTATE-POLYA
directory.
For
*annotated_polyA_clusters.bed
files:
Column Nr | Field Name | Explanation |
---|---|---|
1 | Chr | Chromosome |
2 | Start | Start (0-based) |
3 | End | End |
4 | GeneName | Gene Name |
5 | ClusterSize | Detected number of reads representing this termination site. |
6 | Strand | Strand |
7 | Feature | Feature |
8 | ClusterCenter | Position of cluster center. 1-based. |
9 | Biotype | Biotype |
10 | ClusterMedian | Median of frequencies of reads representing each termination site in cluster. |
11 | ClusterMean | Mean of frequencies of reads representing each termination site in cluster. |
12 | ClusterMin | Min of frequencies of reads representing each termination site in cluster. |
13 | ClusterMax | Max of frequencies of reads representing each termination site in cluster. |
14 | Cluster1stQuartile | 1st quartile of frequencies of reads representing each termination site in cluster. |
15 | Cluster3rdQuartile | 3rd quartile of frequencies of reads representing each termination site in cluster. |
16 | TSMedian | Median length polyA tail. |
17 | TSMean | Mean length polyA tail. |
18 | TSMin | Min length polyA tail. |
19 | TSMax | Max length polyA tail. |
20 | TS1stQuartile | 1st quartile of lengths of polyA tail. |
21 | TS3rdQuartile | 3rd quartile of lengths of polyA tail. |
In addition for
*annotated_polyA_clusters_plus_coverage.bed
files
Column Nr | Field Name | Explanation |
---|---|---|
22 | - | Sambamba added Read count at Chr:Start-End |
23 | - | Sambamba added mean coverage at Chr:Start-End |
24 | - | Sambamba added Sample Name if any. |
For
*annotated_polyA.bed
files:
Column Nr | Field Name | Explanation |
---|---|---|
1 | Chr | Chromosome |
2 | Start | Start (0-based) |
3 | End | End |
4 | GeneName | Gene Name |
5 | Count | Detected number of reads representing this termination site. |
6 | Strand | Strand |
7 | Feature | Feature |
8 | Median | Median length polyA tail. |
9 | Min | Min length polyA tail. |
10 | Max | Max length polyA tail. |
11 | Biotype | Biotype |
Code Snippets
24 25 26 27 28 29 30 | for fastq in "$@"; do { zcat $fastq 2>/dev/null || cat $fastq; } | \ #sed -n '1~4p' | \ sed -n '1,${p;n;n;n;}' | \ awk '{if (substr($0,0,1)!="@") {exit 1}; nreads+=1;} END {printf nreads " "}' done |
53 54 | shell: "echo $(git log) | cut -d ' ' -f 2 &>> {output}" |
59 60 61 | run: with open(logs + "/config.yaml", 'w') as outfile: yaml.dump(CONFIG, outfile, default_flow_style=False) |
8 9 | run: install_slimfastq() |
17 18 | run: get_sample_files_named_pipe_script(wildcards.stem) |
26 27 | shell: "./{input} > {output}" |
35 36 | shell: "./{input} > {output}" |
18 19 20 | shell: "fastqc {input} -o {params.out_dir} -t {threads} " + "-k {params.kmer_size} 2> {log}" |
30 31 | shell: "multiqc {input} -o "+MULTIQC_DIR+" -n fastqc_report_raw_reads" |
41 42 | shell: "multiqc {input} -o "+MULTIQC_DIR+" -n fastqc_report_trimmed_reads" |
52 53 | shell: "multiqc {input} -o "+MULTIQC_DIR+" -n fastqc_report_processed_polyA_reads --force " |
30 31 32 33 34 35 36 37 38 39 40 | shell: "bbduk.sh in={input[0]} out={output[1]} " + "in2={input[1]} out2={output[2]} " + "ref={params.ref} threads={threads} " + "ktrim={params.ktrim} k={params.k} " + "mink={params.mink} hdist={params.hdist} " + "minlength={params.minlength} maxns={params.maxns} " + "qtrim={params.qtrim} trimq={params.trimq} " + "{params.add} threads={threads} " + "stats={output[0]} overwrite=t " + "-Xmx{params.m}g 2> {log}" |
63 64 65 66 67 68 69 70 | shell: "minimum={params.target_count}; " + "seqkit sample -s 11 -j {threads} --two-pass " + "-n {params.target_count} {input[0]} -o {output[0]} " + "2> {log[0]}; " + "seqkit sample -s 11 -j {threads} --two-pass " + "-n {params.target_count} {input[1]} -o {output[1]} " + "2> {log[1]}" |
78 79 | shell: "cat {input} > {output}" |
86 87 | shell: "cp {input} {output}" |
99 100 101 | shell: "./scripts/fastq_num_reads.sh {input} > {output}; " + "sed -i -e 's/^/{params.prefix}/' {output} ; echo >> {output}" |
120 121 | shell: """minimum=`python ./scripts/""" + |
21 22 23 24 | shell: "julia --depwarn=no PolyAAnalysis.jl/scripts/mark_poly_A.jl -i -p {threads} " + "-a {input[0]} -b {input[1]} -o {params.output_stem} " + "-r {params.gz} 2>&1 | tee -a {log}" |
45 46 47 48 | shell: "julia --depwarn=no PolyAAnalysis.jl/scripts/mark_poly_A.jl -i -p {threads} " + "-a {input[0]} -b {input[1]} -o {params.output_stem} " + "-g {params.ref} -f {params.gff} 2>&1 | tee -a {log}" |
68 69 70 71 | shell: "julia --depwarn=no PolyAAnalysis.jl/scripts/mark_poly_A.jl -i -p {threads} " + "-a {input[0]} -b {input[1]} -o {params.output_stem} " + "-g {params.ref} 2>&1 | tee -a {log}" |
20 21 22 23 24 25 26 27 | shell: "if [ ! -d {params.genomeDir} ]; then " + " mkdir {params.genomeDir}; fi && " + " STAR --runMode genomeGenerate " + "--runThreadN {threads} " + "--genomeDir {params.genomeDir} " + "--genomeFastaFiles {input.ref} " + "--sjdbGTFfile {input.gtf} 2> {log}" |
49 50 51 52 53 54 55 56 57 58 59 60 61 62 | shell: "STAR --outTmpDir {params.tmp} --runThreadN {threads} " + "--genomeDir {input.genomeDir} --runMode alignReads " + "--outSAMunmapped Within " + "--outFileNamePrefix {params.prefix} " + "--readFilesIn {input[0]} {input[1]} --readFilesCommand zcat " + "--genomeLoad NoSharedMemory " + "--outSAMstrandField intronMotif --outSAMattributes All " + "--outSAMtype BAM Unsorted " + "--chimSegmentMin {params.chimSegmentMin} " + "{params.additional}; " + "mv {params.starPrefix}.bam {output[0]}; " + "mv {params.prefix}Log.final.out {output[1]}; " + "mv {params.prefix}Log.out {log}" |
77 78 79 | shell: "sambamba sort --tmpdir={params.temp_dir_sambamba} -p " + "-t{threads} -o {output[0]} {input[0]} 2> {log}" |
92 93 | shell: "sambamba index -p -t{threads} {input[0]} 2> {log}" |
26 27 28 29 30 | shell: "export JULIA_NUM_THREADS={threads}; julia --depwarn=no " + "PolyAAnalysis.jl/scripts/annotate_polyA.jl -b {input} -o {params.pref} " + "-g {params.gff} -k {params.k} -m {params.m} -q {params.q} " + "-s {params.s} {params.add_params} &> {log}" |
45 46 47 48 | shell: "sambamba depth region -F 'mapping_quality > {params.q}' " + "-t {threads} -L {input[1]} " + "{input[0]} | sed '/^#/ d' > {output}" |
Support
- Future updates
Related Workflows





