This package contains a pipeline for prediction of RNA-RNA contacts from RIC-seq data.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Prediction of RNA-RNA contacts from RIC-seq data. Developed by Sergei Margasyuk (smargasyuk@gmail.com) and Dmitri Pervouchine (pervouchine@gmail.com).
Description
This package contains a pipeline for prediction of RNA-RNA contacts from RIC-seq data (Cai et al., 2020 ).
Usage
Step 1: Obtain a copy of this workflow
Clone this repository to your local system, into the place where you want to perform the data analysis.
git clone https://github.com/smargasyuk/RNAcontacts.git
cd RNAcontacts
git checkout v0.2.4
Step 2: Configure workflow
Configure the workflow according to your needs via editing the files in the
config/
folder. Adjust
config.yaml
to configure the workflow execution, and
samples.tsv
to specify your sample setup (described in
Settings
).
Step 3: Install Snakemake
Install Snakemake using conda :
# install mamba package manager if you don't have it
conda install -n base -c conda-forge mamba
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation .
Step 4: Execute workflow
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
See the Snakemake documentation for further details.
Test run
To make a test run, type
snakemake --use-conda --configfile config/config.test.yaml -c8 test
The rule will download a toy dataset (sample sheet, truncated fastq files, genome, and genome annotation confined to the first 100MB of chr1), unpack, and execute the pipeline. The output files in
results/test_hg19/test/contacts
will be compared to those provided in the archive.
Run on full RIC-seq data
Download the RIC-seq files for HeLa cell line from GEO repository GSE127188 . Download the control RNA-seq files from ENCODE consortium webpage .
The files are as follows:
RNASeq_HeLa_total_rep1: - ENCFF000FOM.fastq.gz - ENCFF000FOV.fastq.gz RNASeq_HeLa_total_rep2: - ENCFF000FOK.fastq.gz - ENCFF000FOY.fastq.gz RIC-seq_HeLa_rRNA_depleted_rep1: - SRR8632820_1.fastq.gz - SRR8632820_2.fastq.gz RIC-seq_HeLa_rRNA_depleted_rep2: - SRR8632821_1.fastq.gz - SRR8632821_2.fastq.gz
Put the files somewhere, fill the sample sheet (described in Settings ), and run the pipeline.
Step 5: Investigate results
The output of the pipeline consists of the following files:
-
results/{genome}/{project}/{sample}/contacts
is the list of contacts and their respective read counts in tsv format (columns 1-3 and 4-6 are the contacting coordinates, column 7 is read count). -
results/{genome}/{project}/views/global/contacts.bed
is the BED12 file with contacts on the same chromosome and length less than the threshold defined inconfig
.
These files for HeLa experiment are available at 10.5281/zenodo.6511342 .
Code Snippets
15 16 | wrapper: "0.84.0/bio/star/index" |
33 34 | wrapper: "0.84.0/bio/star/align" |
42 43 44 45 | shell: """ awk -v 'OFS="\t"' '$5 == 1' {input} > {output} """ |
64 65 | wrapper: "0.84.0/bio/star/align" |
2 3 4 5 | shell: """ find results/*/*/*/ -maxdepth 0 -not -name "bam" -exec rm -rf -v {{}} \; rm -rf -v results/trackhub """ |
6 7 8 9 10 | shell: """ mkdir -p $(dirname {output}) cut -f1,2,3 {input.control_jxn} {input.star_ref_dir}/sjdbList.out.tab | sort -u > {output} """ |
19 20 21 22 23 24 25 | shell: """ mkdir -p $(dirname {output}) sort --parallel=8 -S4G -k9,9 <(cut -f1-6,10 {input.mate0} | awk -v OFS="\t" 'BEGIN{{s["+"]="-";s["-"]="+";}}{{print $4, $5, $5+1, s[$6], $1, $2, $2+1, s[$3], $7, 0}}') \ <(cut -f1-6,10 {input.mate1} | awk -v OFS="\t" '{{print $1, $2, $2+1, $3, $4, $5, $5+1, $6, $7, 1}}') | \ grep -v GL > {output} """ |
35 36 37 38 39 40 41 | shell: """ mkdir -p $(dirname {output}) sort -k9,9 -S4G --parallel=8 <(samtools view {input.mate0} | perl workflow/scripts/neo.pl {input.junctions} 0) \ <(samtools view {input.mate1} | perl workflow/scripts/neo.pl {input.junctions} 1) | grep -v GL |\ awk -v 'OFS=\\t' '{{print $1,$2,$2+1,$3,$4,$5,$5+1,$6,$7,$8}}' > {output} """ |
51 52 53 54 55 | shell: """ cut -f1-4 {input} | sort-bed - | bedops -m --range {params.radius} - > {output.donors} cut -f5-8 {input} | sort-bed - | bedops -m --range {params.radius} - > {output.acceptors} """ |
67 68 69 70 71 72 73 | shell: """ mkdir -p $(dirname {output}) sort-bed {input.junctions} | \ intersectBed -a stdin -b {input.donors} -wa -wb | cut -f5- | sort-bed - | \ intersectBed -a stdin -b {input.acceptors} -wa -wb | cut -f5- | sort -k1,1 > {output} """ |
81 82 83 84 85 | shell: """ mkdir -p $(dirname {output}) awk '{{n[$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8]++}}END{{for(j in n){{print j"\t"n[j]}}}}' {input} | sort > {output} """ |
96 97 98 99 100 101 102 103 104 105 106 107 108 | shell: """ mkdir -p $(dirname {output[0]}) cat {input} |\ cut -f 3- |\ awk -v 'OFS=\t' '$1==$4' |\ awk -v 'OFS=\t' '{{if ($2>$6){{s1=$2;s2=$3;$2=$5;$3=$6;$5=s1;$6=s2}};print}}' |\ awk -v 'OFS=\t' '$2<$6' |\ awk -v 'OFS=\t' '($3<$5)' |\ awk -v 'OFS=\t' '$6-$2<{params.max_size}' |\ awk -v 'OFS=\t' '{{n[$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6]++}}END{{for(j in n){{print j"\t"n[j]}}}}' |\ awk -v 'OFS=\t' '{{print $1,$2,$6,"id_"NR",reads="$7,$7,"+",$2,$6,"0,0,0",2,$3-$2","$6-$5,0","$5-$2}}' | sort-bed - > {output} """ |
117 118 119 120 121 122 123 124 | shell: """ mkdir -p $(dirname {output}) echo 'track name="{wildcards.project} contacts" visibility=2 itemRgb="On"' > {output} python workflow/scripts/bed12_merge_color.py {input} | \ sort-bed - |\ awk -v 'OFS=\t' '$4="id_"NR",count="$5' >> {output} """ |
26 27 28 29 | shell: """ mkdir -p $(dirname {output}) cp {input} {output} """ |
35 36 37 38 | shell: """ mkdir -p $(dirname {output}) tail -n +2 {input} > {output} """ |
47 48 49 50 | shell: """ mkdir -p $(dirname {output}) bedClip {input.bed} {input.chromsizes} {output} """ |
56 57 58 59 | shell: """ mkdir -p $(dirname {output}) fetchChromSizes {wildcards.genome} > {output} """ |
66 67 68 69 | shell: """ mkdir -p $(dirname {output}) awk -v 'OFS=\\t' '$5=$5<999?$5:999' {input} | sort-bed - > {output} """ |
78 79 80 81 | shell: """ mkdir -p $(dirname {output}) bedToBigBed {input.bed} {input.chromsizes} {output} """ |
87 88 89 90 | run: with open(output[0], 'w') as handle: for i in input: handle.write(hub_template.format(Path(i).stem)) |
96 97 98 99 | run: with open(output[0], 'w') as handle: for g in get_all_genomes(): handle.write(genome_template.format(g)) |
105 106 107 108 | shell: """ mkdir -p $(dirname {output}) cp {input} {output} """ |
8 9 10 11 | shell:""" mkdir -p $(dirname {output}) awk -v 'OFS=\\t' '$3=$3+1' {input} | sort-bed - > {output} """ |
20 21 22 23 24 25 26 27 28 | shell: """ mkdir -p $(dirname {output.bed}) cat {input.tsv} | awk -v 'OFS=\\t' '$1==$5' | \ cut -f1,2,4,7,9,10 | \ awk -v 'OFS=\\t' -v 'name_prefix={wildcards.id}_{wildcards.jtype}' '{{print $1,$2,$4,name_prefix"_"NR,1,$3,$5,$6}}' | \ awk -v 'OFS=\\t' '{{if ($2 > $3){{t=$3;$3=$2;$2=t}}; print}}' |\ awk -v 'OFS=\\t' '$2<$3' |\ sort-bed - > {output.bed} """ |
37 38 39 40 | shell: """ mkdir -p $(dirname {output}) cat {input.contacts} | python workflow/scripts/bedj_filter_junctions.py --sjdb {input.junctions} > {output} """ |
47 48 49 50 | shell: """ mkdir -p $(dirname {output}) awk -v 'OFS=\\t' '$3-$2<{wildcards.range}' {input} > {output} """ |
59 60 61 62 | shell: """ mkdir -p $(dirname {output}) bedops -e {input.contacts} {input.junctions} > {output} """ |
71 72 73 | shell: """ bedops -u {input.neo} {input.chim} > {output} """ |
80 81 82 83 84 | shell: """ cut -f 1,2,3 {input} | sort | uniq -c | sed -r 's/([0-9]) /\\1\\t/' |\ awk -v 'OFS=\\t' -v 'name_prefix=id' '{{print $2,$3,$4,name_prefix"_"NR,$1,"+"}}' |\ sort-bed - > {output} """ |
91 92 93 94 | shell: """ mkdir -p $(dirname {output}) cat {input} | python workflow/scripts/bedj_to_bed12.py > {output} """ |
101 102 103 | shell: """ python workflow/scripts/bed12_merge_color.py {input} | sort-bed - | python workflow/scripts/bed_enumerate.py > {output} """ |
9 10 11 | shell: """ genomepy install -g resources/genomes/ -p {params.provider} --annotation {wildcards.assembly} """ |
21 22 23 24 25 | shell: """ mkdir -p $(dirname {output.fasta}) cp {input.fasta} {output.fasta} cp {input.annotation} {output.gtf} """ |
5 | shell: "wget https://zenodo.org/record/6475703/files/RICseq_toy_data.tgz" |
16 | shell: "tar -xf RICseq_toy_data.tgz" |
25 26 27 28 29 | run: for f1, f2 in zip(input["pipeline_out"], input["ref_out"]): if not filecmp.cmp(f1, f2, shallow = False): print(f"Files {f1} and {f2} are different") raise ValueError |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import click import distinctipy @click.command() @click.argument("files", nargs=-1) def main(files): colors = distinctipy.get_colors(len(files), pastel_factor=0.7) color_strings = [ ",".join([str(int(i * 255)) for i in rgb_frac_tuple]) for rgb_frac_tuple in colors ] for file_idx, bed in enumerate(files): with open(bed) as f: for line in f.readlines(): fields = line.strip().split("\t") fields[8] = color_strings[file_idx] print("\t".join(fields)) if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 | import sys def main(): for idx, line in enumerate(sys.stdin): fields = line.strip().split("\t") fields[3] = f"id={idx},count={fields[4]}" print("\t".join(fields)) if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | import itertools from typing import Any, List, Tuple import sys import click import pipe from intervaltree import Interval, IntervalTree def pairwise(iterable): # s -> (s0,s1), (s1,s2), (s2, s3), ... a, b = itertools.tee(iterable) next(b, None) return itertools.islice(zip(a, b), None, None, 1) def parse_SJDB(SJDB_file, radius): tree_dict = {} with open(SJDB_file) as f: for line in f.readlines(): fields = line.split() if len(fields) < 3: continue l_chrom, l_start, l_end = fields[0], int(fields[1]), int(fields[2]) interval1, interval2 = Interval(l_start - radius, l_start + radius), Interval(l_end - radius, l_end + radius) if l_chrom in tree_dict: tree_dict[l_chrom].add(interval1) tree_dict[l_chrom].add(interval2) else: tree_dict[l_chrom] = IntervalTree([interval1, interval2]) return tree_dict def is_new_junction(l, parsed_junctions): fields = l.split() l_chrom, l_start, l_end = fields[0], int(fields[1]), int(fields[2]) if l_chrom not in parsed_junctions: return False return (not parsed_junctions[l_chrom].overlaps(l_start)) and (not parsed_junctions[l_chrom].overlaps(l_end)) @click.command() @click.option("--sjdb", help="BED3 file with junctions") @click.option("--radius", help="radius from known junction", type=int, default=25) def main(sjdb, radius): parsed_junctions = parse_SJDB(sjdb, radius) filtered_entries = ( sys.stdin | pipe.where(lambda l: is_new_junction(l, parsed_junctions)) ) for junction in filtered_entries: print(junction.strip()) if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import sys arm_len = 20 color = "255,255,255" def generate_view(line): fields = line.strip().split("\t") fields = fields[:6] fields[3] = f"{fields[3]};count={fields[4]}" fields += [fields[1], fields[2], color, "2"] fields.append(f"{arm_len},{arm_len}") fields.append(f"0,{int(fields[2]) - int(fields[1]) + arm_len}") fields[1], fields[2] = str(int(fields[1]) - arm_len), str(int(fields[2]) + arm_len) return "\t".join(fields) def main(): for junction in sys.stdin: print(generate_view(junction)) if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) fq1 = snakemake.input.get("fq1") assert fq1 is not None, "input-> fq1 is a required input parameter" fq1 = ( [snakemake.input.fq1] if isinstance(snakemake.input.fq1, str) else snakemake.input.fq1 ) fq2 = snakemake.input.get("fq2") if fq2: fq2 = ( [snakemake.input.fq2] if isinstance(snakemake.input.fq2, str) else snakemake.input.fq2 ) assert len(fq1) == len( fq2 ), "input-> equal number of files required for fq1 and fq2" input_str_fq1 = ",".join(fq1) input_str_fq2 = ",".join(fq2) if fq2 is not None else "" input_str = " ".join([input_str_fq1, input_str_fq2]) if fq1[0].endswith(".gz"): readcmd = "--readFilesCommand zcat" else: readcmd = "" if "SortedByCoordinate" in extra: bamprefix = "Aligned.sortedByCoord.out." else: bamprefix = "Aligned.out." outprefix = snakemake.output[0].split(bamprefix)[0] if outprefix == os.path.dirname(snakemake.output[0]): outprefix += "/" shell( "STAR " "{extra} " "--runThreadN {snakemake.threads} " "--genomeDir {snakemake.params.index} " "--readFilesIn {input_str} " "{readcmd} " "--outFileNamePrefix {outprefix} " "--outStd Log " "{log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | __author__ = "Thibault Dayris" __copyright__ = "Copyright 2019, Dayris Thibault" __email__ = "thibault.dayris@gustaveroussy.fr" __license__ = "MIT" from snakemake.shell import shell from snakemake.utils import makedirs log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params.get("extra", "") sjdb_overhang = snakemake.params.get("sjdbOverhang", "100") gtf = snakemake.input.get("gtf") if gtf is not None: gtf = "--sjdbGTFfile " + gtf sjdb_overhang = "--sjdbOverhang " + sjdb_overhang else: gtf = sjdb_overhang = "" makedirs(snakemake.output) shell( "STAR " # Tool "--runMode genomeGenerate " # Indexation mode "{extra} " # Optional parameters "--runThreadN {snakemake.threads} " # Number of threads "--genomeDir {snakemake.output} " # Path to output "--genomeFastaFiles {snakemake.input.fasta} " # Path to fasta files "{sjdb_overhang} " # Read-len - 1 "{gtf} " # Highly recommended GTF "{log}" # Logging ) |
Support
- Future updates
Related Workflows





