This package contains a pipeline for prediction of RNA-RNA contacts from RIC-seq data.

public public 1yr ago Version: v0.2.4 0 bookmarks

Prediction of RNA-RNA contacts from RIC-seq data. Developed by Sergei Margasyuk (smargasyuk@gmail.com) and Dmitri Pervouchine (pervouchine@gmail.com).

Description

This package contains a pipeline for prediction of RNA-RNA contacts from RIC-seq data (Cai et al., 2020 ).

Usage

Step 1: Obtain a copy of this workflow

Clone this repository to your local system, into the place where you want to perform the data analysis.

git clone https://github.com/smargasyuk/RNAcontacts.git
cd RNAcontacts
git checkout v0.2.4

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup (described in Settings ).

Step 3: Install Snakemake

Install Snakemake using conda :

# install mamba package manager if you don't have it
conda install -n base -c conda-forge mamba
conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

or

snakemake --use-conda --drmaa --jobs 100

See the Snakemake documentation for further details.

Test run

To make a test run, type

snakemake --use-conda --configfile config/config.test.yaml -c8 test

The rule will download a toy dataset (sample sheet, truncated fastq files, genome, and genome annotation confined to the first 100MB of chr1), unpack, and execute the pipeline. The output files in results/test_hg19/test/contacts will be compared to those provided in the archive.

Run on full RIC-seq data

Download the RIC-seq files for HeLa cell line from GEO repository GSE127188 . Download the control RNA-seq files from ENCODE consortium webpage .

The files are as follows:

 RNASeq_HeLa_total_rep1: - ENCFF000FOM.fastq.gz - ENCFF000FOV.fastq.gz RNASeq_HeLa_total_rep2: - ENCFF000FOK.fastq.gz - ENCFF000FOY.fastq.gz RIC-seq_HeLa_rRNA_depleted_rep1: - SRR8632820_1.fastq.gz - SRR8632820_2.fastq.gz RIC-seq_HeLa_rRNA_depleted_rep2: - SRR8632821_1.fastq.gz - SRR8632821_2.fastq.gz

Put the files somewhere, fill the sample sheet (described in Settings ), and run the pipeline.

Step 5: Investigate results

The output of the pipeline consists of the following files:

  • results/{genome}/{project}/{sample}/contacts is the list of contacts and their respective read counts in tsv format (columns 1-3 and 4-6 are the contacting coordinates, column 7 is read count).

  • results/{genome}/{project}/views/global/contacts.bed is the BED12 file with contacts on the same chromosome and length less than the threshold defined in config .

These files for HeLa experiment are available at 10.5281/zenodo.6511342 .

Code Snippets

15
16
wrapper:
    "0.84.0/bio/star/index"
33
34
wrapper:
    "0.84.0/bio/star/align"
42
43
44
45
    shell:
        """
awk -v 'OFS="\t"' '$5 == 1' {input} > {output}
"""
64
65
wrapper:
    "0.84.0/bio/star/align"  
2
3
4
5
    shell: """
find results/*/*/*/  -maxdepth 0 -not -name "bam" -exec rm -rf -v {{}} \;
rm -rf -v results/trackhub    
"""
 6
 7
 8
 9
10
    shell:
        """
mkdir -p $(dirname {output})
cut -f1,2,3 {input.control_jxn} {input.star_ref_dir}/sjdbList.out.tab | sort -u > {output}
"""
19
20
21
22
23
24
25
    shell:
        """
mkdir -p $(dirname {output})
sort --parallel=8 -S4G -k9,9 <(cut -f1-6,10 {input.mate0} | awk -v OFS="\t" 'BEGIN{{s["+"]="-";s["-"]="+";}}{{print $4, $5, $5+1, s[$6], $1, $2, $2+1, s[$3], $7, 0}}') \
           <(cut -f1-6,10 {input.mate1} | awk -v OFS="\t" '{{print $1, $2, $2+1, $3, $4, $5, $5+1, $6, $7, 1}}') | \
           grep -v GL > {output}
"""
35
36
37
38
39
40
41
    shell:
        """
mkdir -p $(dirname {output})
sort -k9,9  -S4G --parallel=8 <(samtools view {input.mate0} | perl workflow/scripts/neo.pl {input.junctions} 0) \
           <(samtools view {input.mate1} | perl workflow/scripts/neo.pl {input.junctions} 1) | grep -v GL |\
            awk -v 'OFS=\\t' '{{print $1,$2,$2+1,$3,$4,$5,$5+1,$6,$7,$8}}' > {output}       
"""
51
52
53
54
55
    shell:
        """
cut -f1-4 {input} | sort-bed - | bedops -m --range {params.radius} - > {output.donors}
cut -f5-8 {input} | sort-bed - | bedops -m --range {params.radius} - > {output.acceptors}
"""
67
68
69
70
71
72
73
    shell:
        """
mkdir -p $(dirname {output})
sort-bed {input.junctions} | \
intersectBed -a stdin -b {input.donors} -wa -wb | cut -f5- | sort-bed - | \
intersectBed -a stdin -b {input.acceptors} -wa -wb | cut -f5- | sort -k1,1 > {output}
"""
81
82
83
84
85
    shell:
        """
mkdir -p $(dirname {output})
awk '{{n[$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8]++}}END{{for(j in n){{print j"\t"n[j]}}}}' {input} | sort > {output}
"""
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
    shell:
        """
mkdir -p $(dirname {output[0]})
cat {input} |\
cut -f 3- |\
awk -v 'OFS=\t' '$1==$4' |\
awk -v 'OFS=\t' '{{if ($2>$6){{s1=$2;s2=$3;$2=$5;$3=$6;$5=s1;$6=s2}};print}}' |\
awk -v 'OFS=\t' '$2<$6' |\
awk -v 'OFS=\t' '($3<$5)' |\
awk -v 'OFS=\t' '$6-$2<{params.max_size}' |\
awk -v 'OFS=\t' '{{n[$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6]++}}END{{for(j in n){{print j"\t"n[j]}}}}' |\
awk -v 'OFS=\t' '{{print $1,$2,$6,"id_"NR",reads="$7,$7,"+",$2,$6,"0,0,0",2,$3-$2","$6-$5,0","$5-$2}}' | sort-bed - > {output}
"""
117
118
119
120
121
122
123
124
    shell:
        """
mkdir -p $(dirname {output})
echo 'track name="{wildcards.project} contacts" visibility=2 itemRgb="On"' > {output}
python workflow/scripts/bed12_merge_color.py {input} | \
sort-bed - |\
awk -v 'OFS=\t' '$4="id_"NR",count="$5' >> {output}
"""
26
27
28
29
    shell: """
mkdir -p $(dirname {output})
cp {input} {output}
"""
SnakeMake From line 26 of rules/hub.smk
35
36
37
38
    shell: """
mkdir -p $(dirname {output})
tail -n +2 {input} > {output}
"""
SnakeMake From line 35 of rules/hub.smk
47
48
49
50
    shell: """
mkdir -p $(dirname {output})
bedClip {input.bed} {input.chromsizes} {output}
"""
56
57
58
59
    shell: """
mkdir -p $(dirname {output})
fetchChromSizes {wildcards.genome} > {output}
"""
SnakeMake From line 56 of rules/hub.smk
66
67
68
69
    shell: """
mkdir -p $(dirname {output})
awk -v 'OFS=\\t' '$5=$5<999?$5:999' {input} | sort-bed - > {output}    
"""
SnakeMake From line 66 of rules/hub.smk
78
79
80
81
    shell: """
mkdir -p $(dirname {output})
bedToBigBed {input.bed} {input.chromsizes} {output}
"""		
SnakeMake From line 78 of rules/hub.smk
87
88
89
90
run:
    with open(output[0], 'w') as handle:
        for i in input:
            handle.write(hub_template.format(Path(i).stem))
SnakeMake From line 87 of rules/hub.smk
96
97
98
99
run:
    with open(output[0], 'w') as handle:
        for g in get_all_genomes():
            handle.write(genome_template.format(g))
SnakeMake From line 96 of rules/hub.smk
105
106
107
108
    shell: """
mkdir -p $(dirname {output})
cp {input} {output}
"""
SnakeMake From line 105 of rules/hub.smk
 8
 9
10
11
    shell:"""
mkdir -p $(dirname {output})
awk -v 'OFS=\\t' '$3=$3+1' {input} | sort-bed - > {output}    
"""
20
21
22
23
24
25
26
27
28
    shell: """
mkdir -p $(dirname {output.bed})
cat {input.tsv} | awk -v 'OFS=\\t' '$1==$5' | \
cut -f1,2,4,7,9,10 |  \
awk -v 'OFS=\\t' -v 'name_prefix={wildcards.id}_{wildcards.jtype}' '{{print $1,$2,$4,name_prefix"_"NR,1,$3,$5,$6}}' | \
awk -v 'OFS=\\t' '{{if ($2 > $3){{t=$3;$3=$2;$2=t}}; print}}' |\
awk -v 'OFS=\\t' '$2<$3' |\
sort-bed - > {output.bed}
"""
37
38
39
40
    shell: """
mkdir -p $(dirname {output})
cat {input.contacts} | python workflow/scripts/bedj_filter_junctions.py --sjdb {input.junctions} > {output}
"""
47
48
49
50
    shell: """
mkdir -p $(dirname {output})
awk -v 'OFS=\\t' '$3-$2<{wildcards.range}'  {input} > {output}
"""
59
60
61
62
    shell: """
mkdir -p $(dirname {output})    
bedops -e {input.contacts} {input.junctions} > {output}
"""
71
72
73
    shell: """
bedops -u {input.neo} {input.chim} > {output}
"""
80
81
82
83
84
    shell: """
cut -f 1,2,3 {input} | sort | uniq -c | sed -r 's/([0-9]) /\\1\\t/' |\
awk -v 'OFS=\\t' -v 'name_prefix=id' '{{print $2,$3,$4,name_prefix"_"NR,$1,"+"}}' |\
sort-bed - > {output}
"""
91
92
93
94
    shell: """
mkdir -p $(dirname {output})  
cat {input} | python workflow/scripts/bedj_to_bed12.py > {output}
"""
101
102
103
    shell: """
python workflow/scripts/bed12_merge_color.py {input} | sort-bed - | python workflow/scripts/bed_enumerate.py > {output}
"""
 9
10
11
    shell: """
genomepy install -g resources/genomes/ -p {params.provider} --annotation {wildcards.assembly}
"""
21
22
23
24
25
    shell: """
mkdir -p $(dirname {output.fasta})   
cp {input.fasta} {output.fasta}
cp {input.annotation} {output.gtf}
"""
SnakeMake From line 21 of rules/ref.smk
5
shell: "wget https://zenodo.org/record/6475703/files/RICseq_toy_data.tgz"
16
shell: "tar -xf RICseq_toy_data.tgz"
SnakeMake From line 16 of rules/test.smk
25
26
27
28
29
run:
    for f1, f2 in zip(input["pipeline_out"], input["ref_out"]):
        if not filecmp.cmp(f1, f2, shallow = False):
            print(f"Files {f1} and {f2} are different")
            raise ValueError
SnakeMake From line 25 of rules/test.smk
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import click
import distinctipy


@click.command()
@click.argument("files", nargs=-1)
def main(files):

    colors = distinctipy.get_colors(len(files), pastel_factor=0.7)
    color_strings = [
        ",".join([str(int(i * 255)) for i in rgb_frac_tuple])
        for rgb_frac_tuple in colors
    ]

    for file_idx, bed in enumerate(files):
        with open(bed) as f:
            for line in f.readlines():
                fields = line.strip().split("\t")
                fields[8] = color_strings[file_idx]
                print("\t".join(fields))


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import sys


def main():
    for idx, line in enumerate(sys.stdin):
        fields = line.strip().split("\t")
        fields[3] = f"id={idx},count={fields[4]}"
        print("\t".join(fields))


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import itertools
from typing import Any, List, Tuple
import sys

import click
import pipe
from intervaltree import Interval, IntervalTree


def pairwise(iterable):
    # s -> (s0,s1), (s1,s2), (s2, s3), ...
    a, b = itertools.tee(iterable)
    next(b, None)
    return itertools.islice(zip(a, b), None, None, 1)


def parse_SJDB(SJDB_file, radius):
    tree_dict = {}
    with open(SJDB_file) as f:
        for line in f.readlines():
            fields = line.split()
            if len(fields) < 3:
                continue
            l_chrom, l_start, l_end = fields[0], int(fields[1]), int(fields[2])
            interval1, interval2  = Interval(l_start - radius, l_start + radius), Interval(l_end - radius, l_end + radius)
            if l_chrom in tree_dict:
                tree_dict[l_chrom].add(interval1)
                tree_dict[l_chrom].add(interval2)
            else:
                tree_dict[l_chrom] = IntervalTree([interval1, interval2])
    return tree_dict

def is_new_junction(l, parsed_junctions):
    fields = l.split()
    l_chrom, l_start, l_end = fields[0], int(fields[1]), int(fields[2])
    if l_chrom not in parsed_junctions:
        return False
    return (not parsed_junctions[l_chrom].overlaps(l_start)) and (not parsed_junctions[l_chrom].overlaps(l_end))

@click.command()
@click.option("--sjdb", help="BED3 file with junctions")     
@click.option("--radius", help="radius from known junction", type=int, default=25)  
def main(sjdb, radius):
    parsed_junctions = parse_SJDB(sjdb, radius)
    filtered_entries = (
        sys.stdin 
        | pipe.where(lambda l: is_new_junction(l, parsed_junctions)) 
    )
    for junction in filtered_entries:
        print(junction.strip())


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import sys


arm_len = 20
color = "255,255,255"


def generate_view(line):
    fields = line.strip().split("\t")
    fields = fields[:6]
    fields[3] = f"{fields[3]};count={fields[4]}"
    fields += [fields[1], fields[2], color, "2"]
    fields.append(f"{arm_len},{arm_len}")
    fields.append(f"0,{int(fields[2]) - int(fields[1]) + arm_len}")
    fields[1], fields[2] = str(int(fields[1]) - arm_len), str(int(fields[2]) + arm_len)
    return "\t".join(fields)


def main():
    for junction in sys.stdin:
        print(generate_view(junction))


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = (
    [snakemake.input.fq1]
    if isinstance(snakemake.input.fq1, str)
    else snakemake.input.fq1
)
fq2 = snakemake.input.get("fq2")
if fq2:
    fq2 = (
        [snakemake.input.fq2]
        if isinstance(snakemake.input.fq2, str)
        else snakemake.input.fq2
    )
    assert len(fq1) == len(
        fq2
    ), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])

if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""

if "SortedByCoordinate" in extra:
    bamprefix = "Aligned.sortedByCoord.out."
else:
    bamprefix = "Aligned.out."

outprefix = snakemake.output[0].split(bamprefix)[0]

if outprefix == os.path.dirname(snakemake.output[0]):
    outprefix += "/"

shell(
    "STAR "
    "{extra} "
    "--runThreadN {snakemake.threads} "
    "--genomeDir {snakemake.params.index} "
    "--readFilesIn {input_str} "
    "{readcmd} "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "{log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake.utils import makedirs

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

extra = snakemake.params.get("extra", "")
sjdb_overhang = snakemake.params.get("sjdbOverhang", "100")

gtf = snakemake.input.get("gtf")
if gtf is not None:
    gtf = "--sjdbGTFfile " + gtf
    sjdb_overhang = "--sjdbOverhang " + sjdb_overhang
else:
    gtf = sjdb_overhang = ""

makedirs(snakemake.output)

shell(
    "STAR "  # Tool
    "--runMode genomeGenerate "  # Indexation mode
    "{extra} "  # Optional parameters
    "--runThreadN {snakemake.threads} "  # Number of threads
    "--genomeDir {snakemake.output} "  # Path to output
    "--genomeFastaFiles {snakemake.input.fasta} "  # Path to fasta files
    "{sjdb_overhang} "  # Read-len - 1
    "{gtf} "  # Highly recommended GTF
    "{log}"  # Logging
)
ShowHide 35 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/smargasyuk/RNAcontacts
Name: rnacontacts
Version: v0.2.4
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...