This package contains a pipeline for prediction of RNA-RNA contacts from RIC-seq data.

public 1yr ago Version: v0.2.4 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Prediction of RNA-RNA contacts from RIC-seq data. Developed by Sergei Margasyuk (smargasyuk@gmail.com) and Dmitri Pervouchine (pervouchine@gmail.com).

Description

This package contains a pipeline for prediction of RNA-RNA contacts from RIC-seq data (Cai et al., 2020 ).

Usage

Step 1: Obtain a copy of this workflow

Clone this repository to your local system, into the place where you want to perform the data analysis.

git clone https://github.com/smargasyuk/RNAcontacts.git
cd RNAcontacts
git checkout v0.2.4

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup (described in Settings ).

Step 3: Install Snakemake

Install Snakemake using conda :

# install mamba package manager if you don't have it
conda install -n base -c conda-forge mamba
conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

snakemake --use-conda --drmaa --jobs 100

See the Snakemake documentation for further details.

Test run

To make a test run, type

snakemake --use-conda --configfile config/config.test.yaml -c8 test

The rule will download a toy dataset (sample sheet, truncated fastq files, genome, and genome annotation confined to the first 100MB of chr1), unpack, and execute the pipeline. The output files in results/test_hg19/test/contacts will be compared to those provided in the archive.

Run on full RIC-seq data

Download the RIC-seq files for HeLa cell line from GEO repository GSE127188 . Download the control RNA-seq files from ENCODE consortium webpage .

The files are as follows:

 RNASeq_HeLa_total_rep1: - ENCFF000FOM.fastq.gz - ENCFF000FOV.fastq.gz RNASeq_HeLa_total_rep2: - ENCFF000FOK.fastq.gz - ENCFF000FOY.fastq.gz RIC-seq_HeLa_rRNA_depleted_rep1: - SRR8632820_1.fastq.gz - SRR8632820_2.fastq.gz RIC-seq_HeLa_rRNA_depleted_rep2: - SRR8632821_1.fastq.gz - SRR8632821_2.fastq.gz

Put the files somewhere, fill the sample sheet (described in Settings ), and run the pipeline.

Step 5: Investigate results

The output of the pipeline consists of the following files:

results/{genome}/{project}/{sample}/contacts is the list of contacts and their respective read counts in tsv format (columns 1-3 and 4-6 are the contacting coordinates, column 7 is read count).
results/{genome}/{project}/views/global/contacts.bed is the BED12 file with contacts on the same chromosome and length less than the threshold defined in config .

These files for HeLa experiment are available at 10.5281/zenodo.6511342 .

Code Snippets

wrapper:
    "0.84.0/bio/star/index"

SnakeMake From line 15 of rules/align.smk

wrapper:
    "0.84.0/bio/star/align"

SnakeMake From line 33 of rules/align.smk

    shell:
        """
awk -v 'OFS="\t"' '$5 == 1' {input} > {output}
"""

SnakeMake From line 42 of rules/align.smk

wrapper:
    "0.84.0/bio/star/align"  

SnakeMake From line 64 of rules/align.smk

    shell: """
find results/*/*/*/  -maxdepth 0 -not -name "bam" -exec rm -rf -v {{}} \;
rm -rf -v results/trackhub    
"""

SnakeMake From line 2 of rules/clean.smk

    shell:
        """
mkdir -p $(dirname {output})
cut -f1,2,3 {input.control_jxn} {input.star_ref_dir}/sjdbList.out.tab | sort -u > {output}
"""

SnakeMake From line 6 of rules/contacts.smk

    shell:
        """
mkdir -p $(dirname {output})
sort --parallel=8 -S4G -k9,9 <(cut -f1-6,10 {input.mate0} | awk -v OFS="\t" 'BEGIN{{s["+"]="-";s["-"]="+";}}{{print $4, $5, $5+1, s[$6], $1, $2, $2+1, s[$3], $7, 0}}') \
           <(cut -f1-6,10 {input.mate1} | awk -v OFS="\t" '{{print $1, $2, $2+1, $3, $4, $5, $5+1, $6, $7, 1}}') | \
           grep -v GL > {output}
"""

SnakeMake From line 19 of rules/contacts.smk

    shell:
        """
mkdir -p $(dirname {output})
sort -k9,9  -S4G --parallel=8 <(samtools view {input.mate0} | perl workflow/scripts/neo.pl {input.junctions} 0) \
           <(samtools view {input.mate1} | perl workflow/scripts/neo.pl {input.junctions} 1) | grep -v GL |\
            awk -v 'OFS=\\t' '{{print $1,$2,$2+1,$3,$4,$5,$5+1,$6,$7,$8}}' > {output}       
"""

SnakeMake From line 35 of rules/contacts.smk

    shell:
        """
cut -f1-4 {input} | sort-bed - | bedops -m --range {params.radius} - > {output.donors}
cut -f5-8 {input} | sort-bed - | bedops -m --range {params.radius} - > {output.acceptors}
"""

SnakeMake Bedops From line 51 of rules/contacts.smk

    shell:
        """
mkdir -p $(dirname {output})
sort-bed {input.junctions} | \
intersectBed -a stdin -b {input.donors} -wa -wb | cut -f5- | sort-bed - | \
intersectBed -a stdin -b {input.acceptors} -wa -wb | cut -f5- | sort -k1,1 > {output}
"""

SnakeMake BEDTools From line 67 of rules/contacts.smk

    shell:
        """
mkdir -p $(dirname {output})
awk '{{n[$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8]++}}END{{for(j in n){{print j"\t"n[j]}}}}' {input} | sort > {output}
"""

SnakeMake From line 81 of rules/contacts.smk

    shell:
        """
mkdir -p $(dirname {output[0]})
cat {input} |\
cut -f 3- |\
awk -v 'OFS=\t' '$1==$4' |\
awk -v 'OFS=\t' '{{if ($2>$6){{s1=$2;s2=$3;$2=$5;$3=$6;$5=s1;$6=s2}};print}}' |\
awk -v 'OFS=\t' '$2<$6' |\
awk -v 'OFS=\t' '($3<$5)' |\
awk -v 'OFS=\t' '$6-$2<{params.max_size}' |\
awk -v 'OFS=\t' '{{n[$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6]++}}END{{for(j in n){{print j"\t"n[j]}}}}' |\
awk -v 'OFS=\t' '{{print $1,$2,$6,"id_"NR",reads="$7,$7,"+",$2,$6,"0,0,0",2,$3-$2","$6-$5,0","$5-$2}}' | sort-bed - > {output}
"""

SnakeMake From line 96 of rules/contacts.smk

    shell:
        """
mkdir -p $(dirname {output})
echo 'track name="{wildcards.project} contacts" visibility=2 itemRgb="On"' > {output}
python workflow/scripts/bed12_merge_color.py {input} | \
sort-bed - |\
awk -v 'OFS=\t' '$4="id_"NR",count="$5' >> {output}
"""

SnakeMake From line 117 of rules/contacts.smk

    shell: """
mkdir -p $(dirname {output})
cp {input} {output}
"""

SnakeMake From line 26 of rules/hub.smk

    shell: """
mkdir -p $(dirname {output})
tail -n +2 {input} > {output}
"""

SnakeMake From line 35 of rules/hub.smk

    shell: """
mkdir -p $(dirname {output})
bedClip {input.bed} {input.chromsizes} {output}
"""

SnakeMake ucsc-bedclip From line 47 of rules/hub.smk

    shell: """
mkdir -p $(dirname {output})
fetchChromSizes {wildcards.genome} > {output}
"""

SnakeMake From line 56 of rules/hub.smk

    shell: """
mkdir -p $(dirname {output})
awk -v 'OFS=\\t' '$5=$5<999?$5:999' {input} | sort-bed - > {output}    
"""

SnakeMake From line 66 of rules/hub.smk

    shell: """
mkdir -p $(dirname {output})
bedToBigBed {input.bed} {input.chromsizes} {output}
"""		

SnakeMake From line 78 of rules/hub.smk

run:
    with open(output[0], 'w') as handle:
        for i in input:
            handle.write(hub_template.format(Path(i).stem))

SnakeMake From line 87 of rules/hub.smk

run:
    with open(output[0], 'w') as handle:
        for g in get_all_genomes():
            handle.write(genome_template.format(g))

SnakeMake From line 96 of rules/hub.smk

    shell: """
mkdir -p $(dirname {output})
cp {input} {output}
"""

SnakeMake From line 105 of rules/hub.smk

    shell:"""
mkdir -p $(dirname {output})
awk -v 'OFS=\\t' '$3=$3+1' {input} | sort-bed - > {output}    
"""

SnakeMake From line 8 of rules/junctions-view.smk

    shell: """
mkdir -p $(dirname {output.bed})
cat {input.tsv} | awk -v 'OFS=\\t' '$1==$5' | \
cut -f1,2,4,7,9,10 |  \
awk -v 'OFS=\\t' -v 'name_prefix={wildcards.id}_{wildcards.jtype}' '{{print $1,$2,$4,name_prefix"_"NR,1,$3,$5,$6}}' | \
awk -v 'OFS=\\t' '{{if ($2 > $3){{t=$3;$3=$2;$2=t}}; print}}' |\
awk -v 'OFS=\\t' '$2<$3' |\
sort-bed - > {output.bed}
"""

SnakeMake From line 20 of rules/junctions-view.smk

    shell: """
mkdir -p $(dirname {output})
cat {input.contacts} | python workflow/scripts/bedj_filter_junctions.py --sjdb {input.junctions} > {output}
"""

SnakeMake From line 37 of rules/junctions-view.smk

    shell: """
mkdir -p $(dirname {output})
awk -v 'OFS=\\t' '$3-$2<{wildcards.range}'  {input} > {output}
"""

SnakeMake From line 47 of rules/junctions-view.smk

    shell: """
mkdir -p $(dirname {output})    
bedops -e {input.contacts} {input.junctions} > {output}
"""

SnakeMake Bedops From line 59 of rules/junctions-view.smk

    shell: """
bedops -u {input.neo} {input.chim} > {output}
"""

SnakeMake Bedops From line 71 of rules/junctions-view.smk

    shell: """
cut -f 1,2,3 {input} | sort | uniq -c | sed -r 's/([0-9]) /\\1\\t/' |\
awk -v 'OFS=\\t' -v 'name_prefix=id' '{{print $2,$3,$4,name_prefix"_"NR,$1,"+"}}' |\
sort-bed - > {output}
"""

SnakeMake From line 80 of rules/junctions-view.smk

    shell: """
mkdir -p $(dirname {output})  
cat {input} | python workflow/scripts/bedj_to_bed12.py > {output}
"""

SnakeMake From line 91 of rules/junctions-view.smk

    shell: """
python workflow/scripts/bed12_merge_color.py {input} | sort-bed - | python workflow/scripts/bed_enumerate.py > {output}
"""

SnakeMake From line 101 of rules/junctions-view.smk

    shell: """
genomepy install -g resources/genomes/ -p {params.provider} --annotation {wildcards.assembly}
"""

SnakeMake genomepy From line 9 of rules/ref.smk

    shell: """
mkdir -p $(dirname {output.fasta})   
cp {input.fasta} {output.fasta}
cp {input.annotation} {output.gtf}
"""

SnakeMake From line 21 of rules/ref.smk

shell: "wget https://zenodo.org/record/6475703/files/RICseq_toy_data.tgz"

SnakeMake From line 5 of rules/test.smk

shell: "tar -xf RICseq_toy_data.tgz"

SnakeMake From line 16 of rules/test.smk

run:
    for f1, f2 in zip(input["pipeline_out"], input["ref_out"]):
        if not filecmp.cmp(f1, f2, shallow = False):
            print(f"Files {f1} and {f2} are different")
            raise ValueError

SnakeMake From line 25 of rules/test.smk

import click
import distinctipy


@click.command()
@click.argument("files", nargs=-1)
def main(files):

    colors = distinctipy.get_colors(len(files), pastel_factor=0.7)
    color_strings = [
        ",".join([str(int(i * 255)) for i in rgb_frac_tuple])
        for rgb_frac_tuple in colors
    ]

    for file_idx, bed in enumerate(files):
        with open(bed) as f:
            for line in f.readlines():
                fields = line.strip().split("\t")
                fields[8] = color_strings[file_idx]
                print("\t".join(fields))


if __name__ == "__main__":
    main()

Python CLICK distinctipy From line 1 of scripts/bed12_merge_color.py

import sys


def main():
    for idx, line in enumerate(sys.stdin):
        fields = line.strip().split("\t")
        fields[3] = f"id={idx},count={fields[4]}"
        print("\t".join(fields))


if __name__ == "__main__":
    main()

Python From line 1 of scripts/bed_enumerate.py

import itertools
from typing import Any, List, Tuple
import sys

import click
import pipe
from intervaltree import Interval, IntervalTree


def pairwise(iterable):
    # s -> (s0,s1), (s1,s2), (s2, s3), ...
    a, b = itertools.tee(iterable)
    next(b, None)
    return itertools.islice(zip(a, b), None, None, 1)


def parse_SJDB(SJDB_file, radius):
    tree_dict = {}
    with open(SJDB_file) as f:
        for line in f.readlines():
            fields = line.split()
            if len(fields) < 3:
                continue
            l_chrom, l_start, l_end = fields[0], int(fields[1]), int(fields[2])
            interval1, interval2  = Interval(l_start - radius, l_start + radius), Interval(l_end - radius, l_end + radius)
            if l_chrom in tree_dict:
                tree_dict[l_chrom].add(interval1)
                tree_dict[l_chrom].add(interval2)
            else:
                tree_dict[l_chrom] = IntervalTree([interval1, interval2])
    return tree_dict

def is_new_junction(l, parsed_junctions):
    fields = l.split()
    l_chrom, l_start, l_end = fields[0], int(fields[1]), int(fields[2])
    if l_chrom not in parsed_junctions:
        return False
    return (not parsed_junctions[l_chrom].overlaps(l_start)) and (not parsed_junctions[l_chrom].overlaps(l_end))

@click.command()
@click.option("--sjdb", help="BED3 file with junctions")     
@click.option("--radius", help="radius from known junction", type=int, default=25)  
def main(sjdb, radius):
    parsed_junctions = parse_SJDB(sjdb, radius)
    filtered_entries = (
        sys.stdin 
        | pipe.where(lambda l: is_new_junction(l, parsed_junctions)) 
    )
    for junction in filtered_entries:
        print(junction.strip())


if __name__ == "__main__":
    main()

Python CLICK intervaltree From line 1 of scripts/bedj_filter_junctions.py

import sys


arm_len = 20
color = "255,255,255"


def generate_view(line):
    fields = line.strip().split("\t")
    fields = fields[:6]
    fields[3] = f"{fields[3]};count={fields[4]}"
    fields += [fields[1], fields[2], color, "2"]
    fields.append(f"{arm_len},{arm_len}")
    fields.append(f"0,{int(fields[2]) - int(fields[1]) + arm_len}")
    fields[1], fields[2] = str(int(fields[1]) - arm_len), str(int(fields[2]) + arm_len)
    return "\t".join(fields)


def main():
    for junction in sys.stdin:
        print(generate_view(junction))


if __name__ == "__main__":
    main()

Python From line 1 of scripts/bedj_to_bed12.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = (
    [snakemake.input.fq1]
    if isinstance(snakemake.input.fq1, str)
    else snakemake.input.fq1
)
fq2 = snakemake.input.get("fq2")
if fq2:
    fq2 = (
        [snakemake.input.fq2]
        if isinstance(snakemake.input.fq2, str)
        else snakemake.input.fq2
    )
    assert len(fq1) == len(
        fq2
    ), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])

if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""

if "SortedByCoordinate" in extra:
    bamprefix = "Aligned.sortedByCoord.out."
else:
    bamprefix = "Aligned.out."

outprefix = snakemake.output[0].split(bamprefix)[0]

if outprefix == os.path.dirname(snakemake.output[0]):
    outprefix += "/"

shell(
    "STAR "
    "{extra} "
    "--runThreadN {snakemake.threads} "
    "--genomeDir {snakemake.params.index} "
    "--readFilesIn {input_str} "
    "{readcmd} "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "{log}"
)

Python Snakemake STAR From line 1 of align/wrapper.py

__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake.utils import makedirs

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

extra = snakemake.params.get("extra", "")
sjdb_overhang = snakemake.params.get("sjdbOverhang", "100")

gtf = snakemake.input.get("gtf")
if gtf is not None:
    gtf = "--sjdbGTFfile " + gtf
    sjdb_overhang = "--sjdbOverhang " + sjdb_overhang
else:
    gtf = sjdb_overhang = ""

makedirs(snakemake.output)

shell(
    "STAR "  # Tool
    "--runMode genomeGenerate "  # Indexation mode
    "{extra} "  # Optional parameters
    "--runThreadN {snakemake.threads} "  # Number of threads
    "--genomeDir {snakemake.output} "  # Path to output
    "--genomeFastaFiles {snakemake.input.fasta} "  # Path to fasta files
    "{sjdb_overhang} "  # Read-len - 1
    "{gtf} "  # Highly recommended GTF
    "{log}"  # Logging
)