Snakemake based analysis pipeline to identify m6As from eCLIP data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Overview of meCLIP
meCLIP is method to identify m6As residues at single-nucleotide resolution using eCLIP. It uses the Snakemake workflow management system to handle analysis of the resulting sequencing reads. The meCLIP analysis pipeline requires minimal input from the user once executed and automatically generates a list of identified m6As along with a report summarizing the analysis. The following steps outline installation of Snakemake and execution of the meCLIP workflow.
Requirements
If installed as recommended, the pipeline automatically handles the installation of required software. For reference, the following programs are used by meCLIP:
Snakemake (v7.24.2)
STAR (v2.7.10b)
SeqKit (v2.3.0)
BEDOPS (v2.4.41)
Perl (v5.32.1)
FastQC (v0.11.7)
UMI-tools (v1.1.4)
cutadapt (v4.2)
samtools (v1.16.1)
Java (OpenJDK v11.0.15)
bedtools (v2.30.0)
R (v4.2.2)
"scales" and "ggplot2" packages
MultiQC (v1.14)
The meCLIP pipeline should work on all modern operating systems and has been specifically tested on Ubuntu (Bionic, Focal, Jammy), OS X (10.15), and Windows 10/11 using WSL (for a great guide on installing WSL in Windows, see [here][id2]).
The pipeline itself is not very resource intensive, although providing more cores / threads does speed up execution (for reference, running on a human sample with 11 threads takes ~6-8 hours). The biggest bottleneck is the genome indexing step, which requires at least 16Gb of memory (see Reference Genome section below if this is an issue). The aligner indices, genomes, and corresponding annotations can also be quite large (45Gb for human genome), so please ensure you have sufficient disk space to store the files before executing the pipeline.
Conda Installation
The recommended way to setup the meCLIP pipeline is via conda because it also enables any software dependencies to be easily installed.
First, download the latest version [here][id] (making sure to download the Python3 version) and then execute the following command:
bash Miniconda3-latest-Linux-x86_64.sh
Answer 'yes' to the question about whether conda shall be put into your PATH. You can check that the installation was successful by running:
conda list
For a successful installation, a list of installed packages appears.
Prepare Working Directory
The meCLIP pipeline creates a number of intermediate and temporary files as different underlying tools are executed. In order to keep these files organized and separated from other meCLIP instances, we recommend creating an independent working directory for each meCLIP experiment (where a single meCLIP experiment would consist of an 'IP' sample with a corresponding 'Input' sample). While there are several ways to accomplish this, the easiest is to simply clone the meCLIP files from GitHub into a new directory for each independent meCLIP experiment.
First, create a new directory in a reasonable place to use as your working directory.
mkdir experiment-name
Next, clone the meCLIP files from GitHub into that directory:
git clone https://github.com/ajlabuc/meCLIP.git
After the pipeline finishes running the meCLIP files themselves can be safely deleted, leaving an independent directory containing all the relevant files from the analysis of that particular experiment.
Create Conda Environment
The environment.yaml file that was downloaded from GitHub can be used to install all the software required by meCLIP into an isolated Conda environment. This ensures that the correct version of the software is utilized and any other dependencies are reconciled.
The default Conda solver is a bit slow and sometimes has issues with selecting the latest package releases. Therefore, we recommend to install Mamba as a drop-in replacement via:
conda install -c conda-forge mamba
Then, to create an environment with the required software:
mamba env create --name meCLIP --file workflow/envs/environment.yaml
Finally, to activate the meCLIP environment, execute:
conda activate meCLIP
Be sure to exit the environment once the analysis is complete.
Customize Configuration File
One of the few steps in the meCLIP analysis pipeline that actually requires opening a file is customizing the configuration. This is where you inform the pipeline where relevant files are on your system, namely the sequencing reads and reference genomes. A sample configuration file is included in the downloaded files and detailed below.
sample_name: meCLIP
resources:
threads: 11
ram: 29
reads:
ip:
ip_read1: reads/ip_read_1.fq
ip_read2: reads/ip_read_2.fq
input:
input_read1: reads/input_read_1.fq
input_read2: reads/input_read_2.fq
species:
name: homo_sapiens
assembly: GRCh38
release: 109
adapters:
dna:
randomer_size: 10
rna:
name: X1A
sequence: AUAUAGGNNNNNAGAUCGGAAGAGCGUCGUGUAG
-
sample_name: easily identifiable name of the experiment (this will be included in most of the file names and plot titles)
-
threads: defines the number of threads available to the pipeline (we recommend one less than the total number of usable CPU cores)
-
ram: defines the amount of RAM available to the pipeline (mainly used for genome indexing)
-
reads: specifies the locations / filenames of the paired sequencing read files for the IP and INPUT samples (relative to working directory)
-
species: describes the reference species to identify m6A residues in (see 'Reference Genome' below for details)
-
adapters: details the specific adapters used in the experiment (see 'Adapters' below for details)
Reference Genome
The pipeline will attempt to download the appropriate reference genome automatically based on the information provided in the species section of the configuration file. The script attempts to download the 'primary assembly' reference and reverts to the 'top level' reference if there is no primary assembly for that species. Automatic downloading is currently only supported for vertebrates listed in the Ensembl database (see [here][id3]).
-
The name should be all lowercase with underscores for spaces (see Ensembl link above for proper formatting)
-
The assembly should be the appropriate reference genome assembly for the indicated species (i.e., GRCh38 for humans)
-
The release should be the desired Ensembl release number (new releases occur every few months, see [here][id4] for info)
The pipeline will parse the Ensembl FTP site to obtain the reference genome FASTA file and associated GTF annotation. As previously mentioned, automatic downloading is currently only available for vertebrate species. To utilize other organisms (i.e., bacteria, fungi, plants, etc.), reference genomes and annotations can be manually provided using the 'manual-reference-genome.yaml' configuration file located in the 'config' directory.
resources:
threads: 11
ram: 29
genome:
fasta: <url-to-fasta-file>
gtf: <url-to-gtf-file>
-
resources: same as main configuration file, specify available memory and threads for genome indexing / annotation
-
genome: URL links (i.e., from FTP site) where the FASTA and GTF files can be downloaded (see sample config file for example)
Once the configuration file has been updated with the desired genome information, the files can be downloaded and annotated with the following command:
snakemake --snakefile workflow/rules/genome-download.smk --cores N
The script will download the defined reference files and index / annotate them so that they can be used in the main pipeline.
Adapters
The meCLIP library preparation consists of adding two separate adapters to the m6A containing fragment. The first is a (possibly indexed) RNA adapter that is ligated while the transcript is still on the immunoprecipitation beads, and the second is a ssDNA adapter ssDNA adapter that is ligated to the cDNA following reverse transcription. The ssDNA adapter contains a randomized UMI (either N5 or N10) to determine whether identical reads are unique transcripts or PCR duplicates of the same RNA fragment. Therefore, the following information should be provided in the adapter section of the configuration file:
-
The randomer should indicate the length of the random UMI sequence of the ssDNA adapter (either 5 or 10)
-
The name should be the given name of the RNA adapter used in the library prep (see meCLIP / eCLIP paper)
-
The sequence should be the complete sequence of the RNA adapter used in the library prep
The strategy outlined in the original meCLIP paper was based on the eCLIP protocol which recommend using two different RNA adapters to ensure proper balancing as on the Illumina HiSeq sequencer. Given load balancing is not as much of an issue on more recent instruments, we find it is sufficient to only use a single adapter which is what is assumed by the current version of this pipeline.
The ends of each adapter are complimentary to the Illumina TruSeq adapters and therefore the resulting reads generally have the following structure where 'Read 1' begins with the RNA adapter and 'Read 2' (corresponding to the sense strand) begins with the UMI followed by a sequence corresponding to the 5′ end of the original RNA fragment.
Read 1
[RNA Adapter (Reverse)] [Sequenced Fragment (Reverse)] [DNA Adapter]
Read 2
[DNA Adapter (Reverse)] [Sequenced Fragment] [RNA Adapter]
Execute the Analysis Pipeline
Once the location of the reads and genomes are saved into the configuration file, the analysis pipeline can then be executed simply by executing the following command within the working directory (where 'N' is the number of CPU cores available to the pipeline):
snakemake --use-conda --cores N
This will identify m6As in the IP and INPUT samples, cross-reference the two sets to remove any overlaps (to account for mutations not induced by the m6A antibody), and then annotate the resulting m6As and generate a report summarizing the analysis.
Final Remarks
The goal of the meCLIP analysis pipeline is to simplify the identification of m6As from sequencing reads by streamlining the workflow so that the vast majority of steps are automatically executed and relevant output files are automatically generated. We hope this pipeline will become a valuable tool for researchers interesting in identifying m6As.
Citation
If you use this pipeline to identify m6A sites, please cite the following manuscript:
“Identification of m6A residues at single-nucleotide resolution using eCLIP and an accessible custom analysis pipeline.”
Justin T. Roberts, Allison M. Porman, Aaron M. Johnson.
RNA. 2020 Dec 29;27(4):527-541. doi:10.1261/rna.078543.120
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | suppressMessages(if (!require(scales)) install.packages('scales', repos = "http://cran.us.r-project.org")) suppressMessages(library("scales")) ## Read in distance measure file dist <- read.delim (snakemake@input[["txt"]], header = T) title <- snakemake@params[["sample_name"]] ## Select the longest isoforms trx_len <- dist$utr5_size + dist$cds_size + dist$utr3_size dist <- dist[order(dist$gene_name, trx_len),] # sort by gene name, then transcript length dist <- dist[duplicated(dist$gene_name),] # select the longest isoform ## Rescale regions and determine scale utr5.SF <- median(dist$utr5_size, na.rm = T)/median(dist$cds_size, na.rm = T) utr3.SF <- median(dist$utr3_size, na.rm = T)/median(dist$cds_size, na.rm = T) # Assign the regions to new dataframes utr5.dist <- dist[dist$rel_location < 1, ] utr3.dist <- dist[dist$rel_location >= 2, ] cds.dist <- dist [dist$rel_location < 2 & dist$rel_location >= 1, ] # Rescale 5'UTR and 3'UTR utr5.dist$rel_location <- rescale(utr5.dist$rel_location, to = c(1-utr5.SF, 1), from = c(0,1)) utr3.dist$rel_location <- rescale(utr3.dist$rel_location, to = c(2, 2+utr3.SF), from = c(2,3)) # Combine the regions in a new dataframe and plot all.regions <- c(utr5.dist$rel_location, cds.dist$rel_location, utr3.dist$rel_location) ## A smooth density plot png(filename=snakemake@output[["png"]]) plot(density(all.regions), main=title, xaxt='n', xlab="") abline (v = 1, lty = 1, col = "red") abline (v = 2, lty = 1, col = "red") axis(at=c(0.6,1.5,2.5), labels=c("5'UTR","CDS","3'UTR"), side=1) |
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | shell: """ name="{params.name}"; \ name="${{name^}}"; \ ram="$(({params.ram}*1000000000))"; \ curl -s --list-only {params.url_fasta}/ | if grep -q "primary_assembly"; then wget -P {output} {params.url_fasta}/${{name}}.{params.assembly_fasta}.dna.primary_assembly.fa.gz; else wget -P {output} {params.url_fasta}/${{name}}.{params.assembly_fasta}.dna.toplevel.fa.gz; fi; \ wget -P {output} {params.url_gtf}/${{name}}.{params.assembly_gtf}.gtf.gz; \ gzip -dr {output}; \ echo "$(date +%b' '%d' '%H:%M:%S) Indexing genome (takes a while)..."; \ STAR --runThreadN {threads_max} --limitGenomeGenerateRAM ${{ram}} --runMode genomeGenerate --genomeSAsparseD 2 --genomeFastaFiles {output}/*.fa --sjdbGTFfile {output}/*.gtf --genomeDir {output}/index > {log} && \ echo "$(date +%b' '%d' '%H:%M:%S) Annotating genome (takes a while)..." && \ seqkit split {output}/*.fa -i --by-id-prefix "" --id-regexp "([^\s]+)" -O {output}/chroms --quiet && \ awk -F'"' -v OFS='"' '{{for(i=2; i<=NF; i+=2) gsub(";", "-", $i)}} 1' {output}/${{name}}.{params.assembly_gtf}.gtf | gtf2bed --attribute-key=gene_name - > {output}/${{name}}.{params.assembly_gtf}.geneNames.bed && \ gtfToGenePred -genePredExt {output}/${{name}}.{params.assembly_gtf}.gtf {output}/${{name}}.{params.assembly_gtf}.genePred && \ perl workflow/scripts/metaPlotR/make_annot_bed.pl --genomeDir {output}/chroms/ --genePred {output}/${{name}}.{params.assembly_gtf}.genePred > {output}/${{name}}.{params.assembly_gtf}.annotated.bed && \ echo "$(date +%b' '%d' '%H:%M:%S) Sorting annotation..." && \ sort -k1,1 -k2,2n {output}/${{name}}.{params.assembly_gtf}.annotated.bed > {output}/${{name}}.{params.assembly_gtf}.annotated.sorted.bed && \ rm {output}/${{name}}.{params.assembly_gtf}.annotated.bed && \ echo "$(date +%b' '%d' '%H:%M:%S) Calculating size of transcript regions (i.e. 5'UTR, CDS and 3'UTR)..." && \ perl workflow/scripts/metaPlotR/size_of_cds_utrs.pl --annot {output}/${{name}}.{params.assembly_gtf}.annotated.sorted.bed > {output}/${{name}}.{params.assembly_gtf}.region_sizes.txt """ |
58 59 60 61 62 63 64 65 66 67 | shell: """ mkdir -p {output} && \ rna_seq="{params.rna_seq}"; \ rna_seq=`echo -e ">seq\n${{rna_seq}}" | seqkit subseq -r 1:7 | seqkit seq -s -t RNA --rna2dna --quiet`; \ rna_seq_rc=`echo -e ">seq\n${{rna_seq}}" | seqkit seq -p -r -s -t DNA --quiet`; \ echo -e "TRUSEQ_I5\t{params.truseq_read2}\nTRUSEQ_I7\t{params.truseq_read1}\n{params.rna_name}\t${{rna_seq}}\n{params.rna_name}_RC\t${{rna_seq_rc}}" > config/adapterList.txt && \ fastqc --quiet -t {threads} -a config/adapterList.txt -o {output} {input.read2} && \ mv {output}/*.zip {output}/{params.sample_name}_pre_fastqc.zip """ |
83 84 85 86 87 88 | shell: """ randomer_size="{params.randomer_size}"; \ randomer_size=`echo "{{"${{randomer_size}}"}}"`; \ umi_tools extract --extract-method=regex --bc-pattern="^(?P<umi_1>.${{randomer_size}})" -I {input.read2} --read2-in={input.read1} --stdout={output.read2} --read2-out={output.read1} --log={log} """ |
107 108 109 110 111 112 113 114 | shell: """ rna_seq="{params.rna_seq}"; \ rna_seq_rc=`echo -e ">seq\n${{rna_seq}}" | seqkit seq -p -r -s -t RNA --rna2dna --quiet`; \ randomer_size="{params.randomer_size}"; \ randomer_size=`echo "{{"${{randomer_size}}"}}"`; \ cutadapt {params.args} -j {threads} -a "ssDNA=${{rna_seq_rc}};min_overlap=5...N${{randomer_size}}{params.truseq_read1};min_overlap=15" -A ssRNA={params.rna_seq} -A TruSeq={params.truseq_read2} -o {output.read1} -p {output.read2} {input} > {log} """ |
125 126 127 128 129 130 | shell: """ mkdir -p {output} && \ fastqc --quiet -t {threads} -a config/adapterList.txt -o {output} {input} && \ mv {output}/*.zip {output}/{params.sample_name}_post_fastqc.zip """ |
144 145 146 147 148 | shell: """ STAR --runThreadN {threads} --runMode alignReads --genomeDir resources/index --readFilesIn {input.read1} {input.read2} --outFileNamePrefix {params.prefix} {params.args} > {output.bam} && \ samtools index -@ {threads} {output.bam} """ |
160 161 162 163 164 165 166 | shell: """ umi_tools dedup -I {input.bam} --paired -S {output.bam} --log={log} && \ samtools index -@ {threads} {output.bam} && \ samtools view -hb -f 128 {output.bam} > {output.r2} && \ samtools index -@ {threads} {output.r2} """ |
176 177 | shell: "samtools mpileup -B -f resources/*.fa {input.bam} > {output.mpileup} 2> {log}" |
187 188 189 190 191 192 193 194 195 196 | shell: """ java -cp workflow/scripts/meclip/target/meCLIP.jar com.github.ajlabuc.meclip.MpileupParser {input.mpileup} {params.prefix} C T 0.025 0.5 2 0 && \ seqkit subseq --quiet --bed results/mpileup/IP/{params.prefix}_motifList_positive.bed resources/*.fa > results/mpileup/IP/motifList_positive.fa && \ seqkit subseq --quiet --bed results/mpileup/IP/{params.prefix}_motifList_negative.bed resources/*.fa > results/mpileup/IP/motifList_negative.fa && \ java -cp workflow/scripts/meclip/target/meCLIP.jar com.github.ajlabuc.meclip.MotifFrequencyCalculator results/mpileup/IP/motifList_positive.fa results/mpileup/IP/{params.prefix}_MpileupParser_positive.xls [AG]AC && \ java -cp workflow/scripts/meclip/target/meCLIP.jar com.github.ajlabuc.meclip.MotifFrequencyCalculator results/mpileup/IP/motifList_negative.fa results/mpileup/IP/{params.prefix}_MpileupParser_negative.xls GT[TC] && \ cat <(echo -e "Chr \tM6A \tRef \tFreq \tMutationCount \tReadCount \tMotifStart \tMotifEnd \tMotif") results/mpileup/IP/{params.prefix}_MpileupParser_positive_motifFrequency.xls results/mpileup/IP/{params.prefix}_MpileupParser_negative_motifFrequency.xls > {output.xls} && \ awk '{{if ($9!="No") print $0}}' {output.xls} > {output.xls}.temp && mv {output.xls}.temp {output.xls} """ |
208 209 210 211 212 213 | shell: """ mkdir -p {output} && \ fastqc --quiet -t {threads} -a config/adapterList.txt -o {output} {input.read2} && \ mv {output}/*.zip {output}/{params.sample_name}_pre_fastqc.zip """ |
229 230 231 232 233 234 | shell: """ randomer_size="{params.randomer_size}"; \ randomer_size=`echo "{{"${{randomer_size}}"}}"`; \ umi_tools extract --extract-method=regex --bc-pattern="^(?P<umi_1>.${{randomer_size}})" -I {input.read2} --read2-in={input.read1} --stdout={output.read2} --read2-out={output.read1} --log={log} """ |
253 254 255 256 257 258 259 260 | shell: """ rna_seq="{params.rna_seq}"; \ rna_seq_rc=`echo -e ">seq\n${{rna_seq}}" | seqkit seq -p -r -s -t RNA --rna2dna --quiet`; \ randomer_size="{params.randomer_size}"; \ randomer_size=`echo "{{"${{randomer_size}}"}}"`; \ cutadapt {params.args} -j {threads} -a "ssDNA=${{rna_seq_rc}};min_overlap=5...N${{randomer_size}}{params.truseq_read1};min_overlap=15" -A ssRNA={params.rna_seq} -A TruSeq={params.truseq_read2} -o {output.read1} -p {output.read2} {input} > {log} """ |
271 272 273 274 275 276 | shell: """ mkdir -p {output} && \ fastqc --quiet -t {threads} -a config/adapterList.txt -o {output} {input} && \ mv {output}/*.zip {output}/{params.sample_name}_post_fastqc.zip """ |
290 291 292 293 294 | shell: """ STAR --runThreadN {threads} --runMode alignReads --genomeDir resources/index --readFilesIn {input.read1} {input.read2} --outFileNamePrefix {params.prefix} {params.args} > {output.bam} && \ samtools index -@ {threads} {output.bam} """ |
306 307 308 309 310 311 312 313 314 315 | shell: """ umi_tools dedup -I {input.bam} --paired -S {output.bam} --log={log} && \ samtools index -@ {threads} {output.bam} && \ samtools view -hb -f 128 {output.bam} > {output.r2} && \ samtools index -@ {threads} {output.r2} && \ cp results/STAR/IP/*.Log.final.out results/logs && \ cp results/STAR/INPUT/*.Log.final.out results/logs && \ rm -R results/STAR/ """ |
325 326 | shell: "samtools mpileup -B -f resources/*.fa {input.bam} > {output.mpileup} 2> {log}" |
336 337 338 339 340 341 342 343 344 345 346 347 | shell: """ java -cp workflow/scripts/meclip/target/meCLIP.jar com.github.ajlabuc.meclip.MpileupParser {input.mpileup} {params.prefix} C T 0.025 0.5 2 0 && \ seqkit subseq --quiet --bed results/mpileup/INPUT/{params.prefix}_motifList_positive.bed resources/*.fa > results/mpileup/INPUT/motifList_positive.fa && \ seqkit subseq --quiet --bed results/mpileup/INPUT/{params.prefix}_motifList_negative.bed resources/*.fa > results/mpileup/INPUT/motifList_negative.fa && \ java -cp workflow/scripts/meclip/target/meCLIP.jar com.github.ajlabuc.meclip.MotifFrequencyCalculator results/mpileup/INPUT/motifList_positive.fa results/mpileup/INPUT/{params.prefix}_MpileupParser_positive.xls [AG]AC && \ java -cp workflow/scripts/meclip/target/meCLIP.jar com.github.ajlabuc.meclip.MotifFrequencyCalculator results/mpileup/INPUT/motifList_negative.fa results/mpileup/INPUT/{params.prefix}_MpileupParser_negative.xls GT[TC] && \ cat <(echo -e "Chr \tM6A \tRef \tFreq \tMutationCount \tReadCount \tMotifStart \tMotifEnd \tMotif") results/mpileup/INPUT/{params.prefix}_MpileupParser_positive_motifFrequency.xls results/mpileup/INPUT/{params.prefix}_MpileupParser_negative_motifFrequency.xls > {output.xls} && \ awk '{{if ($9!="No") print $0}}' {output.xls} > {output.xls}.temp && mv {output.xls}.temp {output.xls} && \ rm -R results/mpileup/ rm -R results/umitools/ """ |
356 357 | shell: "java -cp workflow/scripts/meclip/target/meCLIP.jar com.github.ajlabuc.meclip.InputParser {input.ip_sample} {input.input_sample} {output.xls}" |
369 370 371 372 373 374 375 376 377 378 | shell: """ preCount=`wc -l < results/*_IP.MpileupParser_MotifFrequency.xls`; postCount=`wc -l < results/*.m6aList.xls`; finalCount=`echo "scale = 3; 100 - ((($postCount - 1) / ($preCount - 1)) * 100)" | bc`; \ java -cp workflow/scripts/meclip/target/meCLIP.jar com.github.ajlabuc.meclip.ConfidenceCategorizer {input.xls} {params.name} {params.prefix}.m6aList ${{finalCount}} && \ sort -k1,1 -k2,2n {params.prefix}.m6aList.bed > {output.bed} && \ bedtools intersect -a {output.bed} -b resources/*.geneNames.bed -wb -s -sorted > {params.prefix}.m6aList.sorted.annotated.bed && \ java -cp workflow/scripts/meclip/target/meCLIP.jar com.github.ajlabuc.meclip.BedAnnotator {params.prefix}.m6aList.sorted.annotated.bed {output.xls} && \ rm {params.prefix}.m6aList.sorted.annotated.bed && \ rm results/*_MotifFrequency.xls """ |
388 389 390 391 392 | shell: """ bedtools intersect -a {input.bed} -b resources/*.annotated.sorted.bed > results/metaPlotR/annot_m6a.sorted.bed -wo -s -sorted && \ perl workflow/scripts/metaPlotR/rel_and_abs_dist_calc.pl --bed results/metaPlotR/annot_m6a.sorted.bed --regions resources/*.region_sizes.txt > {output.txt} 2> {log} """ |
402 403 | script: "scripts/metaPlotR/visualize_metagenes.R" |
412 413 414 415 416 | shell: """ multiqc results --config config/multiqc_config.yaml -n meCLIP_report -o results/report && \ rm -R results/metaPlotR/ """ |
Support
- Future updates
Related Workflows





