Snakemake workflow for the mapping and quantification of miRNAs and isomiRs from miRNA-Seq libraries.

public 1yr ago 0 bookmarks

View Workflow

MIRFLOWZ is a [Snakemake][snakemake] workflow for mapping miRNAs and isomiRs.

Installation
Usage
Workflow description
Contributing
License
Contact

Installation

The workflow lives inside this repository and will be available for you to run after following the installation instructions layed out in this section.

Cloning the repository

Traverse to the desired path on your file system, then clone the repository and change into it with:

git clone https://github.com/zavolanlab/mirflowz.git
cd mirflowz

Dependencies

For improved reproducibility and reusability of the workflow, as well as an easy means to run it on a high performance computing (HPC) cluster managed, e.g., by [Slurm][slurm], all steps of the workflow run inside their own containers. As a consequence, running this workflow has only a few individual dependencies. These are managed by the package manager [Conda][conda], which needs to be installed on your system before proceeding.

If you do not already have [Conda][conda] installed globally on your system, we recommend that you install [Miniconda][miniconda-installation]. For faster creation of the environment (and Conda environments in general), you can also install [Mamba][mamba] on top of Conda. In that case, replace conda with mamba in the commands below (particularly in conda env create ).

Setting up the virtual environment

Create and activate the environment with necessary dependencies with Conda:

conda env create -f environment.yml
conda activate mirflowz

If you plan to run MIRFLOWZ via Singularity and do not already have it installed globally on your system, you must further update the Conda environment using the environment.root.yml with the command below. Mind that you must have the environment activated to update it.

conda env update -f environment.root.yml

Note that you will need to have root permissions on your system to be able to install Singularity. If you want to run MIRFLOWZ on an HPC cluster (recommended in almost all cases), ask your systems administrator about Singularity.

If you would like to contribute to MIRFLOWZ development, you may find it useful to further update your environment with the development dependencies:

conda env update -f environment.dev.yml

Testing your installation

Several tests are provided to check the integrity of the installation. Follow the instructions in this section to make sure the workflow is ready to use.

Run test workflow on local machine

Execute one of the following commands to run the test workflow on your local machine:

Test workflow on local machine with Singularity :

bash test/test_workflow_local_with_singularity.sh

Test workflow on local machine with Conda :

bash test/test_workflow_local_with_conda.sh

Run test workflow on a cluster via SLURM

Execute one of the following commands to run the test workflow on a [Slurm][slurm]-managed high-performance computing (HPC) cluster:

Test workflow with Singularity :

bash test/test_workflow_slurm_with_singularity.sh

Test workflow with Conda :

bash test/test_workflow_slurm_with_conda.sh

Rule graph

Execute the following command to generate a rule graph image for the workflow. The output will be found in the images/ directory in the repository root.

bash test/test_rule_graph.sh

You can see the rule graph below in the workflow description section.

Clean up test results

After successfully running the tests above, you can run the following command to remove all artifacts generated by the test runs:

bash test/test_cleanup.sh

Usage

Now that your virtual environment is set up and the workflow is deployed and tested, you can go ahead and run the workflow on your samples.

Preparing inputs

It is suggested to have all the input files for a given run (or hard links pointing to them) inside a dedicated directory, for instance under the MIRFLOWZ root directory. This way it is easier to keep the data together, reproduce an analysis and set up Singularity access to them.

1. Prepare a sample table

touch path/to/your/sample/table.csv

Fill the sample table according to the following requirements:

sample . This column contains the library name.

sample_file . In this column, you must provide the path to the library file. The path must be relative to the working directory.

adapter . This field must contain the adapter sequence in capital letters.

format . In this field you mast state the library format. It can either be fa if providing a FASTA file or fastq if the library is a FASTQ file.

You can look at the test/test_files/sample_table.csv to know what this file must look like, or use it as a template.

2. Prepare genome resources

There are 4 files you must provide:

A gzip ped FASTA file containing reference sequences , typically the genome of the source/organism from which the library was extracted.
A gzip ped GTF file with matching gene annotations for the reference sequences above.

MIRFLOWZ expects both the reference sequence and gene annotation files to follow [Ensembl][ensembl] style/formatting. If you obtained these files from a source other than Ensembl, you may first need to convert them to the expected style to avoid issues!

An uncompressed GFF3 file with microRNA annotations for the reference sequences above.

MIRFLOWZ expects the miRNA annotations to follow [miRBase][mirbase] style/formatting. If you obtained this file from a source other than miRBase, you may first need to convert it to the expected style to avoid issues!

An uncompressed tab-separated file with a mapping between the reference names used in the miRNA annotation file (column 1; "UCSC style") and in the gene annotations and reference sequence files (column 2; "Ensembl style"). Values in column 1 are expected to be unique, no header is expected, and any additional columns will be ignored. [This resource][chrMap] provides such files for various organisms, and in the expected format.

General note: If you want to process the genome resources before use (e.g., filtering), you can do that, but make sure the formats of any modified resource files meet the formatting expectations outlined above!

3. Prepare a configuration file

We recommend creating a copy of the configuration file template:

cp path/to/config_template.yaml path/to/config.yaml

Open the new copy in your editor of choice and adjust the configuration parameters to your liking. The template explains what each of the parameters means and how you can meaningfully adjust them.

Running the workflow

With all the required files in place, you can now run the workflow locally with the following command:

snakemake \
 --snakefile="path/to/Snakefile" \
 --cores 4 \
 --configfile="path/to/config.yaml" \
 --use-singularity \
 --singularity-args "--bind ${PWD}/../" \
 --printshellcmds \
 --rerun-incomplete \
 --verbose

NOTE: Depending on your working directory, you do not need to use the parameters --snakefile and --configfile . For instance, if the Snakefile is in the same directory or the workflow/ directory is beneath the current working directory, there's no need for the --snakefile directory. Refer to the [Snakemake documentation][snakemakeDocu] for more information.

After successful execution of the workflow, results and logs will be found in the results/ and logs/ directories, respectively.

Creating a Snakemake report

Snakemake provides the option to generate a detailed HTML report on runtime statistics, workflow topology and results. If you want to create a Snakemake report, you must run the following command:

snakemake \
 --snakefile="path/to/Snakefile" \
 --configfile="path/to/config.yaml" \
 --report="snakemake_report.html"

NOTE: The report creation must be done after running the workflow in order to have the runtime statistics and the results.

Workflow description

The MIRFLOWZ workflow first processes and indexes the user-provided genome resources. Afterwards, the user-provided short read smallRNA-seq libraries will be aligned seperately against the genome and transcriptome. For increased fidelity, two seperated aligners, [Segemehl][segemehl] and our in-house tool [Oligomap][oligomap], are used. All the resulting alignments are merged such that only the best alignments of each read are kept (smallest edit distance). Finally, alignments are intersected with the user-provided, pre-processed miRNA annotation file using [ bedtools ][bedtools]. Counts are tabulated seperately for reads consistent with either miRNA precursors, mature miRNA and/or isomiRs.

The schema below is a visual representation of the individual workflow steps and how they are related:

![rule-graph][rule-graph]

Contributing

MIRFLOWZ is an open-source project which relies on community contributions. You are welcome to participate by submitting bug reports or feature requests, taking part in discussions, or proposing fixes and other code changes.

License

This project is covered by the MIT License .

Contact

Do not hesitate on contacting us via [email][email] for any inquiries on MIRFLOWZ . Please mention the name of the tool.

Code Snippets

shell:
    "(zcat {input.reads} > {output.reads}) &> {log}"

SnakeMake From line 92 of rules/map.smk

shell:
    "(fastq_quality_filter \
    -v \
    -q {params.q} \
    -p {params.p} \
    -i {input.reads} \
    > {output.reads} \
    ) &> {log}"

SnakeMake From line 122 of rules/map.smk

shell:
    "(fastq_to_fasta -r -n -i {input.reads} > {output.reads}) &> {log}"

SnakeMake From line 156 of rules/map.smk

shell:
    "(fasta_formatter -w 0 -i {input.reads} > {output.reads}) &> {log}"

SnakeMake From line 187 of rules/map.smk

shell:
    "(cutadapt \
    -a {params.adapter} \
    --error-rate {params.error_rate} \
    --minimum-length {params.minimum_length} \
    --overlap {params.overlap} \
    --trim-n \
    --max-n {params.max_n} \
    --cores {resources.threads} \
    -o {output.reads} {input.reads}) &> {log}"

SnakeMake From line 222 of rules/map.smk

shell:
    "(fastx_collapser -i {input.reads} > {output.reads}) &> {log}"

SnakeMake From line 260 of rules/map.smk

shell:
    "(segemehl.x \
    -i {input.genome_index_segemehl} \
    -d {input.genome} \
    -t {threads} \
    -e \
    -q {input.reads} \
    -outfile {output.gmap} \
    ) &> {log}"

SnakeMake From line 296 of rules/map.smk

shell:
    "(segemehl.x \
    -i {input.transcriptome_index_segemehl} \
    -d {input.transcriptome} \
    -t {threads} \
    -q {input.reads} \
    -e \
    -outfile {output.tmap} \
    ) &> {log}"

SnakeMake From line 346 of rules/map.smk

shell:
    "(python {input.script} \
    -r {params.max_length_reads} \
    -i {input.reads} \
    -o {output.reads} \
    ) &> {log}"

SnakeMake From line 387 of rules/map.smk

shell:
    "(oligomap \
    {input.target} \
    {input.reads} \
    -r {output.report} \
    > {output.gmap} \
    ) &> {log}"

SnakeMake From line 429 of rules/map.smk

shell:
    "(bash {input.script} \
    {input.tmap} \
    {resources.threads} \
    {output.sort} \
    ) &> {log}"

SnakeMake From line 467 of rules/map.smk

shell:
    "(python {input.script} \
    -i {input.sort} \
    -n {params.nh} \
    > {output.gmap}) &> {log}"

SnakeMake From line 509 of rules/map.smk

shell:
    "(oligomap \
    {input.target} \
    {input.reads} \
    -s \
    -r {output.report} \
    > {output.tmap} \
    ) &> {log}"

SnakeMake From line 557 of rules/map.smk

shell:
    "(bash {input.script} \
    {input.tmap} \
    {resources.threads} \
    {output.sort} \
    ) &> {log}"

SnakeMake From line 604 of rules/map.smk

shell:
    "(python {input.script} \
    -i {input.sort} \
    -n {params.nh} \
    > {output.tmap} \
    ) &> {log}"

SnakeMake From line 651 of rules/map.smk

shell:
    "(cat {input.gmap1} {input.gmap2} > {output.gmaps}) &> {log}"

SnakeMake From line 684 of rules/map.smk

shell:
    "(cat {input.tmap1} {input.tmap2} > {output.tmaps}) &> {log}"

SnakeMake From line 719 of rules/map.smk

shell:
    "(python {input.script} \
    {input.gmaps} \
    {params.nh} \
    {output.gmaps} \
    ) &> {log}"

SnakeMake From line 749 of rules/map.smk

shell:
    "(python {input.script} \
    {input.tmaps} \
    {params.nh} \
    {output.tmaps} \
    ) &> {log}"

SnakeMake From line 787 of rules/map.smk

shell:
    "samtools view {input.gmap} > {output.gmap}"

SnakeMake SAMtools From line 821 of rules/map.smk

shell:
    "samtools view {input.tmap} > {output.tmap}"

SnakeMake SAMtools From line 857 of rules/map.smk

shell:
    "(perl {input.script} \
    --in {input.tmap} \
    --exons {input.exons} \
    --out {output.genout} \
    ) &> {log}"

SnakeMake From line 893 of rules/map.smk

shell:
    "(cat {input.gmap1} {input.gmap2} > {output.catmaps}) &> {log}"

SnakeMake From line 928 of rules/map.smk

shell:
    "(cat {input.header} {input.catmaps} > {output.concatenate}) &> {log}"

SnakeMake From line 957 of rules/map.smk

shell:
    "(samtools sort -n -o {output.sort} {input.concatenate}) &> {log}"

SnakeMake From line 987 of rules/map.smk

shell:
    "(perl {input.script} \
    --print-header \
    --keep-mm \
    --in {input.sort} \
    --out {output.remove_inf} \
    ) &> {log}"

SnakeMake From line 1024 of rules/map.smk

shell:
    "(python {input.script} \
    {input.sam} \
    --nh \
    > {output.sam} \
    ) &> {log}"

SnakeMake From line 1064 of rules/map.smk

shell:
    "(samtools view -b {input.maps} > {output.maps}) &> {log}"

SnakeMake From line 1098 of rules/map.smk

shell:
    "(samtools sort {input.maps} > {output.maps}) &> {log}"

SnakeMake From line 1130 of rules/map.smk

shell:
    "(samtools index -b {input.maps} > {output.maps}) &> {log}"

SnakeMake From line 1162 of rules/map.smk

shell:
    "(zcat {input.genome} | {input.script} > {output.genome}) &> {log}"

SnakeMake From line 81 of rules/prepare.smk

shell:
    "(zcat {input.gtf} | gffread -w {output.fasta} -g {input.genome}) &> {log}"

SnakeMake gffread From line 110 of rules/prepare.smk

shell:
    "(cat {input.fasta} | {input.script} > {output.fasta}) &> {log}"

SnakeMake From line 133 of rules/prepare.smk

shell:
    "(segemehl.x -x {output.idx} -d {input.fasta}) &> {log}"

SnakeMake From line 168 of rules/prepare.smk

shell:
    "(segemehl.x -x {output.idx} -d {input.genome}) &> {log}"

SnakeMake From line 200 of rules/prepare.smk

shell:
    "(bash \
    {input.script} \
    -f {input.gtf} \
    -c 3 \
    -p exon \
    -o {output.exons} \
    ) &> {log}"

SnakeMake From line 221 of rules/prepare.smk

shell:
    "(Rscript \
    {input.script} \
    --gtf {input.exons} \
    -o {output.exons} \
    ) &> {log}"

SnakeMake From line 250 of rules/prepare.smk

shell:
    "(samtools dict -o {output.header} --uri=NA {input.genome}) &> {log}"

SnakeMake From line 278 of rules/prepare.smk

shell:
    "(perl {input.script} \
    {input.anno} \
    {params.column} \
    {params.delimiter} \
    {input.map_chr} \
    {output.gff} \
    ) &> {log}"

SnakeMake From line 304 of rules/prepare.smk

shell:
    "(samtools faidx {input.genome}) &> {log}"

SnakeMake From line 334 of rules/prepare.smk

shell:
    "(cut -f1,2 {input.genome} > {output.chrsize}) &> {log}"

SnakeMake From line 354 of rules/prepare.smk

shell:
    "(python {input.script} \
    {input.gff3} \
    --chr {input.chrsize} \
    --extension {params.extension} \
    --outdir {params.out_dir} \
    ) &> {log}"

SnakeMake From line 395 of rules/prepare.smk

shell:
    "(bedtools intersect \
    -wb \
    -s \
    -F 1 \
    -b {input.alignment} \
    -a {input.primir} \
    -bed \
    > {output.intersect} \
    ) &> {log}"

SnakeMake From line 118 of rules/quantify.smk

shell:
    "((samtools view \
    -H {input.alignments}; \
    awk 'NR==FNR {{bed[$13]=1; next}} $1 in bed' \
    {input.intersect} {input.alignments} \
    ) > {output.sam} \
    ) &> {log}"

SnakeMake From line 165 of rules/quantify.smk

shell:
    "(samtools view -b {input.maps} > {output.maps}) &> {log}"

SnakeMake From line 206 of rules/quantify.smk

shell:
    "(samtools sort -n {input.maps} > {output.maps}) &> {log}"

SnakeMake From line 242 of rules/quantify.smk

shell:
    "(samtools index -b {input.maps} > {output.maps}) &> {log}"

SnakeMake From line 276 of rules/quantify.smk

shell:
    "(bedtools intersect \
    -wo \
    -s \
    -F 1 \
    -b {input.alignment} \
    -a {input.mirna} \
    -bed \
    > {output.intersect} \
    ) &> {log}"

SnakeMake From line 315 of rules/quantify.smk

shell:
    "((samtools view \
    -H {input.alignments}; \
    awk 'NR==FNR {{bed[$13]=1; next}} $1 in bed' \
    {input.intersect} {input.alignments} \
    ) > {output.sam} \
    ) &> {log}"

SnakeMake From line 362 of rules/quantify.smk

shell:
    "(python {input.script} \
    --bed {input.intersect} \
    --sam {input.alignments} \
    --extension {params.extension} \
    > {output.sam} \
    ) &> {log}"

SnakeMake From line 406 of rules/quantify.smk

shell:
    "(samtools sort -t YW -O SAM {input.sam} > {output.sam}) &> {log}"

SnakeMake From line 447 of rules/quantify.smk

shell:
    "(python \
    {input.script} \
    {input.alignments} \
    --collapsed \
    --nh \
    --lib {params.library} \
    --outdir {params.out_dir} \
    --mir-list {params.mir_list} \
    ) &> {log}"

SnakeMake From line 481 of rules/quantify.smk

shell:
    "(python \
    {input.script} \
    {input.intersect} \
    --collapsed \
    --nh \
    > {output.table} \
    ) &> {log}"

SnakeMake From line 526 of rules/quantify.smk

shell:
    "(Rscript \
    {input.script} \
    --input_dir {params.input_dir} \
    --output_file {output.table} \
    --prefix {params.prefix} \
    --verbose \
    ) &> {log}"

SnakeMake From line 567 of rules/quantify.smk

shell:
    "(perl {input.script} \
    --suffix \
    --in {input.maps} \
    --out {output.maps} \
    ) &> {log}"

SnakeMake From line 606 of rules/quantify.smk

shell:
    "(samtools view -b {input.maps} > {output.maps}) &> {log}"

SnakeMake From line 646 of rules/quantify.smk

shell:
    "(samtools sort {input.maps} > {output.maps}) &> {log}"

SnakeMake From line 682 of rules/quantify.smk

shell:
    "(samtools index -b {input.maps} > {output.maps}) &> {log}"

SnakeMake From line 716 of rules/quantify.smk

ShowHide 56 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/zavolanlab/mirflowz

Name: mirflowz

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

Gene ID Quantification Gene report gffread SAMtools Snakemake RNA-Seq

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free

Snakemake workflow for the mapping and quantification of miRNAs and isomiRs from miRNA-Seq libraries.

Table of Contents

Installation

Cloning the repository

Dependencies

Setting up the virtual environment

Testing your installation

Run test workflow on local machine

Run test workflow on a cluster via SLURM

Rule graph

Clean up test results

Usage

Preparing inputs

1. Prepare a sample table

2. Prepare genome resources

3. Prepare a configuration file

Running the workflow

Creating a Snakemake report

Workflow description

Contributing

License

Contact

Code Snippets

Comments

Support

Free

Related Workflows

public

public

public

public

public

public