Snakemake workflow for the mapping and quantification of miRNAs and isomiRs from miRNA-Seq libraries.
MIRFLOWZ is a [Snakemake][snakemake] workflow for mapping miRNAs and isomiRs.
Table of Contents
Installation
The workflow lives inside this repository and will be available for you to run after following the installation instructions layed out in this section.
Cloning the repository
Traverse to the desired path on your file system, then clone the repository and change into it with:
git clone https://github.com/zavolanlab/mirflowz.git
cd mirflowz
Dependencies
For improved reproducibility and reusability of the workflow, as well as an easy means to run it on a high performance computing (HPC) cluster managed, e.g., by [Slurm][slurm], all steps of the workflow run inside their own containers. As a consequence, running this workflow has only a few individual dependencies. These are managed by the package manager [Conda][conda], which needs to be installed on your system before proceeding.
If you do not already have [Conda][conda] installed globally on your system,
we recommend that you install [Miniconda][miniconda-installation]. For faster
creation of the environment (and Conda environments in general), you can also
install [Mamba][mamba] on top of Conda. In that case, replace
conda
with
mamba
in the commands below (particularly in
conda env create
).
Setting up the virtual environment
Create and activate the environment with necessary dependencies with Conda:
conda env create -f environment.yml
conda activate mirflowz
If you plan to run
MIRFLOWZ
via Singularity and do not already have it
installed globally on your system, you must further update the Conda
environment using the
environment.root.yml
with the command below.
Mind that you must have the environment activated to update it.
conda env update -f environment.root.yml
Note that you will need to have root permissions on your system to be able to install Singularity. If you want to run MIRFLOWZ on an HPC cluster (recommended in almost all cases), ask your systems administrator about Singularity.
If you would like to contribute to MIRFLOWZ development, you may find it useful to further update your environment with the development dependencies:
conda env update -f environment.dev.yml
Testing your installation
Several tests are provided to check the integrity of the installation. Follow the instructions in this section to make sure the workflow is ready to use.
Run test workflow on local machine
Execute one of the following commands to run the test workflow on your local machine:
- Test workflow on local machine with Singularity :
bash test/test_workflow_local_with_singularity.sh
- Test workflow on local machine with Conda :
bash test/test_workflow_local_with_conda.sh
Run test workflow on a cluster via SLURM
Execute one of the following commands to run the test workflow on a [Slurm][slurm]-managed high-performance computing (HPC) cluster:
- Test workflow with Singularity :
bash test/test_workflow_slurm_with_singularity.sh
- Test workflow with Conda :
bash test/test_workflow_slurm_with_conda.sh
Rule graph
Execute the following command to generate a rule graph image for the workflow.
The output will be found in the
images/
directory in the repository root.
bash test/test_rule_graph.sh
You can see the rule graph below in the workflow description section.
Clean up test results
After successfully running the tests above, you can run the following command to remove all artifacts generated by the test runs:
bash test/test_cleanup.sh
Usage
Now that your virtual environment is set up and the workflow is deployed and tested, you can go ahead and run the workflow on your samples.
Preparing inputs
It is suggested to have all the input files for a given run (or hard links pointing to them) inside a dedicated directory, for instance under the MIRFLOWZ root directory. This way it is easier to keep the data together, reproduce an analysis and set up Singularity access to them.
1. Prepare a sample table
touch path/to/your/sample/table.csv
Fill the sample table according to the following requirements:
sample
. This column contains the library name.
sample_file
. In this column, you must provide the path to the library file. The path must be relative to the working directory.
adapter
. This field must contain the adapter sequence in capital letters.
format
. In this field you mast state the library format. It can either befa
if providing a FASTA file orfastq
if the library is a FASTQ file.You can look at the
test/test_files/sample_table.csv
to know what this file must look like, or use it as a template.
2. Prepare genome resources
There are 4 files you must provide:
-
A
gzip
ped FASTA file containing reference sequences , typically the genome of the source/organism from which the library was extracted. -
A
gzip
ped GTF file with matching gene annotations for the reference sequences above.
MIRFLOWZ expects both the reference sequence and gene annotation files to follow [Ensembl][ensembl] style/formatting. If you obtained these files from a source other than Ensembl, you may first need to convert them to the expected style to avoid issues!
- An uncompressed GFF3 file with microRNA annotations for the reference sequences above.
MIRFLOWZ expects the miRNA annotations to follow [miRBase][mirbase] style/formatting. If you obtained this file from a source other than miRBase, you may first need to convert it to the expected style to avoid issues!
- An uncompressed tab-separated file with a mapping between the reference names used in the miRNA annotation file (column 1; "UCSC style") and in the gene annotations and reference sequence files (column 2; "Ensembl style"). Values in column 1 are expected to be unique, no header is expected, and any additional columns will be ignored. [This resource][chrMap] provides such files for various organisms, and in the expected format.
General note: If you want to process the genome resources before use (e.g., filtering), you can do that, but make sure the formats of any modified resource files meet the formatting expectations outlined above!
3. Prepare a configuration file
We recommend creating a copy of the configuration file template:
cp path/to/config_template.yaml path/to/config.yaml
Open the new copy in your editor of choice and adjust the configuration parameters to your liking. The template explains what each of the parameters means and how you can meaningfully adjust them.
Running the workflow
With all the required files in place, you can now run the workflow locally with the following command:
snakemake \
--snakefile="path/to/Snakefile" \
--cores 4 \
--configfile="path/to/config.yaml" \
--use-singularity \
--singularity-args "--bind ${PWD}/../" \
--printshellcmds \
--rerun-incomplete \
--verbose
NOTE: Depending on your working directory, you do not need to use the parameters
--snakefile
and--configfile
. For instance, if theSnakefile
is in the same directory or theworkflow/
directory is beneath the current working directory, there's no need for the--snakefile
directory. Refer to the [Snakemake documentation][snakemakeDocu] for more information.
After successful execution of the workflow, results and logs will be found in
the
results/
and
logs/
directories, respectively.
Creating a Snakemake report
Snakemake provides the option to generate a detailed HTML report on runtime statistics, workflow topology and results. If you want to create a Snakemake report, you must run the following command:
snakemake \
--snakefile="path/to/Snakefile" \
--configfile="path/to/config.yaml" \
--report="snakemake_report.html"
NOTE: The report creation must be done after running the workflow in order to have the runtime statistics and the results.
Workflow description
The
MIRFLOWZ
workflow first processes and indexes the user-provided genome
resources. Afterwards, the user-provided short read smallRNA-seq libraries will
be aligned seperately against the genome and transcriptome. For increased
fidelity, two seperated aligners, [Segemehl][segemehl] and our in-house tool
[Oligomap][oligomap], are used. All the resulting alignments are merged such
that only the best alignments of each read are kept (smallest edit distance).
Finally, alignments are intersected with the user-provided, pre-processed
miRNA annotation file using [
bedtools
][bedtools]. Counts are tabulated
seperately for reads consistent with either miRNA precursors, mature miRNA
and/or isomiRs.
The schema below is a visual representation of the individual workflow steps and how they are related:
![rule-graph][rule-graph]
Contributing
MIRFLOWZ is an open-source project which relies on community contributions. You are welcome to participate by submitting bug reports or feature requests, taking part in discussions, or proposing fixes and other code changes.
License
This project is covered by the MIT License .
Contact
Do not hesitate on contacting us via [email][email] for any inquiries on MIRFLOWZ . Please mention the name of the tool.
Code Snippets
92 93 | shell: "(zcat {input.reads} > {output.reads}) &> {log}" |
122 123 124 125 126 127 128 129 | shell: "(fastq_quality_filter \ -v \ -q {params.q} \ -p {params.p} \ -i {input.reads} \ > {output.reads} \ ) &> {log}" |
156 157 | shell: "(fastq_to_fasta -r -n -i {input.reads} > {output.reads}) &> {log}" |
187 188 | shell: "(fasta_formatter -w 0 -i {input.reads} > {output.reads}) &> {log}" |
222 223 224 225 226 227 228 229 230 231 | shell: "(cutadapt \ -a {params.adapter} \ --error-rate {params.error_rate} \ --minimum-length {params.minimum_length} \ --overlap {params.overlap} \ --trim-n \ --max-n {params.max_n} \ --cores {resources.threads} \ -o {output.reads} {input.reads}) &> {log}" |
260 261 | shell: "(fastx_collapser -i {input.reads} > {output.reads}) &> {log}" |
296 297 298 299 300 301 302 303 304 | shell: "(segemehl.x \ -i {input.genome_index_segemehl} \ -d {input.genome} \ -t {threads} \ -e \ -q {input.reads} \ -outfile {output.gmap} \ ) &> {log}" |
346 347 348 349 350 351 352 353 354 | shell: "(segemehl.x \ -i {input.transcriptome_index_segemehl} \ -d {input.transcriptome} \ -t {threads} \ -q {input.reads} \ -e \ -outfile {output.tmap} \ ) &> {log}" |
387 388 389 390 391 392 | shell: "(python {input.script} \ -r {params.max_length_reads} \ -i {input.reads} \ -o {output.reads} \ ) &> {log}" |
429 430 431 432 433 434 435 | shell: "(oligomap \ {input.target} \ {input.reads} \ -r {output.report} \ > {output.gmap} \ ) &> {log}" |
467 468 469 470 471 472 | shell: "(bash {input.script} \ {input.tmap} \ {resources.threads} \ {output.sort} \ ) &> {log}" |
509 510 511 512 513 | shell: "(python {input.script} \ -i {input.sort} \ -n {params.nh} \ > {output.gmap}) &> {log}" |
557 558 559 560 561 562 563 564 | shell: "(oligomap \ {input.target} \ {input.reads} \ -s \ -r {output.report} \ > {output.tmap} \ ) &> {log}" |
604 605 606 607 608 609 | shell: "(bash {input.script} \ {input.tmap} \ {resources.threads} \ {output.sort} \ ) &> {log}" |
651 652 653 654 655 656 | shell: "(python {input.script} \ -i {input.sort} \ -n {params.nh} \ > {output.tmap} \ ) &> {log}" |
684 685 | shell: "(cat {input.gmap1} {input.gmap2} > {output.gmaps}) &> {log}" |
719 720 | shell: "(cat {input.tmap1} {input.tmap2} > {output.tmaps}) &> {log}" |
749 750 751 752 753 754 | shell: "(python {input.script} \ {input.gmaps} \ {params.nh} \ {output.gmaps} \ ) &> {log}" |
787 788 789 790 791 792 | shell: "(python {input.script} \ {input.tmaps} \ {params.nh} \ {output.tmaps} \ ) &> {log}" |
821 822 | shell: "samtools view {input.gmap} > {output.gmap}" |
857 858 | shell: "samtools view {input.tmap} > {output.tmap}" |
893 894 895 896 897 898 | shell: "(perl {input.script} \ --in {input.tmap} \ --exons {input.exons} \ --out {output.genout} \ ) &> {log}" |
928 929 | shell: "(cat {input.gmap1} {input.gmap2} > {output.catmaps}) &> {log}" |
957 958 | shell: "(cat {input.header} {input.catmaps} > {output.concatenate}) &> {log}" |
987 988 | shell: "(samtools sort -n -o {output.sort} {input.concatenate}) &> {log}" |
1024 1025 1026 1027 1028 1029 1030 | shell: "(perl {input.script} \ --print-header \ --keep-mm \ --in {input.sort} \ --out {output.remove_inf} \ ) &> {log}" |
1064 1065 1066 1067 1068 1069 | shell: "(python {input.script} \ {input.sam} \ --nh \ > {output.sam} \ ) &> {log}" |
1098 1099 | shell: "(samtools view -b {input.maps} > {output.maps}) &> {log}" |
1130 1131 | shell: "(samtools sort {input.maps} > {output.maps}) &> {log}" |
1162 1163 | shell: "(samtools index -b {input.maps} > {output.maps}) &> {log}" |
81 82 | shell: "(zcat {input.genome} | {input.script} > {output.genome}) &> {log}" |
110 111 | shell: "(zcat {input.gtf} | gffread -w {output.fasta} -g {input.genome}) &> {log}" |
133 134 | shell: "(cat {input.fasta} | {input.script} > {output.fasta}) &> {log}" |
168 169 | shell: "(segemehl.x -x {output.idx} -d {input.fasta}) &> {log}" |
200 201 | shell: "(segemehl.x -x {output.idx} -d {input.genome}) &> {log}" |
221 222 223 224 225 226 227 228 | shell: "(bash \ {input.script} \ -f {input.gtf} \ -c 3 \ -p exon \ -o {output.exons} \ ) &> {log}" |
250 251 252 253 254 255 | shell: "(Rscript \ {input.script} \ --gtf {input.exons} \ -o {output.exons} \ ) &> {log}" |
278 279 | shell: "(samtools dict -o {output.header} --uri=NA {input.genome}) &> {log}" |
304 305 306 307 308 309 310 311 | shell: "(perl {input.script} \ {input.anno} \ {params.column} \ {params.delimiter} \ {input.map_chr} \ {output.gff} \ ) &> {log}" |
334 335 | shell: "(samtools faidx {input.genome}) &> {log}" |
354 355 | shell: "(cut -f1,2 {input.genome} > {output.chrsize}) &> {log}" |
395 396 397 398 399 400 401 | shell: "(python {input.script} \ {input.gff3} \ --chr {input.chrsize} \ --extension {params.extension} \ --outdir {params.out_dir} \ ) &> {log}" |
118 119 120 121 122 123 124 125 126 127 | shell: "(bedtools intersect \ -wb \ -s \ -F 1 \ -b {input.alignment} \ -a {input.primir} \ -bed \ > {output.intersect} \ ) &> {log}" |
165 166 167 168 169 170 171 | shell: "((samtools view \ -H {input.alignments}; \ awk 'NR==FNR {{bed[$13]=1; next}} $1 in bed' \ {input.intersect} {input.alignments} \ ) > {output.sam} \ ) &> {log}" |
206 207 | shell: "(samtools view -b {input.maps} > {output.maps}) &> {log}" |
242 243 | shell: "(samtools sort -n {input.maps} > {output.maps}) &> {log}" |
276 277 | shell: "(samtools index -b {input.maps} > {output.maps}) &> {log}" |
315 316 317 318 319 320 321 322 323 324 | shell: "(bedtools intersect \ -wo \ -s \ -F 1 \ -b {input.alignment} \ -a {input.mirna} \ -bed \ > {output.intersect} \ ) &> {log}" |
362 363 364 365 366 367 368 | shell: "((samtools view \ -H {input.alignments}; \ awk 'NR==FNR {{bed[$13]=1; next}} $1 in bed' \ {input.intersect} {input.alignments} \ ) > {output.sam} \ ) &> {log}" |
406 407 408 409 410 411 412 | shell: "(python {input.script} \ --bed {input.intersect} \ --sam {input.alignments} \ --extension {params.extension} \ > {output.sam} \ ) &> {log}" |
447 448 | shell: "(samtools sort -t YW -O SAM {input.sam} > {output.sam}) &> {log}" |
481 482 483 484 485 486 487 488 489 490 | shell: "(python \ {input.script} \ {input.alignments} \ --collapsed \ --nh \ --lib {params.library} \ --outdir {params.out_dir} \ --mir-list {params.mir_list} \ ) &> {log}" |
526 527 528 529 530 531 532 533 | shell: "(python \ {input.script} \ {input.intersect} \ --collapsed \ --nh \ > {output.table} \ ) &> {log}" |
567 568 569 570 571 572 573 574 | shell: "(Rscript \ {input.script} \ --input_dir {params.input_dir} \ --output_file {output.table} \ --prefix {params.prefix} \ --verbose \ ) &> {log}" |
606 607 608 609 610 611 | shell: "(perl {input.script} \ --suffix \ --in {input.maps} \ --out {output.maps} \ ) &> {log}" |
646 647 | shell: "(samtools view -b {input.maps} > {output.maps}) &> {log}" |
682 683 | shell: "(samtools sort {input.maps} > {output.maps}) &> {log}" |
716 717 | shell: "(samtools index -b {input.maps} > {output.maps}) &> {log}" |
Support
- Future updates
Related Workflows





