CAGE-sequencing analysis pipeline with trimming, alignment and counting of CAGE tags.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
CAGE-seq pipeline .
Introduction
nf-core/cageseq is a bioinformatics analysis pipeline used for CAGE-seq sequencing data.
The pipeline takes raw demultiplexed fastq-files as input and includes steps for linker and artefact trimming ( cutadapt ), rRNA removal ( SortMeRNA , alignment to a reference genome ( STAR or bowtie1 ) and CAGE tag counting and clustering ( paraclu ). Additionally, several quality control steps ( FastQC , RSeQC , MultiQC ) are included to allow for easy verification of the results after a run.
The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
Quick Start
-
Install
nextflow
-
Install any of
Docker
,Singularity
orPodman
for full pipeline reproducibility (please only useConda
as a last resort; see docs ) -
Download the pipeline and test it on a minimal dataset with a single command:
nextflow run nf-core/cageseq -profile test,<docker/singularity/podman/conda/institute>
Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile <institute>
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment. -
Start running your own analysis!
nextflow run nf-core/cageseq -profile <docker/singularity/podman/conda/institute> --input '*_R1.fastq.gz' --aligner <'star'/'bowtie1'> --genome GRCh38
See usage docs for all of the available options when running the pipeline.
Pipeline Summary
By default, the pipeline currently performs the following:
-
Input read QC (
FastQC
) -
Adapter + EcoP15 + 5'G trimming (
cutadapt
) -
(optional) rRNA filtering (
SortMeRNA
), -
Trimmed and filtered read QC (
FastQC
) -
CAGE tag counting and clustering (
paraclu
) -
CAGE tag clustering QC (
RSeQC
) -
Present QC and visualisation for raw read, alignment and clustering results (
MultiQC
)
Documentation
The nf-core/cageseq pipeline comes with documentation about the pipeline: usage and output .
Credits
nf-core/cageseq was originally written by Kevin Menden ( @KevinMenden ) and Tristan Kast ( @TrisKast ) and updated by Matthias Hörtenhuber ( @mashehu ).
Contributions and Support
If you would like to contribute to this pipeline, please see the contributing guidelines .
For further information or help, don't hesitate to get in touch on the
Slack
#cageseq
channel
(you can join with
this invite
).
Citations
If you use nf-core/cageseq for your analysis, please cite it using the following doi: 10.5281/zenodo.4095105
You can cite the
nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x . ReadCube: Full Access Link
In addition, references of tools and data used in this pipeline are as follows:
Nextflow
Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
Pipeline tools
-
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.
-
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25. Epub 2009 Mar 4. PMID: 19261174; PMCID: PMC2690996.
-
Martin, M., 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal, 17(1), pp.10-12.
-
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
-
Frith MC, Valen E, Krogh A, Hayashizaki Y, Carninci P, Sandelin A. A code for transcription initiation in mammalian genomes. Genome Res. 2008 Jan;18(1):1-12. doi: 10.1101/gr.6831208. Epub 2007 Nov 21. PMID: 18032727; PMCID: PMC2134772.
-
Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments Bioinformatics. 2012 Aug 15;28(16):2184-5. doi: 10.1093/bioinformatics/bts356. Epub 2012 Jun 27. PubMed PMID: 22743226.
-
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.
-
Kopylova E, Noé L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data Bioinformatics. 2012 Dec 15;28(24):3211-7. doi: 10.1093/bioinformatics/bts611. Epub 2012 Oct 15. PubMed PMID: 23071270.
-
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25. PubMed PMID: 23104886; PubMed Central PMCID: PMC3530905.
-
Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics. 2010 Sep 1;26(17):2204-7. doi: 10.1093/bioinformatics/btq351. Epub 2010 Jul 17. PubMed PMID: 20639541; PubMed Central PMCID: PMC2922891.
Software packaging/containerisation tools
-
Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.
-
Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.
-
da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.
-
Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
Code Snippets
347 348 349 350 351 352 353 354 355 356 357 358 359 360 | """ echo $workflow.manifest.version > v_pipeline.txt echo $workflow.nextflow.version > v_nextflow.txt fastqc --version > v_fastqc.txt multiqc --version > v_multiqc.txt STAR --version > v_star.txt bowtie --version > v_bowtie.txt cutadapt --version > v_cutadapt.txt samtools --version > v_samtools.txt bedtools --version > v_bedtools.txt read_distribution.py --version > v_rseqc.txt sortmerna --version > v_sortmerna.txt scrape_software_versions.py &> software_versions_mqc.yaml """ |
373 374 375 | """ gtf2bed.pl $gtf > ${gtf.baseName}.bed """ |
386 387 388 | ''' cat !{fasta} | awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' > chrom_sizes.txt ''' |
410 411 412 | """ fastqc --quiet --threads $task.cpus $reads """ |
438 439 440 441 442 443 444 445 446 447 | """ mkdir star STAR \\ --runMode genomeGenerate \\ --runThreadN $task.cpus \\ --sjdbGTFfile $gtf \\ --genomeDir star/ \\ --genomeFastaFiles $fasta \\ $avail_mem """ |
465 466 467 | """ bowtie-build --threads $task.cpus ${fasta} ${fasta.baseName}.index """ |
495 496 497 498 499 500 501 502 503 504 505 | """ cutadapt -a ^${params.eco_site}...${params.linker_seq} \\ --match-read-wildcards \\ --minimum-length 15 --maximum-length 40 \\ --discard-untrimmed \\ --quality-cutoff 30 \\ --cores=$task.cpus \\ -o "${name}".adapter_trimmed.fastq.gz \\ $reads \\ > "${name}"_adapter_trimming.output.txt """ |
510 511 512 513 514 515 516 517 518 519 520 521 522 | """ mkdir trimmed cutadapt -g ^${params.eco_site} \\ -e 0 \\ --match-read-wildcards \\ --minimum-length 20 --maximum-length 40 \\ --discard-untrimmed \\ --quality-cutoff 30 \\ --cores=$task.cpus \\ -o "${name}".adapter_trimmed.fastq.gz \\ $reads \\ > "${name}"_adapter_trimming.output.txt """ |
527 528 529 530 531 532 533 534 535 536 537 538 539 | """ mkdir trimmed cutadapt -a ${params.linker_seq}\$ \\ -e 0 \\ --match-read-wildcards \\ --minimum-length 20 --maximum-length 40 \\ --discard-untrimmed \\ --quality-cutoff 30 \\ --cores=$task.cpus \\ -o "${name}".adapter_trimmed.fastq.gz \\ $reads \\ > "${name}"_adapter_trimming.output.txt """ |
567 568 569 570 571 572 573 574 | """ cutadapt -g ^G \\ -e 0 --match-read-wildcards \\ --cores=$task.cpus \\ -o "${name}".g_trimmed.fastq.gz \\ $reads \\ > "${name}".g_trimming.output.txt """ |
604 605 606 607 608 609 610 611 612 | """ cutadapt -a file:$artifacts_3end \\ -g file:$artifacts_5end -e 0.1 --discard-trimmed \\ --match-read-wildcards -m 15 -O 19 \\ --cores=$task.cpus \\ -o "${name}".artifacts_trimmed.fastq.gz \\ $reads \\ > ${reads.baseName}.artifacts_trimming.output.txt """ |
645 646 647 648 649 650 651 652 653 654 655 656 657 | """ sortmerna ${Refs} \\ --reads ${reads} \\ --num_alignments 1 \\ --threads $task.cpus \\ --workdir . \\ --fastx \\ --aligned rRNA-reads \\ --other non-rRNA-reads \\ -v gzip --force < non-rRNA-reads.fastq > ${name}.fq.gz mv rRNA-reads.log ${name}_rRNA_report.txt """ |
680 681 682 | """ fastqc -q $reads """ |
711 712 713 714 715 716 717 718 719 720 721 722 723 724 | """ STAR --genomeDir $index \\ --sjdbGTFfile $gtf \\ --readFilesIn $reads \\ --runThreadN $task.cpus \\ --outSAMtype BAM SortedByCoordinate \\ --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 \\ --seedSearchStartLmax 20 \\ --outFilterMismatchNmax 1 \\ --readFilesCommand zcat \\ --runDirPerm All_RWX \\ --outFileNamePrefix $name \\ --outFilterMultimapNmax 1 """ |
749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 | """ bowtie --sam \\ -m 1 \\ --best \\ --strata \\ -k 1 \\ --tryhard \\ --threads $task.cpus \\ --phred33-quals \\ --chunkmbs 64 \\ --seedmms 2 \\ --seedlen 20 \\ --maqerr 70 \\ ${index} \\ -q ${reads} \\ --un ${reads.baseName}.unAl > ${name}.sam 2> ${name}.out samtools sort -@ $task.cpus -o ${name}.bam ${name}.sam """ |
789 790 791 | """ samtools idxstats $bam_count > ${bam_count}.idxstats """ |
818 819 820 | ''' make_ctss.sh -q 20 -i !{bam_count.baseName} -n !{name} ''' |
835 836 837 838 839 | """ bedtools genomecov -bg -i ${name}.ctss.bed -g ${chrom_sizes} > ${name}.bedgraph sort -k1,1 -k2,2n ${name}.bedgraph > ${name}_sorted.bedgraph bedGraphToBigWig ${name}_sorted.bedgraph ${chrom_sizes} ${name}.ctss.bw """ |
857 858 859 860 861 862 863 864 865 866 867 868 | ''' process_ctss.sh -t !{params.tpm_cluster_threshold} !{ctss} paraclu !{params.min_cluster} "ctss_all_pos_4Ps" > "ctss_all_pos_clustered" paraclu !{params.min_cluster} "ctss_all_neg_4Ps" > "ctss_all_neg_clustered" paraclu-cut "ctss_all_pos_clustered" > "ctss_all_pos_clustered_simplified" paraclu-cut "ctss_all_neg_clustered" > "ctss_all_neg_clustered_simplified" cat "ctss_all_pos_clustered_simplified" "ctss_all_neg_clustered_simplified" > "ctss_all_clustered_simplified" awk -F '\t' '{print $1"\t"$3"\t"$4"\t"$1":"$3".."$4","$2"\t"$6"\t"$2}' "ctss_all_clustered_simplified" > "ctss_all_clustered_simplified.bed" ''' |
886 887 888 889 890 891 892 893 | ''' intersectBed -a !{clusters} -b !{ctss} -loj -s > !{ctss}_counts_tmp echo !{name} > !{ctss}_counts.txt bedtools groupby -i !{ctss}_counts_tmp -g 1,2,3,4,6 -c 11 -o sum > !{ctss}_counts.bed awk -v OFS='\t' '{if($6=="-1") $6=0; print $6 }' !{ctss}_counts.bed >> !{ctss}_counts.txt ''' |
910 911 912 913 914 | ''' echo 'coordinates' > coordinates awk '{ print $4}' !{clusters} >> coordinates paste -d "\t" coordinates !{counts} >> count_table.tsv ''' |
938 939 940 941 | """ bedtools bedtobam -i $clusters -g $chrom_sizes > ${clusters.baseName}.bam read_distribution.py -i ${clusters.baseName}.bam -r $gtf > ${clusters.baseName}.read_distribution.txt """ |
982 983 984 | """ multiqc . -f $rtitle $rfilename $custom_config_file """ |
1001 1002 1003 | """ markdown_to_html.py $output_docs -o results_description.html """ |
Support
- Future updates
Related Workflows





