Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
nfcore/mnaseseq is a bioinformatics analysis pipeline used for DNA sequencing data obtained via micrococcal nuclease digestion.
The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
Pipeline summary
-
Raw read QC (
FastQC
) -
Adapter trimming (
Trim Galore!
) -
Alignment (
BWA
) -
Mark duplicates (
picard
) -
Merge alignments from multiple libraries of the same sample (
picard
)-
Re-mark duplicates (
picard
) -
Filtering to remove:
-
reads mapping to blacklisted regions (
SAMtools
,BEDTools
) -
reads that are marked as duplicates (
SAMtools
) -
reads that arent marked as primary alignments (
SAMtools
) -
reads that are unmapped (
SAMtools
) -
reads that map to multiple locations (
SAMtools
) -
reads containing > 4 mismatches (
BAMTools
) -
reads that are soft-clipped (
BAMTools
) -
reads that have an insert size within specified range (
BAMTools
; paired-end only ) -
reads that map to different chromosomes (
Pysam
; paired-end only ) -
reads that arent in FR orientation (
Pysam
; paired-end only ) -
reads where only one read of the pair fails the above criteria (
Pysam
; paired-end only )
-
-
Alignment-level QC and estimation of library complexity (
picard
,Preseq
) -
Create normalised bigWig files scaled to 1 million mapped reads (
BEDTools
,bedGraphToBigWig
) -
Calculate genome-wide coverage assessment (
deepTools
) -
Call nucleosome positions and generate smoothed, normalised coverage bigWig files that can be used to generate occupancy profile plots between samples across features of interest (
DANPOS2
) -
Generate gene-body meta-profile from DANPOS2 smoothed bigWig files (
deepTools
)
-
-
Merge filtered alignments across replicates (
picard
)-
Re-mark duplicates (
picard
) -
Remove duplicate reads (
SAMtools
) -
Create normalised bigWig files scaled to 1 million mapped reads (
BEDTools
,wigToBigWig
) -
Call nucleosome positions and generate smoothed, normalised coverage bigWig files that can be used to generate occupancy profile plots between samples across features of interest (
DANPOS2
) -
Generate gene-body meta-profile from DANPOS2 smoothed bigWig files (
deepTools
)
-
-
Create IGV session file containing bigWig tracks for data visualisation (
IGV
). -
Present QC for raw read and alignment results (
MultiQC
)
Quick Start
i. Install
nextflow
ii. Install either
Docker
or
Singularity
for full pipeline reproducibility (please only use
Conda
as a last resort; see
docs
)
iii. Download the pipeline and test it on a minimal dataset with a single command
nextflow run nf-core/mnaseseq -profile test,<docker/singularity/conda/institute>
Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile <institute>
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment.
iv. Start running your own analysis!
nextflow run nf-core/mnaseseq -profile <docker/singularity/conda/institute> --input design.csv --genome GRCh37
See usage docs for all of the available options when running the pipeline.
Documentation
The nf-core/mnaseseq pipeline comes with documentation about the pipeline, found in the
docs/
directory:
-
Pipeline configuration
Credits
The pipeline was originally written by The Bioinformatics & Biostatistics Group for use at The Francis Crick Institute , London.
The pipeline was developed by Harshil Patel .
Many thanks to others who have helped out along the way too, including (but not limited to): @crickbabs .
Contributions and Support
If you would like to contribute to this pipeline, please see the contributing guidelines .
For further information or help, don't hesitate to get in touch on Slack (you can join with this invite ).
Citation
If you use nf-core/mnaseseq for your analysis, please cite it using the following doi: 10.5281/zenodo.6581372 .
You can cite the
nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x .
ReadCube: Full Access Link
An extensive list of references for the tools used by the pipeline can be found in the
CITATIONS.md
file.
Code Snippets
288 289 290 | """ check_design.py $design design_reads.csv """ |
344 345 346 347 | """ bwa index -a bwtsw $fasta mkdir BWAIndex && mv ${fasta}* BWAIndex """ |
367 368 369 | """ gtf2bed $gtf > ${gtf.baseName}.bed """ |
388 389 390 | """ cat $bed | awk -v FS='\t' -v OFS='\t' '{ if(\$6=="+") \$3=\$2+1; else \$2=\$3-1; print \$1, \$2, \$3, \$4, \$5, \$6;}' > ${bed.baseName}.tss.bed """ |
418 419 420 421 422 | """ samtools faidx $fasta cut -f 1,2 ${fasta}.fai > ${fasta}.sizes $blacklist_filter > ${fasta}.include_regions.bed """ |
456 457 458 459 | """ [ ! -f ${name}.fastq.gz ] && ln -s $reads ${name}.fastq.gz fastqc -q -t $task.cpus ${name}.fastq.gz """ |
461 462 463 464 465 466 | """ [ ! -f ${name}_1.fastq.gz ] && ln -s ${reads[0]} ${name}_1.fastq.gz [ ! -f ${name}_2.fastq.gz ] && ln -s ${reads[1]} ${name}_2.fastq.gz fastqc -q -t $task.cpus ${name}_1.fastq.gz fastqc -q -t $task.cpus ${name}_2.fastq.gz """ |
524 525 526 527 | """ [ ! -f ${name}.fastq.gz ] && ln -s $reads ${name}.fastq.gz trim_galore --cores $cores --fastqc --gzip $c_r1 $tpc_r1 $nextseq ${name}.fastq.gz """ |
529 530 531 532 533 | """ [ ! -f ${name}_1.fastq.gz ] && ln -s ${reads[0]} ${name}_1.fastq.gz [ ! -f ${name}_2.fastq.gz ] && ln -s ${reads[1]} ${name}_2.fastq.gz trim_galore --cores $cores --paired --fastqc --gzip $c_r1 $c_r2 $tpc_r1 $tpc_r2 $nextseq ${name}_1.fastq.gz ${name}_2.fastq.gz """ |
566 567 568 569 570 571 572 573 574 | """ bwa mem \\ -t $task.cpus \\ -M \\ -R $rg \\ ${index}/${bwa_base} \\ $reads \\ | samtools view -@ $task.cpus -b -h -F 0x0100 -O BAM -o ${prefix}.bam - """ |
602 603 604 605 606 607 608 | """ samtools sort -@ $task.cpus -o ${prefix}.sorted.bam -T $name $bam samtools index ${prefix}.sorted.bam samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats """ |
660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 | """ picard -Xmx${avail_mem}g MergeSamFiles \\ ${'INPUT='+bam_files.join(' INPUT=')} \\ OUTPUT=${name}.sorted.bam \\ SORT_ORDER=coordinate \\ VALIDATION_STRINGENCY=LENIENT \\ TMP_DIR=tmp samtools index ${name}.sorted.bam picard -Xmx${avail_mem}g MarkDuplicates \\ INPUT=${name}.sorted.bam \\ OUTPUT=${prefix}.sorted.bam \\ ASSUME_SORTED=true \\ REMOVE_DUPLICATES=false \\ METRICS_FILE=${prefix}.MarkDuplicates.metrics.txt \\ VALIDATION_STRINGENCY=LENIENT \\ TMP_DIR=tmp samtools index ${prefix}.sorted.bam samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats """ |
684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 | """ picard -Xmx${avail_mem}g MarkDuplicates \\ INPUT=${bam_files[0]} \\ OUTPUT=${prefix}.sorted.bam \\ ASSUME_SORTED=true \\ REMOVE_DUPLICATES=false \\ METRICS_FILE=${prefix}.MarkDuplicates.metrics.txt \\ VALIDATION_STRINGENCY=LENIENT \\ TMP_DIR=tmp samtools index ${prefix}.sorted.bam samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats """ |
737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 | """ sed 's/MIN_INSERT_SIZE/${params.min_insert}/g' <$bamtools_filter_config >bamtools_filter.json sed -i -e 's/MAX_INSERT_SIZE/${params.max_insert}/g' bamtools_filter.json sed -i -e 's/MAX_MISMATCH/${params.max_mismatch}/g' bamtools_filter.json samtools view \\ $filter_params \\ $dup_params \\ $multimap_params \\ $blacklist_params \\ -b ${bam[0]} \\ | bamtools filter \\ -out ${prefix}.sorted.bam \\ -script bamtools_filter.json samtools index ${prefix}.sorted.bam samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats $name_sort_bam """ |
807 808 809 810 811 812 813 814 815 | """ bampe_rm_orphan.py ${bam[0]} ${prefix}.bam --only_fr_pairs samtools sort -@ $task.cpus -o ${prefix}.sorted.bam -T $prefix ${prefix}.bam samtools index ${prefix}.sorted.bam samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats """ |
846 847 848 | """ preseq lc_extrap -v -output ${prefix}.ccurve.txt -bam ${bam[0]} """ |
883 884 885 886 887 888 889 890 | """ picard -Xmx${avail_mem}g CollectMultipleMetrics \\ INPUT=${bam[0]} \\ OUTPUT=${prefix}.CollectMultipleMetrics \\ REFERENCE_SEQUENCE=$fasta \\ VALIDATION_STRINGENCY=LENIENT \\ TMP_DIR=tmp """ |
925 926 927 928 929 930 931 932 | """ picard -Xmx${avail_mem}g CollectMultipleMetrics \\ INPUT=${bam[0]} \\ OUTPUT=${prefix}.CollectMultipleMetrics \\ REFERENCE_SEQUENCE=$fasta \\ VALIDATION_STRINGENCY=LENIENT \\ TMP_DIR=tmp """ |
961 962 963 964 965 966 967 968 969 | """ SCALE_FACTOR=\$(grep 'mapped (' $flagstat | awk '{print 1000000/\$1}') echo \$SCALE_FACTOR > ${prefix}.scale_factor.txt genomeCoverageBed -ibam ${bam[0]} -bg -scale \$SCALE_FACTOR $pe_fragment $extend | sort -T '.' -k1,1 -k2,2n > ${prefix}.bedGraph bedGraphToBigWig ${prefix}.bedGraph $sizes ${prefix}.bigWig find * -type f -name "*.bigWig" -exec echo -e "bwa/mergedLibrary/bigwig/"{}"\\t0,0,178" \\; > ${prefix}.bigWig.igv.txt """ |
993 994 995 996 997 998 999 1000 1001 1002 1003 1004 | """ plotFingerprint \\ --bamfiles ${bam[0]} \\ --plotFile ${prefix}.plotFingerprint.pdf \\ $extend \\ --labels $prefix \\ --outRawCounts ${prefix}.plotFingerprint.raw.txt \\ --outQualityMetrics ${prefix}.plotFingerprint.qcmetrics.txt \\ --skipZeros \\ --numberOfProcessors $task.cpus \\ --numberOfSamples $params.fingerprint_bins """ |
1034 1035 1036 | """ bamToBed -i $ibam > ${prefix}.bed """ |
1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 | """ danpos.py dpos \\ $bed \\ --span 1 \\ --smooth_width 20 \\ --width 40 \\ --count 1000000 \\ --out ./result/ \\ $pe_params mv ./result/*/* . wigToBigWig -clip ${prefix}.Fnor.smooth.wig $sizes ${prefix}.Fnor.smooth.bigWig awk -v FS='\t' -v OFS='\t' 'FNR > 1 { print \$1, \$2-1, \$3, "Interval_"NR-1, \$6, "+" }' ${prefix}.Fnor.smooth.positions.xls > ${prefix}.Fnor.smooth.positions.bed awk -v FS='\t' -v OFS='\t' 'FNR > 1 { print \$1, \$4-1, \$4, "Interval_"NR-1, \$6, "+" }' ${prefix}.Fnor.smooth.positions.xls > ${prefix}.Fnor.smooth.positions.summit.bed find * -type f -name "*.bigWig" -exec echo -e "bwa/mergedLibrary/danpos/"{}"\\t0,0,178" \\; > ${prefix}.danpos.bigWig.igv.txt find * -type f -name "*.bed" -exec echo -e "bwa/mergedLibrary/danpos/"{}"\\t0,0,178" \\; > ${prefix}.danpos.bed.igv.txt """ |
1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 | """ computeMatrix scale-regions \\ --regionsFileName $bed \\ --scoreFileName $bigwig \\ --outFileName ${prefix}.computeMatrix.mat.gz \\ --outFileNameMatrix ${prefix}.computeMatrix.vals.mat.gz \\ --regionBodyLength 1000 \\ --beforeRegionStartLength 3000 \\ --afterRegionStartLength 3000 \\ --skipZeros \\ --samplesLabel $name \\ --numberOfProcessors $task.cpus plotProfile --matrixFile ${prefix}.computeMatrix.mat.gz \\ --outFileName ${prefix}.plotProfile.pdf \\ --outFileNameData ${prefix}.plotProfile.tab """ |
1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 | """ picard -Xmx${avail_mem}g MergeSamFiles \\ ${'INPUT='+bam_files.join(' INPUT=')} \\ OUTPUT=${name}.sorted.bam \\ SORT_ORDER=coordinate \\ VALIDATION_STRINGENCY=LENIENT \\ TMP_DIR=tmp samtools index ${name}.sorted.bam picard -Xmx${avail_mem}g MarkDuplicates \\ INPUT=${name}.sorted.bam \\ OUTPUT=${prefix}.sorted.bam \\ ASSUME_SORTED=true \\ REMOVE_DUPLICATES=true \\ METRICS_FILE=${prefix}.MarkDuplicates.metrics.txt \\ VALIDATION_STRINGENCY=LENIENT \\ TMP_DIR=tmp samtools index ${prefix}.sorted.bam samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats """ |
1206 1207 1208 1209 1210 1211 1212 1213 | """ ln -s ${bams[0]} ${prefix}.sorted.bam ln -s ${bams[1]} ${prefix}.sorted.bam.bai touch ${prefix}.MarkDuplicates.metrics.txt samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats """ |
1238 1239 1240 | """ samtools sort -n -@ $task.cpus -o ${prefix}.bam -T $prefix ${bam[0]} """ |
1281 1282 1283 1284 1285 1286 1287 1288 1289 | """ SCALE_FACTOR=\$(grep 'mapped (' $flagstat | awk '{print 1000000/\$1}') echo \$SCALE_FACTOR > ${prefix}.scale_factor.txt genomeCoverageBed -ibam ${bam[0]} -bg -scale \$SCALE_FACTOR $pe_fragment $extend | sort -T '.' -k1,1 -k2,2n > ${prefix}.bedGraph bedGraphToBigWig ${prefix}.bedGraph $sizes ${prefix}.bigWig find * -type f -name "*.bigWig" -exec echo -e "bwa/mergedReplicate/bigwig/"{}"\\t0,0,178" \\; > ${prefix}.bigWig.igv.txt """ |
1319 1320 1321 | """ bamToBed -i $ibam > ${prefix}.bed """ |
1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 | """ danpos.py dpos \\ $bed \\ --span 1 \\ --smooth_width 20 \\ --width 40 \\ --count 1000000 \\ --out ./result/ \\ $pe_params mv ./result/*/* . wigToBigWig -clip ${prefix}.Fnor.smooth.wig $sizes ${prefix}.Fnor.smooth.bigWig awk -v FS='\t' -v OFS='\t' 'FNR > 1 { print \$1, \$2-1, \$3, "Interval_"NR-1, \$6, "+" }' ${prefix}.Fnor.smooth.positions.xls > ${prefix}.Fnor.smooth.positions.bed awk -v FS='\t' -v OFS='\t' 'FNR > 1 { print \$1, \$4-1, \$4, "Interval_"NR-1, \$6, "+" }' ${prefix}.Fnor.smooth.positions.xls > ${prefix}.Fnor.smooth.positions.summit.bed find * -type f -name "*.bigWig" -exec echo -e "bwa/mergedReplicate/danpos/"{}"\\t0,0,178" \\; > ${prefix}.danpos.bigWig.igv.txt find * -type f -name "*.bed" -exec echo -e "bwa/mergedReplicate/danpos/"{}"\\t0,0,178" \\; > ${prefix}.danpos.bed.igv.txt """ |
1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 | """ computeMatrix scale-regions \\ --regionsFileName $bed \\ --scoreFileName $bigwig \\ --outFileName ${prefix}.computeMatrix.mat.gz \\ --outFileNameMatrix ${prefix}.computeMatrix.vals.mat.gz \\ --regionBodyLength 1000 \\ --beforeRegionStartLength 3000 \\ --afterRegionStartLength 3000 \\ --skipZeros \\ --samplesLabel $name \\ --numberOfProcessors $task.cpus plotProfile --matrixFile ${prefix}.computeMatrix.mat.gz \\ --outFileName ${prefix}.plotProfile.pdf \\ --outFileNameData ${prefix}.plotProfile.tab """ |
1444 1445 1446 1447 | """ cat *.txt > igv_files.txt igv_files_to_session.py igv_session.xml igv_files.txt ../reference_genome/${fasta.getName()} --path_prefix '../' """ |
1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 | """ echo $workflow.manifest.version > v_pipeline.txt echo $workflow.nextflow.version > v_nextflow.txt fastqc --version > v_fastqc.txt trim_galore --version > v_trim_galore.txt echo \$(bwa 2>&1) > v_bwa.txt samtools --version > v_samtools.txt bedtools --version > v_bedtools.txt echo \$(bamtools --version 2>&1) > v_bamtools.txt echo \$(plotFingerprint --version 2>&1) > v_deeptools.txt || true picard MarkDuplicates --version &> v_picard.txt || true echo \$(R --version 2>&1) > v_R.txt python -c "import pysam; print(pysam.__version__)" > v_pysam.txt preseq &> v_preseq.txt danpos.py --version > v_danpos.txt multiqc --version > v_multiqc.txt scrape_software_versions.py &> software_versions_mqc.yaml """ |
1473
of
master/main.nf
1555 1556 1557 1558 | """ multiqc . -f $rtitle $rfilename $custom_config_file \\ -m custom_content -m fastqc -m cutadapt -m samtools -m picard -m preseq -m deeptools """ |
1582 1583 1584 | """ markdown_to_html.py $output_docs -o results_description.html """ |
Support
- Future updates
Related Workflows





