Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A fully reproducible pipeline for COPROlite and paleofeces host IDentification
CoproID helps you to identify the "true maker" of Illumina sequenced Coprolites/Paleofaeces by checking the microbiome composition and the endogenous DNA.
It combines the analysis of putative host ancient DNA with a machine learning prediction of the feces source based on microbiome taxonomic composition:
-
( A ) First coproID performs a comparative mapping of all reads agains two (or three) target genomes (genome1, genome2, and eventually genome3) and computes a host-DNA species ratio ( NormalizedRatio )
-
( B ) Then coproID performs a metagenomic taxonomic profiling, and compares the obtained profiles to modern reference samples of the target species metagenomes. Using machine learning , coproID then estimates the host source from the metagenomic taxonomic composition ( prop_microbiome ).
-
Finally, coproID combines A and B to predict the likely host of the metagenomic sample.
The coproID pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
A detailed description of coproID can be found in the article published in PeerJ .
Quick Start
i. Install
nextflow
ii. Install either
Docker
or
Singularity
for full pipeline reproducibility (please only use
Conda
as a last resort; see
docs
)
iii. Download the pipeline and test it on a minimal dataset with a single command
nextflow run nf-core/coproid -profile test,<docker/singularity/conda/institute>
Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile institute
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment.
iv. Start running your own analysis!
nextflow run maxibor/coproid --genome1 'GRCh37' --genome2 'CanFam3.1' --name1 'Homo_sapiens' --name2 'Canis_familiaris' --reads '*_R{1,2}.fastq.gz' --krakendb 'path/to/minikraken_db' -profile docker
This command runs coproID to estimate whether the source of test samples (
--reads '*_R{1,2}.fastq.gz'
) are coming from a human (
--genome1 'GRCh37' -name1 'Homo_sapiens'
) or a dog (
--genome2 'CanFam3.1' --name2 'Canis_familiaris'
), and specifies the path to the minikraken database (
--krakendb 'path/to/minikraken_db'
).
NB: The example above assumes access to iGenomes .
See usage docs for all of the available options when running the pipeline.
Documentation
The nf-core/coproid pipeline comes with documentation about the pipeline, found in the
docs/
directory:
The nf-core/coproid pipeline comes with documentation about the pipeline, found in the
docs/
directory and at the following address:
coproid.readthedocs.io
-
Pipeline configuration
Credits
nf-core/coproid was written by Maxime Borry .
Contributions and Support
If you would like to contribute to this pipeline, please see the contributing guidelines .
For further information or help, don't hesitate to get in touch on Slack (you can join with this invite ).
Citing
coproID has been published in peerJ . The bibtex citation is available below:
@article{borry_coproid_2020,
title = {{CoproID} predicts the source of coprolites and paleofeces using microbiome composition and host {DNA} content},
volume = {8},
issn = {2167-8359},
url = {https://peerj.com/articles/9001},
doi = {10.7717/peerj.9001},
language = {en},
urldate = {2020-04-20},
journal = {PeerJ},
author = {Borry, Maxime and Cordova, Bryan and Perri, Angela and Wibowo, Marsha and Honap, Tanvi Prasad and Ko, Jada and Yu, Jie and Britton, Kate and Girdland-Flink, Linus and Power, Robert C. and Stuijts, Ingelise and Salazar-García, Domingo C. and Hofman, Courtney and Hagan, Richard and Kagoné, Thérèse Samdapawindé and Meda, Nicolas and Carabin, Helene and Jacobson, David and Reinhard, Karl and Lewis, Cecil and Kostic, Aleksandar and Jeong, Choongwon and Herbig, Alexander and Hübner, Alexander and Warinner, Christina},
month = apr,
year = {2020},
note = {Publisher: PeerJ Inc.},
pages = {e9001}
}
Contributors
Tool references
-
AdapterRemoval v2 Schubert, M., Lindgreen, S., & Orlando, L. (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 9, 88. https://doi.org/10.1186/s13104-016-1900-2
-
FastQC https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
-
Bowtie2 Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357. https://dx.doi.org/10.1038%2Fnmeth.1923
-
Samtools Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352
-
Kraken2 Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. BioRxiv, 762302. https://doi.org/10.1101/762302
-
PMDTools Skoglund, P., Northoff, B. H., Shunkov, M. V., Derevianko, A. P., Pääbo, S., Krause, J., & Jakobsson, M. (2014). Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal. Proceedings of the National Academy of Sciences of the United States of America, 111(6), 2229–2234. https://doi.org/10.1073/pnas.1318934111
-
DamageProfiler Judith Neukamm (Unpublished): 10.5281/zenodo.1064062
-
Sourcepredict Borry, M. (2019). Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification. The Journal of Open Source Software. https://doi.org/10.21105/joss.01540
-
MultiQC Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354
Code Snippets
328 329 330 | """ tar xvzf $ckdb """ |
414 415 416 | """ fastqc -q $reads """ |
429 430 431 | """ mv $genome $outname """ |
442 443 444 | """ mv $genome $outname """ |
456 457 458 | """ mv $genome $outname """ |
482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 | """ AdapterRemoval --basename $name \\ --file1 ${reads[0]} \\ --file2 ${reads[1]} \\ --trimns \\ --trimqualities \\ --collapse \\ --minquality 20 \\ --minlength 30 \\ --output1 $out1 \\ --output2 $out2 \\ --outputcollapsed $col_out \\ --threads ${task.cpus} \\ --qualitybase ${params.phred} \\ --settings $settings """ |
518 519 520 521 522 523 524 525 526 527 528 529 530 531 | """ AdapterRemoval --basename $name \\ --file1 ${reads[0]} \\ --file2 ${reads[1]} \\ --trimns \\ --trimqualities \\ --minquality 20 \\ --minlength 30 \\ --output1 $out1 \\ --output2 $out2 \\ --threads ${task.cpus} \\ --qualitybase ${params.phred} \\ --settings $settings """ |
533 534 535 536 537 538 539 540 541 542 543 544 | """ AdapterRemoval --basename $name \\ --file1 ${reads[0]} \\ --trimns \\ --trimqualities \\ --minquality 20 \\ --minlength 30 \\ --output1 $se_out \\ --threads ${task.cpus} \\ --qualitybase ${params.phred} \\ --settings $settings """ |
565 566 567 | """ bowtie2-build $fasta ${bt1_index} """ |
592 593 594 595 596 | """ bowtie2 -x $bt1_index -U ${reads[0]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign """ |
598 599 600 601 602 | """ bowtie2 -x $bt1_index -1 ${reads[0]} -2 ${reads[1]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign """ |
619 620 621 | """ samtools fastq -1 $out1 -2 $out2 -0 /dev/null -s /dev/null -n -F 0x900 $bam """ |
624 625 626 | """ samtools fastq $bam > $out """ |
643 644 645 | """ bowtie2-build $fasta ${bt2_index} """ |
662 663 664 | """ bowtie2-build $fasta ${bt3_index} """ |
691 692 693 694 695 | """ bowtie2 -x $bt2_index -U ${reads[0]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign """ |
697 698 699 700 701 | """ bowtie2 -x $bt2_index -1 ${reads[0]} -2 ${reads[1]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign """ |
728 729 730 731 732 | """ bowtie2 -x $bt3_index -U ${reads[0]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign """ |
734 735 736 737 738 | """ bowtie2 -x $bt3_index -1 ${reads[0]} -2 ${reads[1]} $bowtie_setting --threads ${task.cpus} > $samfile 2> $fstat samtools view -S -b -F 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile samtools view -S -b -f 4 -@ ${task.cpus} $samfile | samtools sort -@ ${task.cpus} -o $outfile_unalign """ |
757 758 759 | """ samtools view -h -F 4 $bam1 | pmdtools -t ${params.pmdscore} --header $library | samtools view -Sb - > $outfile """ |
773 774 775 | """ samtools view -h -F 4 $bam2 | pmdtools -t ${params.pmdscore} --header $library | samtools view -Sb - > $outfile """ |
790 791 792 | """ samtools view -h -F 4 $bam3 | pmdtools -t ${params.pmdscore} --header $library | samtools view -Sb - > $outfile """ |
815 816 817 818 819 820 821 | """ kraken2 --db ${krakendb} \\ --threads ${task.cpus} \\ --output $out \\ --report $kreport \\ --paired ${reads[0]} ${reads[1]} """ |
823 824 825 826 827 828 | """ kraken2 --db ${krakendb} \\ --threads ${task.cpus} \\ --output $out \\ --report $kreport ${reads[0]} """ |
844 845 846 | """ kraken_parse.py -c ${params.minKraken} $kraken_r """ |
861 862 863 | """ merge_kraken_res.py -o $out """ |
881 882 883 884 885 886 887 888 889 890 891 | """ sourcepredict -di ${params.sp_dim} \\ -kne ${params.sp_neighbors} \\ -me ${params.sp_embed} \\ -n ${params.sp_norm} \\ -l ${sp_labels} \\ -s ${sp_sources} \\ -t ${task.cpus} \\ -o $outfile \\ -e $embed_out $otu_table """ |
920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 | """ samtools index $bam1 samtools index $bam2 samtools index $abam1 samtools index $abam2 normalizedReadCount -n $name \\ -b1 $bam1 \\ -ab1 $abam1 \\ -b2 $bam2 \\ -ab2 $abam2 \\ -g1 $genome1 \\ -g2 $genome2 \\ -r1 $organame1 \\ -r2 $organame2 \\ -i ${params.identity} \\ -o $outfile \\ -ob1 $obam1 \\ -aob1 $aobam1 \\ -ob2 $obam2 \\ -aob2 $aobam2 \\ -ed1 ${params.endo1} \\ -ed2 ${params.endo2} \\ -p ${task.cpus} """ |
946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 | """ samtools index $bam1 samtools index $bam2 normalizedReadCount -n $name \\ -b1 $bam1 \\ -b2 $bam2 \\ -g1 $genome1 \\ -g2 $genome2 \\ -r1 $organame1 \\ -r2 $organame2 \\ -i ${params.identity} \\ -o $outfile \\ -ob1 $obam1 \\ -ob2 $obam2 \\ -ed1 ${params.endo1} \\ -ed2 ${params.endo2} \\ -p ${task.cpus} """ |
999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 | """ samtools index $bam1 samtools index $bam2 samtools index $bam3 samtools index $abam1 samtools index $abam2 samtools index $abam3 normalizedReadCount -n $name \\ -b1 $bam1 \\ -ab1 $abam1 \\ -b2 $bam2 \\ -ab2 $abam2 \\ -b3 $bam3 \\ -ab3 $abam3 \\ -g1 $genome1 \\ -g2 $genome2 \\ -g3 $genome3 \\ -r1 $organame1 \\ -r2 $organame2 \\ -r3 $organame3 \\ -i ${params.identity} \\ -o $outfile \\ -ob1 $obam1 \\ -aob1 $aobam1 \\ -ob2 $obam2 \\ -aob2 $aobam2 \\ -ob3 $obam3 \\ -aob3 $aobam3 \\ -ed1 ${params.endo1} \\ -ed2 ${params.endo2} \\ -ed3 ${params.endo3} \\ -p ${task.cpus} """ |
1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 | """ samtools index $bam1 samtools index $bam2 samtools index $bam3 normalizedReadCount -n $name \\ -b1 $bam1 \\ -b2 $bam2 \\ -b3 $bam3 \\ -g1 $genome1 \\ -g2 $genome2 \\ -g3 $genome3 \\ -r1 $organame1 \\ -r2 $organame2 \\ -r3 $organame3 \\ -i ${params.identity} \\ -o $outfile \\ -ob1 $obam1 \\ -ob2 $obam2 \\ -ob3 $obam3 \\ -ed1 ${params.endo1} \\ -ed2 ${params.endo2} \\ -ed3 ${params.endo3} \\ -p ${task.cpus} """ |
1082 1083 1084 1085 1086 1087 1088 | """ mv $align $bam_name damageprofiler -i $bam_name -r $fasta -o tmp mv tmp/${smp_name}/5pCtoT_freq.txt $fwd_name mv tmp/${smp_name}/3pGtoA_freq.txt $rev_name mv tmp/${smp_name}/dmgprof.json ${smp_name}.dmgprof.json """ |
1107 1108 1109 1110 1111 1112 1113 | """ mv $align $bam_name damageprofiler -i $bam_name -r $fasta -o tmp mv tmp/${smp_name}/5pCtoT_freq.txt $fwd_name mv tmp/${smp_name}/3pGtoA_freq.txt $rev_name mv tmp/${smp_name}/dmgprof.json ${smp_name}.dmgprof.json """ |
1133 1134 1135 1136 1137 1138 1139 | """ mv $align $bam_name damageprofiler -i $bam_name -r $fasta -o tmp mv tmp/${smp_name}/5pCtoT_freq.txt $fwd_name mv tmp/${smp_name}/3pGtoA_freq.txt $rev_name mv tmp/${smp_name}/dmgprof.json ${smp_name}.dmgprof.json """ |
1159 1160 1161 1162 1163 | """ ls -1 *.bpc.csv | head -1 | xargs head -1 > coproID_bp.csv tail -q -n +2 *.bpc.csv >> coproID_bp.csv merge_bp_sp.py -c coproID_bp.csv -s $sp -o $outfile """ |
1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 | """ echo ${workflow.manifest.version} > version.txt jupyter nbconvert \\ --TagRemovePreprocessor.remove_input_tags='{"remove_cell"}' \\ --TagRemovePreprocessor.remove_all_outputs_tags='{"remove_output"}' \\ --TemplateExporter.exclude_input_prompt=True \\ --TemplateExporter.exclude_output_prompt=True \\ --ExecutePreprocessor.timeout=200 \\ --execute \\ --to html_embed $report """ |
1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 | """ echo ${workflow.manifest.version} > version.txt jupyter nbconvert \\ --TagRemovePreprocessor.remove_input_tags='{"remove_cell"}' \\ --TagRemovePreprocessor.remove_all_outputs_tags='{"remove_output"}' \\ --TemplateExporter.exclude_input_prompt=True \\ --TemplateExporter.exclude_output_prompt=True \\ --ExecutePreprocessor.timeout=200 \\ --execute \\ --to html_embed $report """ |
1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 | """ echo ${workflow.manifest.version} > version.txt jupyter nbconvert \\ --TagRemovePreprocessor.remove_input_tags='{"remove_cell"}' \\ --TagRemovePreprocessor.remove_all_outputs_tags='{"remove_output"}' \\ --TemplateExporter.exclude_input_prompt=True \\ --TemplateExporter.exclude_output_prompt=True \\ --ExecutePreprocessor.timeout=200 \\ --execute \\ --to html_embed $report """ |
1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 | """ echo $workflow.manifest.version > v_pipeline.txt echo $workflow.nextflow.version > v_nextflow.txt fastqc --version > v_fastqc.txt multiqc --version > v_multiqc.txt sourcepredict -h > v_sourcepredict.txt samtools --version > v_samtools.txt kraken2 --version > v_kraken2.txt bowtie2 --version > v_bowtie2.txt python --version > v_python.txt AdapterRemoval --version 2> v_adapterremoval.txt scrape_software_versions.py &> software_versions_mqc.yaml """ |
1266
of
master/main.nf
1299 1300 1301 | """ multiqc -f -d adapter_removal alignment fastqc DamageProfiler software_versions software_versions -c $multiqc_conf """ |
1337 1338 1339 | """ markdown_to_html.py $output_docs -o results_description.html """ |
Support
- Future updates
Related Workflows





