StaG Metagenomic Workflow Collaboration

public 1yr ago Version: v0.7.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

StaG Metagenomic Workflow Collaboration (mwc)

StaG mwc logo

The StaG Metagenomic Workflow Collaboration (mwc) project focuses on providing a metagenomics analysis workflow suitable for microbiome research and general metagenomics analyses.

Please visit https://stag-mwc.readthedocs.io for the full documentation.

Usage

Step 0: Install conda and Snakemake

Conda and Snakemake are required to be able to use StaG-mwc. Most people would probably want to install Miniconda and install Snakemake into their base environment. When running StaG with the --use-conda or --use-singularity flags, all dependencies are managed automatically. If using conda it will automatically install the required versions of all tools required to run StaG-mwc. There is no need to combine the conda and singularity flags: the Singularity images used by the workflow already contain all required dependencies.

Step 1: Clone workflow

To use StaG-mwc, you need a local copy of the workflow repository. Start by making a clone of the repository:

git clone git@github.com:ctmrbio/stag-mwc

If you use StaG-mwc in a publication, please credit the authors by citing either the URL of this repository or the project's DOI. Also, don't forget to cite the publications of the other tools used in your workflow.

Step 2: Configure workflow

Configure the workflow according to your needs by editing the file config/config.yaml . The most common changes include setting the paths to input and output folders, and configuring what steps of the workflow should be included when running the workflow.

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores N

This will run the workflow locally using N cores. It is also possible to run it in a cluster environment by using one of the available profiles, or creating your own, e.g. to run on CTMR's Gandalf cluster:

snakemake --profile profiles/ctmr_gandalf

Make sure you specify the Slurm account and partition in profiles/ctmr_gandalf/config.yaml . Refer to the official Snakemake documentation for further details on how to run Snakemake workflows on other types of cluster resources.

Note that in all examples above, --use-conda can be replaced with --use-singularity to run in Singularity containers instead of using a locally installed conda. Read more about it under the Running section in the docs.

Testing

A very basic continuous integration test is currently in place. It merely validates the syntax by trying to let Snakemake build the dependency graph if all outputs are activated. Suggestions for how to improve the automated testing of StaG-mwc are very welcome!

Contributing

Refer to the contributing guidelines in CONTRIBUTING.md for instructions on how to contribute to StaG-mwc.

If you intend to modify or further develop this workflow, you are welcome to fork this reposity. Please consider sharing potential improvements via a pull request.

Citing

If you find StaG-mwc useful in your research, please cite the Zenodo DOI: https://zenodo.org/badge/latestdoi/125840716

Logo attribution

Animal vector created by Patrickss - Freepik.com

Code Snippets

shell:
    """
    TEMPDIR="{resources.tmpdir}/{wildcards.sample}"
    mkdir -pv $TEMPDIR >> {log.stdout}
    mkdir -pv {params.outdir} >> {log.stdout}
    cat {input.read1} {input.read2} > $TEMPDIR/{wildcards.sample}_concat.fq.gz

    humann \
        --input $TEMPDIR/{wildcards.sample}_concat.fq.gz \
        --output $TEMPDIR \
        --nucleotide-database {params.nucleotide_db} \
        --protein-database {params.protein_db} \
        --output-basename {wildcards.sample} \
        --threads {threads} \
        --taxonomic-profile {input.taxonomic_profile} \
        {params.extra} \
        >> {log.stdout} \
        2> {log.stderr}

    cp $TEMPDIR/{wildcards.sample}*.tsv {params.outdir}
    cp $TEMPDIR/{wildcards.sample}_humann_temp/{wildcards.sample}.log {log.log}
    rm -rfv $TEMPDIR >> {log.stdout}
    """

SnakeMake humann From line 58 of functional_profiling/humann.smk

shell:
    """
    humann_renorm_table \
        --input {input.genefamilies} \
        --output {output.genefamilies} \
        --units {params.method} \
        --mode {params.mode} \
        > {log.stdout} \
        2> {log.stderr}

    humann_renorm_table \
        --input {input.pathabundance} \
        --output {output.pathabundance} \
        --units {params.method} \
        --mode {params.mode} \
        >> {log.stdout} \
        2>> {log.stderr}
    """

SnakeMake humann From line 104 of functional_profiling/humann.smk

shell:
    """
    humann_join_tables \
        --input {params.output_dir} \
        --output {output.genefamilies} \
        --file_name {params.genefamilies} \
        > {log.stdout} \
        2> {log.stderr}

    humann_join_tables \
        --input {params.output_dir} \
        --output {output.pathcoverage} \
        --file_name {params.pathcoverage} \
        >> {log.stdout} \
        2>> {log.stderr}

    humann_join_tables \
        --input {params.output_dir} \
        --output {output.pathabundance} \
        --file_name {params.pathabundance} \
        >> {log.stdout} \
        2>> {log.stderr}
    """

SnakeMake humann From line 154 of functional_profiling/humann.smk

shell:
    """
    bbmap.sh \
        threads={threads} \
        minid={params.min_id} \
        path={params.db_path} \
        in1={input.read1} \
        in2={input.read2} \
        out={output.sam} \
        covstats={output.covstats} \
        rpkm={output.rpkm} \
        bamscript={output.bamscript} \
        {params.extra} \
        > {log.stdout} \
        2> {log.stderr}

    sed -i 's/_sorted//g' {output.bamscript}

    ./{output.bamscript} 2>> {log.stderr} >> {log.stdout}
    """

SnakeMake BBMap From line 82 of mappers/bbmap.smk

shell:
    """
    workflow/scripts/make_count_table.py \
        --annotation-file {params.annotations} \
        --columns {params.columns} \
        --outdir {params.outdir} \
        {input} \
        2> {log}
    """

SnakeMake From line 134 of mappers/bbmap.smk

shell:
    """
    featureCounts \
        -a {params.annotations} \
        -o {output.counts} \
        -t {params.feature_type} \
        -g {params.attribute_type} \
        -T {threads} \
        {params.extra} \
        {input.bams} \
        > {log} \
        2>> {log}
    cut \
        -f1,7- \
        {output.counts}  \
        | sed '1d' \
        | sed 's|\t\w\+/bbmap/{params.dbname}/|\t|g' \
        > {output.counts_table}
    """

SnakeMake FeatureCounts From line 174 of mappers/bbmap.smk

wrapper:
    "0.23.1/bio/bowtie2/align"

SnakeMake From line 75 of mappers/bowtie2.smk

shell:
    """
    pileup.sh \
        in={input.bam} \
        out={output.covstats} \
        rpkm={output.rpkm} \
        2> {log}
    """

SnakeMake From line 97 of mappers/bowtie2.smk

shell:
    """
    workflow/scripts/make_count_table.py \
        --annotation-file {params.annotations} \
        --columns {params.columns} \
        --outdir {params.outdir} \
        {input} \
        2> {log}
    """

SnakeMake From line 138 of mappers/bowtie2.smk

shell:
    """
    featureCounts \
        -a {params.annotations} \
        -o {output.counts} \
        -t {params.feature_type} \
        -g {params.attribute_type} \
        -T {threads} \
        {params.extra} \
        {input.bams} \
        > {log} \
        2>> {log} \
    && \
    cut \
        -f1,7- \
        {output.counts}  \
        | sed '1d' \
        | sed 's|\t\w\+/bowtie2/{params.dbname}/|\t|g' \
        > {output.counts_table}
    """

SnakeMake FeatureCounts From line 178 of mappers/bowtie2.smk

shell:
    """
    multiqc {OUTDIR} \
        --filename {output.report} \
        --force \
        2> {log}
    """

SnakeMake MultiQC From line 26 of multiqc/multiqc.smk

shell:
    """
    bbcountunique.sh \
      in={input} \
      out={output.txt} \
      interval={params.interval} \
      > {log.stdout} \
      2> {log.stderr}

    workflow/scripts/plot_bbcountunique.py \
      {output.txt} \
      {output.pdf} \
      >> {log.stdout} \
      2>> {log.stderr}
    """

SnakeMake From line 36 of naive/bbcountunique.smk

shell:
    """
    sketch.sh \
        in={input} \
        out={output} \
        name0={wildcards.sample} \
        2> {log}
    """

SnakeMake From line 35 of naive/sketch_compare.smk

shell:
    """
    comparesketch.sh \
        format=3 \
        out={output} \
        alltoall \
        {input} \
        2> {log}
    """

SnakeMake From line 60 of naive/sketch_compare.smk

shell:
    """
    workflow/scripts/plot_sketch_comparison_heatmap.py \
        --outfile {output.heatmap} \
        --clustered {output.clustered} \
        {input} \
        > {log.stdout} \
        2> {log.stderr}
    """

SnakeMake From line 87 of naive/sketch_compare.smk

shell:
    """
    kraken2 \
        --db {params.db} \
        --threads {threads} \
        --output {output.kraken} \
        --classified-out {params.classified} \
        --unclassified-out {params.unclassified} \
        --report  {output.kreport} \
        --paired \
        --confidence {params.confidence} \
        {params.extra} \
        {input.read1} {input.read2} \
        2> {log.stderr}
    pigz \
        --processes {threads} \
        --verbose \
        --force \
        {params.fq_to_compress} \
        2>> {log.stderr}
    """

SnakeMake kraken2 From line 68 of preproc/host_removal.smk

shell:
    """
    workflow/scripts/plot_proportion_kraken2.py \
        {input} \
        --histogram {output.histogram} \
        --barplot {output.barplot} \
        --table {output.txt} \
        2>&1 > {log}
    """

SnakeMake From line 114 of preproc/host_removal.smk

shell:
    """
    bowtie2 \
        --threads {threads} \
        -x {params.db_path} \
        -1 {input.read1} \
        -2 {input.read2} \
        -S {output.sam} \
        2> {log.stderr}
    """

SnakeMake Bowtie 2 From line 162 of preproc/host_removal.smk

shell:
    """
    samtools view \
        -b \
        --threads {threads} \
        {input.sam} \
        -o {output.bam} \
        2> {log.stderr}
    """

SnakeMake SAMtools From line 187 of preproc/host_removal.smk

shell:
    """
    samtools view \
        -b \
        -f 13 \
        -F 256 \
        --threads {threads} \
        {input.bam2} \
        -o {output.unmapped} \
        2> {log.stderr}
    """

SnakeMake SAMtools From line 211 of preproc/host_removal.smk

shell: 
    """
    samtools sort \
        -n \
        -m 5G \
        --threads {threads} \
        {input.pairs} \
        -o {output.sorted} \
        2> {log.stderr}
    """

SnakeMake SAMtools From line 237 of preproc/host_removal.smk

shell: 
    """
    samtools fastq \
        --threads {threads} \
        -1 {output.read1} \
        -2 {output.read2} \
        -0 /dev/null \
        -s /dev/null \
        -n \
        {input.sorted_pairs} \
        2> {log.stderr}
    """

SnakeMake SAMtools From line 263 of preproc/host_removal.smk

shell:
    """
    ln -sv $(readlink -f {input.read1}) {output.read1} >> {log.stderr}
    ln -sv $(readlink -f {input.read2}) {output.read2} >> {log.stderr}
    """

SnakeMake From line 303 of preproc/host_removal.smk

shell:
    """
    workflow/scripts/preprocessing_summary.py \
        {params.fastp_arg} \
        {params.kraken2_arg} \
        {params.bowtie2_arg} \
        --output-table {output.table} \
        > {log.stdout}
    """

SnakeMake From line 38 of preproc/preprocessing_summary.smk

shell:
    """
    fastp \
        --in1 {input.read1} \
        --in2 {input.read2} \
        --out1 {output.read1} \
        --out2 {output.read2} \
        --json {output.json} \
        --html {output.html} \
        --thread {threads} \
        {params.extra} \
        > {log.stdout} \
        2> {log.stderr}
    """

SnakeMake fastp From line 40 of preproc/read_quality.smk

shell:
    """
    ln -sv $(readlink -f {input.read1}) {output.read1} >> {log.stderr}
    ln -sv $(readlink -f {input.read2}) {output.read2} >> {log.stderr}
    """

SnakeMake From line 75 of preproc/read_quality.smk

shell:
    """
    kaiju \
        -z {threads} \
        -t {params.nodes} \
        -f {params.db} \
        -i {input.read1} \
        -j {input.read2} \
        -o {output.kaiju} > {log}
    """

SnakeMake Kaiju From line 63 of taxonomic_profiling/kaiju.smk

shell:
    """
    kaiju2krona \
        -t {params.nodes} \
        -n {params.names} \
        -i {input.kaiju} \
        -o {output.krona} \
        -u
    """

SnakeMake From line 90 of taxonomic_profiling/kaiju.smk

shell:
    """
    ktImportText \
        -o {output.krona_html} \
        {input}
    """

SnakeMake Krona From line 113 of taxonomic_profiling/kaiju.smk

shell:
    """
    kaiju2table \
        -t {params.nodes} \
        -n {params.names} \
        -r {wildcards.level} \
        -l superkingdom,phylum,class,order,family,genus,species \
        -o {output} \
        {input.kaiju} \
        2>&1 > {log}
    """

SnakeMake From line 136 of taxonomic_profiling/kaiju.smk

shell:
    """
    workflow/scripts/join_tables.py \
        --feature-column {params.feature_column} \
        --value-column {params.value_column} \
        --outfile {output} \
        {input} \
        2>&1 > {log}
    """

SnakeMake From line 167 of taxonomic_profiling/kaiju.smk

shell:
    """
    workflow/scripts/area_plot.py \
        --table {input} \
        --output {output} \
        --mode kaiju \
        2>&1 > {log}
    """

SnakeMake Kaiju From line 191 of taxonomic_profiling/kaiju.smk

shell:
    """
    kraken2 \
        --db {params.db} \
        --confidence {params.confidence} \
        --minimum-hit-groups {params.minimum_hit_groups} \
        --threads {threads} \
        --output {output.kraken} \
        --report {output.kreport} \
        --use-names \
        --paired \
        {input.read1} {input.read2} \
        {params.extra} \
        2> {log}
    """

SnakeMake kraken2 From line 80 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/KrakenTools/kreport2mpa.py \
        --report-file {input.kreport} \
        --output {output.txt} \
        --display-header \
        2>&1 > {log}
    sed --in-place 's|{input.kreport}|taxon_name\treads|g' {output.txt}
    """

SnakeMake From line 109 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/join_tables.py \
        --outfile {output.table} \
        --value-column {params.value_column} \
        --feature-column '{params.feature_column}' \
        {input.txt} \
        2>&1 > {log}
    """

SnakeMake From line 137 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/area_plot.py \
        --table {input} \
        --output {output} \
        --mode kraken2 \
        2>&1 > {log}
    """

SnakeMake kraken2 From line 161 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/KrakenTools/combine_kreports.py \
        --output {output} \
        --report-files {input.kreports} \
        2>> {log} \
        >> {log}
    """

SnakeMake From line 186 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/KrakenTools/kreport2krona.py \
        --report-file {input.kreport} \
        --output {output} \
        2> {log}
    """

SnakeMake From line 210 of taxonomic_profiling/kraken2.smk

    shell:
        """
		ktImportText \
			-o {output.krona_html} \
			{input}
        """

SnakeMake Krona From line 232 of taxonomic_profiling/kraken2.smk

shell:
    """
    est_abundance.py \
        --input {input.kreport} \
        --kmer_distr {params.kmer_distrib} \
        --output {output.bracken} \
        --out-report {output.bracken_kreport} \
        --level S \
        --thresh {params.thresh} \
        2>&1 > {log}
    """

SnakeMake From line 291 of taxonomic_profiling/kraken2.smk

shell:
    """
    est_abundance.py \
        --input {input.kreport} \
        --kmer_distr {params.kmer_distrib} \
        --output {output.bracken} \
        --level {wildcards.level} \
        --thresh {params.thresh} \
        2>&1 > {log}
    """

SnakeMake From line 322 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/KrakenTools/kreport2mpa.py \
        --report-file {input.kreport} \
        --output {output.txt} \
        --display-header \
        2>&1 > {log}
    sed --in-place 's|{input.kreport}|taxon_name\treads|g' {output.txt}
    """

SnakeMake From line 346 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/join_tables.py \
        --outfile {output.table} \
        --value-column {params.value_column} \
        --feature-column {params.feature_column} \
        {input.txt} \
        2>&1 > {log}
    """

SnakeMake From line 374 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/area_plot.py \
        --table {input} \
        --output {output} \
        --mode kraken2 \
        2>&1 > {log}
    """

SnakeMake kraken2 From line 398 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/join_tables.py \
        --outfile {output.table} \
        --value-column {params.value_column} \
        --feature-column {params.feature_column} \
        {input.bracken} \
        2>&1 > {log}
    """

SnakeMake From line 425 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/KrakenTools/kreport2krona.py \
        --report-file {input.bracken_kreport} \
        --output {output.bracken_krona} \
        2>&1 > {log}
    """

SnakeMake From line 450 of taxonomic_profiling/kraken2.smk

    shell:
        """
		ktImportText \
			-o {output.krona_html} \
			{input}
        """

SnakeMake Krona From line 472 of taxonomic_profiling/kraken2.smk

shell:
    """
    {params.filter_bracken} \
        --input-file {input.bracken} \
        --output {output.filtered} \
        {params.include} \
        {params.exclude} \
        2>&1 > {log}
    """

SnakeMake From line 496 of taxonomic_profiling/kraken2.smk

shell:
    """
    workflow/scripts/join_tables.py \
        --outfile {output.table} \
        --value-column {params.value_column} \
        --feature-column {params.feature_column} \
        {input.bracken} \
        2>&1 > {log}
    """

SnakeMake From line 524 of taxonomic_profiling/kraken2.smk

shell:
    """
    fuse.sh \
        in1={input.read1} \
        in2={input.read2} \
        out={output.fasta} \
        pad=1 \
        fusepairs=t \
        2> {log}
    """

SnakeMake From line 58 of taxonomic_profiling/krakenuniq.smk

shell:
    """
    krakenuniq \
        --db {params.db} \
        --threads {threads} \
        --output {output.kraken} \
        --report-file {output.kreport} \
        --preload-size {params.preload_size} \
        {input.fasta} \
        {params.extra} \
        2> {log}
    """

SnakeMake KrakenUniq From line 88 of taxonomic_profiling/krakenuniq.smk

shell:
    """
    workflow/scripts/join_tables.py \
        --feature-column rank,taxName \
        --value-column taxReads \
        --outfile {output.combined} \
        --skiplines 2 \
        {input.kreports} \
        2> {log}
    """

SnakeMake From line 118 of taxonomic_profiling/krakenuniq.smk

shell:
    """
    workflow/scripts/KrakenTools/kreport2mpa.py \
        --report-file {input.kreport} \
        --output {output.txt} \
        --display-header \
        > {log.stdout} \
        2> {log.stderr}
    """

SnakeMake From line 143 of taxonomic_profiling/krakenuniq.smk

shell:
    """
    workflow/scripts/join_tables.py \
        --outfile {output.table} \
        --value-column {params.value_column} \
        --feature-column '{params.feature_column}' \
        {input.txt} \
        > {log.stdout} \
        2> {log.stderr}
    """

SnakeMake From line 172 of taxonomic_profiling/krakenuniq.smk

shell:
    """
    awk -v OFS='\\t' '{{
      gsub("\\\\|","\\t",$1);
      print $2,$1;
      }}' {input.kreport} \
      > {output} \
      2> {log.stderr}
    """

SnakeMake From line 199 of taxonomic_profiling/krakenuniq.smk

    shell:
        """
		ktImportText \
			-o {output.krona_html} \
			{input}
        """

SnakeMake Krona From line 224 of taxonomic_profiling/krakenuniq.smk

shell:
    """
    metaphlan \
        --input_type fastq \
        --nproc 10 \
        --sample_id {wildcards.sample} \
        --samout {output.sam_out} \
        --bowtie2out {output.bt2_out} \
        --bowtie2db {params.bt2_db_dir} \
        --index {params.bt2_index} \
        {input.read1},{input.read2} \
        {output.mpa_out} \
        {params.extra} \
        > {log.stdout} \
        2> {log.stderr}
    """

SnakeMake MetaPhlAn From line 65 of taxonomic_profiling/metaphlan.smk

shell:
    """
    set +o pipefail  # Small samples can produce empty output files failing the pipeline
    sed '/#/d' {input.mpa_out} \
        | grep -E "s__|unclassified" \
        | cut -f1,3 \
        | awk '{{print $2,"\t",$1}}' \
        | sed 's/|\w__/\t/g' \
        | sed 's/k__//' \
        > {output.krona} \
        2> {log}
    """

SnakeMake From line 93 of taxonomic_profiling/metaphlan.smk

shell:
    """
    merge_metaphlan_tables.py {input} > {output.txt} 2> {log}
    sed --in-place 's/\.metaphlan//g' {output.txt} 
    """

SnakeMake From line 123 of taxonomic_profiling/metaphlan.smk

shell:
    """
    workflow/scripts/area_plot.py \
        --table {input} \
        --output {output} \
        --mode metaphlan4 \
        2>&1 > {log}
    """

SnakeMake From line 143 of taxonomic_profiling/metaphlan.smk

shell:
    """
    workflow/scripts/plot_metaphlan_heatmap.py \
        --outfile-prefix {params.outfile_prefix} \
        --level {wildcards.level} \
        --topN {wildcards.topN} \
        --pseudocount {params.pseudocount} \
        --colormap {params.colormap} \
        --method {params.method} \
        --metric {params.metric} \
        --force \
        {input} \
        2> {log}
    """

SnakeMake From line 177 of taxonomic_profiling/metaphlan.smk

shell:
    """
    ktImportText \
        -o {output.html_samples} \
        {input} \
        > {log}

    ktImportText \
        -o {output.html_all} \
        -c \
        {input} \
        >> {log}
    """

SnakeMake Krona From line 211 of taxonomic_profiling/metaphlan.smk

shell:
    """
    set +o pipefail
    sed '/#.*/d' {input.mpa_combined} | cut -f 1- | head -n1 | tee {output.species} {output.genus} {output.family} {output.order} > /dev/null

    sed '/#.*/d' {input.mpa_combined} | cut -f 1- | grep s__ | sed 's/^.*s__/s__/g' >> {output.species}
    sed '/#.*/d' {input.mpa_combined} | cut -f 1- | grep g__ | sed 's/^.*s__.*//g' | grep g__ | sed 's/^.*g__/g__/g' >> {output.genus}
    sed '/#.*/d' {input.mpa_combined} | cut -f 1- | grep f__ | sed 's/^.*g__.*//g' | grep f__ | sed 's/^.*f__/f__/g' >> {output.family}
    sed '/#.*/d' {input.mpa_combined} | cut -f 1- | grep o__ | sed 's/^.*f__.*//g' | grep o__ | sed 's/^.*o__/o__/g' >> {output.order}
    """

SnakeMake From line 234 of taxonomic_profiling/metaphlan.smk

shell:
    """
    sample2markers.py \
         -i {input.sam} \
         -o {params.output_dir} \
         -n 8 \
         > {log.stdout} \
         2> {log.stderr}
    """

SnakeMake From line 52 of taxonomic_profiling/strainphlan.smk

shell:
    """
    strainphlan \
         -s {input.consensus_markers} \
         --print_clades_only \
         -d {params.database} \
         -o {params.out_dir} \
         -n {threads} \
         > {log.stdout} \
         2> {log.stderr}

    cd {params.out_dir} && ln -s ../logs/strainphlan/available_clades.txt
    """

SnakeMake From line 81 of taxonomic_profiling/strainphlan.smk

shell:
    """
    extract_markers.py \
         -c {params.clade} \
         -o {params.out_dir} \
         -d {params.database} \
         > {log.stdout} \
         2> {log.stderr}
    """

SnakeMake From line 116 of taxonomic_profiling/strainphlan.smk

shell:
    """
    echo "please compare your clade_of_interest to list of available clades in available_clades.txt" > {log.stderr}

    strainphlan \
         -s {input.consensus_markers} \
         -m {input.reference_markers} \
         {params.extra} \
         -d {params.database} \
         -o {params.out_dir} \
         -n {threads} \
         -c {params.clade} \
         --phylophlan_mode accurate \
         --mutation_rates \
         > {log.stdout} \
         2>> {log.stderr}
    """

SnakeMake From line 150 of taxonomic_profiling/strainphlan.smk

from sys import argv, exit
import argparse
import warnings

from matplotlib import rcParams
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

"""
Generates a pretty areaplot from a collapsed feature table.
"""

__author__ = 'JW Debelius'
__date__ = '2020-02'
__version__ = "0.2"

# Sets up the matplotlib parameters so that we can save to be edited in 
# illustator if a direct conversion is required. Because it just makes life
# better
rcParams['pdf.fonttype'] = 42
rcParams['ps.fonttype'] = 42

# Sets up the order of colors to be used in joint plots.
colors_order = ['Reds', 'Blues', 'Greens', 'Purples', "Oranges", 'Greys']

over9 = {'Paired', 'Paired_r', 'Set3', 'Set3_r'}
over8 = over9 | {'Set1', "Pastel1"}

mode_dict = {
    'kaiju': {
        'tax_delim': ';',
        'multi_level': True,
        'tax_col': 'taxon_name',
        'table_drop': [],
        'skip_rows': 0,
    },
    'kraken2': {
        'tax_delim': '|',
        'multi_level': True,
        'tax_col': 'taxon_name',
        'table_drop': [],
        'skip_rows': 0,
    },
    'metaphlan': {
        'tax_delim': '|',
        'multi_level': True,
        'tax_col': 'clade_name',
        'table_drop': ['NCBI_tax_id'],
        'skip_rows': 1,
    },
    'metaphlan4': {
        'tax_delim': '|',
        'multi_level': True,
        'tax_col': 'clade_name',
        'table_drop': [],
        'skip_rows': 1,
    },
    'marker': {
        'tax_delim': ';',
        'multi_level': False,
        'tax_col': 'taxonomy',
        'table_drop': ['sequence'],
        'skip_rows': 0,
    },
}


def extract_label_array(table, tax_col, tax_delim='|'):
    """
    Converts delimited taxonomy strings into a working table

    Parameters
    ----------
    table : DataFrame
        A DataFrame with observation on the rows (biom-style table) with 
        `tax_col` as one of its columns.
    tax_col : str
        The column in `table` containing the taxonomy information
    tax_delim: str, optional
        The delimiter between taxonomic groups

    Returns
    -------
    DataFrame
        The taxonomic strings parsed into n levels
    """
    def f_(x):
        return pd.Series([y.strip() for y in x.split(tax_delim)])

    return table[tax_col].apply(f_)


def level_taxonomy(table, taxa, samples, level, consider_nan=True):
    """
    Gets the taxonomy collapsed to the desired level

    Parameters
    ----------
    table : DataFrame
        A table with observation on the rows (biom-style table) with `samples`
        in its columns.
    taxa: DataFrame
        The taxonomic strings parsed into n levels
    level: list
        The level to which the taxonomy should be summarized 
    samples : list
        The columns from `table` to be included in the analysis
    consider_nan: bool, optional
        Whether the table contains multiple concatenated, in which cases 
        considering `nan` will filter the samples to retain only the levels 
        of interest. This is recommended for kraken/bracken tables, but not 
        applicable for some 16s sequences
    """
    level = level.max()
    if consider_nan:
        leveler = (taxa[level].notna() & taxa[(level + 1)].isna())
    else:
        leveler = (taxa[level].notna())

    cols = list(np.arange(level + 1))

    # Combines the filtered tables
    level_ = pd.concat(
        axis=1, 
        objs=[taxa.loc[leveler, cols],
              (table.loc[leveler, samples] / table.loc[leveler, samples].sum(axis=0))],
    #     # sort=False
    )
    level_.reset_index()
    if taxa.loc[leveler, cols].duplicated().any():
        return level_.groupby(cols).sum()
    else:
        return level_.set_index(cols)


def profile_one_level(collapsed, level, threshold=0.01, count=8):
    """
    Gets upper and lower tables for a single taxonomic level

    Parameters
    ----------
    Collapsed: DataFrame
        The counts data with the index as a multi-level index of 
        levels of interest and the columns as samples
    threshold: float, optional
        The minimum relative abundance for an organism to be shown
    count : int, optional
        The maximum number of levels to show for a single group

    Returns
    -------
    DataFame
        A table of the top taxa for the data of interest
    """
    collapsed['mean'] = collapsed.mean(axis=1)
    collapsed.sort_values(['mean'], ascending=False, inplace=True)
    collapsed['count'] = 1
    collapsed['count'] = collapsed['count'].cumsum()

    thresh_ = (collapsed['mean'] > threshold) & (collapsed['count'] <= count)
    top_taxa = collapsed.loc[thresh_].copy()
    top_taxa.drop(columns=['mean', 'count'], inplace=True)
    for l_ in np.arange(level):
        top_taxa.index = top_taxa.index.droplevel(l_)

    first_ = top_taxa.index[0]

    top_taxa.sort_values(
        [first_],
        ascending=False,
        axis='columns',
        inplace=True,
    )

    upper_ = top_taxa.cumsum()
    lower_ = top_taxa.cumsum() - top_taxa

    return upper_, lower_


def profile_joint_levels(collapsed, lo_, hi_, samples, lo_thresh=0.01, 
    hi_thresh=0.01, lo_count=4, hi_count=5):
    """
    Generates a table of taxonomy using two levels to define grouping

    Parameters
    ----------
    collapsed: DataFrame
        The counts data with the index as a multi-level index of 
        levels of interest and the columns as samples
    lo_, hi_: int
        The numeric identifier for lower (`lo_`) and higher (`hi_`)
        resolution where "low" is defined as having fewer groups.
        (i.e. for taxonomy Phylum is low, family is high)
    lo_thresh, hi_thresh: int, optional
        The minimum relative abundance for an organism to be shown
        at a given level
    lo_count, hi_count : int, optional
        The maximum number of levels to show for a single group. 
        This is to appease the limitations of our eyes and colormaps.

    Returns
    -------
    DataFame
        A table of the top taxa for the data of interest
    """
    collapsed['mean_hi'] = collapsed.mean(axis=1)
    collapsed.reset_index(inplace=True)
    mean_lo_rep = collapsed.groupby(lo_)['mean_hi'].sum().to_dict()
    collapsed['mean_lo'] = collapsed[lo_].replace(mean_lo_rep)
    collapsed.sort_values(['mean_lo', 'mean_hi'], ascending=False, inplace=True)

    collapsed['count_lo'] = ~collapsed[lo_].duplicated(keep='first') * 1.
    collapsed['count_lo'] = collapsed['count_lo'].cumsum()
    collapsed['count_hi'] = 1
    collapsed['count_hi'] = collapsed.groupby(lo_)['count_hi'].cumsum()

    collapsed['thresh_lo'] = ((collapsed['mean_lo'] > lo_thresh) & 
                              (collapsed['count_lo'] <= lo_count))
    collapsed['thresh_hi'] = ((collapsed['mean_hi'] > hi_thresh) & 
                              (collapsed['count_hi'] <= hi_count))

    top_lo = collapsed.loc[collapsed['thresh_lo']].copy()
    top_lo['other'] = ~(top_lo['thresh_lo'] & ~top_lo['thresh_hi']) * 1
    top_lo['new_name'] = top_lo[lo_].apply(lambda x: 'other %s' % x )
    top_lo.loc[top_lo['thresh_hi'], 'new_name'] = \
        top_lo.loc[top_lo['thresh_hi'], hi_]

    drop_levels = np.arange(hi_)[np.arange(hi_) != lo_]
    top_lo.drop(columns=drop_levels, inplace=True)
    new_taxa = top_lo.groupby([lo_, 'new_name']).sum(dropna=True)
    new_taxa.reset_index(inplace=True)
    new_taxa['mean_lo'] = new_taxa[lo_].replace(mean_lo_rep)
    new_taxa.set_index([lo_, 'new_name'], inplace=True)
    new_taxa.sort_values(['mean_lo', 'count_hi', 'mean_hi'], 
                         ascending=[False, True, False], 
                         inplace=True)

    upper_ = new_taxa.cumsum()[samples]
    upper_.sort_values([upper_.index[0], upper_.index[1]],
                       axis='columns', 
                       inplace=True, ascending=False)
    lower_ = upper_ - new_taxa[upper_.columns]

    upper_.index.set_names(['rough', 'fine'], inplace=True)
    lower_.index.set_names(['rough', 'fine'], inplace=True)

    return upper_, lower_


def define_single_cmap(cmap, top_taxa):
    """
    Gets the colormap a single level table

    """
    # Gets the colormap object
    map_ = mpl.colormaps[cmap]
    # Gets the taxonomic object
    return {tax: map_(i) for i, tax in enumerate(top_taxa.index)}


def define_join_cmap(table):
    """
    Defines a joint colormap for a taxonomic table.
    """
    table['dummy'] = 1
    grouping = table['dummy'].reset_index()

    rough_order = grouping['rough'].unique()

    rough_map = {group: mpl.colormaps[cmap] 
                 for (group,cmap) in zip(*(rough_order, colors_order))} 
    pooled_map = dict([])
    for rough_, fine_ in grouping.groupby('rough')['fine']:
        cmap_ = rough_map[rough_]
        colors = {c: cmap_(200-(i + 1) * 20) for i, c in enumerate(fine_)}
        pooled_map.update(colors)

    table.drop(columns=['dummy'], inplace=True)

    return pooled_map


def plot_area(upper_, lower_, colors, sample_interval=5):
    """
    An in-elegant function to make an area plot

    Yes, you'll get far more control if you do it yourself outside this
    function but it will at least give you a first pass of a stacked area
    plot

    Parameters
    ---------
    upper_, lower_ : DataFrame
        The upper (`top_`) and lower (`low_`) limits for the 
        area plot. This should already be sorted in the
        desired order.
    colors: dict
        A dictionary mapping the taxonomic label to the
        appropriate matplotlib readable colors. For convenience,
        `define_single_colormap` and `define_joint_colormap`
        are good functions to use to generate this
    sample_interval : int, optional
        The interval for ticks for counting samples.

    Returns
    -------
    Figure
        A 8" x 4" matplotlib figure with the area plot and legend.
    """

    # Gets the figure
    fig_, ax1 = plt.subplots(1,1)
    fig_.set_size_inches((8, 4))
    ax1.set_position((0.15, 0.125, 0.4, 0.75))

    # Plots the area plot
    x = np.arange(0, len(upper_.columns))
    for taxa, hi_ in upper_.iloc[::-1].iterrows():
        lo_ = lower_.loc[taxa]
        cl_ = colors[taxa]


        ax1.fill_between(x=x, y1=1-lo_.values, y2=1-hi_.values, 
                         color=cl_, label=taxa)
    # Adds the legend
    leg_ = ax1.legend()
    leg_.set_bbox_to_anchor((2.05, 1))

    # Sets up the y-axis so the order matches the colormap
    # (accomplished by flipping the axis?)
    ax1.set_ylim((1, 0))
    ax1.set_yticks(np.arange(0, 1.1, 0.25))
    ax1.set_yticklabels(np.arange(1, -0.1, -0.25), size=11)
    ax1.set_ylabel('Relative Abundance', size=13)

    # Sets up x-axis without numeric labels
    ax1.set_xticklabels([])
    ax1.set_xticks(np.arange(0, x.max(), sample_interval))
    ax1.set_xlim((0, x.max() - 0.99))  # Subtract less than 1 to avoid singularity if xmin=xmax=0
    ax1.set_xlabel('Samples', size=13)

    return fig_


def single_area_plot(table, level=3, samples=None, 
    tax_col='taxon_name', cmap='Set3',
    tax_delim='|', multilevel_table=True, abund_thresh=0.1, 
    group_thresh=8):
    """
    Generates an area plot for the table at the specified level of resolution

    Parameters
    ----------
    table : DataFrame
        A pandas dataframe of the original table of data (either containing 
        counts or relative abundance)
    level : int
        The hierarchical level within the table to display as an integer
    cmap : str
        The qualitative colormap to use to generate your plot. Refer to 
        colorbrewer for options. If a selected colormap exceeds the number
        of groups (`--group-thresh`) possible, it will default to Set3.
    samples : list, optional
        The columns from `table` to be included in the analysis. If `samples`
        is None, then all columns in `table` except `tax_col` will be used.
    tax_col : str, optional
        The column in `table` which contains the taxonomic information.
    tax_delim: str, optional
        The delimiter between taxonomic levels, for example "|" or ";".
    multilevel_table: bool, optional
        Whether the table contains multiple concatenated, in which cases 
        considering `nan` will filter the samples to retain only the levels 
        of interest. This is recommended for kraken/bracken tables, but not 
        applicable for some 16s sequences
    abund_thresh: float [0, 1]
        The mean abundance threshold for a sample to be plotted. This is 
        in conjunction with the group threshold (`--group-thresh`) will be 
        used to determine the groups that are shown.
    group_thresh: int, [1, 12]
        The maximum number of groups (colors) to show in the area plot. This 
        is handled in conjunction with the `--abund-thresh` in that 
        to be displayed, a group must have both a mean relative abundance 
        exceeding the `abund-thresh` and must be in the top `group-thresh` 
        groups.

    Returns
    -------
    Figure
        A 8" x 4" matplotlib figure with the area plot and legend.

    Also See
    --------
    make_joint_area_plot

    """

    if group_thresh > 12:
        raise ValueError("You may display at most 12 colors on this plot. "
                         "Please re-consider your plotting choices.")
    elif (group_thresh > 9) & ~(cmap in over9):
        warnings.warn('There are too many colors for your colormap. '
                      'Changing to Set3.')
        cmap = 'Set3'
    elif (group_thresh > 8) & ~(cmap in over8):
        warnings.warn('There are too many colors for your colormap. '
                      'Changing to Set3.')
        cmap = 'Set3'

    # Parses the taxonomy and collapses the table
    taxa = extract_label_array(table, tax_col, tax_delim)

    if samples is None:
        samples = list(table.columns.values)
        samples.remove(tax_col)

    # Gets the appropriate taxonomic level information to go forward
    collapsed = level_taxonomy(table, taxa, samples, np.array([level]), 
                              consider_nan=multilevel_table)

    # Gets the top taxonomic levels
    upper_, lower_, = profile_one_level(collapsed, np.array([level]), 
                                        threshold=abund_thresh, 
                                        count=group_thresh)

    # Gets the colormap 
    cmap = define_single_cmap(cmap, upper_)

    # Plots the data
    fig_ = plot_area(upper_, lower_, cmap)

    return fig_


def joint_area_plot(table, rough_level=2, fine_level=5, samples=None, 
    tax_col='taxon_name', tax_delim='|', 
    multilevel_table=True, abund_thresh_rough=0.1, 
    abund_thresh_fine=0.05, group_thresh_fine=5, 
    group_thresh_rough=5):
    """
    Generates an area plot with nested grouping where the the higher level
    (`rough_level`) in the table (lower resolution/fewer groups) is used to 
    provide the general grouping structure and then within each `rough_level`,
    a number of `fine_level` groups are displayed. 

    Parameters
    ----------
    table : DataFrame
        A dataframe of hte original data, either as counts or relative 
        abundance with the taxonomic information in `tax_col`. The data
        can have separate count values at multiple levels (i.e. combine)
        collapsed phylum, class, etc levels.
    rough_level, fine_level: int
        The taxonomic levels to be displayed. The `fine_level` will be grouped
        by `rough_level` to display the data grouped by `rough_level`. The
        `rough_level` should smaller than the `fine_level`. 
    samples : list, optional
        The columns from `table` to be included in the analysis. If `samples`
        is None, then all columns in `table` except `tax_col` will be used.
    tax_col : str, optional
        The column in `table` which contains the taxonomic information.
    tax_delim: str, optional
        The delimiter between taxonomic levels, for example "|" or ";".
    multilevel_table: bool, optional
        Whether the table contains multiple concatenated, in which cases 
        considering `nan` will filter the samples to retain only the levels 
        of interest. This is recommended for kraken/bracken tables, but not 
        applicable for some 16s sequences
    abund_thresh_rough, abund_thresh_fine : float [0, 1]
        The mean abundance threshold for a taxonomic group to be plotted for
        the higher level grouping (`abund_thresh_rough`) and sub grouping
        level. This will be used in conjunction with the `group_thresh_rough`
        and `group_thresh_fine` to determine the number of groups to be
        included.
    group_thresh_fine, group_thresh_rough: int, [1, 6]
        The maximum number of taxonmic groups to display for the respective 
        level. If `group_thresh_rough` > 6, then it will be replaced with 
        6 because this is the maximum number of avaliable color groups.

    Returns
    -------
    Figure
        A 8" x 4" matplotlib figure with the area plot and legend.

    Also See
    --------
    single_area_plot

    """

    # Parses the taxonomy and collapses the table
    taxa = extract_label_array(table, tax_col, tax_delim)
    if samples is None:
        samples = list(table.drop(columns=[tax_col]).columns)

    # Gets the appropriate taxonomic level information to go forward
    collapsed = level_taxonomy(table, taxa, samples, 
                               level=np.array([fine_level]), 
                               consider_nan=multilevel_table)
    samples = collapsed.columns

    # Gets the top taxonomic levels
    upper_, lower_, = profile_joint_levels(collapsed, rough_level, fine_level, 
                                           samples=samples,
                                           lo_thresh=abund_thresh_rough, 
                                           lo_count=min(5, group_thresh_rough),
                                           hi_thresh=abund_thresh_fine,
                                           hi_count=group_thresh_fine,
                                           )
    # Gets the colormap 
    cmap = define_join_cmap(upper_)
    upper_.index = upper_.index.droplevel('rough')
    lower_.index = lower_.index.droplevel('rough')

    # Plots the data
    fig_ = plot_area(upper_.astype(float), lower_.astype(float), cmap)

    return fig_


# Sets up the main arguments for argparse.
def create_argparse():
    parser_one = argparse.ArgumentParser(
        description=('A set of functions to generate diagnostic stacked area '
                     'plots from metagenomic outputs.'),
        prog=('area_plotter'),
        )
    parser_one.add_argument(
        '-t', '--table', 
        help=('The abundance table as a tsv classic biom (features as rows, '
              'samples as columns) containing absloute or relative abundance '
              'for the samples.'),
        required=True,
        )
    parser_one.add_argument(
        '-o', '--output',
        help=('The location for the final figure'),
        required=True,
        )
    parser_one.add_argument(
        '-s', '--samples', 
        help=('A text file with the list of samples to be included (one '
            'per line). If no list is provided, then data from all columns '
            'in the table (except the one specifying taxonomy) will be used.'),
        )
    parser_one.add_argument(
        '--mode', 
        choices=mode_dict.keys(),
        help=('The software generating the table to make parsing easier. '
              'Options are kraken, metaphlan, marker (i.e. CTMR amplicon).'),
        )
    parser_one.add_argument(
        '-l', '--level',
        help=('The taxonomic level (as an integer) to plot the data.'),
        default=3,
        type=int,
        )
    parser_one.add_argument(
        '--abund-thresh',
        help=("the minimum abundance required to display a group."),
        default=0.01,
        type=float,
        )
    parser_one.add_argument(
        '--group-thresh',
        help=("The maximum number of groups to be displayed in the graph."),
        default=8,
        type=int,
        )
    parser_one.add_argument(
        '-c', '--colormap',
        help=("The qualitative colormap to use to generate your plot. Refer"
             ' to colorbrewer for options. If a selected colormap exceeds '
             'the number of groups (`--group-thresh`) possible, it will '
             'default to Set3.'),
        default='Set3',
        )
    parser_one.add_argument(
        '--sub-level',
        help=('The second level to use if doing a joint plot'),
        type=int,
        )
    parser_one.add_argument(
        '--sub-abund-thresh',
        help=("the minimum abundance required to display a sub group"),
        default=0.05,
        type=float,
        )
    parser_one.add_argument(
        '--sub-group-thresh',
        help=("the maximum number of sub groups allowed in a joint level plot."),
        default=5,
        type=float,
        )
    parser_one.add_argument(
        '--tax-delim',
        help=("String delimiting taxonomic levels."),
        type=str,
        )
    parser_one.add_argument(
        '--multi-level',
        help=("Whether the table contains multiple concatenated, in which "
             "case considering `nan` will filter the samples to retain only"
             "the levels of interest. This is recommended for most "
             "metagenomic tables, but not applicable for some 16s sequences."),
        )
    parser_one.add_argument(
        "--tax-col",
        help=("The column in `table` containig the taxobnomy information"),
        )
    parser_one.add_argument(
        '--table-drop',
        help=('A comma-seperated list describing the columns to drop'),
        )
    parser_one.add_argument(
        '--skip-rows',
        help=('The number of rows to skip when reading in the feature table.')
        )

    return parser_one


if __name__ == '__main__':
    parser_one = create_argparse()

    if len(argv) < 2:
        parser_one.print_help()
        exit()

    args = parser_one.parse_args()

    if args.table_drop is not None:
        args.table_drop = [s for s in args.table_drop.split(',')]
    else:
        args.table_drop = []

    mode_defaults = mode_dict.get(args.mode, mode_dict['kraken2'])
    mode_defaults.update({k: v for k, v in args.__dict__.items() 
                         if (k in mode_defaults) and (v)})

    table = pd.read_csv(args.table, sep='\t', 
                        skiprows=mode_defaults['skip_rows'])
    if args.samples is not None:
        with open(args.samples, 'r') as f_:
            samples = f_.read().split('\n')
    else:
        samples = None

    if args.sub_level is not None:
        fig_ = joint_area_plot(
            table.drop(columns=mode_defaults['table_drop']),
            rough_level=args.level,
            fine_level=args.sub_level,
            samples=args.samples,
            tax_delim=mode_defaults['tax_delim'],
            tax_col=mode_defaults['tax_col'],
            multilevel_table=mode_defaults['multi_level'],
            abund_thresh_rough=args.abund_thresh,
            group_thresh_rough=args.group_thresh,
            abund_thresh_fine=args.sub_abund_thresh,
            group_thresh_fine=args.sub_group_thresh,
        )
    else:
        fig_ = single_area_plot(
            table.drop(columns=mode_defaults['table_drop']),
            level=args.level,
            cmap=args.colormap,
            samples=samples,
            tax_delim=mode_defaults['tax_delim'],
            tax_col=mode_defaults['tax_col'],
            multilevel_table=mode_defaults['multi_level'],
            abund_thresh=args.abund_thresh,
            group_thresh=args.group_thresh,
        )

    fig_.savefig(args.output, dpi=300)

Python Pandas numpy matplotlib kraken2 Singularity Hub bracken MetaPhlAn Kaiju From line 3 of scripts/area_plot.py

__author__ = "Fredrik Boulund"
__date__ = "2020-2022"
__version__ = "1.1"

from sys import argv, exit
from functools import reduce, partial
from pathlib import Path
import argparse

import pandas as pd


def parse_args():
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("TABLE", nargs="+",
            help="TSV table with columns headers.")
    parser.add_argument("-f", "--feature-column", dest="feature_column",
            default="name",
            help="Column header of feature column to use, "
                 "typically containing taxa names. "
                 "Select several columns by separating with comma (e.g. name,taxid) "
                 "[%(default)s].")
    parser.add_argument("-c", "--value-column", dest="value_column",
            default="fraction_total_reads",
            help="Column header of value column to use, "
                 "typically containing counts or abundances [%(default)s].")
    parser.add_argument("-o", "--outfile", dest="outfile",
            default="joined_table.tsv",
            help="Outfile name [%(default)s].")
    parser.add_argument("-n", "--fillna", dest="fillna", metavar="FLOAT",
            default=0.0,
            type=float,
            help="Fill NA values in merged table with FLOAT [%(default)s].")
    parser.add_argument("-s", "--skiplines", dest="skiplines", metavar="N",
            default=0,
            type=int,
            help="Skip N lines before parsing header (e.g. for files "
                 "containing comments before the real header) [%(default)s].")

    if len(argv) < 2:
        parser.print_help()
        exit()

    return parser.parse_args()


def main(table_files, feature_column, value_column, outfile, fillna, skiplines):
    feature_columns = feature_column.split(",")

    tables = []
    for table_file in table_files:
        sample_name = Path(table_file).name.split(".")[0]
        tables\
            .append(pd.read_csv(table_file, sep="\t", skiprows=skiplines)\
            .set_index(feature_columns)\
            .rename(columns={value_column: sample_name})\
            .loc[:, [sample_name]])  # Ugly hack to get a single-column DataFrame

    df = tables[0]
    for table in tables[1:]:
        df = df.join(table, how="outer")
    df.fillna(fillna, inplace=True)

    df.to_csv(outfile, sep="\t")


if __name__ == "__main__":
    args = parse_args()
    if len(args.TABLE) < 2:
        print("Need at least two tables to merge!")
        exit(1)
    main(args.TABLE, args.feature_column, args.value_column, args.outfile, args.fillna, args.skiplines)

Python Pandas From line 3 of scripts/join_tables.py

import os, sys, argparse
import operator
from time import gmtime 
from time import strftime 

#Tree Class 
#usage: tree node used in constructing a taxonomy tree
#   including only the taxonomy levels and genomes identified in the Kraken report
class Tree(object):
    'Tree node.'
    def __init__(self, name, taxid, level_num, level_id, all_reads, lvl_reads, children=None, parent=None):
        self.name = name
        self.taxid = taxid
        self.level_num = level_num
        self.level_id = level_id
        self.tot_all = all_reads
        self.tot_lvl = lvl_reads
        self.all_reads = {}
        self.lvl_reads = {}
        self.children = []
        self.parent = parent
        if children is not None:
            for child in children:
                self.add_child(child)
    def add_child(self,node):
        assert isinstance(node,Tree)
        self.children.append(node)
    def add_reads(self, sample, all_reads, lvl_reads):
        self.all_reads[sample] = all_reads
        self.lvl_reads[sample] = lvl_reads
        self.tot_all += all_reads
        self.tot_lvl += lvl_reads
    def __lt__(self,other):
        return self.tot_all < other.tot_all

####################################################################
#process_kraken_report
#usage: parses a single line in the kraken report and extracts relevant information
#input: kraken report file with the following tab delimited lines 
#   - percent of total reads   
#   - number of reads (including at lower levels)
#   - number of reads (only at this level)
#   - taxonomy classification of level 
#       (U, - (root), - (cellular org), D, P, C, O, F, G, S) 
#   - taxonomy ID (0 = unclassified, 1 = root, 2 = Bacteria...etc)
#   - spaces + name 
#returns:
#   - classification/genome name
#   - taxonomy ID for this classification
#   - level for this classification (number)
#   - level name (U, -, D, P, C, O, F, G, S)
#   - all reads classified at this level and below in the tree
#   - reads classified only at this level
def process_kraken_report(curr_str):
    split_str = curr_str.strip().split('\t')
    if len(split_str) < 5:
        return []
    try:
        int(split_str[1])
    except ValueError:
        return []
    #Extract relevant information
    all_reads =  int(split_str[1])
    level_reads = int(split_str[2])
    level_type = split_str[-3]
    taxid = split_str[-2] 
    #Get name and spaces
    spaces = 0
    name = split_str[-1]
    for char in name:
        if char == ' ':
            name = name[1:]
            spaces += 1 
        else:
            break 
    #Determine which level based on number of spaces
    level_num = int(spaces/2)
    return [name, taxid, level_num, level_type, all_reads, level_reads]

####################################################################
#Main method
def main():
    #Parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('-r','--report-file','--report-files',
        '--report','--reports', required=True,dest='r_files',nargs='+',
        help='Input kraken report files to combine (separate by spaces)') 
    parser.add_argument('-o','--output', required=True,dest='output',
        help='Output kraken report file with combined information')
    parser.add_argument('--display-headers',required=False,dest='headers',
        action='store_true', default=True,
        help='Include header lines')
    parser.add_argument('--no-headers',required=False,dest='headers',
        action='store_false',default=True,
        help='Do not include header lines')
    parser.add_argument('--sample-names',required=False,nargs='+',
        dest='s_names',default=[],help='Sample names to use as headers in the new report')
    parser.add_argument('--only-combined', required=False, dest='c_only',
        action='store_true', default=False, 
        help='Include only the total combined reads column, not the individual sample cols')
    args=parser.parse_args()


    #Initialize combined values 
    main_lvls = ['U','R','D','K','P','C','O','F','G','S']
    map_lvls = {'kingdom':'K', 'superkingdom':'D','phylum':'P','class':'C','order':'O','family':'F','genus':'G','species':'S'}
    count_samples = 0
    num_samples = len(args.r_files)
    sample_names = args.s_names
    root_node = -1 
    prev_node = -1
    curr_node = -1
    u_reads = {0:0} 
    total_reads = {0:0} 
    taxid2node = {}

    #Check input values 
    if len(sample_names) > 0 and len(sample_names) != num_samples: 
        sys.stderr.write("Number of sample names provided does not match number of reports\n")
        sys.exit(1)
    #Map names
    id2names = {} 
    id2files = {} 
    if len(sample_names) == 0:
        for i in range(num_samples):
            id2names[i+1] = "S" + str(i+1)
            id2files[i+1] = ""
    else:
        for i in range(num_samples):
            id2names[i+1] = sample_names[i] 
            id2files[i+1] = ""

    #################################################
    #STEP 1: READ IN REPORTS
    #Iterate through reports and make combined tree! 
    sys.stdout.write(">>STEP 1: READING REPORTS\n")
    sys.stdout.write("\t%i/%i samples processed" % (count_samples, num_samples))
    sys.stdout.flush()
    for r_file in args.r_files:
        count_samples += 1 
        sys.stdout.write("\r\t%i/%i samples processed" % (count_samples, num_samples))
        sys.stdout.flush()
        id2files[count_samples] = r_file
        #Open File 
        curr_file = open(r_file,'r')
        for line in curr_file: 
            report_vals = process_kraken_report(line)
            if len(report_vals) < 5:
                continue
            [name, taxid, level_num, level_id, all_reads, level_reads] = report_vals
            if level_id in map_lvls:
                level_id = map_lvls[level_id]
            #Total reads 
            total_reads[0] += level_reads
            total_reads[count_samples] = level_reads 
            #Unclassified 
            if level_id == 'U' or taxid == '0':
                u_reads[0] += level_reads
                u_reads[count_samples] = level_reads 
                continue
            #Tree Root 
            if taxid == '1': 
                if count_samples == 1:
                    root_node = Tree(name, taxid, level_num, 'R', 0,0)
                    taxid2node[taxid] = root_node 
                root_node.add_reads(count_samples, all_reads, level_reads) 
                prev_node = root_node
                continue 
            #Move to correct parent
            while level_num != (prev_node.level_num + 1):
                prev_node = prev_node.parent
            #IF NODE EXISTS 
            if taxid in taxid2node: 
                taxid2node[taxid].add_reads(count_samples, all_reads, level_reads) 
                prev_node = taxid2node[taxid]
                continue 
            #OTHERWISE
            #Determine correct level ID
            if level_id == '-' or len(level_id)> 1:
                if prev_node.level_id in main_lvls:
                    level_id = prev_node.level_id + '1'
                else:
                    num = int(prev_node.level_id[-1]) + 1
                    level_id = prev_node.level_id[:-1] + str(num)
            #Add node to tree
            curr_node = Tree(name, taxid, level_num, level_id, 0, 0, None, prev_node)
            curr_node.add_reads(count_samples, all_reads, level_reads)
            taxid2node[taxid] = curr_node
            prev_node.add_child(curr_node)
            prev_node = curr_node 
        curr_file.close()

    sys.stdout.write("\r\t%i/%i samples processed\n" % (count_samples, num_samples))
    sys.stdout.flush()

    #################################################
    #STEP 2: SETUP OUTPUT FILE
    sys.stdout.write(">>STEP 2: WRITING NEW REPORT HEADERS\n")
    o_file = open(args.output,'w') 
    #Lines mapping sample ids to filenames
    if args.headers: 
        o_file.write("#Number of Samples: %i\n" % num_samples) 
        o_file.write("#Total Number of Reads: %i\n" % total_reads[0])
        for i in id2names:
            o_file.write("#")
            o_file.write("%s\t" % id2names[i])
            o_file.write("%s\n" % id2files[i])
        #Report columns
        o_file.write("#perc\ttot_all\ttot_lvl")
        if not args.c_only:
            for i in id2names:
                o_file.write("\t%s_all" % i)
                o_file.write("\t%s_lvl" % i)
        o_file.write("\tlvl_type\ttaxid\tname\n")
    #################################################
    #STEP 3: PRINT TREE
    sys.stdout.write(">>STEP 3: PRINTING REPORT\n")
    #Print line for unclassified reads
    o_file.write("%0.4f\t" % (float(u_reads[0])/float(total_reads[0])*100))
    for i in u_reads:
        if i == 0 or (i > 0 and not args.c_only):
            o_file.write("%i\t" % u_reads[i])
            o_file.write("%i\t" % u_reads[i])
    o_file.write("U\t0\tunclassified\n")
    #Print for all remaining reads 
    all_nodes = [root_node]
    curr_node = -1
    curr_lvl = 0
    prev_node = -1
    while len(all_nodes) > 0:
        #Remove node and insert children
        curr_node = all_nodes.pop()
        if len(curr_node.children) > 0:
            curr_node.children.sort()
            for node in curr_node.children:
                all_nodes.append(node)
        #Print information for this node 
        o_file.write("%0.4f\t" % (float(curr_node.tot_all)/float(total_reads[0])*100))
        o_file.write("%i\t" % curr_node.tot_all)
        o_file.write("%i\t" % curr_node.tot_lvl)
        if not args.c_only:
            for i in range(num_samples):
                if (i+1) not in curr_node.all_reads: 
                    o_file.write("0\t0\t")
                else:
                    o_file.write("%i\t" % curr_node.all_reads[i+1])
                    o_file.write("%i\t" % curr_node.lvl_reads[i+1])
        o_file.write("%s\t" % curr_node.level_id)
        o_file.write("%s\t" % curr_node.taxid)
        o_file.write(" "*curr_node.level_num*2)
        o_file.write("%s\n" % curr_node.name)
    o_file.close() 
####################################################################
if __name__ == "__main__":
    main()

Python Kraken From line 60 of KrakenTools/combine_kreports.py

import os, sys, argparse

####################################################################
#process_kraken_report
#usage: parses a single line in the kraken report and extracts relevant information
#input: kraken report file with the following tab delimited lines
#   - percent of total reads
#   - number of reads (including at lower levels)
#   - number of reads (only at this level)
#   - taxonomy classification of level
#       (U, D, P, C, O, F, G, S, -)
#   - taxonomy ID (0 = unclassified, 1 = root, 2 = Bacteria,...etc)
#   - spaces + name
#returns:
#   - classification/genome name
#   - level name (U, -, D, P, C, O, F, G, S)
#   - reads classified at this level and below in the tree
def process_kraken_report(curr_str):
    split_str = curr_str.strip().split('\t')
    if len(split_str) < 2:
        return []
    try:
        int(split_str[1])
    except ValueError:
        return []
    all_reads = int(split_str[1])
    lvl_reads = int(split_str[2])
    level_type = split_str[-3]
    type2main = {'superkingdom':'D','phylum':'P',
        'class':'C','order':'O','family':'F',
        'genus':'G','species':'S'} 
    if len(level_type) > 1:
        if level_type in type2main:
            level_type = type2main[level_type]
        else:
            level_type = '-'
    #Get name and spaces 
    spaces = 0
    name = split_str[-1]
    for char in name:
        if char == ' ':
            name = name[1:]
            spaces += 1
        else:
            break
    name = name.replace(' ','_')
    #Determine level based on number of spaces
    level_num = spaces/2
    return [name, level_num, level_type, lvl_reads]

###################################################################
#kreport2krona_all
#usage: prints all levels for a kraken report 
#input: kraken report file and output krona file names 
#returns: none 
def kreport2krona_all(report_file, out_file):
    #Process report file and output 
    curr_path = [] 
    prev_lvl_num = -1
    r_file = open(report_file, 'r')
    o_file = open(out_file, 'w')
    #Read through report file 
    main_lvls = ['D','P','C','O','F','G','S']
    for line in r_file:
        report_vals = process_kraken_report(line)
        #If header line, skip
        if len(report_vals) < 4: 
            continue
        #Get relevant information from the line 
        [name, level_num, level_type, lvl_reads] = report_vals
        if level_type == 'U':
            o_file.write(str(lvl_reads) + "\tUnclassified\n")
            continue
        #Create level name 
        if level_type not in main_lvls:
            level_type = "x"
        elif level_type == "D":
            level_type = "K"
        level_str = level_type.lower() + "__" + name
        #Determine full string to add
        if prev_lvl_num == -1:
            #First level
            prev_lvl_num = level_num
            curr_path.append(level_str)
            o_file.write(str(lvl_reads) + "\t" + level_str + "\n")
        else:
            o_file.write(str(lvl_reads))
            #Move back if needed
            while level_num != (prev_lvl_num + 1):
                prev_lvl_num -= 1
                curr_path.pop()
            #Print all ancestors of current level followed by |
            for string in curr_path:
                if string[0] != "r": 
                    o_file.write("\t" + string)
            #Print final level and then number of reads
            o_file.write("\t" + level_str + "\n")
            #Update
            curr_path.append(level_str)
            prev_lvl_num = level_num
    o_file.close()
    r_file.close()

###################################################################
#kreport2krona_main
#usage: prints only main taxonomy levels for a kraken report 
#input: kraken report file and output krona file names 
#returns: none 
def kreport2krona_main(report_file, out_file):
    #Process report file and output 
    main_lvls = ['D','P','C','O','F','G','S']
    curr_path = [] 
    prev_lvl_num = -1
    num2path = {} 
    path2reads = {} 
    line_num = -1
    #Read through report file 
    r_file = open(report_file, 'r')
    for line in r_file:
        line_num += 1
        #########################################
        report_vals = process_kraken_report(line)
        #If header line, skip
        if len(report_vals) < 4: 
            continue
        #Get relevant information from the line 
        [name, level_num, level_type, lvl_reads] = report_vals
        if level_type == 'U':
            num2path[line_num] = ["Unclassified"]
            path2reads["Unclassified"] = lvl_reads 
            continue
        #########################################
        #Create level name 
        if level_type not in main_lvls:
            level_type = "x"
        elif level_type == "D":
            level_type = "K"
        level_str = level_type.lower() + "__" + name
        #########################################
        #Determine full string to add
        if prev_lvl_num == -1:
            #First level
            prev_lvl_num = level_num
            curr_path.append(level_str)
            #Save
            if curr_path[-1][0] == "x":
                num2path[line_num] = ""
            else:
                path2reads[curr_path[-1]] = lvl_reads
                num2path[line_num] = []
                for i in curr_path:
                    num2path[line_num].append(i)
            continue
        else:
            #########################################
            #Move back if needed
            while level_num != (prev_lvl_num + 1):
                prev_lvl_num -= 1
                curr_path.pop()
            #Update the list 
            curr_path.append(level_str)
            prev_lvl_num = level_num
            #########################################
            #IF AT NON-TRADITIONAL LEVEL, ADD TO PARENT
            if level_type == "x":
                test_num = len(curr_path) - 1
                while(test_num >= 0):
                    if curr_path[test_num][0] != "x":
                        path2reads[curr_path[test_num]] += lvl_reads 
                        test_num = -1
                    test_num = test_num - 1 
                num2path[line_num] = ""
            #IF AT TRADITIONAL LEVEL, SAVE 
            if level_type != "x":
                path2reads[curr_path[-1]] = lvl_reads
                num2path[line_num] = []
                for i in curr_path:
                    num2path[line_num].append(i)
    r_file.close() 

    #WRITE OUTPUT FILE
    o_file = open(out_file, 'w')
    for i in range(0,line_num+1):
        #Get values
        if i not in num2path:
            continue
        curr_path = num2path[i] 
        if len(curr_path) > 0:
            curr_reads = path2reads[curr_path[-1]] 
            if curr_path[-1][0] != "x":
                o_file.write("%i" % curr_reads)
            for name in curr_path:
                if name[0] != "r" and name[0] != "x":
                    o_file.write("\t%s" % name)
            o_file.write("\n")
    o_file.close()

######################################################################
#Main method
def main():
    #Parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('-r', '--report-file', '--report', required=True,
        dest='r_file', help='Input kraken report file for converting')
    parser.add_argument('-o', '--output', required=True,
        dest='o_file', help='Output krona-report file name')
    parser.add_argument('--intermediate-ranks', action='store_true',
        dest='x_include', default=False, required=False,
        help='Include non-traditional taxonomic ranks in output')
    parser.add_argument('--no-intermediate-ranks', action='store_false',
        dest='x_include', default=False, required=False,
        help='Do not include non-traditional taxonomic ranks in output [default: no intermediate ranks]')
    args=parser.parse_args()

    #Determine which krona report to make 
    if args.x_include:
        kreport2krona_all(args.r_file,args.o_file)
    else:
        kreport2krona_main(args.r_file,args.o_file) 

#################################################################
if __name__ == "__main__":
    main()

Python Kraken From line 51 of KrakenTools/kreport2krona.py

import os, sys, argparse

#process_kraken_report
#usage: parses a single line in the kraken report and extracts relevant information
#input: kraken report file with the following tab delimited lines
#   - percent of total reads
#   - number of reads (including at lower levels)
#   - number of reads (only at this level)
#   - taxonomy classification of level
#       (U, D, P, C, O, F, G, S, -)
#   - taxonomy ID (0 = unclassified, 1 = root, 2 = Bacteria,...etc)
#   - spaces + name
#returns:
#   - classification/genome name
#   - level name (U, -, D, P, C, O, F, G, S)
#   - reads classified at this level and below in the tree
def process_kraken_report(curr_str):
    split_str = curr_str.strip().split('\t')
    if len(split_str) < 4:
        return []
    try:
        int(split_str[1])
    except ValueError:
        return []
    percents = float(split_str[0])
    all_reads = int(split_str[1])
    #Extract relevant information
    try:
        taxid = int(split_str[-3]) 
        level_type = split_str[-2]
        map_kuniq = {'species':'S', 'genus':'G','family':'F',
            'order':'O','class':'C','phylum':'P','superkingdom':'D',
            'kingdom':'K'}
        if level_type not in map_kuniq:
            level_type = '-'
        else:
            level_type = map_kuniq[level_type]
    except ValueError:
        taxid = int(split_str[-2])
        level_type = split_str[-3]
    #Get name and spaces 
    spaces = 0
    name = split_str[-1]
    for char in name:
        if char == ' ':
            name = name[1:]
            spaces += 1
        else:
            break
    name = name.replace(' ','_')
    #Determine level based on number of spaces
    level_num = spaces/2
    return [name, level_num, level_type, all_reads, percents]

#Main method
def main():
    #Parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('-r', '--report-file', '--report', required=True,
        dest='r_file', help='Input kraken report file for converting')
    parser.add_argument('-o', '--output', required=True,
        dest='o_file', help='Output mpa-report file name')
    parser.add_argument('--display-header', action='store_true', 
        dest='add_header', default=False, required=False,
        help='Include header [Kraken report filename] in mpa-report file [default: no header]') 
    parser.add_argument('--read_count', action='store_true',
        dest='use_reads', default=True, required=False,
        help='Use read count for output [default]')
    parser.add_argument('--percentages', action='store_false',
        dest='use_reads', default=True, required=False,
        help='Use percentages for output [instead of reads]')
    parser.add_argument('--intermediate-ranks', action='store_true',
        dest='x_include', default=False, required=False,
        help='Include non-traditional taxonomic ranks in output')
    parser.add_argument('--no-intermediate-ranks', action='store_false',
        dest='x_include', default=False, required=False,
        help='Do not include non-traditional taxonomic ranks in output [default]')
    args=parser.parse_args()

    #Process report file and output 
    curr_path = [] 
    prev_lvl_num = -1
    r_file = open(args.r_file, 'r')
    o_file = open(args.o_file, 'w')
    #Print header
    if args.add_header:
        o_file.write("taxon_name\treads\n")

    #Read through report file 
    main_lvls = ['R','K','D','P','C','O','F','G','S']
    for line in r_file:
        report_vals = process_kraken_report(line)
        #If header line, skip
        if len(report_vals) < 5: 
            continue
        #Get relevant information from the line 
        [name, level_num, level_type, all_reads, percents] = report_vals
        if level_type == 'U':
            continue
        #Create level name 
        if level_type not in main_lvls:
            level_type = "x"
        elif level_type == "K":
            level_type = "k"
        elif level_type == "D":
            level_type = "k"
        level_str = level_type.lower() + "__" + name
        #Determine full string to add
        if prev_lvl_num == -1:
            #First level
            prev_lvl_num = level_num
            curr_path.append(level_str)
        else:
            #Move back if needed
            while level_num != (prev_lvl_num + 1):
                prev_lvl_num -= 1
                curr_path.pop()
            #Print if at non-traditional level and that is requested
            if (level_type == "x" and args.x_include) or level_type != "x":
                #Print all ancestors of current level followed by |
                for string in curr_path:
                    if (string[0] == "x" and args.x_include) or string[0] != "x":
                        if string[0] != "r": 
                            o_file.write(string + "|")
                #Print final level and then number of reads
                if args.use_reads:
                    o_file.write(level_str + "\t" + str(all_reads) + "\n")
                else:
                    o_file.write(level_str + "\t" + str(percents) + "\n")
            #Update
            curr_path.append(level_str)
            prev_lvl_num = level_num
    o_file.close()
    r_file.close()

if __name__ == "__main__":
    main()

Python Kraken From line 55 of KrakenTools/kreport2mpa.py

__author__ = "Fredrik Boulund"
__date__ = "2019-03-07"
__version__ = "2.0.0"

from sys import argv, exit, stderr
from collections import defaultdict
from pathlib import Path
import argparse
import logging
import csv

logging.basicConfig(format="%(levelname)s: %(message)s")


def parse_args():
    desc = "{} Version v{}. Copyright (c) {}.".format(__doc__, __version__, __author__, __date__[:4])
    parser = argparse.ArgumentParser(description=desc)

    parser.add_argument("RPKM", nargs="+",
            help="RPKM file(s) from BBMap pileup.sh.")
    parser.add_argument("-c", "--columns", dest="columns",
            default="",
            help="Comma-separated list of column names to include [all columns].")
    parser.add_argument("-a", "--annotation-file", dest="annotation_file", 
            required=True,
            help="Two-column tab-separated annotation file.")
    parser.add_argument("-o", "--outdir", dest="outdir", metavar="DIR",
            default="",
            help="Directory for output files, will create one output file per selected column.")

    if len(argv) < 2:
        parser.print_help()
        exit(1)

    return parser.parse_args()


def parse_rpkm(rpkm_file):
    read_counts = {}
    with open(rpkm_file) as f:
        firstline = f.readline()
        if not firstline.startswith("#File"):
            logging.error("File does not look like a BBMap pileup.sh RPKM: %s", rpkm_file)
        _ = [f.readline() for l in range(4)] # Skip remaining header lines: #Reads, #Mapped, #RefSequences, Table header
        for line_no, line in enumerate(f, start=1):
            try:
                ref, length, bases, coverage, reads, RPKM, frags, FPKM = line.strip().split("\t")
            except ValueError:
                logging.error("Could not parse RPKM file line %s: %s", line_no, rpkm_file)
                continue
            if int(reads) != 0:
                ref = ref.split()[0]  # Truncate reference header on first space
                read_counts[ref] = int(reads)
    return read_counts


def parse_annotations(annotation_file):
    annotations = defaultdict(dict)
    with open(annotation_file) as f:
        csv_reader = csv.DictReader(f, delimiter="\t")
        for line in csv_reader:
            ref = list(line.values())[0].split()[0]  # Truncate reference header on first space
            for colname, value in list(line.items())[1:]:
                annotations[colname][ref] = value
    return annotations


def merge_counts(annotations, rpkms):
    output_table = {"Unknown": [0 for n in range(len(rpkms))]}
    for annotation in set(annotations.values()):
        output_table[annotation] = [0 for n in range(len(rpkms))]
    for idx, rpkm in enumerate(rpkms):
        for ref, count in rpkm.items():
            try:
                output_table[annotations[ref]][idx] += count
            except KeyError:
                logging.warning("Found no annotation for '%s', assigning to 'Unknown'", ref)
                output_table["Unknown"][idx] += count
    return output_table


def write_table(table_data, sample_names, outfile):
    with open(str(outfile), "w") as outf:
        header = "\t".join(["Annotation"] + [sample_name for sample_name in sample_names]) + "\n"
        outf.write(header)
        for ref, counts in table_data.items():
            outf.write("{}\t{}\n".format(ref, "\t".join(str(count) for count in counts)))


if __name__ == "__main__":
    args = parse_args()

    Path(args.outdir).mkdir(parents=True, exist_ok=True)

    rpkms = []
    for rpkm_file in args.RPKM:
        rpkms.append(parse_rpkm(rpkm_file))

    annotations = parse_annotations(args.annotation_file)

    if args.columns:
        selected_columns = []
        for col in args.columns.split(","):
            if col in annotations:
                selected_columns.append(col)
            else:
                logging.warning("Column %s not found in annotation file!", col)
    else:
        selected_columns = list(annotations.keys())

    for selected_column in selected_columns:
        table_data = merge_counts(annotations[selected_column], rpkms)
        sample_names = [Path(fn).stem.split(".")[0] for fn in args.RPKM]

        table_filename = Path(args.outdir) / "counts.{}.tsv".format(selected_column)
        write_table(table_data, sample_names, table_filename)
        logging.debug("Wrote", table_filename)

Python From line 6 of scripts/make_count_table.py

__author__ = "Fredrik Boulund"
__date__ = "2022"
__version__ = "0.4"

from sys import argv, exit
from collections import defaultdict
from pathlib import Path
import argparse
import logging

import numpy as np
import pandas as pd
import seaborn as sns

TAXLEVELS = [
    "Kingdom", 
    "Phylum", 
    "Class", 
    "Order", 
    "Family", 
    "Genus", 
    "Species", 
    "Strain",
]

def parse_args():
    desc = f"{__doc__} v{__version__}. {__author__} (c) {__date__}."
    parser = argparse.ArgumentParser(description=desc, epilog="Version "+__version__)
    parser.add_argument("mpa_table",
            help="MetaPhlAn TSV table to plot.")
    parser.add_argument("-o", "--outfile-prefix", dest="outfile_prefix",
            default="mpa_heatmap",
            help="Outfile name [%(default)s]. "
                 "Will be appended with <taxonomic_level>_top<N>.{png,pdf}")
    parser.add_argument("-f", "--force", action="store_true",
            default=False,
            help="Overwrite output file if it already exists [%(default)s].")
    parser.add_argument("-l", "--level", 
            default="Species",
            choices=TAXLEVELS,
            help="Taxonomic level to summarize results for [%(default)s].")
    parser.add_argument("-t", "--topN", metavar="N",
            default=50,
            type=int,
            help="Only plot the top N taxa [%(default)s].")
    parser.add_argument("-p", "--pseudocount", metavar="P",
            default=-1,
            type=float,
            help="Use custom pseudocount, a negative value means to "
                 "autocompute a pseudocount as the median of the 0.01th "
                 "quantile across all samples [%(default)s].")
    parser.add_argument("-c", "--colormap",
            default="viridis",
            help="Matplotlib colormap to use [%(default)s].")
    parser.add_argument("-M", "--method",
            default="average",
            help="Linkage method to use, "
                 "see scipy.cluster.hierarchy.linkage docs [%(default)s].")
    parser.add_argument("-m", "--metric",
            default="euclidean",
            help="Distance metric to use, "
                 "see scipy.spatial.distance.pdist docs [%(default)s].")
    parser.add_argument("-L", "--loglevel", choices=["INFO", "DEBUG"],
            default="INFO",
            help="Set logging level [%(default)s].")

    if len(argv) < 2:
        parser.print_help()
        exit()

    return parser.parse_args()


def parse_mpa_table(mpa_tsv):
    """Read joined MetaPhlAn tables into a Pandas DataFrame.

    * Convert ranks from first column into hierarchical MultiIndex
    """
    with open(mpa_tsv) as f:
        for lineno, line in enumerate(f):
            if line.startswith("#"):
                continue
            elif line.startswith("clade_name"):
                skiprows = lineno
                dropcols = ["clade_name"]
                break
            elif not line.startswith("#"):
                logger.error(f"Don't know how to process table")
                exit(3)

    df = pd.read_csv(mpa_tsv, sep="\t", skiprows=skiprows)

    logger.debug(df.head())

    lineages = df[dropcols[0]].str.split("|", expand=True)
    levels_present = TAXLEVELS[:len(lineages.columns)]  # Some tables don't have strain or species assignments
    df[levels_present] = lineages\
            .rename(columns={key: level for key, level in zip(range(len(levels_present)), levels_present)})
    mpa_table = df.drop(columns=dropcols).set_index(levels_present)

    logger.debug(f"Parsed data dimensions: {mpa_table.shape}")
    logger.debug(mpa_table.sample(10))

    return mpa_table


def extract_specific_level(mpa_table, level):
    """Extract abundances for a specific taxonomic level."""

    level_pos = mpa_table.index.names.index(level)

    if level_pos+1 == len(mpa_table.index.names):
        level_only = ~mpa_table.index.get_level_values(level).isnull()
        mpa_level = mpa_table.loc[level_only]
    else:
        level_assigned = ~mpa_table.index.get_level_values(level).isnull()
        next_level_assigned = ~mpa_table.index.get_level_values(mpa_table.index.names[level_pos+1]).isnull()
        level_only = level_assigned & ~next_level_assigned  # AND NOT 
        mpa_level = mpa_table.loc[level_only]

    ranks = mpa_table.index.names.copy()
    ranks.remove(level)
    mpa_level.index = mpa_level.index.droplevel(ranks)
    logger.debug(f"Table dimensions after extracting {level}-level only: {mpa_level.shape}")
    return mpa_level


def plot_clustermap(mpa_table, topN, pseudocount, colormap, method, metric):
    """Plot Seaborn clustermap."""

    top_taxa = mpa_table.median(axis=1).nlargest(topN)
    mpa_topN = mpa_table.loc[mpa_table.index.isin(top_taxa.index)]
    logger.debug(f"Table dimensions after extracting top {topN} taxa: {mpa_topN.shape}")

    if pseudocount < 0:
        pseudocount = mpa_topN.quantile(0.05).median() / 10
        if pseudocount < 1e-10:
            logger.warning(f"Automatically generated pseudocount is very low: {pseudocount}! "
                            "Setting pseudocount to 1e-10.")
            pseudocount = 1e-10
        logger.debug(f"Automatically generated pseudocount is: {pseudocount}")

    figwidth = mpa_topN.shape[1]
    figheight = 10+topN/5

    sns.set("notebook")
    clustergrid = sns.clustermap(
            mpa_topN.apply(lambda x: np.log10(x+pseudocount)),
            figsize=(figwidth, figheight),
            method=method,
            metric=metric,
            cmap=colormap,
            cbar_kws={"label": "$log_{10}$(abundance)"},
    )
    return clustergrid


def main(mpa_table, outfile_prefix, overwrite, level, topN, pseudocount, colormap, method, metric):
    mpa_table = parse_mpa_table(mpa_table)
    mpa_level = extract_specific_level(mpa_table, level)
    clustermap = plot_clustermap(mpa_level, topN, pseudocount, colormap, method, metric)

    outfile_png = Path(f"{outfile_prefix}.{level}_top{topN}.png")
    outfile_pdf = Path(f"{outfile_prefix}.{level}_top{topN}.pdf")
    if (outfile_png.exists() or outfile_pdf.exists()) and not overwrite:
        logger.error(f"Output file {outfile_png} or {outfile_pdf} already exists and --force is not set.")
        exit(2)

    clustermap.ax_heatmap.set_xticklabels(clustermap.ax_heatmap.get_xticklabels(), rotation=90)

    clustermap.savefig(outfile_png)
    clustermap.savefig(outfile_pdf)


if __name__ == "__main__":
    args = parse_args()
    logger = logging.getLogger(__name__)
    loglevels = {"INFO": logging.INFO, "DEBUG": logging.DEBUG}
    logging.basicConfig(format='%(levelname)s: %(message)s', level=loglevels[args.loglevel])

    main(
        args.mpa_table, 
        args.outfile_prefix, 
        args.force,
        args.level,
        args.topN,
        args.pseudocount,
        args.colormap,
        args.method,
        args.metric,
    )

Python Pandas numpy seaborn From line 3 of scripts/plot_metaphlan_heatmap.py

__author__ = "CTMR, Fredrik Boulund"
__date__ = "2020"
__version__ = "0.1"

from sys import argv, exit
from pathlib import Path
import argparse

import matplotlib as mpl
mpl.use("agg")
mpl.rcParams.update({'figure.autolayout': True})
import matplotlib.pyplot as plt

import pandas as pd

def parse_args():
    desc = f"{__doc__} Copyright (c) {__author__} {__date__}. Version v{__version__}"
    parser = argparse.ArgumentParser(description=desc)
    parser.add_argument("log_output", metavar="LOG", nargs="+",
            help="Kraken2 log output (txt).")
    parser.add_argument("-H", "--histogram", dest="histogram", metavar="FILE",
            default="histogram.pdf",
            help="Filename of output histogram plot [%(default)s].")
    parser.add_argument("-b", "--barplot", dest="barplot", metavar="FILE",
            default="barplot.pdf",
            help="Filename of output barplot [%(default)s].")
    parser.add_argument("-t", "--table", dest="table", metavar="FILE",
            default="proportions.tsv",
            help="Filename of histogram data in TSV format [%(default)s].")
    parser.add_argument("-u", "--unclassified", dest="unclassified", action="store_true",
            default=False,
            help="Plot proportion unclassified reads instead of classified reads [%(default)s].")

    if len(argv) < 2:
        parser.print_help()
        exit(1)

    return parser.parse_args()


def parse_kraken2_logs(logfiles, unclassified):
    search_string = "unclassified" if unclassified else " classified"
    for logfile in logfiles:
        with open(logfile) as f:
            sample_name = Path(logfile).stem.split(".")[0]
            for line in f:
                if search_string in line:
                    yield sample_name, float(line.split("(")[1].split(")")[0].strip("%"))


if __name__ == "__main__":
    options = parse_args()

    proportions = list(parse_kraken2_logs(options.log_output, options.unclassified))
    action = "unclassified" if options.unclassified else "classified"

    df = pd.DataFrame(proportions, columns=["Sample", "Proportion"]).set_index("Sample").rename(columns={"Proportion": f"% {action}"})
    print("Loaded {} proportions for {} samples.".format(df.shape[0], len(df.index.unique())))

    fig, ax = plt.subplots(figsize=(7, 5))
    df.plot(kind="hist", ax=ax, legend=None)
    ax.set_title(f"Proportion {action} reads")
    ax.set_xlabel(f"Proportion {action} reads")
    ax.set_ylabel("Frequency")
    fig.savefig(options.histogram, bbox_inches="tight")

    length_longest_sample_name = max([s for s in df.index.str.len()])
    fig2_width = max(5, length_longest_sample_name * 0.4)
    fig2_height = max(3, df.shape[0] * 0.25)

    fig2, ax2 = plt.subplots(figsize=(fig2_width, fig2_height))
    df.plot(kind="barh", ax=ax2, legend=None)
    ax2.set_title(f"Proportion {action} reads")
    ax2.set_xlabel(f"Proportion {action} reads")
    ax2.set_ylabel("Sample")
    fig2.savefig(options.barplot, bbox_inches="tight")

    df.to_csv(options.table, sep="\t")

Python Pandas matplotlib From line 3 of scripts/plot_proportion_kraken2.py

__author__ = "Fredrik Boulund"
__date__ = "2018"
__version__ = "0.2.0"

from sys import argv, exit
from pathlib import Path
import argparse

import matplotlib as mpl
mpl.use("agg")
mpl.rcParams.update({'figure.autolayout': True})

import pandas as pd
import seaborn as sns


def parse_args():
    desc = f"{__doc__} Version {__version__}. Copyright (c) {__author__} {__date__}."
    parser = argparse.ArgumentParser(description=desc)
    parser.add_argument("alltoall", metavar="alltoall",
            help="Output table from BBMap's comparesketch.sh in format=3.")
    parser.add_argument("-o", "--outfile", dest="outfile", metavar="FILE",
            default="all_vs_all.pdf",
            help="Filename of heatmap plot [%(default)s].")
    parser.add_argument("-c", "--clustered", dest="clustered", metavar="FILE",
            default="all_vs_all.clustered.pdf",
            help="Filename of clustered heatmap plot [%(default)s].")
    if len(argv) < 2:
        parser.print_help()
        exit(1)

    return parser.parse_args()


if __name__ == "__main__":
    options = parse_args()

    df = pd.read_table(
            options.alltoall, 
            index_col=False)
    print("Loaded data for {} sample comparisons.".format(df.shape[0]))

    similarity_matrix = df.pivot(index="#Query", 
            columns="Ref", values="ANI").fillna(100)

    corr = similarity_matrix.corr().fillna(0)
    g = sns.heatmap(corr, annot=True, fmt="2.1f", annot_kws={"fontsize": 2})
    g.set_title("Sample similarity")
    #g.set_xticklabels(g.get_xticklabels(), fontsize=4)  #WIP
    #g.set_yticklabels(g.get_yticklabels(), rotation=0, fontsize=4) #WIP
    g.set_ylabel("")
    g.set_xlabel("")
    g.figure.savefig(str(Path(options.outfile)))

    g = sns.clustermap(corr, annot=True, fmt="2.1f", annot_kws={"fontsize": 2})
    g.fig.suptitle("Sample similarity (clustered)")
    g.savefig(str(Path(options.clustered)))

Python Pandas matplotlib seaborn From line 3 of scripts/plot_sketch_comparison_heatmap.py

__author__ = "CTMR, Fredrik Boulund"
__date__ = "2021-2023"
__version__ = "0.3"

from sys import argv, exit
from pathlib import Path
import json
import argparse

import matplotlib as mpl
mpl.use("agg")
mpl.rcParams.update({'figure.autolayout': True})
import matplotlib.pyplot as plt

import pandas as pd

def parse_args():
    desc = f"{__doc__} Copyright (c) {__author__} {__date__}. Version v{__version__}"
    parser = argparse.ArgumentParser(description=desc)
    parser.add_argument("--fastp", metavar="sample.json", nargs="+",
            help="fastp JSON output file.")
    parser.add_argument("--kraken2", metavar="sample.kraken2.log", nargs="+",
            help="Kraken2 log output.")
    parser.add_argument("--bowtie2", metavar="sample.samtools.fastq.log", nargs="+",
            help="Bowtie2 samtools fastq log output.")
    parser.add_argument("-o", "--output-table", metavar="TSV",
            default="read_processing_summary.txt",
            help="Filename of output table in tsv format [%(default)s].")
    parser.add_argument("-p", "--output-plot", metavar="PDF",
            default="",
            help="Filename of output table in PDF format [%(default)s].")

    if len(argv) < 2:
        parser.print_help()
        exit(1)

    return parser.parse_args()


def parse_bowtie2_samtools_fastq_logs(logfiles):
    for logfile in logfiles:
        with open(logfile) as f:
            sample_name = Path(logfile).stem.split(".")[0]
            for line in f:
                if not line.startswith("[M::bam2fq_mainloop]"):
                    raise ValueError
                if "bam2fq_mainloop] processed" in line:
                    yield {
                        "Sample": sample_name,
                        "after_bowtie2_host_removal": int(int(line.split()[2])/2),  # /2 because bowtie2 counts both pairs
                    }


def parse_kraken2_logs(logfiles):
    for logfile in logfiles:
        with open(logfile) as f:
            sample_name = Path(logfile).stem.split(".")[0]
            for line in f:
                if " unclassified" in line:
                    yield {
                        "Sample": sample_name,
                        "after_kraken2_host_removal": int(line.strip().split()[0]),
                    }


def parse_fastp_logs(logfiles):
    for logfile in logfiles:
        sample_name = Path(logfile).stem.split(".")[0]
        with open(logfile) as f:
            fastp_data = json.load(f)
            yield {
                "Sample": sample_name, 
                "before_fastp": int(fastp_data["summary"]["before_filtering"]["total_reads"]/2),  # /2 because fastp counts both pairs
                "after_fastp": int(fastp_data["summary"]["after_filtering"]["total_reads"]/2), 
                "duplication": float(fastp_data["duplication"]["rate"]),
            }


if __name__ == "__main__":
    args = parse_args()

    dfs = {
        "fastp": pd.DataFrame(),
        "kraken2": pd.DataFrame(),
        "bowtie2": pd.DataFrame(),
    }

    if args.fastp:
        data_fastp = list(parse_fastp_logs(args.fastp))
        dfs["fastp"] = pd.DataFrame(data_fastp).set_index("Sample")
    if args.kraken2:
        data_kraken2 = list(parse_kraken2_logs(args.kraken2))
        dfs["kraken2"] = pd.DataFrame(data_kraken2).set_index("Sample")
    if args.bowtie2:
        data_bowtie2 = list(parse_bowtie2_samtools_fastq_logs(args.bowtie2))
        dfs["bowtie2"] = pd.DataFrame(data_bowtie2).set_index("Sample")

    df = pd.concat(dfs.values(), axis="columns")

    column_order = [
        "duplication",
        "before_fastp",
        "after_fastp",
        "after_kraken2_host_removal",
        "after_bowtie2_host_removal",
    ]
    final_columns = [c for c in column_order if c in df.columns]
    df = df[final_columns]

    df.to_csv(args.output_table, sep="\t")

    if args.output_plot:
        fig, ax = plt.subplots(figsize=(6, 5))
        df[final_columns[1:]]\
            .transpose()\
            .plot(kind="line", style=".-", ax=ax)
        ax.set_title("Reads passing through QC and host removal")
        ax.set_xlabel("Stage")
        ax.set_ylabel("Reads")
        handles, labels = ax.get_legend_handles_labels()
        ax.legend(handles, labels, loc="upper left", bbox_to_anchor=(0, -0.1))
        fig.savefig(args.output_plot, bbox_inches="tight")

Python SAMtools Pandas matplotlib JSON fastp Bowtie 2 kraken2 From line 3 of scripts/preprocessing_summary.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

n = len(snakemake.input.sample)
assert n == 1 or n == 2, "input->sample must have 1 (single-end) or 2 (paired-end) elements."

if n == 1:
    reads = "-U {}".format(*snakemake.input.sample)
else:
    reads = "-1 {} -2 {}".format(*snakemake.input.sample)

shell(
    "(bowtie2 --threads {snakemake.threads} {snakemake.params.extra} "
    "-x {snakemake.params.index} {reads} "
    "| samtools view -Sbh -o {snakemake.output[0]} -) {log}")