Workflow to rapidly quantify taxa from all domains of life, directly from short-read human gut metagenomes
Phanta
Rapidly quantify taxa from all domains of life, directly from short-read human gut metagenomes
The foundation of this workflow is a comprehensive, virus-inclusive database of genomes with integrated taxonomic information
Workflow figure created on BioRender.com.
Citation
Pinto, Y., Chakraborty, M., Jain, N. et al. Phage-inclusive profiling of human gut microbiomes with Phanta. Nat Biotechnol (2023). https://doi.org/10.1038/s41587-023-01799-4 https://www.nature.com/articles/s41587-023-01799-4
Updates
- 18 Aug 2023 - Thanks to Phanta user @telatin, we now have added a runner script (run_phanta.py) to the root of this repository. This runner script may simplify Phanta execution, especially if you use Phanta regularly. For usage instructions, please see RUNNER.md . Please note that we have not tested this extensively and we recommend reading through the below information before trying to execute Phanta with the runner script.
Table of contents
Quick Start
Installation
Step One - Clone the repository
Clone the repository to the desired location on your system using the following command:
git clone https://github.com/bhattlab/phanta.git
Step Two - Install conda, if not already installed
If you do not already have conda installed, please install using the instructions provided here .
Step Three - Create and activate a new conda environment
Create a new conda environment via the following command, replacing
/full/path/to/cloned/repo
with the appropriate path to your cloned repository:
# If you use strict or flexible channel priority, first remove the setting
conda config --set channel_priority false
conda env create -n phanta_env --file /full/path/to/cloned/repo/phanta_env.yaml
Activate the environment by executing:
conda activate phanta_env
If you have trouble creating the environment using the above commands, you can alternatively follow the instructions here .
Step Four - Download Phanta's default database
Create a directory to store Phanta's default database of genomes. For example:
mkdir phanta_dbs
cd phanta_dbs
Then execute the following commands:
wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/database_V1.tar.gz
tar xvzf database_V1.tar.gz
The list of files that should be extracted is detailed here . Note alternative databases are also available, please see our databases list .
Test Your Installation
To test that you are ready to run Phanta on your data, navigate to your cloned repository using
cd
and download the four
.fq.gz
files required for testing via the following command:
wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/test_dataset.tar.gz
tar xvzf test_dataset.tar.gz
Then edit two files contained in the
testing
subdirectory of your cloned repository.
-
Edit
samp_file.txt
by replacing/full/path/to/cloned/repo
in the four locations indicated with the full path to your cloned repository. -
Edit
config_test.yaml
by replacing:
-
/full/path/to/cloned/repo
in the three locations indicated with the full path to your cloned repository. -
/full/path/to/downloaded/database
in the one location indicated with the full path to the database you downloaded during the install.
Finally, execute the Snakemake command below, after replacing:
-
/full/path/to/cloned/repo
with the path to your cloned repository -
The number provided to
max-threads
with the maximum number of threads you have available. Note that if this number is greater than 16, you can (but don't need to) also increase the argument toclass_threads
inconfig_test.yaml
.
Note
At least 32GB memory is required. Also, you may have to replace the
--cores
and
max-threads
arguments with a
profile for Snakemake execution
depending on your setup (e.g., replace with
--profile slurm
).
snakemake -s /full/path/to/cloned/repo/Snakefile \
--configfile /full/path/to/cloned/repo/testing/config_test.yaml \
--jobs 99 --cores 1 --max-threads 16
When execution has completed, please check that your
test_dataset
directory has an empty file called
pipeline_completed.txt
. You should also have two new subdirectories in
test_dataset
-
classification
and
final_merged_outputs
- which should have identical contents to the corresponding subdirectories in the
testing
subdirectory of your cloned repository. You will also have two additional files ending in
*krak
within
test_dataset/classification/intermediate
.
Basic Usage
For basic usage, replace the four paths at the top of the provided
config.yaml
file as appropriate.
You do not need to make any additional changes
except
if: 1) you did not conduct 150bp sequencing, or 2) if your read files are not gzipped. In those cases, you may also need to change the
read_length
or
gzipped
parameters, which are described under
Advanced Usage
.
After you have finished editing your config file, execute a similar Snakemake command to the one you used to Test Your Installation , example below:
snakemake -s /full/path/to/cloned/repo/Snakefile \
--configfile /full/path/to/cloned/repo/config.yaml \
--jobs 99 --cores 1 --max-threads 16
When the pipeline is done, you will have an empty file in your designated output directory called
pipeline_completed.txt
.
Main Outputs
The main outputs are merged tables that list the abundance of each taxon, in each sample. Rows are taxa and columns are samples.
-
final_merged_outputs/counts.txt
: provides the number of read (pairs) assigned to each taxon -
final_merged_outputs/relative_read_abundance.txt
: same ascounts.txt
but normalized out of the total number of reads in each sample that were ultimately assigned to any taxon during abundance estimation. -
final_merged_outputs/relative_taxonomic_abundance.txt
: similar torelative_read_abundance.txt
but abundances are corrected for genome length. In addition, only species (and not higher taxonomic levels) are included in this report.
For examples of the above outputs, please see the
testing/final_merged_outputs
subdirectory.
Note
: To filter
counts.txt
or
relative_read_abundance.txt
to a specific taxonomic level (e.g., species, genus), or to change
relative_taxonomic_abundance.txt
to a higher taxonomic level than species, please refer to
Filtering Merged Tables to a Specific Taxonomic Level
under
Provided Postprocessing Scripts
.
Advanced
Advanced Usage
This section contains a description of the additional parameters in the config file that were not mentioned under Basic Usage but can be altered if desired.
Parameters under Specifications for step one - classification of metagenomic reads
-
confidence_threshold
(default0.1
). This parameter can range from 0 to 1. Higher values yield more confident classifications but reduce sensitivity. Please see this link from the Kraken2 documentation for more details. -
gzipped
(defaultTrue
). This parameter should beTrue
if your read files are gzipped, otherwiseFalse
. -
class_mem_mb
(default32768
). This parameter specifies the memory in megabytes used for the classification step. -
class_threads
(default16
). This parameter specifies the number of threads used for the classification step.
Parameters under Specifications for step two - filtering false positive species
-
cov_thresh_viral
(default0.1
). This parameter can range from 0 to 1 and specifies a genome coverage requirement for a viral species to be considered a "true positive" in a sample. For example, if this parameter is 0.1, that means that for a viral species to be considered a true positive in a sample, at least one genome of the species must be at least 10% covered by sample reads.-
Each genome's coverage is estimated by dividing:
-
The number of unique minimizers in the genome that are covered by sample reads, by
-
The total number of unique minimizers in the genome.
-
-
Terminology note - minimizers are very similar to kmers; for a more detailed description of what they are, please see the Kraken2 paper .
-
-
minimizer_thresh_viral
(default0
). This parameter can take any value >= 0 and specifies an additional requirement for a viral species to be considered true positive in a sample. E.g., if this parameter is 10, that means at least one genome of the species must have 10+ of its unique minimizers covered by sample reads. -
cov_thresh_bacterial
andminimizer_thresh_bacterial
are the analogous parameters for filtering bacterial species. -
Sample sequencing depth may affect the choice of bacterial/viral coverage thresholds - please see our suggested coverage thresholds by sequencing depth, below. These suggestions were based on evaluating the sensitivity/specificity (TPR/FDR) tradeoff at various coverage thresholds and sequencing depths, as described in our manuscript in greater detail.
-
As of release 1.1, upon user request we have also added coverage and minimizer parameters for archaeal and eukaryotic species (
cov_thresh_arc
,cov_thresh_euk
,minimizer_thresh_arc
,minimizer_thresh_euk
).
Parameters under Specifications for step three - per-species abundance estimation
-
read_length
(default150
). This parameter specifies the read length. Currently the default database is compatible with 75, 100, 120, or 150, and the alternative databases are compatible with 100 or 150. Additional read lengths can be requested here . -
filter_thresh
(default10
). This parameter specifies one last false positive species filter - how many sample reads must have been classified to species X (in step one) for it to be considered truly present in the sample? This parameter is specific to the Bracken tool used for abundance estimation and is equivalent to the threshold parameter described in the original Bracken documentation . Note that this filter is uniform across all types of species (e.g., viral, bacterial).
Additional parameters
-
database
. Phanta is typically run with the default database linked above under Step Four of Installation . However, as described in our manuscript, an alternative version of Phanta's default database was also created, in which prophage sequences have been "masked" in prokaryotic genomes. The download link for this database is: http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/masked_db_v1.tar.gz . All available databases are described here . -
delete_intermediate
(defaultTrue
). SpecifyTrue
if you would like intermediate outputs to be deleted, otherwiseFalse
. Intermediate outputs are per-sample outputs generated during the execution of Steps 1 and 2. Examples of these intermediate files can be found within thetesting/classification/intermediate
subdirectory of the cloned repository.
Additional Outputs
In addition to the merged tables provided in the
final_merged_outputs
subdirectory (see
Main Outputs
), the pipeline provides per-sample outputs in the
classification
subdirectory. Specifically:
-
The files ending with
.krak.report_bracken_species.filtered
correspond to the Kraken-style report outputted by Bracken and specify the per-sample abundances that underlie the creation of the final merged tablescounts.txt
andrelative_read_abundance.txt
. Unlike in the merged tables, taxa that are not present in the sample are not included. -
The files ending with
.krak.report.filtered.bracken.scaled
essentially correspond to per-sample versions offinal_merged_outputs/relative_taxonomic_abundance.txt
. Specifically see therel_taxon_abundance
column. Unlike in the merged table, taxa that are not present in the sample are not included.-
Note
: additional normalizations beyond relative taxonomic abundance are also provided - e.g.,
reads_per_million_bases
,reads_per_million_reads
,reads_per_million_bases_per_million_reads (RPMPM)
,copies_per_million_reads
.
-
Note
: additional normalizations beyond relative taxonomic abundance are also provided - e.g.,
There are two final outputs worth noting:
-
samples_that_failed_bracken.txt
in theclassification
subdirectory. This file contains names of samples that did not ultimately have any reads directly assigned to the species level during step three of the workflow. Please note that for these samples, the.krak.report.filtered.bracken.scaled
file will be empty and there may be additional associated files in theclassification
subdirectory that end with.temp
. -
total_reads.tsv
in thefinal_merged_outputs
subdirectory. This file contains information about the total number of reads assigned to taxa in each sample, at various steps of the pipeline. Specifically:
-
Tot_Samp_Reads
: total number of reads in the original sample -
Unassigned_Step_One
: number of reads that were not assigned a classification by Kraken2 during the classification step -
Assigned_Step_Three
: sum total of reads assigned to a taxon after filtering false positive species and estimating species-level abundances -
Unassigned_Step_Three
: difference betweenTot_Samp_Reads
andAssigned_Step_Three
Note that the normalization used to create
final_merged_outputs/relative_read_abundance.txt
from
final_merged_outputs/counts.txt
utilizes the
Assigned_Step_Three
column of
total_reads.tsv
.
Provided Postprocessing Scripts
These scripts are provided within the
post_pipeline_scripts
subdirectory of the cloned repository.
Filtering Merged Tables to a Specific Taxonomic Level
There are two scripts provided for this purpose.
Script one
post_pipeline_scripts/reduce_merged_table_to_specific_rank.py
is a Python script that can be utilized to filter
final_merged_outputs/counts.txt
or
final_merged_outputs/relative_read_abundance.txt
to a given taxonomic level.
The necessary command-line arguments to the script are, in order:
-
Full path to the merged table, and
-
The taxonomic level of interest (e.g., species, genus, superkingdom...)
An example command is:
python reduce_merged_table_to_specific_rank.py /full/path/to/counts.txt genus
The output of the above command would be a file called
counts_genus.txt
in the same directory as the original
counts.txt
.
Script Two
post_pipeline_scripts/sum_corrected_relab_to_higher_level.py
is a Python script that can be used to sum up the species-level, genome length-corrected relative abundances provided in
final_merged_outputs/relative_taxonomic_abundance.txt
to a higher taxonomic level of interest (e.g., genus, superkingdom).
The necessary command-line arguments to the script are, in order:
-
Full path to the downloaded database of genomes
-
Full path to
final_merged_outputs/relative_taxonomic_abundance.txt
-
Full path of desired output file, including the desired file name
-
Taxonomic level of interest
An example command is:
python sum_corrected_relab_to_higher_level.py \
/full/path/to/downloaded/database \
/full/path/to/final_merged_outputs/relative_taxonomic_abundance.txt \
summed.txt genus
Filtering Merged Tables by Prevalence of Taxa
To filter any of Phanta's merged tables by the prevalence of taxa, you may use the
post_pipeline_scripts/filter_by_prevalence.py
script.
Example command:
python post_pipeline_scripts/filter_by_prevalence.py counts.txt counts_filtered.txt 0.2
Modifying Merged Tables to OTU Format
Many tools require a classic "OTU" format for input tables - taxa as rows and samples as columns. To convert any of the merged output tables generated by Phanta to this classic format, you may use the
post_pipeline_scripts/make_otu.py
script.
Example command:
python post_pipeline_scripts/make_otu.py counts.txt counts.otu
Collapse Viral Abundances by Predicted Host
post_pipeline_scripts/collapse_viral_abundances_by_host.py
is script that collapses viral abundances in each sample by predicted host.
Expected arguments are:
-
merged abundance file generated by Phanta (e.g.,
final_merged_outputs/counts.txt
) -
host assignment file (provided with the default database at
/full/path/to/downloaded/database/host_prediction_to_genus.tsv
). Note: the host assignment file has two columns: 1) taxonomy ID of viral species, 2) predicted lineage of host genus, in the following format: d_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Enterobacterales;f_Enterobacteriaceae;g_Escherichia -
full desired path to output file (including file name)
As of preprint posting, the script collapses the viral abundances in the input merged abundance file by predicted host genus. So, the output file is a table where predicted host genera are rows, columns are samples, and each cell provides the abundance of viruses with a particular predicted host genus in a particular sample.
Usage:
python collapse_viral_abundances_by_host.py <merged abundance table> <host assignment file> <path to outfile>
Calculate Viral Lifestyle Statistics
post_pipeline_scripts/calculate_lifestyle_stats/lifestyle_stats.R
is an R script that can be used to calculate overall statistics about the lifestyles of the viruses present in each sample (e.g., abundance ratio of temperate to virulent phages). Calculations are based on per-species lifestyle predictions that were made using the tool BACPHLIP, for the default database described in the preprint. These per-species predictions are provided with our database at
/database/species_name_to_vir_score.txt
.
The necessary command-line arguments to the R script are, in order:
-
The BACPHLIP-predicted virulence score above which a virus should be considered 'virulent' (suggestion:
0.5
). -
The full path to the
species_name_to_vir_score.txt
file -
The full path to a species-level version of
final_merged_outputs/counts.txt
(please see Filtering Merged Tables to a Specific Taxonomic Level ) -
The full path to
final_merged_outputs/relative_taxonomic_abundance.txt
(or a species-level version offinal_merged_outputs/relative_read_abundance.txt
) -
The desired name for the output file (full path)
An example command is:
Rscript lifestyle_stats.R \
0.5 /full/path/to/downloaded/database/species_name_to_vir_score.txt \
final_merged_outputs/counts_species.txt \
final_merged_outputs/relative_taxonomic_abundance.txt \
lifestyle_stats.txt
An example output file, based on the testing dataset, is located in the same directory as the R script (
post_pipeline_scripts/calculate_lifestyle_stats/example_lifestyle_stats.txt
).
Calculation of Cross-Domain Correlations
This module calculates cross-domain abundance correlations. A later step of this module will require fastspar (https://github.com/scwatts/fastspar - please see below). Please note that fastspar may not be compatible with your phanta env so you may need to create a new environment for the step that requires it.
Fastspar calculates all-vs-all correlation and can be memory- and disk usage-intensive.
We first recommend to filter your merged counts table to a specific rank (e.g., genus). Please see Filtering Merged Tables to a Specific Taxonomic Level .
Next, we recommend filtering by prevalence. Please see Filtering Merged Tables by Prevalence of Taxa . Example command for this use case:
python post_pipeline_scripts/filter_by_prevalence.py counts_genus.txt counts_genus_filtered.txt 0.2
To make the output compatible with fastspar, please use the script that modifies our merged output to a more standard OTU table - please see Modifying Merged Tables to OTU Format . Example command for this use case:
python post_pipeline_scripts/make_otu.py counts_genus_filtered.txt counts_genus_filtered.otu
The next step actually correlates abundances. This module requires fastspar (https://github.com/scwatts/fastspar) to be installed in your environment. If fastspar is not compatible with your phanta environment, you can create a new environment for that step.
bash post_pipeline_scripts/correl.sh counts_genus_filtered.otu <outdir>
This script calculates all-vs-all correlations that can be found in
<prefix>_correl.txt
and
covariance <prefix>_cov.txt
within the specified output directory. P-values for the correlations can be found in the
<prefix>_pvalues.tsv
within the specified output directory.
Optional positional arguments to the correlation script above:
-
threads
- number of threads (default:10
) -
permutations
- number of permutations of the data to estimate p-value (default:100
)
Now we can filter for correlations between viruses and bacteria, underneath a maximal p-value, using the following command:
python post_pipeline_scripts/filter_cross_domain.py <pref_correl.txt> <pref_pvalues.tsv> <maximal p-value> <cross_domain_correlations.txt>
Determine Which Detected Viruses are Likely Integrated Prophages
post_pipeline_scripts/integrated_prophage_detector.py
is a Python script that can be used to identify which viruses detected in each sample are likely integrated prophages, based on detecting "junctions" between viruses and bacteria. Specifically, the script leverages paired-end information (when available) to find cases where one end (e.g., forward read) would be independently classified to a viral species, while the other end (e.g., reverse read) would be independently classified to a bacterial species.
To enable the usage of this script, please specify the
single_end_krak
parameter in the config file as
True
before running Phanta. This will create a directory that stores separate Kraken2 classifications for forward and reverse reads. This does not affect other Phanta processes.
If you've already run Phanta for a set of samples, don't worry! You can simply change the
single_end_krak
parameter in the original config to
True
and rerun the Snakemake command that you used to run Phanta.
Then you can run the post-processing script as follows:
python integrated_prophages_detector.py \
/full/path/to/sample_file/used/for/Phanta \
/full/path/to/Phanta/database \
/full/path/to/Phanta/classification/outdir \
<desired_output_directory>
The third argument is the full path to the
classification
subdirectory of the Phanta results.
The output file will be called
integrated_prophages_detection_results.txt
and will have one line per pair of viral and bacterial species that had "chimeric" reads detected. Example line:
sample2 4034362 1680 48
This line means that in sample2, there were 48 read pairs for which one end (e.g., forward read) would be independently classified to viral species 4034362, while the other end (e.g., reverse read) would be independently classified to bacterial species 1680. Thus, 4034362 is likely one of the viral species in sample2 that can be found as an integrated prophage, at least to some extent.
Troubleshooting
Environment Creation
If you have trouble creating the environment specified by
phanta_env.yaml
following the instructions
above
, you can alternatively install the required packages stepwise, into a fresh conda environment, using the instructions below.
Package list:
-
bracken v2.7
-
kraken2 v2.1.2
-
pandas (pipeline developed with v1.4.3 but anything higher should also work)
-
numpy ("" 1.23.2 "")
-
r-base ("" 4.1.3 "")
-
r-stringr ("" 1.4.0 "")
-
r-dplyr ("" 1.0.9 "")
-
snakemake (pipeline developed with v7.12.1 and we have heard of issues with v7.3+)
Example set of commands that should install all of the above (including the dependencies) into a fresh environment:
conda create -n phanta_env_alternate
conda activate phanta_env_alternate
conda install -c bioconda bracken=2.7
conda install pandas
conda install -c conda-forge r-base r-stringr r-dplyr
conda install -c bioconda snakemake=7.12
Stalled Snakemake Pipeline
Occasionally, Snakemake stalls when submitting SLURM jobs (e.g., see
link
). If it appears that the Snakemake command is still running but new jobs have long stopped being submitted, please cancel all current jobs using the
scancel
command and post a GitHub issue. We can help you determine how to proceed to ensure that the pipeline completes both: 1) without error and 2) accurately. This is an easily fixable problem but there is no universal solution. The exact steps will depend on when the pipeline stalled. The Snakemake log file (in the hidden .snakemake directory) will be of use.
Code Snippets
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | if ! command -v bc &> /dev/null; then echo "Error: 'bc' calculator is not installed." exit 1 fi # first - calculate total reads in the sample - add total reads classified under root # by Kraken2 and total reads unclassified by Kraken2 root=$(grep 'root' $2/$1.krak.report | head -n 1 | cut -f 2 | xargs) unclassified_krak=$(grep 'unclassified' $2/$1.krak.report | head -n 1 | cut -f 2 | xargs) if [[ -z $root ]] && [[ -z $unclassified_krak ]] then echo "No reads in sample" exit 64 elif [[ -z $root ]] then tot_reads=$unclassified_krak elif [[ -z $unclassified_krak ]] then tot_reads=$root else tot_reads=$(($root+$unclassified_krak)) fi # next - calculate total BRACKEN-CLASSIFIED reads in the sample tot_reads_bracken=$(cut -f 6 $2/$1.krak.report.filtered.bracken | tail -n +2 | paste -sd+ | bc) # how many reads were left out by Bracken? unclassified_brack="$((tot_reads-tot_reads_bracken))" echo -e "$1\t${tot_reads}\t${unclassified_krak}\t${tot_reads_bracken}\t${unclassified_brack}" > $3 |
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 | import sys import pandas as pd import numpy as np kraken_report, kraken_db, max_cov_bacteria, max_cov_virus, \ max_minimizers_bacteria, max_minimizers_virus, max_cov_arc, \ max_cov_euk, max_minimizers_arc, max_minimizers_euk = \ sys.argv[1], sys.argv[2], float(sys.argv[3]), float(sys.argv[4]), \ int(sys.argv[5]), int(sys.argv[6]), float(sys.argv[7]), float(sys.argv[8]), \ int(sys.argv[9]), int(sys.argv[10]) """ STEP TWO Define some functions that will be required for STEP THREE """ def make_dicts(nodes_file): with open(nodes_file, 'r') as infile: # make a child to parent dictionary # and a taxid to rank dictionary child_to_parent = {} taxid_to_rank = {} for line in infile: line=line.rstrip('\n').split('\t') child, parent, rank = line[0], line[2], line[4] child_to_parent[child] = parent taxid_to_rank[child] = rank return child_to_parent, taxid_to_rank def taxid_to_desired_rank(taxid, desired_rank, child_parent, taxid_rank): # look up the specific taxid, # build the lineage using the dictionaries # stop at the desired rank and return the taxid lineage = [[taxid, taxid_rank[taxid]]] if taxid_rank[taxid] == desired_rank: return taxid child, parent = taxid, None if child == '0': return 'unclassified' while not parent == '1': # print(child, parent) # look up child, add to lineage parent = child_parent[child] rank = taxid_rank[parent] lineage.append([parent, rank]) if rank == desired_rank: return parent child = parent # needed for recursion return 'error - taxid above desired rank, or not annotated at desired rank' """ STEP THREE Adapted from /labs/asbhatt/yishay/scripts/kraken_species_filter_low_rank.py Make a modified version of the Kraken report that: 1) Only includes lines representing species and strains 2) Has two extra columns: 1) coverage, 2) species level taxid """ # read in Kraken report, merge with inspect file to get info to calculate coverage kraken_file = pd.read_csv(kraken_report, sep="\t", names=['fraction','fragments', 'assigned','minimizers','uniqminimizers', 'classification_rank','ncbi_taxa','scientific name']) inspect_fname = kraken_db + '/inspect.out' inspect_file = pd.read_csv(inspect_fname, sep="\t", names=['frac','minimizers_clade', 'minimizers_taxa', 'rank','ncbi_taxonomy','sci_name']) kraken_file = kraken_file.merge(inspect_file,left_on='ncbi_taxa', right_on='ncbi_taxonomy') # calculate coverage kraken_file['cov'] = kraken_file['uniqminimizers']/kraken_file['minimizers_taxa'] # assign coverage as NA if there are fewer than 5 minimizers assigned to this taxon kraken_file.loc[kraken_file.minimizers_taxa < 5,'cov']=np.nan # filter kraken_file to species only species_kraken = kraken_file.copy()[kraken_file['classification_rank'].str.startswith('S', na=False)] # add a column for species level taxid - use functions defined above child_parent, taxid_rank = make_dicts(kraken_db + '/taxonomy/nodes.dmp') species_kraken['species_level_taxa'] = species_kraken.apply(lambda x: taxid_to_desired_rank(str(x['ncbi_taxonomy']), 'species', child_parent, taxid_rank), axis=1) species_kraken.to_csv(kraken_report + ".species", sep="\t", index=False) """ STEP FOUR Read in the modified Kraken report and make two dictionaries. 1) species_to_superkingdom: species taxid to superkingdom 2) species_genome_info: species taxid to list of: [taxid, cov, rank, unique_minimizers] for every taxon found by Kraken and falling under that species taxid """ with open(kraken_report + '.species', 'r') as infile: # figure out the column number for all the features of interest header=infile.readline().rstrip('\n').split('\t') unique_minimizers_col, cov_col, species_col, rank_col, taxid_col = \ None, None, None, None, None for i in range(len(header)): if header[i] == 'cov': cov_col = i elif header[i] == 'species_level_taxa': species_col = i elif header[i] == 'rank': rank_col = i elif header[i] == 'uniqminimizers': unique_minimizers_col = i elif header[i] == 'ncbi_taxa': taxid_col = i assert unique_minimizers_col != None assert cov_col != None assert species_col != None assert rank_col != None assert taxid_col != None # now go through report! # first initialize the dictionaries species_to_superkingdom = {} species_genome_info = {} i = 0 for line in infile: # skipped header already i += 1 line=line.rstrip('\n').split('\t') unique_minimizers, cov, species_taxid, rank, taxid = \ int(line[unique_minimizers_col]), line[cov_col], line[species_col], line[rank_col], \ line[taxid_col] if cov == '': cov = 0 else: cov = float(cov) # add to dictionaries if rank == 'S': # can add to the species_to_superkingdom dictionary # look up superkingdom superkingdom = taxid_to_desired_rank(species_taxid, 'superkingdom', child_parent, taxid_rank) species_to_superkingdom[species_taxid] = superkingdom # if i > 100 and i < 200: # print('superkingdom', i) # print(species_to_superkingdom[species_taxid]) # regardless, add to the species_genome_info dictionary if not species_taxid in species_genome_info: # initialize species_genome_info[species_taxid] = [[taxid, cov, rank, unique_minimizers]] else: species_genome_info[species_taxid].append([taxid, cov, rank, unique_minimizers]) # if i > 100 and i < 200: # print('other', i) # print(species_genome_info[species_taxid]) """ STEP FIVE Go through species_genome_info and figure out - using species_to_superkingdom as well - which species should be "kept" in the Kraken report. Output a file containing info about these decisions. Here we will use the parameters that were passed in: max_cov_bacteria, max_cov_virus, max_minimizers_bacteria, max_minimizers_virus Recall: if max_cov is passed, definitely keep. If not, keep if max_minimizers threshold is passed. Recall - we only care about the coverages/minimizer values for the GENOMES in each species - i.e., the entries in species_genome_info that don't have a 'lower rank' right afterwards. Note - will also make a dictionary that stores this info - taxid_to_lowest_rank. """ species_to_keep = set() taxid_to_lowest_rank = {} out_fname = kraken_report + '.filtering_decisions.txt' with open(out_fname, 'w') as outfile: # write out a header first header = ['species_taxid', 'superkingdom', 'max_cov', 'max_uniq_minimizers', 'kept'] outfile.write('\t'.join(header) + '\n') for species in species_genome_info: lines_to_consider = species_genome_info[species] relevant_coverages, relevant_unique_minimizer_vals = [], [] for i in range(len(lines_to_consider)): taxid_to_lowest_rank[lines_to_consider[i][0]] = 'False' if i == (len(lines_to_consider) - 1): # special case - last line in the Kraken report for this species taxid relevant_coverages.append(lines_to_consider[i][1]) relevant_unique_minimizer_vals.append(lines_to_consider[i][3]) # update taxid_to_lowest_rank taxid_to_lowest_rank[lines_to_consider[i][0]] = 'True' else: # not the last line in the Kraken report for this species taxid this_rank = lines_to_consider[i][2] next_rank = lines_to_consider[i+1][2] if this_rank == 'S': continue # since it's not the last line in the Kraken report, next_rank must be "lower" # sanity check assert len(next_rank) > len(this_rank) else: # we're dealing with S1, S2, etc. if int(this_rank[1:]) < int(next_rank[1:]): # e.g., S1 < S2 continue else: # e.g., S3 > S2 relevant_coverages.append(lines_to_consider[i][1]) relevant_unique_minimizer_vals.append(lines_to_consider[i][3]) # update taxid_to_lowest_rank taxid_to_lowest_rank[lines_to_consider[i][0]] = 'True' # now we can get the maximum coverage from the relevant coverages max_cov = max(relevant_coverages) # also get the maximum "# unique minimizers" max_minimizers = max(relevant_unique_minimizer_vals) # should we keep this species? # figure out based on superkingdom superkingdom = species_to_superkingdom[species] if superkingdom == '2': # bacteria if (max_cov >= max_cov_bacteria) and (max_minimizers >= max_minimizers_bacteria): species_to_keep.add(species) elif superkingdom == '2157': # archaea if (max_cov >= max_cov_arc) and (max_minimizers >= max_minimizers_arc): species_to_keep.add(species) elif superkingdom == '2759': # eukaryotes if (max_cov >= max_cov_euk) and (max_minimizers >= max_minimizers_euk): species_to_keep.add(species) elif superkingdom == '10239': # viruses if (max_cov >= max_cov_virus) and (max_minimizers >= max_minimizers_virus): species_to_keep.add(species) else: species_to_keep.add(species) if species in species_to_keep: keep = 'True' else: keep = 'False' outfile.write('\t'.join([species, superkingdom, str(max_cov), str(max_minimizers), keep]) + '\n') """ STEP SIX Go through the original Kraken report and make a new filtered version, where lines from species to filter out are removed. """ with open(kraken_report, 'r') as infile: with open(kraken_report + '.filtered', 'w') as outfile: for line in infile: line=line.rstrip('\n').split('\t') if line[5].startswith('S'): # look up the taxonomy id - what is the species level taxid? species_taxid = taxid_to_desired_rank(line[6], 'species', child_parent, taxid_rank) # do we care about this species? if species_taxid in species_to_keep: outfile.write('\t'.join(line) + '\n') else: outfile.write('\t'.join(line) + '\n') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | library(stringr) # need to read in arguments from command line args <- commandArgs(trailingOnly=TRUE) indir <- args[1] outdir <- args[2] failed <- args[3] # First merge counts tables counts_tables_list <- args[4] count_tables <- as.character(read.csv(counts_tables_list, header=FALSE)[,1]) failed_samples <- tryCatch( expr = { as.character(read.csv(failed, header=FALSE)[,1]) }, error = function(e){ return(c()) }) #print(count_tables) # figure out what column names we'll want for the merged data frame # find locations of '.' in the file names locations <- str_locate(count_tables, '\\.krak.report')[,1] sampnames <- substr(count_tables, 1, locations-1) desired_colnames <- c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs", sampnames) #print(desired_colnames) # initialize data frame sample <- desired_colnames[3] if (sample %in% failed_samples) { merged_table <- data.frame(matrix(ncol = 2, nrow = 0)) colnames(merged_table) <- desired_colnames[1:2] } else { merged_table <- read.csv(paste0(indir, '/', count_tables[1]), sep='\t', header=FALSE) colnames(merged_table) <- desired_colnames[1:3] } # loop through files and keep merging if (length(count_tables) > 1) { for (i in seq(2,length(count_tables))) { sample <- desired_colnames[i+2] if (sample %in% failed_samples) { # do nothing } else { f <- paste0(indir, '/', count_tables[i]) table_to_merge <- read.csv(f, sep='\t', header=FALSE) colnames(table_to_merge) <- c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs", sample) merged_table <- merge(merged_table, table_to_merge, by = c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs"), all=TRUE) # replace NA with 0 merged_table[is.na(merged_table)] <- 0 }}} for (sample in failed_samples) { merged_table[[sample]] <- rep(NA, nrow(merged_table)) } # make the first two columns character vectors, not factors merged_table[,1] <- as.character(merged_table[,1]) merged_table[,2] <- as.character(merged_table[,2]) # now output outpath <- paste0(outdir, '/', 'counts.txt') write.table(merged_table, outpath, quote=FALSE, row.names = FALSE, sep = '\t') # Next merge tables normalized out of total Bracken-classified reads norm_brack_tables_list <- args[5] norm_brack_tables <- as.character(read.csv(norm_brack_tables_list, header=FALSE)[,1]) # the samples should be in the same order - do a sanity check locations <- str_locate(norm_brack_tables, '\\.krak.report')[,1] sampnames <- substr(norm_brack_tables, 1, locations-1) desired_colnames_test <- c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs", sampnames) stopifnot(desired_colnames == desired_colnames_test) # initialize data frame sample <- desired_colnames[3] if (sample %in% failed_samples) { merged_table <- data.frame(matrix(ncol = 2, nrow = 0)) colnames(merged_table) <- desired_colnames[1:2] } else { merged_table <- read.csv(paste0(indir, '/', norm_brack_tables[1]), sep='\t', header=FALSE) colnames(merged_table) <- desired_colnames[1:3] } # loop through files and keep merging if (length(norm_brack_tables) > 1) { for (i in seq(2,length(norm_brack_tables))) { sample <- desired_colnames[i+2] if (sample %in% failed_samples) { } else { f <- paste0(indir, '/', norm_brack_tables[i]) table_to_merge <- read.csv(f, sep='\t', header=FALSE) colnames(table_to_merge) <- c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs", sample) merged_table <- merge(merged_table, table_to_merge, by = c("Taxon_Lineage_with_Names", "Taxon_Lineage_with_IDs"), all=TRUE) # replace NA with 0 merged_table[is.na(merged_table)] <- 0 }}} for (sample in failed_samples) { merged_table[[sample]] <- rep(NA, nrow(merged_table)) } # make the first two columns character vectors, not factors merged_table[,1] <- as.character(merged_table[,1]) merged_table[,2] <- as.character(merged_table[,2]) # now output outpath <- paste0(outdir, '/', 'relative_read_abundance.txt') write.table(merged_table, outpath, quote=FALSE, row.names = FALSE, sep = '\t') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | import sys db, temp_file, outdir = sys.argv[1], sys.argv[2], sys.argv[3] # make a few helpful dictionaries with open(db + '/taxonomy/nodes.dmp', 'r') as infile: # make a child to parent dictionary # and a taxid to rank dictionary child_to_parent = {} taxid_to_rank = {} for line in infile: line=line.rstrip('\n').split('\t') child, parent, rank = line[0], line[2], line[4] child_to_parent[child] = parent taxid_to_rank[child] = rank # also want a taxid_to_name dictionary # use the inspect.out file with open(db + '/inspect.out', 'r') as infile: taxid_to_name = {} for line in infile: line=line.rstrip('\n').split('\t') taxid, name = line[4], line[5].strip() taxid_to_name[taxid] = name def taxid_to_lineages(taxid): # look up the specific taxid, # build the lineages using the dictionaries # first deal with the special case of unclassiifed if taxid == '0': return 'unclassified' # not unclassified - proceed rank = taxid_to_rank[taxid] lineage_taxids = rank + '_' + taxid lineage_names = rank + '_' + taxid_to_name[taxid] child, parent = taxid, None while not parent == '1': # look up child, add to lineages parent = child_to_parent[child] parent_rank = taxid_to_rank[parent] # look up the name of the parent too parent_name = taxid_to_name[parent] lineage_taxids = parent_rank + '_' + parent + '|' + lineage_taxids lineage_names = parent_rank + '_' + parent_name + '|' + lineage_names child = parent # needed for recursion return lineage_names, lineage_taxids with open(temp_file, 'r') as infile: with open(outdir + '/relative_taxonomic_abundance.txt', 'w') as outfile: # first fix the header orig_header = infile.readline().rstrip('\n').split('\t') new_header = '\t'.join(['Taxon_Lineage_with_Names', 'Taxon_Lineage_with_IDs'] + orig_header[1:]) + '\n' outfile.write(new_header) for line in infile: line=line.rstrip('\n').split('\t') # use the function to figure out what we want to call this line taxid = line[0] lin_names, lin_ids = taxid_to_lineages(taxid) outlist = [lin_names, lin_ids] + line[1:] outfile.write('\t'.join(outlist) + '\n') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | library(stringr) # need to read in arguments from command line args <- commandArgs(trailingOnly=TRUE) outdir <- args[1] scaled_reports_list <- args[2] scaled_reports <- as.character(read.csv(scaled_reports_list, header=FALSE)[,1]) # figure out what column names we'll want for the merged data frame # find locations of '.' in the file names locations <- str_locate(scaled_reports, '\\.krak.report')[,1] sampnames <- substr(scaled_reports, 1, locations-1) desired_colnames <- c("TaxID", sampnames) # also figure out which samples failed failed <- args[3] failed_samples <- tryCatch( expr = { as.character(read.csv(failed, header=FALSE)[,1]) }, error = function(e){ return(c()) }) # initialize data frame sample <- desired_colnames[2] if (sample %in% failed_samples) { merged_table <- data.frame(matrix(ncol = 1, nrow = 0)) colnames(merged_table) <- desired_colnames[1] } else { merged_table <- read.csv(paste0(outdir, '/', scaled_reports[1]), sep='\t', header=TRUE) taxid_col <- which(colnames(merged_table) == 'taxonomy_id') rel_taxon_col <- which(colnames(merged_table) == 'rel_taxon_abundance') merged_table <- merged_table[,c(taxid_col, rel_taxon_col)] colnames(merged_table) <- desired_colnames[1:2] } # loop through files and keep merging if (length(scaled_reports) > 1) { for (i in seq(2,length(scaled_reports))) { sample <- desired_colnames[i+1] if (sample %in% failed_samples) { } else { f <- paste0(outdir, '/', scaled_reports[i]) table_to_merge <- read.csv(f, sep='\t', header=TRUE) taxid_col <- which(colnames(table_to_merge) == 'taxonomy_id') rel_taxon_col <- which(colnames(table_to_merge) == 'rel_taxon_abundance') table_to_merge <- table_to_merge[,c(taxid_col, rel_taxon_col)] colnames(table_to_merge) <- c("TaxID", sample) merged_table <- merge(merged_table, table_to_merge, by = c("TaxID"), all=TRUE) # replace NA with 0 merged_table[is.na(merged_table)] <- 0 }}} for (sample in failed_samples) { merged_table[[sample]] <- rep(NA, nrow(merged_table)) } # make the first column a character vector, not a factor merged_table[,1] <- as.character(merged_table[,1]) # now output outpath <- paste0(outdir, '/', 'relative_taxonomic_abundance_temp.txt') write.table(merged_table, outpath, quote=FALSE, row.names = FALSE, sep = '\t') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | import sys report, kraken_db, failed_file, sample = \ sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4] # make a few helpful dictionaries with open(kraken_db + '/taxonomy/nodes.dmp', 'r') as infile: # make a child to parent dictionary # and a taxid to rank dictionary child_to_parent = {} taxid_to_rank = {} for line in infile: line=line.rstrip('\n').split('\t') child, parent, rank = line[0], line[2], line[4] child_to_parent[child] = parent taxid_to_rank[child] = rank # also want a taxid_to_name dictionary # use the inspect.out file with open(kraken_db + '/inspect.out', 'r') as infile: taxid_to_name = {} for line in infile: line=line.rstrip('\n').split('\t') taxid, name = line[4], line[5].strip() taxid_to_name[taxid] = name # also make a set of the samples that failed Bracken with open(failed_file, 'r') as infile: failed_samples = set() for line in infile: failed_samples.add(line.rstrip('\n')) def taxid_to_lineages(taxid): # look up the specific taxid, # build the lineages using the dictionaries # first deal with the special case of unclassiifed if taxid == '0': return 'unclassified' # not unclassified - proceed rank = taxid_to_rank[taxid] lineage_taxids = rank + '_' + taxid lineage_names = rank + '_' + taxid_to_name[taxid] child, parent = taxid, None while not parent == '1': # look up child, add to lineages parent = child_to_parent[child] parent_rank = taxid_to_rank[parent] # look up the name of the parent too parent_name = taxid_to_name[parent] lineage_taxids = parent_rank + '_' + parent + '|' + lineage_taxids lineage_names = parent_rank + '_' + parent_name + '|' + lineage_names child = parent # needed for recursion return lineage_taxids, lineage_names with open(report, 'r') as infile: with open(report + '.to_merge', 'w') as outfile: if sample in failed_samples: pass else: for line in infile: line=line.rstrip('\n').split('\t') # use the function to figure out what we want to call this line taxid = line[4] lin_taxids, lin_names = taxid_to_lineages(taxid) num_reads = line[1] outfile.write('\t'.join([lin_names, lin_taxids, num_reads]) + '\n') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | import sys tot_reads_file, indir, failed = \ sys.argv[1], sys.argv[2], sys.argv[3] # make a dictionary from sample name to number of Bracken-classified reads with open(tot_reads_file, 'r') as infile: samp_name_to_reads = {} header=infile.readline() # skip for line in infile: line=line.rstrip('\n').split('\t') samp, brack_reads = line[0], line[3] samp_name_to_reads[samp] = brack_reads sample_names = samp_name_to_reads.keys() # make a set of failed samples with open(failed, 'r') as infile: failed_samples = set() for line in infile: failed_samples.add(line.rstrip('\n')) for samp in sample_names: # file to normalize counts_table = indir + '/' + samp + '.krak.report_bracken_species.filtered.to_merge' # file to produce outf = counts_table + '.norm_brack' with open(counts_table, 'r') as infile: with open(outf, 'w') as outfile: if samp in failed_samples: pass else: # look up bracken_classified_reads brack_classified_reads = int(samp_name_to_reads[samp]) for line in infile: line=line.rstrip('\n').split('\t') lin_names, lin_taxids, num_reads = line[0], line[1], float(line[2]) frac_reads_brack = str(round(num_reads/brack_classified_reads, 10)) outfile.write('\t'.join([lin_names, lin_taxids, frac_reads_brack]) + '\n') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | import sys kraken_filtered, threshold, kraken_db = \ sys.argv[1], int(sys.argv[2]), sys.argv[3] # make a dictionary that will be later useful nodes_file = kraken_db + '/taxonomy/nodes.dmp' def child_to_parent(nodes): with open(nodes, 'r') as infile: child_to_parent = {} for line in infile: line=line.rstrip('\n').split('\t') child, parent = line[0], line[2] child_to_parent[child] = parent return child_to_parent child_parent = child_to_parent(nodes_file) # parse filtered Kraken report, figure out whether Bracken will error with open(kraken_filtered, 'r') as infile: bracken_will_fail = True for line in infile: line = line.rstrip('\n').split('\t') rank, assigned = line[5], int(line[1]) if rank == 'S' and assigned >= threshold: bracken_will_fail = False break if bracken_will_fail: # output an empty Bracken species report with open(kraken_filtered + '.bracken.temp', 'w') as outfile: pass # output an empty Kraken-style Bracken report out_fname = kraken_filtered[:kraken_filtered.rfind('.')] + '_bracken_species.filtered.temp' with open(out_fname, 'w') as outfile: pass # output an empty scaled Bracken species report with open(kraken_filtered + '.bracken.scaled.temp', 'w') as outfile: pass |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | import pandas as pd import sys def main(): ###step 1- input kraken_db = sys.argv[1] bracken_report = sys.argv[2] kraken_filtering_decisions = sys.argv[3] genome_length_path = kraken_db + "/library/species_genome_size.txt" read_length = int(sys.argv[4]) paired = sys.argv[5] if str(paired) == 'True': read_length = read_length*2 ##step two- reading files to DFs bracken_df = pd.read_csv(bracken_report, sep="\t") bracken_df['taxonomy_id'] = bracken_df['taxonomy_id'].astype('str') kraken_df = pd.read_csv(kraken_filtering_decisions, sep="\t") kraken_df['species_taxid'] = kraken_df['species_taxid'].astype('str') length_df = pd.read_csv(genome_length_path, sep="\t") ##step 3- calculating stats tmp_merged_df = pd.merge(bracken_df,length_df[['species_level_taxa','max','mean','median']], left_on='taxonomy_id', right_on='species_level_taxa') per_million_scaling = (tmp_merged_df['new_est_reads'].sum() / 1000000) # calculate how many millions of reads there are tmp_merged_df['fraction_total_reads'] = tmp_merged_df['new_est_reads']/tmp_merged_df['new_est_reads'].sum() tmp_merged_df['tmp_scaling'] = tmp_merged_df['fraction_total_reads']*(tmp_merged_df['median'].sum()/tmp_merged_df['median']) # scale fraction_total_reads to reflect difference in genome length - smaller genome should be given greater weight tmp_merged_df['rel_taxon_abundance'] = tmp_merged_df['tmp_scaling']/tmp_merged_df['tmp_scaling'].sum() # make sure that the scaling adds up to 1 tmp_merged_df['genetic_abundance'] = tmp_merged_df['fraction_total_reads'] tmp_merged_df['depth'] = (tmp_merged_df['new_est_reads']*read_length)/tmp_merged_df['median'] #which is estimation of copies tmp_merged_df['reads_per_million_bases'] = (tmp_merged_df['new_est_reads'] / (tmp_merged_df['median'] / 1000000)).round(4) # equivalent to RPK but with megabases instead of kilobases tmp_merged_df['reads_per_million_reads'] = (tmp_merged_df['new_est_reads'] / per_million_scaling).round(4) # reads per million total reads tmp_merged_df['RPMPM'] = tmp_merged_df['new_est_reads'] / ((tmp_merged_df['median'] / 1000000)* (per_million_scaling)) #reads per million base pairs per million reads tmp_merged_df['copies_per_million_reads'] = tmp_merged_df['depth'] / (per_million_scaling).round(4) tmp_merged_df = tmp_merged_df.merge(kraken_df, how='left', left_on='taxonomy_id', right_on='species_taxid') ## step 4 - output tmp_merged_df.rename(columns={'median': 'median_genome_length', 'max_cov':'breadth'}, inplace=True) processed_report = tmp_merged_df.drop(['tmp_scaling', 'species_level_taxa', 'max','mean','species_taxid','superkingdom','max_uniq_minimizers','kept'], axis=1) processed_report.to_csv(bracken_report+'.scaled', sep="\t", index=False) if __name__ == "__main__": main() |
99 100 101 102 103 | shell: """ kraken2 --db {params.db} --threads {threads} --output {output.krak} \ --report {output.krak_report} --report-minimizer-data {params.paired_string} \ {params.gzipped_string} {input.reads} --confidence {params.confidence_threshold} """ |
124 125 126 127 128 129 130 | shell: """ python {params.repo_dir}/pipeline_scripts/filter_kraken_reports.py {input.krak_report} {params.db} \ {params.cov_thresh_bacterial} {params.cov_thresh_viral} {params.minimizer_thresh_bacterial} \ {params.minimizer_thresh_viral} \ {params.cov_thresh_arc} {params.cov_thresh_euk} {params.minimizer_thresh_arc} \ {params.minimizer_thresh_euk} """ |
143 144 145 146 147 | shell: """ python {params.repo_dir}/pipeline_scripts/process_filtered_kraken.py {input.krak_report_filtered} \ {params.threshold} {params.db} touch {output.completed} """ |
165 166 167 168 169 170 | shell: """ kraken2 --db {params.db} --threads {threads} --output {output.fwd_krak} \ {params.gzipped_string} {input.fwd_reads} --confidence {params.confidence_threshold} kraken2 --db {params.db} --threads {threads} --output {output.rev_krak} \ {params.gzipped_string} {input.rev_reads} --confidence {params.confidence_threshold} """ |
191 192 193 194 195 196 197 198 199 | shell: """ # protection against Bracken error [ -f {params.possible_1} ] && \ ( cp {params.possible_1} {output.brack_report_1}; cp {params.possible_2} {output.brack_report_2} ) \ || bracken -d {params.db} -i {input.krak_report} \ -o {output.brack_report_1} -r {params.readlen} \ -l 'S' -t {params.threshold} """ |
208 209 210 211 212 213 214 215 216 217 | shell: """ count=`find {params.classdir} -maxdepth 1 -name "*.krak.report.filtered.bracken.scaled.temp" | wc -l` if [[ $count != 0 ]] then ls {params.classdir}/*.krak.report.filtered.bracken.scaled.temp | rev | \ cut -d'/' -f 1 | rev | cut -d'.' -f 1 > {output.failed} else touch {output.failed} fi """ |
230 231 232 233 | shell: """ python {params.repo_dir}/pipeline_scripts/prep_to_merge_counts.py \ {input.brack_report} {params.db} {input.failed_file} {wildcards.samp} """ |
246 247 248 249 | shell: """ bash {params.repo_dir}/pipeline_scripts/calc_total_reads.sh {wildcards.samp} \ {params.krak_brack_dir} {output.tot_reads_file} """ |
258 259 260 261 | shell: """ echo -e 'Samp_Name\tTot_Samp_Reads\tUnassigned_Step_One\tAssigned_Step_Three\tUnassigned_Step_Three' > {output.outf} cat {params.classdir}/*total_reads.txt >> {output.outf} """ |
274 275 276 277 | shell: """ python {params.repo_dir}/pipeline_scripts/prep_to_merge_normed.py \ {input.tot_reads_file} {params.classdir} {input.failed_file} """ |
293 294 295 296 297 298 | shell: """ ls {params.classdir}/*to_merge | rev | cut -d'/' -f 1 | rev > {params.classdir}/counts_tables.txt ls {params.classdir}/*norm_brack | rev | cut -d'/' -f 1 | rev > {params.classdir}/norm_brack_tables.txt Rscript {params.repo_dir}/pipeline_scripts/merging_bracken_tables.R {params.classdir} {params.finaldir} \ {input.failed_file} {output.list1} {output.list2} """ |
315 316 317 318 319 320 | shell: """ [ -f {params.possible} ] && \ cp {params.possible} {output.scaled_bracken} || \ python {params.repo_dir}/pipeline_scripts/scale_bracken.py {params.db} {input.bracken_report} \ {input.filtering_decisions} {params.readlen} {params.paired} """ |
335 336 337 338 339 | shell: """ ls {params.classdir}/*scaled | rev | cut -d'/' -f 1 | rev > {params.classdir}/scaled_reports.txt Rscript {params.repo_dir}/pipeline_scripts/merging_scaled_bracken.R {params.classdir} {output.list} {input.failed_file} python {params.repo_dir}/pipeline_scripts/merging_scaled_bracken.py {params.db} {output.merged_temp} {params.finaldir} """ |
355 356 357 358 359 360 361 362 363 364 365 366 | shell: """ mkdir {params.intdir} mv {params.classdir}/*krak {params.intdir} mv {params.classdir}/*krak.report {params.intdir} mv {params.classdir}/*krak.report.filtered {params.intdir} mv {params.classdir}/*krak.report.filtered.bracken {params.intdir} mv {params.classdir}/*krak.report.filtering_decisions.txt {params.intdir} if [[ {params.delete} == 'True' ]]; then rm -r {params.intdir} fi touch {output.completed} """ |
Support
- Future updates
Related Workflows





