Snakemake pipeline calculating KEGG orthologue abundance in metagenomic sequence data.
Snakemake pipeline calculating KEGG orthologue abundance in metagenomic sequence data.
Documentation
KOunt is a Snakemake pipeline that calculates the abundance of KEGG orthologues (KOs) in metagenomic sequence data. KOunt takes raw paired-end reads and quality trims, assembles, predicts proteins and annotates them with KofamScan. The reads are mapped to the assembly and protein coverage calculated. Users have the option of calculating coverage evenness of the proteins and filtering the KofamScan proteins to remove unevenly covered proteins. The proteins annotated by KofamScan are clustered at 100%, 90% and 50% identity within each KO to quantify their diversity; as using the evenness filtering option reduces the numbers of these proteins we don't recommend using the evenness option if you are interested in the clustering results. All predicted proteins that don’t have a KO hit or are excluded by evenness filtering are called 'NoHit’. The NoHit proteins are blasted against a custom UniProt database annotated with a KO and the nucleotides against a custom RNA database. Reads mapped to NoHit proteins that remain unannotated and unmapped reads are blasted against the KOunt databases and RNA quantified in the remaining reads.
Workflow

Installation
Dependencies
Source
Download the latest version of the Snakefile, scripts and conda env files.
git clone https://github.com/WatsonLab/KOunt
cd KOunt/
Check that the scripts are executable, if not do:
chmod +x scripts/*sh
Prepare the reference databases
Download the KOunt UniProt and RNA databases.
wget https://figshare.com/ndownloader/files/37711530
mv 37711530 KOunt_databases.tar
tar -xzvf KOunt_databases.tar
gunzip KOunt_databases_v1/*
rm KOunt_databases.tar
If you wish to update these databases, further information on how they were created is available here .
Install Snakemake
conda create -n snakemake_mamba -c conda-forge -c bioconda mamba
conda activate snakemake_mamba
mamba install -c bioconda snakemake
Install the conda environments
conda activate snakemake_mamba
snakemake --use-conda --conda-create-envs-only --cores 1
Test installation
Download the test fastqs. Leave the raw reads location in the config at default and perform a dry-run with the reads subsampled from ERR2027889. Then run the pipeline. With 8 cores it should take approximately 20 minutes.
wget https://figshare.com/ndownloader/files/39545968
mv 39545968 test_fastqs.tar
tar -xvf test_fastqs.tar
rm test_fastqs.tar
snakemake -k --ri --use-conda -n
snakemake -k --ri --use-conda --cores 8
Running KOunt
Amend the options config file,
config.yaml
, with your fastq file locations and extensions. KOunt expects the raw reads to be in a directory with the same sample name eg.
ERR2027889/ERR2027889_R1.fastq.gz
. It runs the pipeline on all the samples in the directory you specify in the config file.
To use the default rule all in the Snakefile specify the number of cores you have available and run the entire pipeline with:
snakemake -k --ri --use-conda --cores 8
If you wish to only run part of the pipeline you can specify another rule all.
To perform all steps but the protein clustering use:
snakemake -k --ri --use-conda all_without_clustering --cores 8
To perform all steps but protein clustering and read/protein annotation with the KOunt reference databases:
snakemake -k --ri --use-conda all_without_reference --cores 8
To perform all steps but protein clustering and RNA abundance quantification:
snakemake -k --ri --use-conda all_without_RNA --cores 8
Estimated run times and memory usage
The average run time and maximum memory used by each of the rules on the 10 samples from the KOunt manuscript is available here .
Options
The following options can be amended in the config.yaml file:
-
raw_reads
the path to the directory containing all the raw reads (default: "reads/")
-
r1_ext
the extension of read one (default: "_R1.fastq.gz")
-
r2_ext
the extension of read two (default: "_R2.fastq.gz")
-
diamond_db
the path to the KOunt DIAMOND database (default: "KOunt_databases/KO_DI_1.0.dmnd")
-
mmseq_db
the path to the KOunt MMseqs2 database (default: "KOunt_databases/KO_RNA_1.0.mmseq")
-
combined_bdg
the path to the KOunt database bedGraph file (default: "KOunt_databases/KO_RNA_DI_1.0.bedgraph")
-
kallisto
the path to the KOunt kallisto reference (default: "KOunt_databases/KO_RNA_kallisto_1.0") -
outdir
the path to the output directory (default: "out/")
The read ids in the trimmed reads are shortened up to the first space and /1 or /2 added to the end if not already present. By default the read ids are compared to ensure all ids are unique but this can be changed if you're sure they will be
-
checking_fqs
running fastq_utils to check that the read ids are unique and the fastq is valid (default: ""). Change to "#" if not checking -
not_checking_fqs
not running fastq_utils on the reads (default: "#"). Change to "" if checking
The abundance of proteins that KofamScan annotates with multiple KOs can either be split between the KOs or summed with all other proteins with multiple hits into 'Multiples'
-
splitting_multiples
splitting proteins between their KO hits (default: ""). Change to "#" if grouping multiples -
grouping_multiples
grouping proteins that have multiple KO hits (default: "#"). Change to "" if grouping multiples
Raw read trimming (rule trim)
-
r1ad
the adapter sequence for read one (default: "AGATCGGAAGAGC")
-
r2ad
the adapter sequence for read two (default: "AGATCGGAAGAGC")
-
polyg
use -g to enable trimming of polyG tails (default: "")
-
qual
use -Q to disable quality filtering (default: "")
-
minlen
the minimum required read length (default: "50")
-
overlap
the minimum required length of an overlap of PE reads (default: "5")
-
trim_threads
the number of threads (default: "4")
Assembly (rule megahit)
-
mega_mem
the maximum memory megahit can use (default: "4.8e+10”) -
mega_threads
the number of threads (default: "8") -
mega_kmers
the kmer sizes (default: "27,37,47,57,67,77,87") -
mega_len
the minimum required contig length (default: "300")
Mapping (rule bwa)
-
bwa_threads
the number of threads (default: "4")
Coverage (rule coverage)
-
cov_split
the number of chunks to split the BAM file into. The bigger the number of chunks, the memory required decreases but run time increases (default: “10”) -
evenness_yes
if you want to filter the kofamscan results by coverage evenness then leave this option empty (default: "") -
evenness_no
if you don't want to filter the kofamscan results by coverage evenness then leave this option empty. Must be the opposite of evenness_yes (default: "#")
KEGG database download (rule kegg_db)
-
db
the path to the kofamscan database. Amend if you have an alternate version you wish to use, default will download the current database version (default: “out/Kofamscan/kofam/”)
Kofamscan (rule kofamscan)
-
kofamscan_threads
the number of threads (default: “4”)
Kofamscan results (rule kofamscan_results)
-
evenness_pid
the evenness percentage threshold to filter the proteins by (default: "0.95")
CD-HIT (rule cdhit)
-
cdhit_mem
the maximum memory CD-HIT can use (default: “32000”) -
cdhit_threads
the number of threads (default: “8”)
MMseqs2 KOs (rule mmseq_keggs)
-
mmseq_keggs_threads
the number of threads (default: “8”)
MMseqs2 NoHit (rule mmseq_nohit)
-
mmseq_nohit_threads
the number of threads (default: “8”)
Diamond (rule diamond_search)
-
dia_threads
the number of threads (default: “8”) -
min_qc
the minimum percentage of the length of the reference hit that the protein has to be (default: “90”) -
max_qc
the maximum percentage of the length of the reference hit that the protein has to be (default: “110”) -
min_pid
the minimum percentage identity required to be classed as a hit (default: “80”)
Barrnap (rule barrnap)
-
barrnap
the number of threads (default: “4”)
Annotate NoHit reads (rule nohit_annotate_reads)
-
nohit
the number of threads (default: “8”)
Kallisto (rule kallisto)
-
kallisto_threads
the number of threads (default: “8”)
Unmapped read annotation (rule unmapped_reads)
-
unmapped_threads
the number of threads (default: “8”)
Output
Default
Results/KOunts_Kofamscan.csv
KO abundance in each sample, calculated by Kofamscan, without read mapping
Results/All_KOunts_nohit_unmapped_default.csv
Final KO abundance in each sample
Results/Number_of_clusters.csv
Number of clusters of proteins at 90% and 50% sequence identity in each KO, the number of clusters that contain multiple proteins and the number of singleton clusters
Without clustering
Results/KOunts_Kofamscan.csv
KO abundance in each sample, calculated by Kofamscan, without read mapping
Results/All_KOunts_nohit_unmapped_no_clustering.csv
Final KO abundance in each sample
Without reference databases
Results/All_KOunts_without_reference.csv
Final KO abundance in each sample calculated by Kofamscan, without read mapping
Without RNA
Results/KOunts_Kofamscan_without_clustering.csv
KO abundance in each sample, calculated by Kofamscan, without read mapping
Results/All_KOunts_without_RNA.csv
Final KO abundance in each sample
Code Snippets
160 161 162 163 164 165 166 167 168 169 170 171 | shell: ''' mkdir -p {params.trim} mkdir -p {params.trim_id} fastp -i {input.r1} -o {params.r1} -I {input.r2} -O {params.r2} --adapter_sequence {params.r1_adapter} --adapter_sequence_r2 {params.r2_adapter} {params.quality} -l {params.min} -w {params.threads} --overlap_len_require {params.over} {params.polyG} --json /dev/null --html /dev/null zcat {params.r1} | perl -pe 's/ .*// if /^@/' | sed '/\/1/! s/^@.*/&\/1/' | gzip > {params.tmp_r1} zcat {params.r2} | perl -pe 's/ .*// if /^@/' | sed '/\/2/! s/^@.*/&\/2/' | gzip > {params.tmp_r2} mv {params.tmp_r1} {params.r1} mv {params.tmp_r2} {params.r2} {params.checking_fqs} if fastq_info {params.r1} {params.r2}; then touch {output}; else echo "read names not unique or fastqs invalid"; fi #include if checking read ids {params.not_checking_fqs} touch {output} #include if not checking read ids ''' |
187 188 189 190 191 192 193 194 | shell: ''' mkdir -p {params.mega} megahit -m {params.memory} --kmin-1pass --k-list {params.kmers} --min-contig-len {params.len} -t {params.threads} -1 {params.r1} -2 {params.r2} -o {params.outdir} mv {params.outdir}final.contigs.fa {output.final} ./scripts/add_sample_name.sh {output.final} {wildcards.id} rm -r {params.outdir} ''' |
205 206 207 208 209 210 211 212 213 214 215 216 217 | shell: ''' mkdir -p {params.output} prodigal -i {input} -p meta -o {params.output}{wildcards.id}.gff -d {params.output}{wildcards.id}.fa -a {params.output}{wildcards.id}.faa -f gff awk '/^>/ {{P=index($0,"partial=00")}} {{if(P) print}} ' {params.output}{wildcards.id}.fa > {output.filtered_nuc} awk '/^>/ {{P=index($0,"partial=00")}} {{if(P) print}} ' {params.output}{wildcards.id}.faa > {output.filtered_aa} sed -i 's/ .*$//' {output.filtered_nuc} sed -i 's/ .*$//' {output.filtered_aa} awk '!/^>/ {{ printf "%s", $0; n="\\n" }} /^>/ {{ print n $0; n="" }} END {{ printf "%s", n }}' {output.filtered_nuc} > {output.filtered_nuc}_tmp && mv {output.filtered_nuc}_tmp {output.filtered_nuc} awk '!/^>/ {{ printf "%s", $0; n="\\n" }} /^>/ {{ print n $0; n="" }} END {{ printf "%s", n }}' {output.filtered_aa} > {output.filtered_aa}_tmp && mv {output.filtered_aa}_tmp {output.filtered_aa} awk '{{P=index($0,"partial=00")}} {{if(P) print}}' {params.output}{wildcards.id}.gff | sed 's/ID=[^_]*_/ID=/g' > {output.filtered_gff} rm -r {params.output} ''' |
231 232 233 234 235 236 237 | shell: ''' bwa index {input.fasta} bwa mem -t {params.threads} {input.fasta} {params.r1} {params.r2} | samtools view -bS - > {params.bam} rm {input.fasta}.* touch {output.touchfile} ''' |
245 246 247 248 249 250 251 | shell: ''' wget -P {params.bamdeal_location} {params.download} tar -C {params.bamdeal_location} -zxvf {params.bamdeal_location}v0.24.tar.gz chmod 755 {output.bamdeal} rm {params.bamdeal_location}v0.24.tar.gz ''' |
272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | shell: ''' mkdir -p {params.wd} mkdir -p {params.out} awk '{{print $1}}' {input.filtered_gff} | sort | uniq > {params.wd}_contigs split -l {params.split} {params.wd}_contigs {params.wd} for file in {params.wd}??; do awk -v file=$file '{{print $1 "\t" file}}' $file | awk -F'/|\t' '{{print $1 "\t" $NF}}'; done > {params.wd}_bedtools_list {input.bamdeal} modify bamAssign -i {params.bam} -a {params.wd}_bedtools_list -q 0 -o {params.wd} for file in {params.wd}??; do ./scripts/bedtools.sh $file {params.gff} {params.wd}; done > {output.coverage} {params.evenness_yes} for file in {params.wd}??; do ./scripts/evenness.sh $file {params.wd}; done {params.evenness_yes} cat {params.wd}??_evenness > {params.evenness} {params.evenness_yes} rm {params.wd}??_evenness bedtools bamtobed -i {params.bam} > {params.bam}.bed rm {params.bam} LC_ALL=C sort -k1,1 -k2,2n {params.bam}.bed > {params.bam}.bed_sorted LC_ALL=C sort -k1,1 -k4,4n {input.filtered_gff} > {input.filtered_gff}_sorted bedtools intersect -a {input.filtered_gff}_sorted -b {params.bam}.bed_sorted -wb -sorted | awk '{{print $13}}' | sort -S 50% | uniq > {params.mapped} rm {input.filtered_gff}_sorted rm {params.bam}.bed_sorted rm -r {params.wd} rm {params.wd}_contigs rm {params.wd}_bedtools_list rm {params.wd}*coverage rm {params.wd}*gff rm {params.wd}?? ''' |
317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 | shell: ''' mkdir -p {params.wd} mkdir -p {params.out} awk '{{print $1}}' {input.filtered_gff} | sort | uniq > {params.wd}_contigs split -l {params.split} {params.wd}_contigs {params.wd} for file in {params.wd}??; do awk -v file=$file '{{print $1 "\t" file}}' $file | awk -F'/|\t' '{{print $1 "\t" $NF}}'; done > {params.wd}_bedtools_list {input.bamdeal} modify bamAssign -i {params.bam} -a {params.wd}_bedtools_list -q 0 -o {params.wd} for file in {params.wd}??; do ./scripts/bedtools.sh $file {params.gff} {params.wd}; done > {output.coverage} {params.evenness_yes} for file in {params.wd}??; do ./scripts/evenness.sh $file {params.wd}; done {params.evenness_yes} cat {params.wd}??_evenness > {params.evenness} {params.evenness_yes} rm {params.wd}??_evenness rm {params.bam} rm -r {params.wd} rm {params.wd}_contigs rm {params.wd}_bedtools_list rm {params.wd}*coverage rm {params.wd}*gff rm {params.wd}?? ''' |
347 348 349 350 351 352 353 354 355 | shell: ''' touch {output.touch} wget -P {params.location} {params.profiles} wget -P {params.location} {params.ko_list} tar -C {params.location} -xf {params.location}/profiles.tar.gz gunzip {params.location}/ko_list.gz rm {params.location}/profiles.tar.gz ''' |
369 370 371 372 373 | shell: ''' exec_annotation --cpu {params.threads} -p {params.profiles} -k {params.list} -o {output.ko} --tmp-dir {params.tmp} {input.filtered_aa} rm -r {params.tmp} ''' |
392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 | shell: ''' grep '^\*' {input.ko} | awk '{{print $2 "\t" $3}}' > {output.out} {params.splitting_multiples}{params.evenness_no}./scripts/splitting_multiples.sh {input.coverage} {output.out} {output.out} #include if splitting multiples and not using evenness {params.splitting_multiples}{params.evenness_yes}./scripts/splitting_multiples_evenness.sh {input.coverage} {output.out} {output.out} {params.evenness} {params.pid} #include if splitting multiples and using evenness {params.grouping_multiples}awk '{{print $1}}' {output.out} | sort | uniq -d > {output.out}_duplicates #include if grouping multiples {params.grouping_multiples}cat {output.out} | parallel --colsep '\t' ./scripts/multiples.sh {{1}} {output.out} {output.out}_duplicates | sort | uniq > {output.out}_multiples #include if grouping multiples {params.grouping_multiples}awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {output.out}_multiples {input.coverage} | awk 'BEGIN{{OFS=FS}} $3 == "" {{$3 = "NoHit"}} 1' > {output.out}_coverage #include if grouping multiples {params.grouping_multiples}rm {output.out}_duplicates #include if grouping multiples {params.grouping_multiples}rm {output.out}_multiples #include if grouping multiples {params.grouping_multiples}{params.evenness_no}awk '{{print $2 "\t" $3}}' {output.out}_coverage | awk '{{a[$2]+=$1}}END{{for(i in a) print i,a[i]}}' > {output.KOunt} #include if grouping multiples and not using evenness {params.grouping_multiples}{params.evenness_yes}awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {params.evenness} {output.out}_coverage | gawk -v pid="{params.pid}" '$4>pid' > {output.out}_coverage && awk '{{print $2 "\t" $3}}' {output.out}_coverage | awk '{{a[$2]+=$1}}END{{for(i in a) print i,a[i]}}' > {output.KOunt} #include if grouping multiples and using evenness mkdir -p {params.split_keggs} awk '{{print $3}}' {output.out}_coverage > {output.out}_coverage_tmp && while read l; do mkdir -p {params.split_keggs}/"$l"; done < {output.out}_coverage_tmp && rm {output.out}_coverage_tmp awk '{{print $1>f"{params.split_keggs}" $3 "/{wildcards.id}_" $3}}' {output.out}_coverage for file in {params.split_keggs}*/{wildcards.id}_*; do ./scripts/extracting_genes.sh $file {input.filtered_aa}; done grep NoHit {output.out}_coverage | awk '{{print $1}}' > {output.nohit} rm {output.out}_coverage ''' |
427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 | shell: ''' grep '^\*' {input.ko} | awk '{{print $2 "\t" $3}}' > {output.out} {params.splitting_multiples}{params.evenness_no}./scripts/splitting_multiples.sh {input.coverage} {output.out} {output.out} #include if splitting multiples and not using evenness {params.splitting_multiples}{params.evenness_yes}./scripts/splitting_multiples_evenness.sh {input.coverage} {output.out} {output.out} {params.evenness} {params.pid} #include if splitting multiples and using evenness {params.grouping_multiples}awk '{{print $1}}' {output.out} | sort | uniq -d > {output.out}_duplicates #include if grouping multiples {params.grouping_multiples}cat {output.out} | parallel --colsep '\t' ./scripts/multiples.sh {{1}} {output.out} {output.out}_duplicates | sort | uniq > {output.out}_multiples #include if grouping multiples {params.grouping_multiples}awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {output.out}_multiples {input.coverage} | awk 'BEGIN{{OFS=FS}} $3 == "" {{$3 = "NoHit"}} 1' > {output.out}_coverage #include if grouping multiples {params.grouping_multiples}rm {output.out}_duplicates #include if grouping multiples {params.grouping_multiples}rm {output.out}_multiples #include if grouping multiples {params.grouping_multiples}{params.evenness_no}awk '{{print $2 "\t" $3}}' {output.out}_coverage | awk '{{a[$2]+=$1}}END{{for(i in a) print i,a[i]}}' > {output.KOunt} #include if grouping multiples and not using evenness {params.grouping_multiples}{params.evenness_yes}awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {params.evenness} {output.out}_coverage | gawk -v pid="{params.pid}" '$4>pid' > {output.out}_coverage && awk '{{print $2 "\t" $3}}' {output.out}_coverage | awk '{{a[$2]+=$1}}END{{for(i in a) print i,a[i]}}' > {output.KOunt} #include if grouping multiples and using evenness grep NoHit {output.out}_coverage | awk '{{print $1}}' > {output.nohit} rm {output.out}_coverage ''' |
457 458 459 460 461 462 463 464 465 466 467 468 469 470 | shell: ''' grep '^\*' {input.ko} | awk '{{print $2 "\t" $3}}' > {output.out} {params.splitting_multiples}{params.evenness_no}./scripts/splitting_multiples.sh {input.coverage} {output.out} {output.out} #include if splitting multiples and not using evenness {params.splitting_multiples}{params.evenness_yes}./scripts/splitting_multiples_evenness.sh {input.coverage} {output.out} {output.out} {params.evenness} {params.pid} #include if splitting multiples and using evenness {params.grouping_multiples}awk '{{print $1}}' {output.out} | sort | uniq -d > {output.out}_duplicates #include if grouping multiples {params.grouping_multiples}cat {output.out} | parallel --colsep '\t' ./scripts/multiples.sh {{1}} {output.out} {output.out}_duplicates | sort | uniq > {output.out}_multiples #include if grouping multiples {params.grouping_multiples}awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {output.out}_multiples {input.coverage} | awk 'BEGIN{{OFS=FS}} $3 == "" {{$3 = "NoHit"}} 1' > {output.out}_coverage #include if grouping multiples {params.grouping_multiples}rm {output.out}_duplicates #include if grouping multiples {params.grouping_multiples}rm {output.out}_multiples #include if grouping multiples {params.grouping_multiples}{params.evenness_no}awk '{{print $2 "\t" $3}}' {output.out}_coverage | awk '{{a[$2]+=$1}}END{{for(i in a) print i,a[i]}}' > {output.KOunt} #include if grouping multiples and not using evenness {params.grouping_multiples}{params.evenness_yes}awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {params.evenness} {output.out}_coverage | gawk -v pid="{params.pid}" '$4>pid' > {output.out}_coverage && awk '{{print $2 "\t" $3}}' {output.out}_coverage | awk '{{a[$2]+=$1}}END{{for(i in a) print i,a[i]}}' > {output.KOunt} #include if grouping multiples and using evenness rm {output.out}_coverage ''' |
481 482 483 484 485 486 487 488 489 | shell: ''' mkdir -p {params.tmp} if [ ! -e {params.kegglist} ]; then cat {input} | awk '{{print $1}}' | sort | uniq > {params.kegglist}; fi for i in {input}; do cp $i {params.tmp}; done ./scripts/abundance_matrix.sh {params.tmp} {params.list} {params.kegglist} > {output.results} rm -r {params.tmp} if [ -e {outdir}Touch/kegg_fasta ]; then rm {params.kegglist}; fi ''' |
500 501 502 503 504 505 506 507 508 | shell: ''' mkdir -p {params.tmp} if [ ! -e {params.kegglist} ]; then cat {input} | awk '{{print $1}}' | sort | uniq > {params.kegglist}; fi for i in {input}; do cp $i {params.tmp}; done ./scripts/abundance_matrix_noc.sh {params.tmp} {params.list} {params.kegglist} > {output.results} rm -r {params.tmp} if [ -e {outdir}Touch/kegg_fasta ]; then rm {params.kegglist}; fi ''' |
519 520 521 522 523 524 525 526 527 | shell: ''' mkdir -p {params.tmp} if [ ! -e {params.kegglist} ]; then cat {input} | awk '{{print $1}}' | sort | uniq > {params.kegglist}; fi for i in {input}; do cp $i {params.tmp}; done ./scripts/abundance_matrix_noref.sh {params.tmp} {params.list} {params.kegglist} > {output.results} rm -r {params.tmp} if [ -e {outdir}Touch/kegg_fasta ]; then rm {params.kegglist}; fi ''' |
540 541 542 543 544 545 546 547 548 549 550 | shell: ''' mkdir -p {params.cd} mkdir -p {params.cdhit} if [ ! -e {params.kegglist} ]; then cat {input} | awk '{{print $1}}' | sort | uniq > {params.kegglist}; fi cat {params.kegglist} | parallel "cat {params.split_dir}{{1}}/*faa > {params.cdhit}{{1}}.faa" rm -r {params.split_dir} if [ -e {params.mega} ]; then rm -r {params.mega}; fi touch {output} if [ -e {params.results} ]; then rm {params.kegglist}; fi ''' |
562 563 564 565 566 567 568 | shell: ''' for file in {params.cdhit}*faa; do cd-hit -i $file -o "$file"_1.0 -c 1.0 -d 0 -M {params.memory} -T {params.threads}; done for file in {params.cdhit}*clstr; do clstr2txt.pl $file | awk '{{print $2 "\t" $1}}' > "$file".txt; done for file in {params.cdhit}*clstr; do clstr_rep.pl $file > "$file"_reps; done touch {output} ''' |
579 580 581 582 583 584 585 586 587 | shell: ''' mkdir -p {params.lists} (cd {params.sample} && ls K*_1.0) > keggslist {params.grouping_multiples}(cd {params.sample} && ls Multiple*1.0) >> keggslist #include if grouping multiples split -d --number=l/{params.num} keggslist {params.lists}/keggslist rm keggslist touch {output} ''' |
601 602 603 604 605 606 607 608 609 610 | shell: ''' mkdir -p {params.mm} mkdir -p {params.mmseq} while read file; do ./scripts/mmseq_keggs.sh $file {params.cdhit} {params.mmseq}; done < {params.list} while read file; do ./scripts/count_mmseq_clusters.sh {params.mmseq}"$file"_0.9_all; done < {params.list} while read file; do ./scripts/count_mmseq_clusters.sh {params.mmseq}"$file"_0.5_all; done < {params.list} rm {params.list} touch {output} ''' |
622 623 624 625 626 627 628 629 630 | shell: ''' mkdir -p {params.mm} mkdir -p {params.mmseq} ./scripts/mmseq_keggs.sh NoHit.faa_1.0 {params.cdhit} {params.mmseq} ./scripts/count_mmseq_clusters.sh {params.mmseq}NoHit.faa_1.0_0.9_all ./scripts/count_mmseq_clusters.sh {params.mmseq}NoHit.faa_1.0_0.5_all touch {output} ''' |
640 641 642 643 644 645 | shell: ''' echo -e "KEGG,Number of clusters,Number of singletons,Number with multiple genes,Number of proteins\n$(cat {params.mmseq}*txt)" > {output} rm {params.mmseq}*txt rm -r {params.list} ''' |
657 658 659 660 661 662 663 | shell: ''' mkdir -p {params.wd} ./scripts/extracting_fastas.sh {input.nohit} {input.faa} > {params.faa} ./scripts/extracting_fastas.sh {input.nohit} {input.fa} > {params.fa} touch {output} ''' |
675 676 677 678 679 680 681 | shell: ''' mkdir -p {params.wd} ./scripts/extracting_fastas.sh {input.nohit} {input.faa} > {params.faa} ./scripts/extracting_fastas.sh {input.nohit} {input.fa} > {params.fa} touch {output} ''' |
692 693 694 695 696 697 | shell: ''' mkdir -p {params.wd} ./scripts/extracting_fastas.sh {input.nohit} {input.faa} > {params.faa} touch {output} ''' |
726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 | shell: ''' diamond blastp --query {params.faa} --threads {params.threads} --outfmt {params.of} -d {params.db} > {output.raw} mmseqs easy-search {params.fa} {params.rna} {output.mmseq} {params.tmp} --threads {params.threads} --search-type 3 --format-output {params.ofm} cat {output.raw} {output.mmseq} > {params.cat} rm -r {params.tmp} awk -F '\t' '{{ print $0 "\t" ($4/$5) * 100 }}' {params.cat} | gawk -F '\t' '$NF>{params.min_qc} {{print $0}}' | gawk -F '\t' '$NF<{params.max_qc} {{print $0}}' | gawk -F '\t' '$3>{params.min_pid} {{print $0}}' > {output.fil} awk '! a[$1]++' {output.fil} > {output.fil}_tmp awk '{{print $1 "," $3 "," $10 "\t" $2}}' {output.fil}_tmp > {output.fil}_tmp.1 awk '{{print $1 "," $3 "," $10 "\t" $2}}' {output.fil} > {output.fil}.1 awk 'NR==FNR{{a[$1];next}}$1 in a' {output.fil}_tmp.1 {output.fil}.1 | tr ',' '\t' > {output.best} rm {output.fil}_tmp.1 rm {output.fil}.1 awk -F '_' '{{print $0 "\t" $NF}}' {output.best} | awk '{{print $1 "\t" $NF}}' | sort | uniq > {output.ko} awk '{{print $1}}' {output.ko} | uniq -c | awk '{{print $2 "\t" $1}}' > {output.ko}_count awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {input.cov} {output.ko} | awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {output.ko}_count - | awk '{{print $1 "\t" $2 "\t" ($3/$4)}}' > {output.ko}_coverage awk '{{a[$2]+=$3}}END{{for(i in a) print i,a[i]}}' {output.ko}_coverage | awk -F'-' '{{print $0 "\t" NF-1}}' | awk -v OFMT='%.5f' '{{print $1"\t",($2/$3)}}' | cut -d- -f2- | awk '{{gsub(/-/,"\t"$2",")}}1' | tr ',' '\n' | awk -v OFMT='%.5f' '{{a[$1]+=$2}}END{{for(i in a) print i,a[i]}}' > {output.KOunt} awk '{{print $1}}' {output.ko}_coverage > {output.ko}_coverage_tmp cat {input.nohit} {output.ko}_coverage_tmp | sort | uniq -u > {output.no} rm {output.ko}_coverage_tmp rm {output.ko}_coverage rm {output.ko}_count ''' |
777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 | shell: ''' diamond blastp --query {params.faa} --threads {params.threads} --outfmt {params.of} -d {params.db} > {output.raw} mmseqs easy-search {params.fa} {params.rna} {output.mmseq} {params.tmp} --threads {params.threads} --search-type 3 --format-output {params.ofm} cat {output.raw} {output.mmseq} > {params.cat} rm -r {params.tmp} awk -F '\t' '{{ print $0 "\t" ($4/$5) * 100 }}' {params.cat} | gawk -F '\t' '$NF>{params.min_qc} {{print $0}}' | gawk -F '\t' '$NF<{params.max_qc} {{print $0}}' | gawk -F '\t' '$3>{params.min_pid} {{print $0}}' > {output.fil} awk '! a[$1]++' {output.fil} > {output.fil}_tmp awk '{{print $1 "," $3 "," $10 "\t" $2}}' {output.fil}_tmp > {output.fil}_tmp.1 awk '{{print $1 "," $3 "," $10 "\t" $2}}' {output.fil} > {output.fil}.1 awk 'NR==FNR{{a[$1];next}}$1 in a' {output.fil}_tmp.1 {output.fil}.1 | tr ',' '\t' > {output.best} rm {output.fil}_tmp.1 rm {output.fil}.1 awk -F '_' '{{print $0 "\t" $NF}}' {output.best} | awk '{{print $1 "\t" $NF}}' | sort | uniq > {output.ko} awk '{{print $1}}' {output.ko} | uniq -c | awk '{{print $2 "\t" $1}}' > {output.ko}_count awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {input.cov} {output.ko} | awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {output.ko}_count - | awk '{{print $1 "\t" $2 "\t" ($3/$4)}}' > {output.ko}_coverage awk '{{a[$2]+=$3}}END{{for(i in a) print i,a[i]}}' {output.ko}_coverage | awk -F'-' '{{print $0 "\t" NF-1}}' | awk -v OFMT='%.5f' '{{print $1"\t",($2/$3)}}' | cut -d- -f2- | awk '{{gsub(/-/,"\t"$2",")}}1' | tr ',' '\n' | awk -v OFMT='%.5f' '{{a[$1]+=$2}}END{{for(i in a) print i,a[i]}}' > {output.KOunt} awk '{{print $1}}' {output.ko}_coverage > {output.ko}_coverage_tmp cat {input.nohit} {output.ko}_coverage_tmp | sort | uniq -u > {output.no} rm {output.ko}_coverage_tmp rm {output.ko}_coverage rm {output.ko}_count ''' |
825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 | shell: ''' diamond blastp --query {params.faa} --threads {params.threads} --outfmt {params.of} -d {params.db} > {output.raw} awk -F '\t' '{{ print $0 "\t" ($4/$5) * 100 }}' {output.raw} | gawk -F '\t' '$NF>{params.min_qc} {{print $0}}' | gawk -F '\t' '$NF<{params.max_qc} {{print $0}}' | gawk -F '\t' '$3>{params.min_pid} {{print $0}}' > {output.fil} awk '! a[$1]++' {output.fil} > {output.fil}_tmp awk '{{print $1 "," $3 "," $10 "\t" $2}}' {output.fil}_tmp > {output.fil}_tmp.1 awk '{{print $1 "," $3 "," $10 "\t" $2}}' {output.fil} > {output.fil}.1 awk 'NR==FNR{{a[$1];next}}$1 in a' {output.fil}_tmp.1 {output.fil}.1 | tr ',' '\t' > {output.best} rm {output.fil}_tmp.1 rm {output.fil}.1 awk -F '_' '{{print $0 "\t" $NF}}' {output.best} | awk '{{print $1 "\t" $NF}}' | sort | uniq > {output.ko} awk '{{print $1}}' {output.ko} | uniq -c | awk '{{print $2 "\t" $1}}' > {output.ko}_count awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {input.cov} {output.ko} | awk 'NR==FNR{{a[$1]=$2;next}}{{print $0,a[$1]}}' {output.ko}_count - | awk '{{print $1 "\t" $2 "\t" ($3/$4)}}' > {output.ko}_coverage awk '{{a[$2]+=$3}}END{{for(i in a) print i,a[i]}}' {output.ko}_coverage | awk -F'-' '{{print $0 "\t" NF-1}}' | awk -v OFMT='%.5f' '{{print $1"\t",($2/$3)}}' | cut -d- -f2- | awk '{{gsub(/-/,"\t"$2",")}}1' | tr ',' '\n' | awk -v OFMT='%.5f' '{{a[$1]+=$2}}END{{for(i in a) print i,a[i]}}' > {output.KOunt} awk '{{print $1}}' {output.ko}_coverage > {output.ko}_coverage_tmp cat {input.nohit} {output.ko}_coverage_tmp | sort | uniq -u > {output.no} rm {output.ko}_coverage_tmp rm {output.ko}_coverage rm {output.ko}_count ''' |
862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 | shell: ''' mkdir -p {params.nohit} awk -F '=|\t|;' '{{print $1 "_" $10 "\t" ($4-1) "\t" $5}}' {input.gff} > {params.bed} awk 'NR==FNR{{a[$1]=$2;b[$1]=$3;next}}{{print $0 "\t" a[$1] "\t" b[$1]}}' {params.bed} {input.nohit} > {params.nobed}_tmp rm {params.bed} awk '{{print $1}}' {params.nobed}_tmp | rev | cut -d "_" -f 2- | rev > {params.nobed}_tmp.1 awk '{{print $2 "\t" $3}}' {params.nobed}_tmp > {params.nobed}_tmp.2 paste {params.nobed}_tmp.1 {params.nobed}_tmp.2 > {params.nobed} rm {params.nobed}_tmp rm {params.nobed}_tmp.1 rm {params.nobed}_tmp.2 bedtools intersect -a {params.nobed} -b {params.bam} -bed -wb | awk '{{print $7}}' | sort | uniq > {params.reads} rm {params.nobed} rm {params.bam} gawk -F '/' '$NF==1' {params.reads} > {params.reads}_R1 gawk -F '/' '$NF==2' {params.reads} > {params.reads}_R2 zcat {params.r1} | seqtk subseq - {params.reads}_R1 > {params.fastq}_R1 zcat {params.r2} | seqtk subseq - {params.reads}_R2 > {params.fastq}_R2 gzip {params.fastq}_R1 gzip {params.fastq}_R2 rm {params.reads}_R1 rm {params.reads}_R2 cat {params.fastq}_R1.gz {params.fastq}_R2.gz > {params.fastq} rm {params.fastq}_R1.gz rm {params.fastq}_R2.gz touch {output} ''' |
907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 | shell: ''' mkdir -p {params.nohit} awk -F '=|\t|;' '{{print $1 "_" $10 "\t" ($4-1) "\t" $5}}' {input.gff} > {params.bed} awk 'NR==FNR{{a[$1]=$2;b[$1]=$3;next}}{{print $0 "\t" a[$1] "\t" b[$1]}}' {params.bed} {input.nohit} > {params.nobed}_tmp rm {params.bed} awk '{{print $1}}' {params.nobed}_tmp | rev | cut -d "_" -f 2- | rev > {params.nobed}_tmp.1 awk '{{print $2 "\t" $3}}' {params.nobed}_tmp > {params.nobed}_tmp.2 paste {params.nobed}_tmp.1 {params.nobed}_tmp.2 > {params.nobed} rm {params.nobed}_tmp rm {params.nobed}_tmp.1 rm {params.nobed}_tmp.2 bedtools intersect -a {params.nobed} -b {params.bam} -bed -wb | awk '{{print $7}}' | sort | uniq > {params.reads} rm {params.nobed} rm {params.bam} gawk -F '/' '$NF==1' {params.reads} > {params.reads}_R1 gawk -F '/' '$NF==2' {params.reads} > {params.reads}_R2 zcat {params.r1} | seqtk subseq - {params.reads}_R1 > {params.fastq}_R1 zcat {params.r2} | seqtk subseq - {params.reads}_R2 > {params.fastq}_R2 gzip {params.fastq}_R1 gzip {params.fastq}_R2 rm {params.reads}_R1 rm {params.reads}_R2 cat {params.fastq}_R1.gz {params.fastq}_R2.gz > {params.fastq} rm {params.fastq}_R1.gz rm {params.fastq}_R2.gz touch {output} ''' |
952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 | shell: ''' mkdir -p {params.nohit} awk -F '=|\t|;' '{{print $1 "_" $10 "\t" ($4-1) "\t" $5}}' {input.gff} > {params.bed} awk 'NR==FNR{{a[$1]=$2;b[$1]=$3;next}}{{print $0 "\t" a[$1] "\t" b[$1]}}' {params.bed} {input.nohit} > {params.nobed}_tmp rm {params.bed} awk '{{print $1}}' {params.nobed}_tmp | rev | cut -d "_" -f 2- | rev > {params.nobed}_tmp.1 awk '{{print $2 "\t" $3}}' {params.nobed}_tmp > {params.nobed}_tmp.2 paste {params.nobed}_tmp.1 {params.nobed}_tmp.2 > {params.nobed} rm {params.nobed}_tmp rm {params.nobed}_tmp.1 rm {params.nobed}_tmp.2 bedtools intersect -a {params.nobed} -b {params.bam} -bed -wb | awk '{{print $7}}' | sort | uniq > {params.reads} rm {params.nobed} rm {params.bam} gawk -F '/' '$NF==1' {params.reads} > {params.reads}_R1 gawk -F '/' '$NF==2' {params.reads} > {params.reads}_R2 zcat {params.r1} | seqtk subseq - {params.reads}_R1 > {params.fastq}_R1 zcat {params.r2} | seqtk subseq - {params.reads}_R2 > {params.fastq}_R2 gzip {params.fastq}_R1 gzip {params.fastq}_R2 rm {params.reads}_R1 rm {params.reads}_R2 cat {params.fastq}_R1.gz {params.fastq}_R2.gz > {params.fastq} rm {params.fastq}_R1.gz rm {params.fastq}_R2.gz touch {output} ''' |
1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 | shell: ''' mkdir -p {params.wd} diamond blastx --query {params.fq} --threads {params.threads} --outfmt {params.of} -d {input.db} > {params.nohit} mmseqs easy-search {params.fq} {input.rna} {params.mmseq} {params.tmp} --threads {params.threads} --search-type 3 --format-output {params.ofm} cat {params.nohit} {params.mmseq} > {params.cat} rm {params.nohit} rm {params.mmseq} rm -r {params.tmp} ./scripts/unmapped_reads_ko_100.sh {params.cat} {params.wd} {output.besthits} {input.bg} {output.ko} awk '{{print $1}}' {output.besthits} | sort | uniq | cat - {params.reads} | sort | uniq -u > {params.missing} rm {params.cat} rm {params.reads} ''' |
1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 | shell: ''' mkdir -p {params.wd} diamond blastx --query {params.fq} --threads {params.threads} --outfmt {params.of} -d {input.db} > {params.nohit} mmseqs easy-search {params.fq} {input.rna} {params.mmseq} {params.tmp} --threads {params.threads} --search-type 3 --format-output {params.ofm} cat {params.nohit} {params.mmseq} > {params.cat} rm {params.nohit} rm {params.mmseq} rm -r {params.tmp} ./scripts/unmapped_reads_ko_100.sh {params.cat} {params.wd} {output.besthits} {input.bg} {output.ko} awk '{{print $1}}' {output.besthits} | sort | uniq | cat - {params.reads} | sort | uniq -u > {params.missing} rm {params.cat} rm {params.reads} ''' |
1071 1072 1073 1074 1075 1076 1077 | shell: ''' mkdir -p {params.wd} diamond blastx --query {params.fq} --threads {params.threads} --outfmt {params.of} -d {input.db} > {params.nohit} ./scripts/unmapped_reads_ko_100.sh {params.nohit} {params.wd} {output.besthits} {input.bg} {output.ko} rm {params.nohit} ''' |
1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 | shell: ''' gawk -F '/' '$NF==1' {params.missing} > {params.missing}_R1 gawk -F '/' '$NF==2' {params.missing} > {params.missing}_R2 zcat {params.fq} | seqtk subseq - {params.missing}_R1 | gzip > {params.r1} zcat {params.fq} | seqtk subseq - {params.missing}_R2 | gzip > {params.r2} rm {params.fq} rm {params.missing} rm {params.missing}_R1 rm {params.missing}_R2 kallisto quant -i {input.ref} -o {params.kal} -t {params.threads} {params.r1} {params.r2} awk -v len=$(cat {params.r1} {params.r2} | awk 'NR%4==2{{sum+=length($0)}}END{{print sum/(NR/4)}}') 'NR>1{{print $1"\t"(len*$4)/$2}}' {params.kal}/abundance.tsv | gawk '$2!=0' > {params.kal}/tmp awk '{{print $1}}' {params.kal}/tmp | awk -F'-' '{{print NF-1}}' | paste {params.kal}/tmp - | awk -v OFMT='%.10f' -v OFS='\t' '{{print $1,($2/$3)}}' | awk '{{gsub(/-/,"\t"$2",")}}1' | cut -d, -f2- | sed 's/,$//g' | tr ',' '\n' | awk -v OFMT='%.10f' '{{a[$1]+=$2}}END{{for(i in a) print i,a[i]}}' > {output.ko} rm {params.kal}/tmp rm {params.r1} rm {params.r2} rm -r {params.kal} gzip {input.besthits} ''' |
1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 | shell: ''' gawk -F '/' '$NF==1' {params.missing} > {params.missing}_R1 gawk -F '/' '$NF==2' {params.missing} > {params.missing}_R2 zcat {params.fq} | seqtk subseq - {params.missing}_R1 | gzip > {params.r1} zcat {params.fq} | seqtk subseq - {params.missing}_R2 | gzip > {params.r2} rm {params.fq} rm {params.missing} rm {params.missing}_R1 rm {params.missing}_R2 kallisto quant -i {input.ref} -o {params.kal} -t {params.threads} {params.r1} {params.r2} awk -v len=$(cat {params.r1} {params.r2} | awk 'NR%4==2{{sum+=length($0)}}END{{print sum/(NR/4)}}') 'NR>1{{print $1"\t"(len*$4)/$2}}' {params.kal}/abundance.tsv | gawk '$2!=0' > {params.kal}/tmp awk '{{print $1}}' {params.kal}/tmp | awk -F'-' '{{print NF-1}}' | paste {params.kal}/tmp - | awk -v OFMT='%.10f' -v OFS='\t' '{{print $1,($2/$3)}}' | awk '{{gsub(/-/,"\t"$2",")}}1' | cut -d, -f2- | sed 's/,$//g' | tr ',' '\n' | awk -v OFMT='%.10f' '{{a[$1]+=$2}}END{{for(i in a) print i,a[i]}}' > {output.ko} rm {params.kal}/tmp rm {params.r1} rm {params.r2} rm -r {params.kal} gzip {input.besthits} ''' |
1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 | shell: ''' mkdir -p {params.wd} mkdir -p {params.mm} mkdir -p {params.mm2} mkdir -p {params.tmp} zcat {params.r1} {params.r2} | awk 'NR%4==1 {{print substr($1,2)}}' | cat - {params.mapped} | sort --parallel=8 | uniq -u > {params.unmapped} rm {params.mapped} zcat {params.r1} {params.r2} | seqtk subseq - {params.unmapped} | gzip > {params.fastq} diamond blastx --query {params.fastq} --threads {params.threads} --outfmt 6 qseqid sseqid pident qlen slen qstart qend sstart send evalue -d {input.db} > {params.dia} mmseqs easy-search {params.fastq} {input.rna} {params.mmseq} {params.tmp} --threads {params.threads} --search-type 3 --format-output {params.ofm} rm -r {params.tmp} cat {params.dia} {params.mmseq} > {params.cat} rm {params.dia} rm {params.mmseq} ./scripts/unmapped_reads_ko_100.sh {params.cat} {params.wd} {output.besthits} {input.bg} {output.ko} rm {params.cat} ''' |
1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 | shell: ''' mkdir -p {params.wd} zcat {params.r1} {params.r2} | awk 'NR%4==1 {{print substr($1,2)}}' | cat - {params.mapped} | sort | uniq -u > {params.unmapped} rm {params.mapped} zcat {params.r1} {params.r2} | seqtk subseq - {params.unmapped} | gzip > {params.fastq} rm {params.unmapped} diamond blastx --query {params.fastq} --threads {params.threads} --outfmt 6 qseqid sseqid pident qlen slen qstart qend sstart send evalue -d {input.db} > {params.dia} ./scripts/unmapped_reads_ko_100.sh {params.dia} {params.wd} {output.besthits} {input.bg} {output.ko} rm {params.dia} ''' |
1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 | shell: ''' mkdir -p {params.nohit} awk '{{print $1}}' {input.besthits} | sort | uniq | cat - {params.unmapped} | sort | uniq -u > {params.still_missing} rm {params.unmapped} gawk -F '/' '$NF==1' {params.still_missing} > {params.still_missing}_R1 gawk -F '/' '$NF==2' {params.still_missing} > {params.still_missing}_R2 zcat {params.fq} | seqtk subseq - {params.still_missing}_R1 | gzip > {params.r1} zcat {params.fq} | seqtk subseq - {params.still_missing}_R2 | gzip > {params.r2} rm {params.fq} rm {params.still_missing}_R1 rm {params.still_missing}_R2 kallisto quant -i {input.ref} -o {params.kal} -t {params.threads} {params.r1} {params.r2} awk -v len=$(zcat {params.r1} {params.r2} | awk 'NR%4==2{{sum+=length($0)}}END{{print sum/(NR/4)}}') 'NR>1{{print $1"\t"(len*$4)/$2}}' {params.kal}/abundance.tsv | gawk '$2!=0' > {params.kal}/tmp awk '{{print $1}}' {params.kal}/tmp | awk -F'-' '{{print NF-1}}' | paste {params.kal}/tmp - | awk -v OFMT='%.10f' -v OFS='\t' '{{print $1,($2/$3)}}' | awk '{{gsub(/-/,"\t"$2",")}}1' | cut -d, -f2- | sed 's/,$//g' | tr ',' '\n' | awk -v OFMT='%.10f' '{{a[$1]+=$2}}END{{for(i in a) print i,a[i]}}' > {output.ko} rm {params.kal}/tmp rm {params.r1} rm {params.r2} gzip {input.besthits} ''' |
1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 | shell: ''' mkdir {params.tmp} for i in {input.kofam}; do cp $i {params.tmp}; done for i in {input.nohit_p}; do cp $i {params.tmp}; done for i in {input.nohit_r}; do cp $i {params.tmp}; done for i in {input.kal}; do cp $i {params.tmp}; done for i in {input.unmap}; do cp $i {params.tmp}; done for i in {input.kal_unmap}; do cp $i {params.tmp}; done sed -i '/NoHit/d' {params.tmp}/* if [ ! -e {params.kegglist} ]; then cat {params.tmp}* | awk '{{print $1}}' | sort | uniq > {params.kegglist}; fi for i in {params.id}; do echo $i; done > {params.tmp}samples while read file; do cat {params.tmp}"$file"_* | awk '{{a[$1]+=$2}}END{{for(i in a) print i,a[i]}}' > {params.tmp}"$file"_catted; done < {params.tmp}samples ./scripts/abundance_matrix_catted.sh {params.tmp} {params.list} {params.kegglist} > {output.ann_results} rm {params.kegglist} rm -r {params.tmp} ''' |
1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 | shell: ''' mkdir {params.tmp} for i in {input.kofam}; do cp $i {params.tmp}; done for i in {input.nohit_p}; do cp $i {params.tmp}; done for i in {input.nohit_r}; do cp $i {params.tmp}; done for i in {input.kal}; do cp $i {params.tmp}; done for i in {input.unmap}; do cp $i {params.tmp}; done for i in {input.kal_unmap}; do cp $i {params.tmp}; done sed -i '/NoHit/d' {params.tmp}/* if [ ! -e {params.kegglist} ]; then cat {params.tmp}* | awk '{{print $1}}' | sort | uniq > {params.kegglist}; fi for i in {params.id}; do echo $i; done > {params.tmp}samples while read file; do cat {params.tmp}"$file"_* | awk '{{a[$1]+=$2}}END{{for(i in a) print i,a[i]}}' > {params.tmp}"$file"_catted; done < {params.tmp}samples ./scripts/abundance_matrix_catted.sh {params.tmp} {params.list} {params.kegglist} > {output.ann_results} rm {params.kegglist} rm -r {params.tmp} ''' |
1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 | shell: ''' mkdir {params.tmp} for i in {input.kofam}; do cp $i {params.tmp}; done for i in {input.nohit_p}; do cp $i {params.tmp}; done for i in {input.nohit_reads}; do cp $i {params.tmp}; done for i in {input.unmap}; do cp $i {params.tmp}; done sed -i '/NoHit/d' {params.tmp}/* if [ ! -e {params.kegglist} ]; then cat {params.tmp}* | awk '{{print $1}}' | sort | uniq > {params.kegglist}; fi for i in {params.id}; do echo $i; done > {params.tmp}samples while read file; do cat {params.tmp}"$file"_* | awk '{{a[$1]+=$2}}END{{for(i in a) print i,a[i]}}' > {params.tmp}"$file"_catted; done < {params.tmp}samples ./scripts/abundance_matrix_catted.sh {params.tmp} {params.list} {params.kegglist} > {output.ann_results} rm -r {params.tmp} rm {params.kegglist} ''' |
Support
- Future updates
Related Workflows





