Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
AGNOSTOS - workflow

Disclaimer: This is a work in progress!
The workflow is still under development and it is subject to change. No guarantee is made regarding the functioning of the workflow and the accuracy of the results. Please, contact us in case you are interested in using it.
Snakemake workflow usage:
The "agnostos-wf" snakemake workflow was developed using/runs in the de.NBI Cloud. We used a cluster setup with 10 nodes of 28 cores and 252 G of memory each. The cluster was build using BiBiGrid and it is using SLURM as Grid Batch Scheduler.
Check our GitHub wiki or the usage file to set AGNOSTOS up on your computer and to run the different modules!
Run the workflow:
WARNING: UPDATES!
-
The workflow general structure slightly changed since the first release: the DB-creation and DB-update modules are now gathered into one workflow folder and to switch between them you just need to specify the module in the config.yaml file.
-
It is now possible to classify the singletons into the 4 categories (K, KWP, GU and EU) by setting
singl: "true"
in the config.yaml file. -
It is also possible to re-validate and classify the existing GCs that got integrated with new genes by setting
eval_shared: "true"
in the config.yaml file. At the moment, all shared GCs is the default (shared_type: "all"
), other possibilities are: "discarded" for reprocessing all the previously discarded GCs with new sequences and "increase" for reprocessing all the shared GCs with 30% new sequences. -
The databases can be downloaded and then removed in each step, to reduced the amount of space needed by the workflow (
db_mode: "memory"
)
1. The DB-creation module starts from a set of genomic/metagenomic contigs in fasta format and retrieves a database of categorised gene clusters and cluster communities.
cd workflow/
snakemake --use-conda -j 100 --config module="creation" --cluster-config config/cluster.yaml --cluster "sbatch --export=ALL -t {cluster.time} -c {threads} --ntasks-per-node {cluster.ntasks_per_node} --nodes {cluster.nodes} --cpus-per-task {cluster.cpus_per_task} --job-name {rulename}.{jobid} --partition {cluster.partition}" -R --until creation_workflow_report
2. The DB-Update module is used to add your genomic/metagenomic contigs or genes to an existing gene cluster database such as the agnostosDB dataset, which is stored in Figshare (https://doi.org/10.6084/m9.figshare.12459056) and publicly available for download. In case you cannot download the whole dataset, seen to the large size of many of the files, the workflow will download the necessary files for each step and it will then remove them. A description of the agnostosDB files can be found in the AgnostosDB_README.md .
- To run the DB-update module of the workflow, you just need to enter the folder, modify the config.yaml and config_communities.yml files specifying your input data and the output paths (see usage file ), and then run the same command shown above, this time modifying the configuration parameter 'module' to "update":
cd workflow/
snakemake -s Snakefile --use-conda -j 100 --config module="update" --cluster-config config/cluster.yaml --cluster "sbatch --export=ALL -t {cluster.time} -c {threads} --ntasks-per-node {cluster.ntasks_per_node} --nodes {cluster.nodes} --cpus-per-task {cluster.cpus_per_task} --job-name {rulename}.{jobid} --partition {cluster.partition}" -R --until update_workflow_report
NB: If you want to run the DB-update module on the results of the DB-creation module, first copy or move the cluster database in the final DB-creation result folder as follows:
mv db_creation/mmseqs_clustering db_creation/clusterDB_results/
**Output**
The output of these 2 modules is described in the Output_README.md .
Test dataset
Set of 10K contigs to test the DB-creation module and 5K contigs to test the DB-update module of the workflow. The test-dataset can be downloaded from Figshare as follows:
mkdir -p agnostos_test
cd agnostos_test
wget https://figshare.com/ndownloader/files/31247614 -O db_creation_data.tar.gz
tar -xzvf db_creation_data.tar.gz
wget https://ndownloader.figshare.com/files/25473335 -O db_update_data.tar.gz
tar -xzvf db_update_data.tar.gz
A brief description of the dataset is available on Figshare.
**Profile-search**: the profile-search vs the AgnostosDB cluster HMM profiles database is not part of the Snakemake workflow. However, if you want to search your set of genes against our profiles you just need to download the AGNOSTOS [gene cluster profiles](https://figshare.com/ndownloader/files/30998305) and the [gene cluster categories](https://ndownloader.figshare.com/files/23067140) and make sure you have [MMseqs2](https://github.com/soedinglab/MMseqs2) installed. The scripts can be found in the [Profile_search/](Profile_search) folder. To run the search you just need the following command:
# download the AGNOSTOS seed database gene cluster profiles
wget https://figshare.com/ndownloader/files/30998305 -O mmseqs_profiles.tar.gz
tar -xzvf mmseqs_profiles.tar.gz
# download the AGNOSTOS seed database gene cluster categories
wget https://ndownloader.figshare.com/files/23067140 -O cluster_ids_categ.tsv.gz
gunzip cluster_ids_categ.tsv.gz
Profile_search/profile_search.sh --query your-genes.fasta --clu_hmm mmseqs_profiles/clu_hmm_db --clu_cat cluster_ids_categ.tsv --mmseqs /path/to/mmseqs --mpi FALSE --threads 8
As additional option you can specify an additional file using "--info". This file should be a table with the correspondence of the genes to the contigs and genomes/MAGs or samples. The format should be gene - contig - genome (or sample_ID) - optional-info.
* * * THE FUNCTIONAL DARK SIDE OF GENOMES AND METAGENOMES
To learn more about what we are doing check out our website dark.metagenomics.eu .
Citation:
Vanni, C., Schechter, M., Acinas, S., Barberán, A., Buttigieg, P. L., Casamayor, E. O., Delmont, T. O., Duarte, C. M., Murat Eren, A., Finn, R., Kottmann, R., Mitchell, A., Sanchez, P., Siren, K., Steinegger, M., Glöckner, F. O., & Fernandez-Guerra, A. Unifying the known and unknown microbial coding sequence space. eLife 2022. https://doi.org/10.7554/eLife.67667
Code Snippets
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | shell: """ set -e set -x export OMPI_MCA_btl=^openib export OMP_NUM_THREADS={threads} export OMP_PROC_BIND=FALSE {params.categ_db} --cluseq_db {params.cluseqdb} \ --step "{params.step}" \ --mpi_runner "{params.mpi_runner}" \ --mmseqs {params.mmseqs_bin} \ --aln {params.famsa_bin} \ --hhsuite {params.hhsuite} \ --reformat {params.reformat} \ --consensus {params.consensus} \ --hhmake {params.hhmake} \ --outdir {params.outdir} \ --idir {params.idir} \ --clu_hhm {output.clu_hhm} \ --threads {threads} 2>{log.err} 1>{log.out} """ |
63 64 | run: shell("echo 'CATEGORY DB DONE'") |
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 | shell: """ set -e set -x export OMPI_MCA_btl=^openib export OMP_NUM_THREADS={threads} export OMP_PROC_BIND=FALSE ### Environmental unknown refinement searching for remote homologies (in the Uniclust DB) if [[ ! -f {params.hmm_eu} ]]; then {params.categ_db} --cluseq_db {params.cluseqdb} \ --step "{params.step_unk}" \ --mpi_runner "{params.mpi_runner}" \ --mmseqs {params.mmseqs_bin} \ --aln {params.famsa_bin} \ --hhsuite {params.hhsuite} \ --reformat {params.reformat} \ --consensus {params.consensus} \ --hhmake {params.hhmake} \ --outdir {params.outdir} \ --idir {params.idir} \ --clu_hhm "none" \ --threads {threads} 2>{log.err} 1>{log.out} fi # Pfam HHBLITS DB if [ ! -s {params.pfam_db}_hhm.ffdata ]; then echo "Dowloading Pfam hh-suite database" DB=$(dirname {params.pfam_db}) wget http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pfamA_34.0.tar.gz -O ${{DB}}/pfam.tar.gz tar xvfz ${{DB}}/pfam.tar.gz --directory ${{DB}}/ rm ${{DB}}/pfam.tar.gz fi # Uniclust HHBLITS DB if [ ! -s {params.uniclust_db}_a3m.ffdata ]; then DB=$(dirname {params.uniclust_db}) echo "Dowloading Uniclust hh-suite database" aria2c --file-allocation=none -d ${{DB}} -c -x 10 -s 10 http://wwwuser.gwdg.de/~compbiol/uniclust/2021_03/UniRef30_2021_03.tar.gz tar xvfz {params.uniclust_db}.tar.gz --directory ${{DB}}/ rm {params.uniclust_db}.tar.gz fi # vmtouch the DBs {params.vmtouch} -f {params.uniclust_db}* {params.vmtouch} -f {params.pfam_db}* {params.vmtouch} -f {params.hmm_eu} # If EU GCs <= 10,000 search them vs Uniclust and parse results EU=$(wc -l {input.eu} | cut -d' ' -f1) if [[ ! -f {params.tmp_eu}_parsed.tsv ]]; then if [[ $EU -le 10000 ]]; then # Run directly EU vs Uniclust if [[ ! -f {params.tmp_eu}.index ]]; then {params.mpi_runner} {params.hhblits_bin_mpi} -i {params.hmm_eu} \ -o {params.tmp_eu} \ -n 2 -cpu 1 -v 0 \ -d {params.uniclust_db} 2>>{log.err} 1>>{log.out} mv {params.tmp_eu}.ffdata {params.tmp_eu} mv {params.tmp_eu}.ffindex {params.tmp_eu}.index fi {params.mpi_runner} {params.mmseqs_bin} apply {params.tmp_eu} {params.tmp_eu}.parsed \ -- {params.hhblits_search} {params.outdir}/templ {threads} {params.hhblits_prob} --threads 1 2>>{log.err} 1>>{log.out} sed -e 's/\\x0//g' {params.tmp_eu}.parsed > {params.tmp_eu}_parsed.tsv awk '{{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15"-"$16"-"$17"-"$18"-"$19"-"$20}}' {params.tmp_eu}_parsed.tsv |\ sed 's/ /\t/g' > {params.outdir}/tmpl && mv {params.outdir}/tmpl {params.tmp_eu}_parsed.tsv rm -rf {params.outdir}/templ else # Run EU vs PFAM and then EU vs Uniclust ## EU vs Pfam if [[ ! -f {params.tmp_eu}.index ]]; then {params.mpi_runner} {params.hhblits_bin_mpi} -i {params.hmm_eu} \ -o {params.tmp_eu} \ -n 2 -cpu 1 -v 0 \ -d {params.pfam_db} 2>>{log.err} 1>>{log.out} mv {params.tmp_eu}.ffdata {params.tmp_eu} mv {params.tmp_eu}.ffindex {params.tmp_eu}.index fi {params.mpi_runner} {params.mmseqs_bin} apply {params.tmp_eu} {params.tmp_eu}.parsed_pfam \ -- {params.hhblits_search} {params.outdir}/templ {threads} {params.hhblits_prob} --threads 1 2>>{log.err} 1>>{log.out} rm -rf {params.outdir}/templ sed -e 's/\\x0//g' {params.tmp_eu}.parsed_pfam > {params.tmp_eu}.tsv #Parsing hhblits result files and filtering for hits with coverage > 0.4, and removing overlapping pfam domains/matches ./{params.hhpfam_parser} {params.tmp_eu}.tsv {threads} > {params.tmp_eu}_filt.tsv if [ -s {params.tmp_eu}_filt.tsv ]; then # join with the pfam names and clans join -11 -21 <(awk '{{split($1,a,"."); print a[1],$3,$8,$9}}' {params.tmp_eu}_filt.tsv | sort -k1,1) \ <(gzip -dc {params.pfam_clan} |\ awk -vFS='\\t' -vOFS='\\t' '{{print $1,$2,$4}}' |\ awk -vFS='\\t' -vOFS='\\t' '{{for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "no_clan"}};1' \ | sort -k1,1) > {params.tmp_eu}_name_acc_clan.tsv # Multi domain format awk '{{print $2,$3,$4,$5,$1,$6}}' {params.tmp_eu}_name_acc_clan.tsv |\ sort -k1,1 -k2,3g | \ awk -vOFS='\\t' '{{print $1,$4,$5,$6}}' | \ awk -f {params.concat} > {params.tmp_eu}_name_acc_clan_multi.tsv rm {params.tmp_eu}_name_acc_clan.tsv if [ -s {params.tmp_eu}_name_acc_clan_multi.tsv ]; then # Divide the new hits with pfam into DUFs and not DUFs ./{params.new_da} {params.tmp_eu}_name_acc_clan_multi.tsv {params.tmp_eu} # New K clusters awk '{{print $1}}' {params.tmp_eu}_new_k_ids_annot.tsv >> {output.k} cat {input.k} >> {output.k} # New GU clusters awk '{{print $1}}' {params.tmp_eu}_new_gu_ids_annot.tsv >> {output.gu} # Remaning EU clusters join -11 -21 -v1 <(sort -k1,1 {input.eu}) \ <(awk '{{print $1}}' {params.tmp_eu}_name_acc_clan_multi.tsv | sort -k1,1) > {output.eu}.1 # Subset the EU hmm DB {params.mmseqs_bin} createsubdb {output.eu}.1 {params.hmm_eu} {params.hmm_eu}.left mv {params.hmm_eu}.left {params.hmm_eu} mv {params.hmm_eu}.left.index {params.hmm_eu}.index fi fi # Run remaining EU GCs vs Uniclust {params.vmtouch} -f {params.hmm_eu} if [[ ! -f {params.tmp_eu}_unicl.index ]]; then {params.mpi_runner} {params.hhblits_bin_mpi} -i {params.hmm_eu} \ -o {params.tmp_eu}_unicl \ -n 2 -cpu 1 -v 0 \ -d {params.uniclust_db} 2>>{log.err} 1>>{log.out} mv {params.tmp_eu}_unicl.ffdata {params.tmp_eu}_unicl mv {params.tmp_eu}_unicl.ffindex {params.tmp_eu}_unicl.index fi {params.mpi_runner} {params.mmseqs_bin} apply {params.tmp_eu}_unicl {params.tmp_eu}.parsed_unicl \ -- {params.hhblits_search} {params.outdir}/templ {threads} {params.hhblits_prob} --threads 1 2>>{log.err} 1>>{log.out} sed -e 's/\\x0//g' {params.tmp_eu}.parsed_unicl > {params.tmp_eu}_parsed.tsv awk '{{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15"-"$16"-"$17"-"$18"-"$19"-"$20}}' {params.tmp_eu}_parsed.tsv |\ sed 's/ /\t/g' > {params.outdir}/tmpl && mv {params.outdir}/tmpl {params.tmp_eu}_parsed.tsv rm -rf {params.outdir}/templ fi fi # Parse EU refinement results LC_ALL=C rg -j 4 -i -f {params.patterns} {params.tmp_eu}_parsed.tsv | \ awk '{{print $0"\thypo"}}' > {params.tmp_eu}_hypo_char LC_ALL=C rg -j 4 -v -i -f {params.patterns} {params.tmp_eu}_parsed.tsv | \ awk '{{print $0"\tchar"}}' >> {params.tmp_eu}_hypo_char sed -i 's/\\t\\t/\\t/g' {params.tmp_eu}_hypo_char if [ -s {params.tmp_eu}_hypo_char ]; then awk -vP={params.hypo_filt} -vFS='\\t' \ '{{a[$2][$16]++}}END{{for (i in a) {{N=a[i]["hypo"]/(a[i]["hypo"]+a[i]["char"]); if (N >= P){{print i}}}}}}' {params.tmp_eu}_hypo_char \ > {params.tmp_eu}_new_gu_ids.txt join -11 -21 -v1 <(awk '!seen[$2]++{{print $2}}' {params.tmp_eu}_parsed.tsv | sort -k1,1) \ <(sort -k1,1 {params.tmp_eu}_new_gu_ids.txt) > {params.tmp_eu}_new_kwp_ids.txt if [[ -f {output.eu}.1 ]]; then join -11 -21 -v1 <(sort -k1,1 {output.eu}.1) \ <(awk '!seen[$2]++{{print $2}}' {params.tmp_eu}_parsed.tsv | sort -k1,1) > {output.eu} rm {output.eu}.1 else join -11 -21 -v1 <(sort -k1,1 {input.eu}) \ <(awk '!seen[$2]++{{print $2}}' {params.tmp_eu}_parsed.tsv | sort -k1,1) > {output.eu} fi cat {input.kwp} {params.tmp_eu}_new_kwp_ids.txt >> {output.kwp} cat {input.gu} {params.tmp_eu}_new_gu_ids.txt >> {output.gu} if [[ ! -s {output.k} ]]; then cp {input.k} {output.k} fi else echo "No EU was found to be a remote homolog of an already observed protein" cp {input.k} {output.k} cp {input.kwp} {output.kwp} cp {input.gu} {output.gu} cp {input.eu} {output.eu} fi # clean the EU results ... rm -rf {params.hmm_eu}* rm -rf {params.tmp_eu} {params.tmp_eu}.index {params.tmp_eu}.dbtype rm -rf {params.tmp_eu}_unicl {params.tmp_eu}_unicl.index {params.tmp_eu}_unicl.dbtype rm -rf {params.tmp_eu}.parsed_* {params.tmp_eu}.tsv {params.tmp_eu}_filt.tsv rm -rf {params.tmp_eu}_parsed.tsv {params.tmp_eu}_hypo_char ####### ## Known without Pfam refinement, searching for remote homologies (in the Pfam DB) # Pfam clans if [ ! -s {params.pfam_clan} ]; then echo "Dowloading Pfam-A clan information" wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/Pfam-A.clans.tsv.gz -O {params.pfam_clan} fi if [[ ! -f {params.hmm_kwp} ]]; then if [[ -s {output.kwp} ]]; then IDIR={params.outdir} else IDIR={params.idir} fi {params.categ_db} --cluseq_db {params.cluseqdb} \ --step "{params.step_k}" \ --mpi_runner "{params.mpi_runner}" \ --mmseqs {params.mmseqs_bin} \ --aln {params.famsa_bin} \ --hhsuite {params.hhsuite} \ --reformat {params.reformat} \ --consensus {params.consensus} \ --hhmake {params.hhmake} \ --outdir {params.outdir} \ --idir "${{IDIR}}" \ --clu_hhm "none" \ --threads {threads} 2>>{log.err} 1>>{log.out} fi rm -f {params.hmm_kwp}.ff* ln -s {params.hmm_kwp} {params.hmm_kwp}.ffdata ln -s {params.hmm_kwp}.index {params.hmm_kwp}.ffindex if [[ ! -f {params.tmp_kwp}.index || ! -f {params.tmp_kwp}.tsv ]]; then {params.vmtouch} -f {params.hmm_kwp} if [[ ! -f {params.tmp_kwp}.index ]]; then {params.mpi_runner} {params.hhblits_bin_mpi} -i {params.hmm_kwp} \ -o {params.tmp_kwp} \ -n 2 -cpu 1 -v 0 \ -d {params.pfam_db} mv {params.tmp_kwp}.ffdata {params.tmp_kwp} mv {params.tmp_kwp}.ffindex {params.tmp_kwp}.index fi {params.mpi_runner} {params.mmseqs_bin} apply {params.tmp_kwp} {params.tmp_kwp}.parsed \ -- {params.hhblits_search} {params.outdir}/templ {threads} {params.hhblits_prob} --threads 1 rm -rf {params.outdir}/templ # Parsing hhr result files and filtering for hits with probability ≥ 90% sed -e 's/\\x0//g' {params.tmp_kwp}.parsed > {params.tmp_kwp}.tsv fi #Parsing hhblits result files and filtering for hits with coverage > 0.4, and removing overlapping pfam domains/matches ./{params.hhpfam_parser} {params.tmp_kwp}.tsv {threads} > {params.tmp_kwp}_filt.tsv # join with the pfam names and clans join -11 -21 <(awk '{{split($1,a,"."); print a[1],$3,$8,$9}}' {params.tmp_kwp}_filt.tsv | sort -k1,1) \ <(gzip -dc {params.pfam_clan} |\ awk -vFS='\\t' -vOFS='\\t' '{{print $1,$2,$4}}' |\ awk -vFS='\\t' -vOFS='\\t' '{{for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "no_clan"}};1' | sort -k1,1) > {params.tmp_kwp}_name_acc_clan.tsv # Multi domain format awk '{{print $2,$3,$4,$5,$1,$6}}' {params.tmp_kwp}_name_acc_clan.tsv |\ sort -k1,1 -k2,3g | \ awk -vOFS='\\t' '{{print $1,$4,$5,$6}}' | \ awk -f {params.concat} > {params.tmp_kwp}_name_acc_clan_multi.tsv rm {params.tmp_kwp}_name_acc_clan.tsv if [ -s {params.tmp_kwp}_name_acc_clan_multi.tsv ]; then # Divide the new hits with pfam into DUFs and not DUFs ./{params.new_da} {params.tmp_kwp}_name_acc_clan_multi.tsv {params.tmp_kwp} # New K clusters awk '{{print $1}}' {params.tmp_kwp}_new_k_ids_annot.tsv >> {output.k} # New GU clusters awk '{{print $1}}' {params.tmp_kwp}_new_gu_ids_annot.tsv >> {output.gu} # New KWP clusters join -11 -21 -v1 <(sort -k1,1 {output.kwp}) \ <(awk '{{print $1}}' {params.tmp_kwp}_name_acc_clan_multi.tsv | sort -k1,1) > {params.tmp_kwp}_kwp mv {params.tmp_kwp}_kwp {output.kwp} # EU remain the same else # The categories mantain the same clusters echo "No KWP was found to be remote homolog of a Pfam domain" fi # clean the results ... rm -rf {params.hmm_kwp} {params.hmm_kwp}.index {params.hmm_kwp}.dbtype {params.hmm_kwp}.ff* rm -rf {params.tmp_kwp} {params.tmp_kwp}.index {params.tmp_kwp}.dbtype rm -rf {params.tmp_kwp}.parsed* {params.tmp_kwp}_filt.tsv {params.tmp_kwp}.tsv #Clean eventual duplicates awk '!seen[$0]++' {output.k} > {output.k}.tmp && mv {output.k}.tmp {output.k} awk '!seen[$0]++' {output.kwp} > {output.kwp}.tmp && mv {output.kwp}.tmp {output.kwp} awk '!seen[$0]++' {output.gu} > {output.gu}.tmp && mv {output.gu}.tmp {output.gu} awk '!seen[$0]++' {output.eu} > {output.eu}.tmp && mv {output.eu}.tmp {output.eu} # Create a file with cl_name and category cat <(awk -vOFS='\\t' '{{print $1,"K"}}' {output.k}) \ <(awk -vOFS='\\t' '{{print $1,"KWP"}}' {output.kwp}) \ <(awk -vOFS='\\t' '{{print $1,"GU"}}' {output.gu}) \ <(awk -vOFS='\\t' '{{print $1,"EU"}}' {output.eu}) > {output.categ} # Add ORFs join -11 -21 <(sort -k1,1 {output.categ}) \ <(awk '!seen[$1,$3]++{{print $1,$3}}' {params.ref_clu} | sort -k1,1) \ > {params.categ_orfs} gzip {params.categ_orfs} # Gather cluster annotations obtained from the classification and the two refinement steps class=$(dirname {input.k}) categ=$(dirname {output.k}) # GU annotations: join -11 -21 <(sort -k1,1 {output.gu}) \ <(awk '{{print $1,"Uniclust",$3,$4}}' {params.tmp_eu}_parsed.tsv | sort -k1,1) \ > {params.outdir}/GU_annotations.tsv if [ -s {params.tmp_eu}_new_gu_ids_annot.tsv ]; then awk '{{print $1,"Pfam","0.0",$2}}' {params.tmp_eu}_new_gu_ids_annot.tsv >> {params.outdir}/GU_annotations.tsv fi awk '{{print $1,"Pfam","0.0",$2}}' {params.tmp_kwp}_new_gu_ids_annot.tsv >> {params.outdir}/GU_annotations.tsv cat {params.idir}/gu_annotations.tsv >> {params.outdir}/GU_annotations.tsv sed -i 's/ /\\t/g' {params.outdir}/GU_annotations.tsv # KWP annotations join -11 -21 <(sort -k1,1 {output.kwp}) \ <(awk '{{print $1,"Uniclust",$3,$4}}' {params.tmp_eu}_parsed.tsv | sort -k1,1) \ > {params.outdir}/KWP_annotations.tsv join -11 -21 <(sort -k1,1 {output.kwp}) \ <(sort -k1,1 {params.idir}/kwp_annotations.tsv) >> {params.outdir}/KWP_annotations.tsv sed -i 's/ /\\t/g' {params.outdir}/KWP_annotations.tsv # K annotations if [ -s {params.tmp_eu}_new_k_ids_annot.tsv ]; then awk '{{print $1,"Pfam","0.0",$2}}' {params.tmp_eu}_new_k_ids_annot.tsv >> {params.outdir}/K_annotations.tsv fi cat {params.idir}/k_annotations.tsv \ <(awk -vOFS='\\t' '{{print $1,"Pfam","0.0",$2}}' {params.tmp_kwp}_new_k_ids_annot.tsv) \ >> {params.outdir}/K_annotations.tsv sed -i 's/ /\\t/g' {params.outdir}/K_annotations.tsv if [[ {params.db_mode} == "memory" ]]; then rm {params.uniclust_db}* {params.pfam_db}* fi """ |
404 405 | run: shell("echo 'CATEGORY REFINEMENT DONE'") |
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | shell: """ if [ {params.conf_tax_db} == "gtdb" ]; then TAXDB={params.gtdb} if [ ! -s {params.gtdb} ]; then echo "Dowloading GTDB-r89 kaiju DB" DB=$(dirname {params.mutantDB}) wget https://ndownloader.figshare.com/files/24745184 -O ${{DB}}/gtdb.tar.gz tar xzvf ${{DB}}/gtdb.tar.gz --directory ${{DB}} rm ${{DB}}/gtdb.tar.gz fi else TAXDB={params.nr} if [[ ! -s {params.nr} ]]; then DIR=$(basedir ${{TAXDB}}) cd ${{DIR}} {params.kaiju_bin}-makedb -s nr_euk cd {params.workdir} fi fi ## Cluster Kaiju taxonomy with GTDB r89 if [[ ! -s {params.kaiju_res} ]]; then {params.vmtouch} -f ${{TAXDB}} ./{params.kaiju_tax} --search {params.kaiju_bin} \ --input {params.refdb} \ --taxdb ${{TAXDB}} \ --parsing {params.kaiju_parse} \ --output {params.kaiju_res} \ --tmpl {params.tmpl} \ --threads {threads} 2>{log.err} 1>{log.out} fi if [[ {params.db_mode} == "memory" ]]; then rm ${{TAXDB}} fi # Extract all sequences from the refined database set: sed -e 's/\\x0//g' {params.refdb} > {params.outdir}/refined_cl_genes.fasta # Create MMseqs2 databases for the refined genes {params.mmseqs_bin} createdb {params.outdir}/refined_cl_genes.fasta {params.outdir}/refined_cl_genes_db ## Cluster level of darkness mkdir -p {params.dark_dir} if [ ! -s {params.DPD} ]; then echo "Dowloading DPD" wget https://ndownloader.figshare.com/files/23756312 -O {params.DPD} wget https://ndownloader.figshare.com/files/23756306 -O {params.dpd_info} fi if [[ ! -s {params.dark} ]]; then {params.mmseqs_bin} createdb {params.DPD} {params.dark_dir}/dpd_db # Search {params.mmseqs_bin} search {params.outdir}/refined_cl_genes_db {params.dark_dir}/dpd_db \ {params.dark_dir}/refined_cl_genes_dpd_db {params.dark_dir}/tmp \ --threads {threads} --max-seqs 300 \ -e 1e-20 --cov-mode 0 -c 0.6 --mpi-runner "{params.mpi_runner}" {params.mmseqs_bin} convertalis {params.outdir}/refined_cl_genes_db {params.dark_dir}/dpd_db \ {params.dark_dir}/refined_cl_genes_dpd_db {params.dark_dir}/refined_cl_genes_dpd.tsv \ --format-mode 2 --threads {threads} \ --format-output 'query,target,pident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,qcov,tcov' rm -rf {params.dark_dir}/dpd_db* {params.dark_dir}/refined_cl_genes_dpd_db* {params.dark_dir}/tmp # Extract best-hits export LANG=C; export LC_ALL=C; \ sort -k1,1 -k11,11g -k13,13gr -k14,14gr --parallel {threads} -T {params.tmpl} \ {params.dark_dir}/refined_cl_genes_dpd.tsv | \ sort -u -k1,1 --parallel {threads} -T {params.tmpl} --merge > {params.dark_dir}/refined_cl_genes_dpd_bh.tsv # Join with cluster categories join -11 -23 <(awk '{{print $1,$2}}' {params.dark_dir}/refined_cl_genes_dpd_bh.tsv | \ sort -k1,1 --parallel {threads} -T {params.tmpl}) \ <(sort -k3,3 --parallel {threads} -T {params.tmpl} <(zcat {params.cl_cat_genes})) > {params.dark} sed -i 's/ /\\t/g' {params.dark} fi ## Cluster vs Price et al. mutants mkdir -p {params.mutant_dir} # Mutant phenotypes (Price et al. 2018) if [ ! -s {params.aa_mutants} ]; then ## Amino acid sequences wget https://fit.genomics.lbl.gov/cgi_data/aaseqs -O {params.aa_mutants} fi if [ ! -s {params.mutantDB} ]; then ## Contextual data wget https://fit.genomics.lbl.gov/cgi_data/feba.db -O {params.mutantDB} fi if [[ ! -s {params.mutants} ]]; then {params.mmseqs_bin} createdb {params.aa_mutants} {params.mutant_dir}/mutant_db # Search {params.mmseqs_bin} search {params.outdir}/refined_cl_genes_db {params.mutant_dir}/mutant_db \ {params.mutant_dir}/refined_cl_genes_mutant_db {params.mutant_dir}/tmp \ --threads {threads} --max-seqs 300 \ -e 1e-20 --cov-mode 0 -c 0.6 --mpi-runner "{params.mpi_runner}" {params.mmseqs_bin} convertalis {params.outdir}/refined_cl_genes_db {params.mutant_dir}/mutant_db \ {params.mutant_dir}/refined_cl_genes_mutant_db {params.mutant_dir}/refined_cl_genes_mutants.tsv \ --format-mode 2 --threads {threads} \ --format-output 'query,target,pident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,qcov,tcov' rm -rf {params.outdir}/refined_cl_genes_db* {params.mutant_dir}/tmp rm -rf {params.mutant_dir}/mutant_db* {params.mutant_dir}/refined_cl_genes_mutant_db* # Extract best-hits export LANG=C; export LC_ALL=C;\ sort -k1,1 -k11,11g -k13,13gr -k14,14gr --parallel {threads} -T {params.tmpl}\ {params.mutant_dir}/refined_cl_genes_mutants.tsv | \ sort -u -k1,1 --merge --parallel {threads} -T {params.tmpl} > {params.mutant_dir}/refined_cl_genes_mutants_bh.tsv # Join with cluster categories join -11 -23 <(awk '{{print $1,$2}}' {params.mutant_dir}/refined_cl_genes_mutants_bh.tsv |\ sort -k1,1 --parallel {threads} -T {params.tmpl}) \ <(sort -k3,3 --parallel {threads} -T {params.tmpl}\ <(zcat {params.cl_cat_genes})) > {params.mutants} sed -i 's/ /\\t/g' {params.mutants} fi ## EggNOG annotations mkdir -p {params.eggnog_dir} if [[ ! -s {params.eggnog_dir}/cluster_eggnogs.tsv ]]; then pip install biopython {params.eggnog_bin} -m diamond --no_annot --no_file_comment --cpu {threads} \ -i {params.outdir}/refined_cl_genes.fasta \ --output {params.eggnog_dir}/refined_nogs --override NOG=$(dirname {params.eggnog_bin}) scp ${{NOG}}/data/* /dev/shm/ {params.eggnog_bin} --annotate_hits_table {params.eggnog_dir}/refined_nogs.emapper.seed_orthologs \ --no_file_comments -o {params.eggnog_dir}/refined \ --cpu {threads} --override --data_dir /dev/shm rm /dev/shm/* awk -vFS='\\t' -vOFS='\\t' '{{print $1,$7,$8}}' {params.eggnog_dir}/refined.emapper.annotations |\ sed 's/ /_/g' > {params.eggnog_dir}/refined_eggnogs.tsv # Join with cluster categories join -11 -23 <(awk '{{print $1,$2,$3}}' {params.eggnog_dir}/refined_eggnogs.tsv |\ sort -k1,1 --parallel {threads} -T {params.tmpl}) \ <(sort -k3,3 --parallel {threads} -T {params.tmpl}\ <(zcat {params.cl_cat_genes})) > {params.eggnog_dir}/cluster_eggnogs.tsv sed -i 's/ /\\t/g' {params.eggnog_dir}/cluster_eggnogs.tsv rm {params.eggnog_dir}/refined_nogs* {params.eggnog_dir}/refined_eggnogs.tsv fi rm {params.outdir}/refined_cl_genes.fasta ## Cluster general stats ./{params.stats} --ref_clu {params.ref} \ --clu_categ {input.cl_cat} \ --kaiju_tax {params.kaiju_res} \ --clu_dark {params.dark} \ --dpd_info {params.dpd_info} \ --compl {params.compl} \ --hq_clu {output.HQ_clusters} \ --mutantDB {params.mutantDB} \ --mutants {params.mutants} \ --eggnog {params.eggnog_dir}/cluster_eggnogs.tsv \ --summ_stats {output.cat_stats} \ --output {params.outdir} 2>>{log.err} 1>>{log.out} if [[ {params.db_mode} == "memory" ]]; then rm {params.dpd_info} {params.DPD} rm {params.aa_mutants} {params.mutantDB} fi """ |
235 236 | run: shell("echo 'COMMUNITIES INFERENCE DONE'") |
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 | shell: """ set -e set -x export OMPI_MCA_btl=^openib export OMP_NUM_THREADS={threads} export OMP_PROC_BIND=FALSE # Extract consensus sequnces if [ ! -f {params.ref_cons} ]; then # Create alignments and retrieve the consensus sequences if [ ! -f {params.db_aln} ]; then {params.mpi_runner} {params.mmseqs_bin} apply {params.refdb_noannot} {params.db_aln} --threads {threads} \ -- {params.famsa_bin} STDIN STDOUT 2> /dev/null fi if [ ! -f {params.db_cons} ]; then {params.mpi_runner} {params.mmseqs_bin} apply {params.db_aln} {params.db_cons} --threads {threads} \ -- {params.consensus} {params.hhcons_bin} fi # Extract the consensus sequences as fasta sed -e 's/\\x0//g' {params.db_cons} > {params.ref_cons} rm {params.db_aln}* {params.db_cons}* # Add singletons not annotated here to ref_cons if config["singl"]=="true" # keep the name ref_cons for the two searches (easier) if [ {params.singl} == "true" ]; then join -11 -21 -v1 <(awk '{{print $2}}' {params.s_all} | sort -k1,1) \ <(awk '{{print $1}}' {params.s_annot} | sort -k1,1) > {params.outdir}/singletons_not_annot.txt sed -e 's/\\x0//g' {params.genes} > {params.outdir}/genes.fasta {params.filterbyname} in={params.outdir}/genes.fasta \ out={params.outdir}/singletons_not_annot.fasta \ names={params.outdir}/singletons_not_annot.txt \ include=t ignorejunk overwrite="true" 2>>{log.err} 1>>{log.out} cat {params.outdir}/singletons_not_annot.fasta >> {params.ref_cons} rm {params.outdir}/genes.fasta fi fi # Search the not annotated cluster consensus sequences against the UniRef90 database if [ ! -s {params.uniref_prot} ]; then echo "Dowloading UniRef90 DB using mmseqs" {params.mmseqs_bin} databases UniRef90 {params.uniref_db} {params.local_tmp} --threads {threads} --remove-tmp-files # Create the protein description file: echo "extracting protein information" sed 's/\\x0//g' {params.uniref_db}_h | sed 's/ /__/g' | sed 's/__/\\t/' | sed 's/__/_/g' | gzip > {params.uniref_prot} fi if [ ! -f {params.outdir}/noannot_vs_uniref90_hits.txt ]; then {params.vmtouch} -f {params.uniref_db}* {params.mmseqs_search} --search {params.mmseqs_bin} \ --mpi_runner "{params.mpi_runner}" \ --ltmp {params.local_tmp} \ --cons {params.ref_cons} \ --db_target {params.uniref_db} \ --db_info {params.uniref_prot} \ --evalue_filter {params.evalue_filt} \ --evalue_threshold {params.evalue_thr} \ --hypo_threshold {params.hypo_thr} \ --hypo_patterns {params.patterns} \ --grep {params.ripgrep} \ --output {params.uniref_res} \ --outdir {params.outdir} \ --threads {threads} 2>{log.err} 1>{log.out} # The output of the search are 2 files: the MMseqs2 output in tabular format: "noannot_vs_uniref90.tsv" # the list of hypothetical proteins: "noannot_vs_uniref90_hypo.txt" # Parse results and search nohits against NCBI nr awk '!seen[$1]++{{print $1}}' {params.uniref_res} > {params.outdir}/noannot_vs_uniref90_hits.txt {params.filterbyname} in={params.ref_cons} \ out={params.outdir}/noannot_vs_uniref90_nohits.fasta \ names={params.outdir}/noannot_vs_uniref90_hits.txt \ include=f ignorejunk overwrite="true" 2>>{log.err} 1>>{log.out} if [[ {params.db_mode} == "memory" ]]; then rm {params.uniref_db}* {params.uniref_prot} fi fi #### # Search the non-matching consensus sequences against the NCBI nr database if [ ! -s {params.nr_prot} ]; then echo "Dowloading NR DB using mmseqs" {params.mmseqs_bin} databases NR {params.nr_db} {params.local_tmp} --threads {threads} --remove-tmp-files # Create the protein description file: echo "extracting protein information" sed 's/\\x0//g' {params.nr_db}_h | sed 's/ /__/g' | sed 's/__/\\t/' | sed 's/__/_/g' | gzip > {params.nr_prot} fi if [ ! -f {params.outdir}/uniref-nohits_vs_NR_hits.txt ]; then {params.vmtouch} -f {params.nr_db}* {params.mmseqs_search} --search {params.mmseqs_bin} \ --mpi_runner "{params.mpi_runner}" \ --ltmp {params.local_tmp} \ --cons {params.outdir}/noannot_vs_uniref90_nohits.fasta \ --db_target {params.nr_db} \ --db_info {params.nr_prot} \ --evalue_filter {params.evalue_filt} \ --evalue_threshold {params.evalue_thr} \ --hypo_threshold {params.hypo_thr} \ --hypo_patterns {params.patterns} \ --grep {params.ripgrep} \ --output {params.nr_res} \ --outdir {params.outdir} \ --threads {threads} 2>{log.err} 1>{log.out} # The output of the search are 2 files: the MMseqs2 output in tabular format: "uniref-nohits_vs_NR.tsv" # the list of hypothetical proteins: "uniref-nohits_vs_NR_hypo.txt" # Parse results and define the first categories awk '!seen[$1]++{{print $1}}' {params.nr_res} \ > {params.outdir}/uniref-nohits_vs_NR_hits.txt {params.filterbyname} in={params.outdir}/noannot_vs_uniref90_nohits.fasta \ out={params.outdir}/uniref-nohits_vs_NR_nohits.fasta \ names={params.outdir}/uniref-nohits_vs_NR_hits.txt \ include=f ignorejunk overwrite="true" 2>>{log.err} 1>>{log.out} if [[ {params.db_mode} == "memory" ]]; then rm {params.nr_db}* {params.nr_prot} fi fi # Environmental unknowns (EUs) if [ {params.singl} == "true" ]; then grep '^>' {params.outdir}/uniref-nohits_vs_NR_nohits.fasta | sed 's/^>//' > {params.outdir}/all_eu # cluster EU join -11 -21 -v1 <(sort -k1,1 {params.outdir}/all_eu) \ <(sort -k1,1 {params.outdir}/singletons_not_annot.txt) > {output.eu} # singleton EU join -11 -21 <(sort -k1,1 {params.outdir}/all_eu) \ <(sort -k1,1 {params.outdir}/singletons_not_annot.txt) > {params.outdir}/singl_eu else grep '^>' {params.outdir}/uniref-nohits_vs_NR_nohits.fasta | sed 's/^>//' > {output.eu} fi # Knowns without Pfam (KWPs) # Not-hypothetical hits from the search against UniRef90 join -11 -21 -v1 <(sort {params.outdir}/noannot_vs_uniref90_hits.txt) \ <(sort {params.outdir}/noannot_vs_uniref90_hypo.txt) > {params.outdir}/noannot_vs_uniref90_char.txt # Not-hypothetical hits from the search against NCBI nr join -11 -21 -v1 <(sort {params.outdir}/uniref-nohits_vs_NR_hits.txt) \ <(sort {params.outdir}/uniref-nohits_vs_NR_hypo.txt) > {params.outdir}/uniref-nohits_vs_NR_char.txt if [ {params.singl} == "true" ]; then cat {params.outdir}/noannot_vs_uniref90_char.txt \ {params.outdir}/uniref-nohits_vs_NR_char.txt > {params.outdir}/all_kwp # cluster kWP join -11 -21 -v1 <(sort -k1,1 {params.outdir}/all_kwp) \ <(sort -k1,1 {params.outdir}/singletons_not_annot.txt) > {output.kwp} # singleton KWP join -11 -21 <(sort -k1,1 {params.outdir}/all_kwp) \ <(sort -k1,1 {params.outdir}/singletons_not_annot.txt) > {params.outdir}/singl_kwp else cat {params.outdir}/noannot_vs_uniref90_char.txt \ {params.outdir}/uniref-nohits_vs_NR_char.txt > {output.kwp} fi # Add annotation info: cat {params.outdir}/noannot_vs_uniref90_E{params.evalue_thr}_prot.tsv \ {params.outdir}/uniref-nohits_vs_NR_E{params.evalue_thr}_prot.tsv > {params.outdir}/noannot_uniref_nr_annotations.tsv ## KWP annotations join -11 -21 <(sort -k1,1 {output.kwp} ) \ <(sort -k1,1 {params.outdir}/noannot_uniref_nr_annotations.tsv) > {params.outdir}/kwp_annotations.tsv # Knowns (Ks) and Genomic unknowns (GUs) if [ ! -s {params.pfam} ]; then echo "Dowloading Pfam list of shared domain names" wget https://figshare.com/ndownloader/files/31127782 -O {params.pfam} fi # Load the annotated cluster file in R to retrieve a consensus pfam domain architecture {params.da_r} --ref_annot {input.ref_annot} \ --clu_annot {params.clu_annot} \ --good_cl {params.good_cl} \ --pfam_terms {params.pfam} \ --dom_arch {params.dom_arch} \ --threads {threads} 2>>{log.err} 1>>{log.out} if [[ {params.db_mode} == "memory" ]]; then rm {params.pfam} fi # Extract Pfam domain functional information (PF or DUF) from the DA file: {params.dom_arch} # Add the clusters annotated to pfam domains of unknown function (DUFs) to the GUs set if [ {params.singl} == "true" ]; then cat <(awk 'NR>1 && $9=="DUF"{{print $1}}' {params.dom_arch}) \ <(awk '$3~/^DUF/{{print $1}}' {params.s_annot}) \ {params.outdir}/noannot_vs_uniref90_hypo.txt \ {params.outdir}/uniref-nohits_vs_NR_hypo.txt > {params.outdir}/all_gu # cluster GU join -11 -21 -v1 <(sort -k1,1 {params.outdir}/all_gu) \ <(awk '{{print $2}}' {params.s_all} | sort -k1,1 ) > {output.gu} # singleton GU join -11 -21 <(sort -k1,1 {params.outdir}/all_gu) \ <(awk '{{print $2}}' {params.s_all} | sort -k1,1) > {params.outdir}/singl_gu else cat <(awk 'NR>1 && $9=="DUF"{{print $1}}' {params.dom_arch}) \ {params.outdir}/noannot_vs_uniref90_hypo.txt \ {params.outdir}/uniref-nohits_vs_NR_hypo.txt > {output.gu} fi ## GU annotations join -11 -21 <(sort -k1,1 {output.gu} ) \ <(sort -k1,1 {params.outdir}/noannot_uniref_nr_annotations.tsv) > {params.outdir}/gu_annotations.tsv awk 'NR>1 && $9=="DUF"{{print $1,"PFAM","0.0",$5}}' {params.dom_arch} >> {params.outdir}/gu_annotations.tsv #rm {params.outdir}/noannot_uniref_nr_annotations.tsv # Retrieve the Knowns cluster set if [ {params.singl} == "true" ]; then cat <(awk 'NR>1 && $9!="DUF"{{print $1}}' {params.dom_arch}) \ <(awk '$3!~/^DUF/{{print $1}}' {params.s_annot}) > {params.outdir}/all_k # cluster K join -11 -21 -v1 <(sort -k1,1 {params.outdir}/all_k) \ <(awk '{{print $1}}' {params.s_annot} | sort -k1,1 ) > {output.k} # singleton K join -11 -21 <(sort -k1,1 {params.outdir}/all_k) \ <(awk '{{print $1}}' {params.s_annot} | sort -k1,1) > {params.outdir}/singl_k # singleton categories cat <(awk '{{print $1,"EU"}}' {params.outdir}/singl_eu) \ <(awk '{{print $1,"GU"}}' {params.outdir}/singl_gu) \ <(awk '{{print $1,"KWP"}}' {params.outdir}/singl_kwp) \ <(awk '{{print $1,"K"}}' {params.outdir}/singl_k) > {params.outdir}/singl_gene_categ # singletons gene cluster categories join -12 -21 <(sort -k2,2 {params.s_all}) \ <(sort -k1,1 {params.outdir}/singl_gene_categ) |\ sed 's/ /\\t/g' > {params.s_categ} rm {params.outdir}/singl_* {params.outdir}/singletons_not_annot* {params.outdir}/all_* else awk 'NR>1 && $9!="DUF"{{print $1}}' {params.dom_arch} > {output.k} fi ## K annotations awk -vOFS='\\t' 'NR>1 && $9!="DUF"{{print $1,"PFAM","0.0",$5}}' {params.dom_arch} >> {params.outdir}/k_annotations.tsv # Clear results #rm -rf {params.outdir}/noannot_vs_uniref90_* {params.outdir}/uniref-nohits_vs_NR_* #rm -rf {params.refdb_noannot}* """ |
325 326 | run: shell("echo 'CLUSTER CLASSIFICATION DONE'") |
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | shell: """ set -e set -x export OMPI_MCA_btl=^openib export OMP_NUM_THREADS={threads} export OMP_PROC_BIND=FALSE # HHblits all-clusters vs all-clusters for each category CATEG=$(echo -e "eu\ngu\nkwp\nk") IN=$(dirname {input.k}) OUT=$(dirname {output.comm}) for categ in $CATEG; do RES=${{OUT}}/${{categ}}_hhblits.tsv if [[ ! -s ${{RES}} ]]; then if [[ ! -s {params.hhb_tmp_db}.index ]]; then {params.vmtouch} -f ${{IN}}/${{categ}}* {params.mpi_runner} {params.hhblits} -i ${{IN}}/${{categ}}_hhm_db \ -o {params.hhb_tmp_db} \ -n 2 -cpu 1 -v 0 \ -d ${{IN}}/${{categ}} mv {params.hhb_tmp_db}.ffdata {params.hhb_tmp_db} mv {params.hhb_tmp_db}.ffindex {params.hhb_tmp_db}.index fi {params.mpi_runner} {params.mmseqs_bin} apply {params.hhb_tmp_db} {params.hhb_tmp_db}.parsed \ --threads 1 \ -- {params.hhparse} 2>{log.err} sed -e 's/\\x0//g' {params.hhb_tmp_db}.parsed > ${{RES}} 2>{log.err} rm -rf {params.hhb_tmp_db} {params.hhb_tmp_db}.index {params.hhb_tmp_db}.dbtype {params.hhb_tmp_db}.parsed* {params.hhb_tmp_db}.ff* fi done """ |
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | shell: """ OUT=$(dirname {output.comm}) if [ ! -s {params.pfam_clan} ]; then echo "Dowloading Pfam-A clan information" wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/Pfam-A.clans.tsv.gz -O {params.pfam_clan} fi # Start cluster community inference ./{params.get_comm} -c {params.comm_config} 1>{log.out} 2>{log.err} rm -rf ${{OUT}}/tmp if [[ {params.db_mode} == "memory" ]]; then rm {params.pfam_clan} fi """ |
107 108 | run: shell("echo 'COMMUNITIES INFERENCE DONE'") |
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | shell: """ set -e set -x export OMPI_MCA_btl=^openib export OMP_NUM_THREADS={threads} export OMP_PROC_BIND=FALSE {params.igraph_lib} {params.parasail_lib} # Run compositional validation in mpi-mode if [[ {params.module} == "creation" ]]; then DB={params.clseqdb} else DB={params.new_clseqdb} fi {params.mpi_runner} {params.mmseqs_bin} apply ${{DB}} {params.outdb} \ -- {params.cvals} --derep {params.mmseqs_bin} \ --msa {params.famsa_bin} \ --msaeval {params.odseq_bin} \ --ssn {params.parasail_bin} \ --gfilter {params.filterg} \ --gconnect {params.isconnect} \ --seqp {params.seqtk_bin} \ --datap {params.datamash} \ --stats {params.stats} \ --outdir {params.outdir} \ --slots {threads} \ --threads {threads} 2>{log.err} 1>{log.out} # Collect results: # collect cluster main compositional validation stats and cluster rejected (bad-aligned) ORFs {params.collect} {params.stat_dir} {output.cl_cval} {output.cval_rej} {params.parallel_bin} {params.threads_collect} 2>>{log.err} 1>>{log.out} rm -rf {params.outdb} {params.outdb}.index {params.outdb}.dbtype rm -rf {params.outdir}/stats {params.outdir}/log_vals {params.outdir}/rejected """ |
94 95 | run: shell("echo 'COMPOSITIONAL VALIDATION DONE'") |
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | shell: """ set -x set -e # Summary table with cluster db origin (original/shared/new) awk -v N={params.data_name} '{{print $1,N,$3}}' {params.or_clu_origin} > {params.clu_origin} sed -i 's/ /\\t/g' {params.clu_origin} # All gene headers and partiality information gzip -c {params.or_partial} > {params.partial} # Spurious and shadow genes information: gzip -c {params.or_sp_sh} > {params.sp_sh} # All gene Pfam annotations: if [ ! -f {params.multi_annot} ]; then mv {params.or_multi_annot} {params.multi_annot} fi # All cluster category annotation files ODIR=$(dirname {input.iclu_cat}) # Combine with new ones gzip -c ${{ODIR}}/K_annotations.tsv > {params.rdir}/K_annotations.tsv.gz gzip -c ${{ODIR}}/KWP_annotations.tsv > {params.rdir}/KWP_annotations.tsv.gz gzip -c ${{ODIR}}/GU_annotations.tsv > {params.rdir}/GU_annotations.tsv.gz # Integrated set of cluster categories gzip -c {input.iclu_cat} > {output.clu_cat} # and the cluster genes if [ {params.singl} == true ]; then awk -vOFS="\\t" '{{print $2,$3,$1}}' {params.s_categ} | gzip -c > {params.singl_cl_gene_categ} fi cp {params.or_clu_gene} {params.clu_gene} # Integrated set of cluster communities gzip -c {input.iclu_com} > {output.clu_com} # New integarted cluster HHMs DB and mmseqs profiles mkdir -p {params.clu_hhm} cp -r {input.iclu_hhm}* {params.clu_hhm}/ {params.mmseqs_bin} convertprofiledb {input.iclu_hhm} {params.clu_hhm}/clu_hmm_db \ --threads {threads} 2>{log.err} # Integrated set of high quality (HQ) clusters gzip -c {input.ihq_clu} > {output.hq_clu} {params.parser} --clu_or {params.clu_origin} \ --cat {output.clu_cat} \ --clu_info {params.clu_info} \ --comm {output.clu_com} \ --hq_clu {output.hq_clu} \ --k_annot {params.multi_annot} \ --is_singl {params.singl} \ --s_cat {params.s_categ} \ --anvio {params.data_stage} \ --res {output.clu_out_tbl} \ --threads {threads} 2>{log.err} 1>{log.out} gzip {params.clu_origin} # Integrated cluster summary information gzip -c {input.iclu_stats} > {output.clu_stats} """ |
122 123 | run: shell("echo 'INTEGRATED DB DONE'") |
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | shell: """ set -e set -x # Pfam list common domain terms if [ ! -s {params.pfam_shared_terms} ]; then echo "Dowloading Pfam list of shared domain names" wget https://figshare.com/ndownloader/files/31127782 -O {params.pfam_shared_terms} fi ./{params.funct_valr} --input {input.cl_annot} \ --pfam_terms {params.pfam_shared_terms} \ --output {output.fval_res} \ --functions {params.funct_val_fun} \ --threads {threads} 2>{log.err} 1>{log.out} if [[ {params.db_mode} == "memory" ]]; then rm {params.pfam_shared_terms} fi """ |
49 50 | run: shell("echo 'FUNCTIONAL VALIDATION DONE'") |
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | shell: """ set -e set -x ## 1. Pfam-clan info and multi-domain format for the ORF Pfam annotations if [ ! -s {params.pfam_clan} ]; then echo "Dowloading Pfam-A clan information" wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/Pfam-A.clans.tsv.gz -O {params.pfam_clan} fi # Add Pfam clan info to the Pfam annotation results join -11 -21 <( awk '{{print $1,$3,$8,$9}}' {input.annot} | sort -k1,1) \ <( zcat {params.pfam_clan} | awk -vFS='\\t' -vOFS='\\t' '{{print $4,$1,$2}}' \ | awk -vFS='\\t' -vOFS='\\t' '{{for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "no_clan"}};1' \ | sort -k1,1) > {params.multi_annot} # Reorder columns # the previous output is: pfam_name - orf - start - end - pfam_acc - pfam_clan # we want to get: orf - pfam_name - pfam_acc - pfam_clan - start - end awk -vOFS='\\t' '{{print $2,$1,$5,$6,$3,$4}}' {params.multi_annot} > {params.tmp} && mv {params.tmp} {params.multi_annot} # Multiple annotations on the same line, separated by “|” (the annotations were ordered first by alignment position) sort -k1,1 -k5,6g {params.multi_annot} | \ awk -vFS='\\t' -vOFS='\\t' '{{print $1,$2,$3,$4}}' | sort -k1,1 | \ awk -f {params.concat} | sed 's/ /\t/g' > {params.tmp} mv {params.tmp} {params.multi_annot} gzip -f {params.multi_annot} ## 2. Cluster annotations # The r script "clu_annot.r" distribute the Pfam annotation in the clusters, # creating two sets: "annotated_clusters" and "not_annotated_clusters" ./{params.cl_annotr} --pfam_annot {params.multi_annot}.gz \ --clusters {input.clu} \ --partial {input.partial} \ --output_annot {output.cl_annot} \ --output_noannot {output.cl_noannot} ## 3. Singleton annotations join -12 -21 <(sort -k2,2 {params.singl} ) \ <(sort -k1,1 {params.multi_annot}) > {params.s_annot} """ |
82 83 | run: shell("echo 'CLUSTER PFAM ANNOTATION DONE'") |
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | shell: """ set -x set -e if [ {params.module} == "creation" ]; then DB={params.clseqdb} PARTIAL={params.partial} else DB={params.new_clseqdb} PARTIAL={params.new_partial} fi # Step performed after the cluster validation, to remove: # 1. bad clusters (≥ 10% bad-aligned ORFs) # 2. shadow clusters (≥ 30% shadow ORFs) # 3. single rejected ORFs (shadow, spurious and bad-aligned) if [[ ! -s {params.toremove} ]]; then # 1st step: from the validation results we already have a table with the good clusters: # good_clusters.tsv (in the directory with the validation results)= {input.good} # 2nd step: we remove from the good clusters those with ≥ 30% shadows join -11 -21 -v2 <(awk '$5>=0.3 && !seen[$2]++{{print $2}}' {input.sp_sh} | sort -k1,1) \ <( grep -v 'cl_name' {input.good} | awk '{{print $1}}' | sort -k1,1) > {params.good_noshadow} ## Retrieve the subdb with the ORFs of these clusters {params.mmseqs_bin} createsubdb <(awk '{{print $1}}' {params.good_noshadow}) ${{DB}} {params.tmpdb} # 3rd step: retrieve the single rejected/spurious/shadow ORFs and remove them from the remaining clusters ## retrieve the bad-aligned sequences in the set of good & no-shadow clusters join -11 -21 <(sort -k1,1 {input.cval_rej} ) <(sort -k1,1 {params.good_noshadow}) > {params.rej} ## Add the bad-aligned sequences to the spurious and shadows awk '$5<0.3 && $6!="FALSE"{{print $1}}' {input.sp_sh} > {params.toremove} awk '$5<0.3 && $7!="FALSE"{{print $1}}' {input.sp_sh} >> {params.toremove} awk '{{print $2}}' {params.rej} >> {params.toremove} fi if [[ -s {params.toremove} ]]; then # add cluster membership join -11 -21 <(sort {params.toremove}) \ <(awk '{{print $3,$1}}' {params.clu_info} | sort -k1,1 --parallel={threads} -T {params.local_tmp}) \ > {params.toremove_cl} # remove the single orfs from the clusters with less than 10% bad-aligned ORFs and less than 30% shadows. if [[ ! -s {params.refdb} ]]; then {params.mpi_runner} {params.mmseqs_bin} apply {params.tmpdb} {params.refdb} --threads {threads} \ -- {params.toremove_sh} {params.toremove_cl} 2>>{log.err} 1>>{log.out} fi # Create tables with new seqs and new clusters for some stats and checking the numbers join -11 -21 <(awk '{{print $1}}' {params.refdb}.index | sort -k1,1 --parallel={threads} -T {params.local_tmp}) \ <(awk '{{print $1,$3}}' {params.clu_info} \ | sort -k1,1 --parallel={threads} -T {params.local_tmp}) > {params.tmp1} # Remove "bad" ORFs from the refined table join -12 -21 -v1 <(sort -k2,2 --parallel={threads} -T {params.local_tmp} {params.tmp1}) \ <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {params.toremove}) > {params.tmp} && mv {params.tmp} {params.tmp1} else mv {params.good_noshadow} {params.tmp1} mv {params.tmpdb} {params.refdb} mv {params.tmpdb}.index {params.refdb}.index mv {params.tmpdb}.dbtype {params.refdb}.dbtype fi # From the refined clusters select the annotated and the not annotated for the following classification steps # annotated (check those left with no-annotated sequences): join with file with all annotated clusters (using the orfs) join -11 -21 <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {params.tmp1}) \ <(awk '{{print $3,$4}}' {params.annot} \ | sort -k1,1 --parallel={threads} -T {params.local_tmp}) > {output.ref_annot} # Reorder columns to have cl_name - orf - pfam_name awk -vOFS='\\t' '{{print $2,$1,$3}}' {output.ref_annot} > {params.tmp} && mv {params.tmp} {output.ref_annot} # Find in the refined annotated clusters, those left with no annotated ORFs sort -k1,1 --parallel={threads} -T {params.local_tmp} {output.ref_annot} \ | awk '!seen[$1,$3]++{{print $1,$3}}' \ | awk 'BEGIN{{getline;id=$1;l1=$1;l2=$2;}}{{if($1 != id){{print l1,l2;l1=$1;l2=$2;}}else{{l2=l2"|"$2;}}id=$1;}}END{{print l1,l2;}}' \ | grep -v "|" | awk '$2=="NA"{{print $1}}' > {params.new_noannot} if [[ -s {params.new_noannot} ]]; then # move the clusters left with no annotated member to the not annotated join -11 -21 -v1 <(awk '!seen[$1,$2,$3]++' {output.ref_annot} | sort -k1,1 --parallel={threads} -T {params.local_tmp}) \ <(sort {params.new_noannot}) > {params.tmp} join -11 -21 <(awk '!seen[$1,$2,$3]++{{print $1,$2}}' {output.ref_annot} |\ sort -k1,1 --parallel={threads} -T {params.local_tmp}) \ <(sort {params.new_noannot}) > {output.ref_noannot} # not annotated join -12 -21 <(sort -k2,2 --parallel={threads} -T {params.local_tmp} {params.tmp1}) \ <(awk -vFS='\\t' '$4=="noannot"{{print $1,$2}}' {input.good} |\ sort -k1,1 --parallel={threads} -T {params.local_tmp}) >> {output.ref_noannot} mv {params.tmp} {output.ref_annot} sed -i 's/ /\t/g' {output.ref_annot} else # not annotated join -12 -21 <(sort -k2,2 --parallel={threads} -T {params.local_tmp} {params.tmp1}) \ <(awk '$4=="noannot"{{print $1,$2}}' {input.good} |\ sort -k1,1 --parallel={threads} -T {params.local_tmp}) > {output.ref_noannot} fi awk -vOFS='\\t' '{{print $1,$2}}' {output.ref_noannot} > {params.tmp} && mv {params.tmp} {output.ref_noannot} # Using the cluster ids retrieve the sub database for not annotated clusters {params.mmseqs_bin} createsubdb <(awk '!seen[$1]++{{print $1}}' {output.ref_noannot}) \ {params.refdb} \ {params.refdb_noannot} 2>>{log.err} 1>>{log.out} # Add ORF partial information to the refined cluster table # Colums partial: 1:orf|2:partial_info # Columns tmp3: 1:orf|2:partial_info|3:cl_name join -11 -21 <(sort -k1,1 --parallel={threads} -T {params.local_tmp} ${{PARTIAL}}) \ <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {params.tmp1}) > {params.tmp2} # Add cluster size and ORF length # Columns clu_info: 1:cl_name|2:old_repr|3:orf|4:orf_len|5:cl_size # Columns intermediate: 1:orf|2:partial_info|3:cl_name|4:old_repr|5:orf_len|6:cl_size # Columns tmp4: 1:cl_name|2:orf|3:partial|4:cl_size|5:orf_len join -11 -21 <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {params.tmp2}) \ <(awk '{{print $3,$2,$4,$5}}' {params.clu_info} \ | sort -k1,1 --parallel={threads} -T {params.local_tmp}) \ | sort -k3,3 --parallel={threads} -T {params.local_tmp} \ | awk -vOFS='\\t' '{{print $3,$1,$2,$6,$5}}' > {params.tmp3} # Add new representatives from validation # Columns tmp5: 1:cl_name|2:orf|3:partial|4:cl_size|5:orf_len|6:new_repr join -11 -21 <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {params.tmp3}) \ <(awk '{{print $1,$2}}' {input.val_res} | sort -k1,1 --parallel={threads} -T {params.local_tmp}) > {params.tmp4} # Reorder columns: 1:cl_name|2:new_repres|3:orf|4:cl_size|5:orf_length|6:partial awk -vOFS='\\t' '{{print $1,$6,$2,$4,$5,$3}}' {params.tmp4} > {output.ref_clu} # cleaning the results... rm -rf {params.tmp1} {params.tmp2} {params.tmp3} {params.tmp4} rm -rf {params.toremove} {params.rej} rm -rf {params.new_noannot} {params.good_noshadow} rm -rf {params.tmpdb} {params.tmpdb}.index {params.tmpdb}.dbtype if [ {params.module} == "update" ]; then rm {params.new_partial} fi """ |
202 203 | run: shell("echo 'CLUSTER REFINEMENT DONE'") |
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | shell: """ set -e set -x ## 1. Pfam-clan info and multi-domain format for the ORF Pfam annotations # Download original dataset Pfam annotations if [[ ! -s {params.or_multi_annot} ]]; then wget https://ndownloader.figshare.com/files/23067146 -O {params.or_multi_annot} fi if [ ! -s {params.pfam_clan} ]; then echo "Dowloading Pfam-A clan information" wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/Pfam-A.clans.tsv.gz -O {params.pfam_clan} fi # Add Pfam clan info to the Pfam annotation results join -11 -21 <( awk '{{print $1,$3,$8,$9}}' {input.annot} | sort -k1,1) \ <( zcat {params.pfam_clan} | awk -vFS='\\t' -vOFS='\\t' '{{print $4,$1,$2}}' \ | awk -vFS='\\t' -vOFS='\\t' '{{for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "no_clan"}};1' \ | sort -k1,1) > {params.multi_annot} if [[ {params.db_mode} == "memory" ]]; then rm {params.pfam_clan} fi # Reorder columns # the previous output is: pfam_name - orf - start - end - pfam_acc - pfam_clan # we want to get: orf - pfam_name - pfam_acc - pfam_clan - start - end awk -vOFS='\\t' '{{print $2,$1,$5,$6,$3,$4}}' {params.multi_annot} > {params.tmp} && mv {params.tmp} {params.multi_annot} # Multiple annotations on the same line, separated by “|” (the annotations were ordered first by alignment position) sort -k1,1 -k5,6g {params.multi_annot} | \ awk -vFS='\\t' -vOFS='\\t' '{{print $1,$2,$3,$4}}' | sort -k1,1 | \ awk -f {params.concat} | sed 's/ /\t/g' > {params.tmp} mv {params.tmp} {params.multi_annot} zcat {params.or_multi_annot} | sed 's/ /\\t/g' >> {params.multi_annot} # Gene completeness information combined # Download original dataset gene completion information if [[ ! -s {params.or_partial} ]]; then wget https://ndownloader.figshare.com/files/23067005 -O {params.or_partial} fi # Combine with new completion info join -11 -21 <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {input.partial}) \ <( awk '{{print $3}}' {input.clu} | sort -k1,1 --parallel={threads}) > {params.partial} join -11 -21 <(zcat {params.or_partial} | \ sort -k1,1 --parallel={threads} -T {params.local_tmp}) \ <( awk '{{print $3}}' {input.clu} | sort -k1,1 --parallel={threads}) >> {params.partial} 2>>{log.err} sed -i 's/ /\\t/g' {params.partial} ## 2. Cluster annotations # The r script "clu_annot.r" distribute the Pfam annotation in the clusters, # creating two sets: "annotated_clusters" and "not_annotated_clusters" ./{params.cl_annotr} --pfam_annot {params.multi_annot} \ --clusters {input.clu} \ --partial {params.partial} \ --output_annot {output.cl_annot} \ --output_noannot {output.cl_noannot} 2>>{log.err} ## 3. Singleton annotations join -12 -21 <(sort -k2,2 {params.singl} ) \ <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {params.multi_annot}) > {params.s_annot} gzip -f {params.multi_annot} """ |
111 112 | run: shell("echo 'CLUSTER PFAM ANNOTATION DONE'") |
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | shell: """ set -e set -x #Retrieve old cluster representatives information # Not annotated clusters # join -11 -21 <(awk '{{print $1,$2}}' {input.cval} | sort -k1,1 --parallel={threads}) \ <(awk '!seen[$1]++{{print $1,$2,"noannot",$4}}' {params.cl_noannot} | sort -k1,1 --parallel={threads}) > {params.val_annot} # Annotated clusters join -11 -21 <(awk '{{print $1,$2}}' {input.cval} | sort -k1,1 --parallel={threads}) \ <(awk '!seen[$1]++{{print $1,$2,"annot",$4}}' {params.cl_annot} | sort -k1,1 --parallel={threads} ) >> {params.val_annot} awk -vOFS='\\t' '{{print $1,$2,$3,$5,$4}}' {params.val_annot} \ > {params.tmp} && mv {params.tmp} {params.val_annot} # Combine with functional validation results ./{params.val_res} --fval_res {input.fval} \ --cval_res {input.cval} \ --val_annot {params.val_annot} \ --val_res {output.val_res} \ --val_stats {params.val_stats} \ --good {output.good} \ --plots {params.val_plots} 2>{log.err} 1>{log.out} rm -rf {params.val_annot} {params.tmp} """ |
59 60 | run: shell("echo 'CLUSTER VALIDATION DONE'") |
24 25 26 27 28 29 30 31 32 33 34 | shell: """ Rscript --vanilla {params.report_maker} --basedir {params.basedir} \ --outdir {params.outdir} \ --stage {params.stage} \ --name {params.name_data} \ --input {params.input_data} \ --wf_report {params.wf_report} \ --output {output.report} 2>{log.err} 1>{log.out} """ |
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | shell: """ set -x set -e if [[ {params.stage} = "contigs" ]]; then {params.prodigal_bin} -i {input.contigs} -a {output.fa} -m -p {params.prodigal_mode} -f gff -o {params.gff_output} -q 2>{log.err} 1>{log.out} awk -f {params.rename_orfs} {output.fa} > {params.tmp} && mv {params.tmp} {output.fa} awk -f {params.partial_info} {params.gff_output} > {output.partial} elif [[ {params.stage} = "genes" ]]; then ln -sf {input.contigs} {output.fa} ln -sf {params.data_partial} {output.partial} elif [[ {params.stage} = "anvio_genes" ]]; then ln -sf {input.contigs} {output.fa} awk -vOFS="\\t" 'NR>1{{if($6==0) print $1,"00"; else print $1,"11";}}' {params.data_partial} > {output.partial} fi """ |
58 59 | run: shell("echo 'GP DONE'") |
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 | shell: """ set -x set -e DIR=$(dirname {output.iclu_hmm}) mkdir -p ${{DIR}} if [[ ! -s {params.or_clu_orig} ]]; then wget https://ndownloader.figshare.com/files/23066966 -O {params.or_clu_orig} fi # Summary table with cluster db origin (original/shared/new) join -11 -21 <(zcat {params.or_clu_orig} | awk '{{print $1,$2}}' | sort -k1,1 --parallel={threads} ) \ <(awk '{{print $1,$3}}' {params.original} | sort -k1,1) > {params.clu_origin} join -11 -21 <(zcat {params.or_clu_orig} | awk '{{print $1,$2}}' | sort -k1,1 --parallel={threads} ) \ <(awk '{{print $1,$3}}' {params.shared} | sort -k1,1) > {params.clu_origin}.temp awk -vN={params.data_name} '{{print $1,$2"_"N,$3}}' {params.clu_origin}.temp >> {params.clu_origin} awk -vN={params.data_name} '{{print $1,N,$3}}' {params.new} >> {params.clu_origin} rm {params.clu_origin}.temp sed -i 's/ /\t/g' {params.clu_origin} gzip {params.clu_origin} # Clean the mmseqs_clustering/ folder mv {params.all} {params.final} rm {params.original} {params.shared} {params.new} {params.new_cluseqdb}* # All gene headers and partiality information cat {params.partial} <(zcat {params.or_partial}) | gzip > {params.ipartial} # Spurious and shadow genes information: gzip -c {params.sp_sh} > {params.isp_sh} # All gene Pfam annotations: cp {params.multi_annot} {params.imulti_annot} # All cluster category annotation files ODIR=$(dirname {params.or_clu_cat}) NDIR=$(dirname {input.clu_cat}) # Download original dataset category annotations if [[ ! -s ${{ODIR}}/K_annotations.tsv.gz ]]; then wget https://ndownloader.figshare.com/files/23063648 -O ${{ODIR}}/K_annotations.tsv.gz wget https://ndownloader.figshare.com/files/23067074 -O ${{ODIR}}/KWP_annotations.tsv.gz wget https://ndownloader.figshare.com/files/23067080 -O ${{ODIR}}/GU_annotations.tsv.gz fi # Combine with new ones cat <(zcat ${{ODIR}}/K_annotations.tsv.gz) ${{NDIR}}/K_annotations.tsv | gzip > {params.idir}/K_annotations.tsv.gz cat <(zcat ${{ODIR}}/KWP_annotations.tsv.gz) ${{NDIR}}/KWP_annotations.tsv | gzip > {params.idir}/KWP_annotations.tsv.gz cat <(zcat ${{ODIR}}/GU_annotations.tsv.gz) ${{NDIR}}/GU_annotations.tsv | gzip > {params.idir}/GU_annotations.tsv.gz # rm ${{ODIR}}/*_annotations.tsv.gz # Integrated set of cluster categories # Download original gene cluster catgeory info if [[ ! -s {params.or_clu_cat} ]]; then wget https://ndownloader.figshare.com/files/23067140 -O {params.or_clu_cat} fi # and the cluster genes # Download original gene cluster catgeory info if [[ ! -s {params.or_clu_gene} ]]; then wget https://ndownloader.figshare.com/files/24121865 -O {params.or_clu_gene} fi if [[ {params.eval_shared} == "true" ]]; then # GC-categories join -11 -21 -v1 <(zcat {params.or_clu_cat} |\ sort -k1,1 --parallel={threads} -T {params.local_tmp}) \ <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {input.clu_cat}) > {params.tmpl2} sed -i 's/ /\\t/g' {params.tmpl2} cat {params.tmpl2} {input.clu_cat} | gzip > {output.iclu_cat} # GC-genes-categories: get the new genes in the good clusters (shared) # remove shared GCs from the old table join -11 -21 -v1 <(zcat {params.or_clu_gene} |\ sort -k1,1 --parallel={threads} -T {params.local_tmp}) \ <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {input.clu_cat}) > {params.tmpl1} sed -i 's/ /\\t/g' {params.tmpl1} cat {params.tmpl1} <(zcat {params.clu_gene}) | gzip > {params.iclu_gene} rm {params.tmpl1} else # GC-categories cat <(zcat {params.or_clu_cat}) {input.clu_cat} | gzip > {output.iclu_cat} # GC-genes-categories: get the new genes in the good clusters (shared) ### double join, first cluDB_info, then orf_seqs.txt to get the new ones only join -11 -21 <(zcat {params.or_clu_cat} | sort -k1,1) \ <(awk '{{print $1,$3}}' {params.clu_info} \ | sort -k1,1 --parallel={threads} -T {params.local_tmp} ) > {params.tmpl1} join -13 -21 <(sort -k3,3 --parallel={threads} -T {params.local_tmp} {params.tmpl1}) \ <(sort -k1,1 <(awk '{{print $1}}' {params.partial})) > {params.tmpl2} # add to original clu genes and new clu genes cat <(awk -vOFS='\\t' '{{print $2,$3,$1}}' {params.tmpl2}) \ <(zcat {params.or_clu_gene}) > {params.tmpl1} cat {params.tmpl1} <(zcat {params.clu_gene}) | gzip > {params.iclu_gene} rm {params.tmpl1} {params.tmpl2} fi if [ {params.singl} == "true" ]; then if [[ ! -s {params.or_singl} ]]; then wget https://figshare.com/ndownloader/files/31012435 -O {params.or_singl} fi join -11 -21 <(zcat {params.clu_origin}.gz | awk '$3==1{{print $1}}' |\ sort -k1,1 --parallel={threads} -T {params.local_tmp} ) \ <(zcat {params.or_singl} | sort -k1,1 --parallel={threads} -T {params.local_tmp}) > {params.tmpl1} cat <(sed 's/ /\t/g' {params.tmpl1}) \ <( awk -vOFS="\\t" '{{print $2,$3,$1}}' {params.s_categ}) > {params.singl_cl_gene_categ} gzip {params.singl_cl_gene_categ} fi # Integrated set of cluster communities # to avoid having overlapping communities names, the dataset origin is appended to the name if [[ ! -s {params.or_clu_com} ]]; then wget https://ndownloader.figshare.com/files/23067134 -O {params.or_clu_com} fi ./{params.integr_comm} --name {params.data_name} \ --comm {input.clu_com} \ --ocomm {params.or_clu_com} \ --icomm {output.iclu_com} \ --shared {params.eval_shared} 2>>{log.err} 1>>{log.out} gzip {output.iclu_com} # Integrated cluster summary information if [[ ! -s {params.or_clu_stats} ]]; then wget https://figshare.com/ndownloader/files/31003162 -O {params.or_clu_stats} fi if [[ {params.eval_shared} == "true" ]]; then join -11 -21 -v1 <(zcat {params.or_clu_stats} | sort -k1,1 --parallel={threads} -T {params.local_tmp}) \ <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {input.clu_stats}) > {params.tmpl1} sed -i 's/ /\\t/g' {params.tmpl1} cat {input.clu_stats} {params.tmpl1} | gzip > {output.iclu_stats} else cat {input.clu_stats} <(zcat {params.or_clu_stats} | awk -vOFS='\\t' 'NR>1{{print $0}}') | gzip > {output.iclu_stats} fi # Integrated set of high quality (HQ) clusters if [[ ! -s {params.or_hq_clu} ]]; then wget https://ndownloader.figshare.com/files/23067137 -O {params.or_hq_clu} fi if [[ {params.eval_shared} == "true" ]]; then join -11 -21 -v1 <(zcat {params.or_hq_clu} | sort -k1,1 ) \ <(sort -k1,1 {input.hq_clu}) > {params.tmpl1} sed -i 's/ /\\t/g' {params.tmpl1} cat {input.hq_clu} {params.tmpl1} | gzip > {output.ihq_clu} else cat {input.hq_clu} <(zcat {params.or_hq_clu} | awk -vOFS='\\t' 'NR>1{{print $0}}' ) | gzip > {output.ihq_clu} fi # New integarted cluster HMMs DB (for MMseqs profile searches) # Download and uncompress the existing GC profiles if [[ ! -s {params.or_clu_hhm} ]]; then wget https://figshare.com/ndownloader/files/30998305 -O {params.or_profiles}.tar.gz tar -C {params.or_dir} -xzvf {params.or_profiles}.tar.gz rm {params.or_profiles}.tar.gz fi if [[ {params.eval_shared} == "true" ]]; then {params.mmseqs_bin} createsubdb <(awk '{{print $1}}' {params.tmpl2}) \ {params.or_clu_hhm} {params.or_clu_hhm}.left {params.mmseqs_bin} concatdbs {input.clu_hhm} {params.or_clu_hhm}.left {params.iclu_hhm} \ --threads 1 --preserve-keys 2>{log.err} # Create a comprehensive profile cluster DB in MMseqs format to perform profile searches {params.mmseqs_bin} convertprofiledb {params.iclu_hhm} {output.iclu_hmm} \ --threads {threads} 2>{log.err} rm {params.or_clu_hhm}.left* else {params.mmseqs_bin} concatdbs {input.clu_hhm} {params.or_clu_hhm} {params.iclu_hhm} \ --threads 1 --preserve-keys 2>{log.err} # Create a comprehensive profile cluster DB in MMseqs format to perform profile searches {params.mmseqs_bin} convertprofiledb {params.iclu_hhm} {output.iclu_hmm} \ --threads {threads} 2>{log.err} fi """ |
254 255 | run: shell("echo 'INTEGRATED DB DONE'") |
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | shell: """ set -e set -x # Create sequences database for the clustering {params.mmseqs_bin} createseqfiledb {params.seqdb} {params.cludb} {params.cluseqdb} --threads {threads} 2>{log.err} # To convert this cluster results tab separated file in wide format (repres member member member ..) awk -f {params.awk_wide} {input.clu} > {params.wide} # Cluster name rep and number of ORFs (size) awk -vFS='\\t' -vOFS='\\t' '{{print NR,$1,NF-1}}' {params.wide} > {params.info1} if [[ -s {params.cludb}.0 ]]; then {params.concat_clu} {params.cludb} {threads} fi if [[ -s {params.cluseqdb}.0 ]]; then {params.concat_clu} {params.cluseqdb} {threads} fi # Cluster naming # The official cluster names are going to be based on the line number of the wide formatted file # We are also going to produce a correspondence file to access the clusters in the MMseqs2 indices # Retrieving MMseqs-entry-name and sequence header from the clu_seqDB {params.mpi_runner} {params.mmseqs_bin} apply {params.cluseqdb} {params.namedb} -- {params.naming} --threads {threads} # Join the sequences with the representatives: this way we combine mmseqs-name with cluster-name (row number) join -12 -22 <(sed -e 's/\\x0//g' {params.namedb} | sort -k2,2 --parallel={threads} -T {params.local_tmp}) \ <(sort -k2,2 --parallel={threads} -T {params.local_tmp} {params.info1} ) > {output.index}.tmp awk -vOFS='\\t' '{{print $2,$3}}' {output.index}.tmp > {output.index} # Rename the cluster sequence DB with the cluster names that are goiung to be used in the next steps {params.mmseqs_bin} renamedbkeys {output.index} {params.cluseqdb} {params.cluseqdb}.new mv {params.cluseqdb}.new {params.cluseqdb} mv {params.cluseqdb}.new.index {params.cluseqdb}.index mv {params.cluseqdb}.new.dbtype {params.cluseqdb}.dbtype # Join with the long format cluster file join -11 -22 <(sort --parallel={threads} -T {params.local_tmp} -k1,1 {input.clu} ) \ <(sort --parallel={threads} -T {params.local_tmp} -k2,2 {params.info1} ) > {params.tmp} # Add length info sed -e 's/\\x0//g' {params.cluseqdb} | {params.seqtk_bin} comp | cut -f1,2 > {params.length} join -11 -22 <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {params.length}) \ <(sort -k2,2 --parallel={threads} -T {params.local_tmp} {params.tmp}) > {output.clu_info} # Reorder fields (cl_name rep orf length size) sort -k4,4n --parallel={threads} -T {params.local_tmp} {output.clu_info} | \ awk -vOFS='\\t' '{{print $4,$3,$1,$2,$5}}' > {params.tmp} cp {params.tmp} {output.clu_info} # Subset the clusters based on their size # Clusters with more than 1 member awk -vOFS='\\t' '$3>=2{{print $1,$2}}' {params.info1} > {params.tmp1} join -12 -21 <(sort -k2,2 --parallel={threads} -T {params.local_tmp} {params.tmp1}) \ <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {input.clu} ) > {params.tmp} awk -vOFS='\\t' '{{print $2,$1,$3}}' {params.tmp} > {output.clusters} # Singletons awk -vOFS='\\t' '$3=="1"{{print $1,$2}}' {params.info1} > {output.singl} rm {params.tmp} {params.tmp1} {params.namedb} {params.namedb}.index {params.namedb}.dbtype {output.index}.tmp rm {params.wide} {params.length} """ |
123 124 | run: shell("echo 'CLUSTERING RESULTS DONE'") |
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | shell: """ set -x set -e export OMPI_MCA_btl=^openib export OMP_NUM_THREADS={threads} export OMP_PROC_BIND=FALSE {params.mmseqs_bin} createdb {input.orfs} {params.seqdb} 2>{log.err} 1>{log.out} {params.mmseqs_bin} cluster \ {params.seqdb} \ {params.cludb} \ {params.mmseqs_tmp} \ --local-tmp {params.mmseqs_local_tmp} \ --threads {threads} \ -c {params.mmseqs_cov} \ --cov-mode {params.mmseqs_cov_mode} \ --min-seq-id {params.mmseqs_id} \ -s {params.mmseqs_ens} \ --mpi-runner "{params.mmseqs_mpi_runner}" 2>>{log.err} 1>>{log.out} {params.mmseqs_bin} createtsv {params.seqdb} {params.seqdb} {params.cludb} {output.clu} """ |
58 59 | run: shell("echo 'MMSEQS2 CLUSTERING DONE'") |
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | shell: """ set -x set -e export OMPI_MCA_btl=^openib export OMP_NUM_THREADS={threads} export OMP_PROC_BIND=FALSE # Create directory for original data to be integrated mkdir -p {params.or_dir} # Download original mmseqs clustering that you want to use for the integration if [[ ! -s {params.or_cludb}.index ]]; then wget https://ndownloader.figshare.com/files/23066651 -O {params.or_dir}/mmseqs_clustering.tar.gz tar -C {params.or_dir} -zxvf {params.or_dir}/mmseqs_clustering.tar.gz rm {params.or_dir}/mmseqs_clustering.tar.gz fi {params.mmseqs_bin} createdb {input.new_seqs} {params.new_seqdb} 2>{log.err} 1>{log.out} # Create symbolic link between original cluDB and seqDB (required by mmeseqs to run the update) ln -sf {params.or_seqdb}_h {params.or_cludb}_h ln -sf {params.or_seqdb}_h.index {params.or_cludb}_h.index ln -sf {params.or_seqdb}_h.dbtype {params.or_cludb}_h.dbtype ln -sf {params.or_seqdb}.lookup {params.or_cludb}.lookup # Concat new and old seq DBs {params.mmseqs_bin} concatdbs {params.or_seqdb} {params.new_seqdb} \ {params.conc_seqdb} \ --threads 1 2>>{log.err} 1>>{log.out} {params.mmseqs_bin} concatdbs {params.or_seqdb}_h {params.new_seqdb}_h \ {params.conc_seqdb}_h \ --threads 1 2>>{log.err} 1>>{log.out} # Update the clusterDB: {params.mmseqs_bin} clusterupdate \ {params.or_seqdb} \ {params.conc_seqdb} \ {params.or_cludb} \ {params.updt_seqdb} \ {params.updt_cludb} \ {params.mmseqs_tmp} \ --local-tmp {params.mmseqs_local_tmp} \ --mpi-runner "{params.mmseqs_mpi_runner}" \ --threads {threads} \ -c {params.mmseqs_cov} \ --cov-mode {params.mmseqs_cov_mode} \ --min-seq-id {params.mmseqs_id} \ -s {params.mmseqs_ens} 2>>{log.err} 1>>{log.out} {params.mmseqs_bin} createtsv \ {params.updt_seqdb} \ {params.updt_seqdb} \ {params.updt_cludb} \ {output.updt_clu} \ --threads {threads} 2>>{log.err} 1>>{log.out} #mv {params.conc_seqdb} {params.updt_seqdb} #mv {params.conc_seqdb}.dbtype {params.updt_seqdb}.dbtype #mv {params.conc_seqdb}_h {params.updt_seqdb}_h #mv {params.conc_seqdb}_h.dbtype {params.updt_seqdb}_h.dbtype #rm {params.conc_seqdb}* """ |
105 106 | run: shell("echo 'MMSEQS2 CLUSTERING DONE'") |
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | shell: """ set -e set -x mkdir -p {params.mmseqs_tmp} # Create sequences database for the clustering # This DB can be accessed with ffindex, to extractt separated fasta files for each cluster and perform operations on them # Ex: ffindex_apply_mpi cluseqdb cluseqdb.index -- your_program/script if [[ ! -s {params.cluseqdb}.index ]]; then {params.mmseqs_bin} createseqfiledb {params.seqdb} {params.cludb} {params.cluseqdb} --threads {threads} 2>{log.err} #Check if sequences contain "*", and if yes, clean them using the following command #{params.mpi_runner} {params.mmseqs_bin} apply \ # {params.cluseqdb} \ # {params.cluseqdb}_clean \ # --threads {threads} \ # -- {params.clean_seqs} 2>>{log.err} fi if [[ -s {params.cludb}.0 ]]; then {params.concat_clu} {params.cludb} {threads} fi if [[ -s {params.cluseqdb}.0 ]]; then {params.concat_clu} {params.cluseqdb} {threads} fi # To convert this cluster results tab separated file in wide format (repres member member member ..) awk -f {params.awk_wide} <(awk '!seen[$2]++' {input.clu}) > {params.wide} 2>>{log.err} # Cluster name rep and number of ORFs (size) awk -vFS='\\t' -vOFS='\\t' '{{print NR,$1,NF-1}}' {params.wide} > {params.info1} # Join with the original clustering using the representative sequences join -12 -22 <(sort -k2,2 {params.orig_info1}) \ <(sort -k2,2 {params.info1}) > {params.rename_or} 2>>{log.err} join -12 -22 -v2 <(sort -k2,2 {params.orig_info1}) \ <(sort -k2,2 {params.info1}) > {params.rename_new} 2>>{log.err} # Store the information about changes in the original Clusters sort -k2,2n {params.rename_or} | awk -vOFS='\\t' '{{print $2,$1,$3,$5}}' > {params.or_new_info} # Order and re-join old and new cluster ids/NAMES (the new ids will start from the last of the original clustering) OR=$(wc -l {params.or_new_info} | cut -d ' ' -f1) awk -vOR=$OR -vOFS='\\t' '!seen[$0]++{{print NR+OR,$1,$3}}' {params.rename_new} > {params.new_info1} 2>>{log.err} cat <(awk -vOFS='\\t' '!seen[$0]++{{print $1,$2,$4}}' {params.or_new_info} ) \ {params.new_info1} > {params.info1} # Cluster naming # The official cluster names are going to be based on the line number of the wide formatted file # We are also going to produce a correspondence file to access the clusters in the MMseqs2 indices # Retrieving MMseqs-entry-name and sequence header from the clu_seqDB if [[ ! -s {params.index} ]]; then {params.mpi_runner} {params.mmseqs_bin} apply {params.cluseqdb} {params.namedb} --threads {threads} -- {params.naming} # Join the sequences with the representatives: this way we combine mmseqs-name with cluster-name (row number) join -12 -22 <(sed -e 's/\\x0//g' {params.namedb} | sort -k2,2 --parallel={threads} -T {params.local_tmp}) \ <(sort -k2,2 --parallel={threads} -T {params.mmseqs_tmp} {params.info1} ) > {params.index} awk -vOFS='\\t' '!seen[$0]++{{print $2,$3}}' {params.index} > {params.tmp} && mv {params.tmp} {params.index} fi # Rename the cluster sequence DB with the cluster names that are goiung to be used in the next steps {params.mmseqs_bin} renamedbkeys {params.index} {params.cluseqdb} {params.cluseqdb}.new mv {params.cluseqdb}.new {params.cluseqdb} mv {params.cluseqdb}.new.index {params.cluseqdb}.index mv {params.cluseqdb}.new.dbtype {params.cluseqdb}.dbtype # Join with the long format cluster file join -11 -22 <(sort --parallel={threads} -T {params.local_tmp} -k1,1 {input.clu}) \ <(sort --parallel={threads} -T {params.local_tmp} -k2,2 {params.info1} ) > {params.tmp1} # Add length info sed -e 's/\\x0//g' {params.cluseqdb} | {params.seqtk_bin} comp | cut -f1,2 > {params.length} join -11 -22 <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {params.length}) \ <(sort -k2,2 --parallel={threads} -T {params.mmseqs_tmp} {params.tmp1}) > {params.tmp} # Reorder fields (cl_name rep orf length size) sort -k4,4n --parallel={threads} -T {params.local_tmp} {params.tmp} | \ awk -vOFS='\\t' '{{print $4,$3,$1,$2,$5}}' > {params.clu_info} ## Retrieve the different cluster sets: # 1. Only-original clusters: cluDB_original_name_rep_size.tsv awk -vOFS='\\t' '$3==$4{{print $1,$2,$3}}' {params.or_new_info} | awk '!seen[$0]++' > {params.original} # 2. Shared original-new clusters: cluDB_shared_name_rep_size.tsv awk -vOFS='\\t' '$3!=$4{{print $1,$2,$4}}' {params.or_new_info} | awk '!seen[$0]++' > {params.shared} # 3. Only-new clusters: cluDB_new_name_rep_size.tsv # we already have this set: {params.new_info1} ## We are going to re-validate and classified the clusters in new_cluDB and if [[ {params.eval_shared} == "true" ]]; then if [[ {params.shared_type} == "all" ]]; then # Evaluate all the shared GCs (GCs that got integrated with new genes) cat {params.new_info1} \ <(awk -vFS='\\t' -vOFS='\\t' '!seen[$0]++{{print $1,$2,$3}}' {params.shared} ) > {params.info2} fi if [[ {params.shared_type} == "discarded" ]]; then # Add the previously discarded shared GCs # download if not there the original cluster - category table if [[ ! -s {params.or_clu_cat} ]]; then wget https://ndownloader.figshare.com/files/23067140 -O {params.or_clu_cat} fi # anti join shared GCs with the original cluster-category to get the discarded GCs join -11 -21 -v1 <(sort -k1,1 --parallel={threads} -T {params.local_tmp} {params.shared}) \ <(zcat {params.or_clu_cat} | awk '{{print $1}}' |\ sort -k1,1 --parallel={threads} -T {params.local_tmp}) > {params.shared}.temp cat {params.new_info1} \ <(awk -vFS='\\t' -vOFS='\\t' '!seen[$0]++{{print $1,$2,$3}}' {params.shared}.temp ) > {params.info2} rm {params.shared}.temp fi if [[ {params.shared_type} == "increase" ]]; then # Or the shared GCs with more than % new genes # Ex: increase of 30% in number of sequences awk '{{print $1,$2,$4,($4-$3)/$3}}' {params.or_new_info} > {params.or_new_info}.temp cat {params.new_info1} \ <(awk -vFS='\\t' -vOFS='\\t' '!seen[$0]++ && $4>=0.3{{print $1,$2,$3}}' {params.or_new_info}.temp ) > {params.info2} rm {params.or_new_info}.temp fi else # Evaluate also the previously singletons now integrated in GCs with new genes cat {params.new_info1} \ <(awk -vFS='\\t' -vOFS='\\t' '$3==1 && $4>1 && !seen[$0]++{{print $1,$2,$4}}' {params.or_new_info} ) > {params.info2} fi # Subset the cluseqDB: if [[ ! -s {params.new_cluseqdb} ]]; then {params.mmseqs_bin} createsubdb <(awk '{{print $1}}' {params.info2}) \ {params.cluseqdb} \ {params.new_cluseqdb} 2>>{log.err} fi # Subset the clusters based on their size # Clusters with more than 1 member awk -vOFS='\\t' '$3>=2{{print $1,$2}}' {params.info2} > {params.tmp1} join -12 -21 <(sort -k2,2 {params.tmp1}) \ <(sort -k1,1 --parallel={threads} -T {params.local_tmp} <(awk '!seen[$2]++' {input.clu} )) > {params.tmp} awk -vOFS='\\t' '!seen[$0]++{{print $2,$1,$3}}' {params.tmp} > {output.clusters} # Singletons awk -vOFS='\\t' '$3=="1"{{print $1,$2}}' {params.info2} | awk '!seen[$0]++' > {output.singl} rm {params.tmp} {params.tmp1} {params.namedb} {params.namedb}.index {params.namedb}.dbtype rm {params.wide} {params.length} {params.rename_or} {params.rename_new} """ |
220 221 | run: shell("echo 'CLUSTERING RESULTS DONE'") |
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | shell: """ set -x set -e DB=$(basename {params.orig_db}) if [[ ${{DB}} == "agnostosDB" ]]; then # Download agnostosDB ecological analysis data wget https://ndownloader.figshare.com/files/23066879 -O ecological.tar.gz tar -xzvf ecological.tar.gz # Download agnostosDB phylogentic analysis data wget https://ndownloader.figshare.com/files/23066864 -O phylogenetic.tar.gz tar -xzvf phylogenetic.tar.gz # Download agnostosDB experimental analysis data wget https://ndownloader.figshare.com/files/23066864 -O phylogenetic.tar.gz tar -xzvf experimental.tar.gz fi # Run R script to retrieve a general table containing a summary of the integration # add contextual data if the original DB is the agnostosDB awk -vOFS='\\t' '{{split($1,a,"_\\\+|_-"); print a[1],$1}}' {input.genes} > {params.contig} ./{params.parser} --clu_or {params.clu_origin} \ --contig {params.contig} \ --cat {input.cat} \ --clu_info {params.clu_info} \ --name {params.new_data_name} \ --comm {params.comm} \ --hq_clu {params.hq_clu} \ --k_annot {params.k_annot} \ --kwp_annot {params.kwp_annot} \ --gu_annot {params.gu_annot} \ --orig_db {params.orig_db} \ --is_singl {params.singl} \ --s_categ {params.singl_cat} \ --anvio {params.data_stage} \ --threads {threads} 2>{log.err} 1>{log.out} """ |
81 82 | run: shell("echo 'Parsing of results DONE'") |
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | shell: """ set -x set -e export OMPI_MCA_btl=^openib export OMP_NUM_THREADS={threads} export OMP_PROC_BIND=FALSE # Pfam database if [ ! -s {params.pfamdb} ]; then DB=$(dirname {params.pfamdb}) mkdir -p ${{DB}} echo "Dowloading Pfam-A database" wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/Pfam-A.hmm.gz -O {params.pfamdb}.gz gunzip {params.pfamdb}.gz fi NPFAM=$(grep -c '^NAME' {params.pfamdb}) NSEQS=$(grep -c '^>' {input.gp}) N=$(($NSEQS * $NPFAM)) if [ ! -s {params.annot} ]; then # Run hmmsearch (MPI-mode) {params.mpi_runner} {params.hmmer_bin} --mpi --cut_ga -Z "${{N}}" --domtblout {params.hmmout} -o {params.hmmlog} {params.pfamdb} {input.gp} 2>{log.err} 1>{log.out} # Collect the results grep -v '^#' {params.hmmout} > {params.annot} fi # Parse the results {params.parser} {params.annot} {params.evalue} {params.coverage} > {output.pf_annot} 2>>{log.err} # remove the search log files rm {params.hmmout} {params.hmmlog} # remove the unparsed pfam_annotations rm {params.annot} if [[ {params.db_mode} == "memory" ]]; then rm {params.pfamdb} fi """ |
74 75 | run: shell("echo 'ANNOTATION DONE'") |
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | shell: """ set -x set -e export OMPI_MCA_btl=^openib export OMP_NUM_THREADS={threads} export OMP_PROC_BIND=FALSE # 1. Detection of spurious ORFs if [ ! -s {params.antifamdb} ]; then echo "Dowloading AntiFam database" DB=$(dirname {params.antifamdb}) wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/AntiFam/current/Antifam.tar.gz -P ${{DB}} tar xvfz ${{DB}}/Antifam.tar.gz --directory ${{DB}} rm ${{DB}}/Antifam.tar.gz {params.hmmpress} {params.antifamdb} fi NANTI=$(grep -c '^NAME' {params.antifamdb}) NSEQS=$(grep -c '^>' {input.fasta}) N=$(($NSEQS * $NANTI)) # Run hmmsearch (MPI mode) {params.mpi_runner} {params.hmmer_bin} --mpi --cut_ga -Z "${{N}}" --domtblout {params.hmmout} -o {params.hmmlog} {params.antifamdb} {input.fasta} 2>{log.err} 1>{log.out} if [[ {params.db_mode} == "memory" ]]; then DB=$(dirname {params.antifamdb}) rm -rf ${{DB}}/AntiFam* fi # Parse the results grep -v '^#' {params.hmmout} > {params.spur}.tmp || true > {params.spur}.tmp 2>>{log.err} rm {params.hmmout} {params.hmmlog} if [[ -s {params.spur}.tmp ]]; then awk '$13<=1e-05 && !seen[$1]++{{print $1}}' {params.spur}.tmp > {params.spur} 2>>{log.err} rm {params.spur}.tmp else mv {params.spur}.tmp {params.spur} fi # 2. Detection of shadow ORFs ./{params.shadowr} --orfs {params.partial} \ --shadows {params.all_shad} \ --threads {threads} 2>>{log.err} ## 2.1 Parsing of results adding cluster information ## Add cluster info on the first column of ORFs join -12 -23 <(sort --parallel={threads} -k2,2 {params.all_shad}) \ <(sort --parallel={threads} -T {params.local_tmp} -k3,3 {params.clu_info} ) > {params.tmp1} ## reorder columns... awk -vOFS='\\t' '{{print $6,$7,$8,$9,$1,$11,$12,$13,$3,$4,$5,$2,$10}}' {params.tmp1} > {params.tmp2} && mv {params.tmp2} {params.tmp1} ## Add cluster info on the second column of ORFs join -11 -23 <(sort --parallel={threads} -k1,1 {params.tmp1}) \ <(sort --parallel={threads} -T {params.local_tmp} -k3,3 {params.clu_info} ) > {params.tmp2} ## reorder columns... awk -vOFS='\\t' '{{print $1,$14,$15,$16,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13}}' {params.tmp2} > {params.tmp1} ### 2.2 Filter to get only the real shadow ORFs ### print ORFs and "TRUE" or "BOTH" if they fall in the definition of shadows awk -vFS='\\t' -vOFS='\\t' '{{if($4>$11) print $8,"TRUE"; else if($11>$4) print $1,"TRUE"; else if($4==$11 && $3<$10) print1; else if($4==$11 && $3>$10) print $1,"TRUE";}}' {params.tmp1} > {params.tmp2} awk -vFS='\\t' -vOFS='\\t' '{{if($4==$11 && $3==$10) print $1,"BOTH";}}' {params.tmp1} >> {params.tmp2} awk -vFS='\\t' -vOFS='\\t' '{{if($4==$11 && $3==$10) print $8,"BOTH";}}' {params.tmp1} >> {params.tmp2} ### re-join with the cluster information table join -13 -21 -a1 -a2 <(sort --parallel={threads} -T {params.local_tmp} -k3,3 {params.clu_info} ) \ <(sort --parallel={threads} -k1,1 <(awk '!seen[$0]++' {params.tmp2}) ) > {params.only_shad} ### Add a "FALSE" to the non-shadow ORFs awk -vOFS='\\t' '!seen[$0]++{{if($6=="TRUE" || $6=="BOTH") print $2,$1,$3,$4,$5,$6; else print $2,$1,$3,$4,$5,"FALSE";}}' {params.only_shad} > {params.tmp1} mv {params.tmp1} {params.only_shad} # 3. Combine spurious and shadow information join -12 -21 -a1 <(sort --parallel={threads} -k2,2 {params.only_shad}) \ <(awk '{{print $1,"TRUE"}}' {params.spur} | sort --parallel={threads} -k1,1) > {params.tmp1} awk '{{if($7=="TRUE") print $1,$4,$2,$3,$5,$6,$7; else print $1,$4,$2,$3,$5,$6,"FALSE";}}' {params.tmp1} | \ awk '{{print $1,$2,$3,$5,$6,$7}}' > {output.sp_sh}.tmp # calculate proportion of shadows per cluster # grep 'TRUE\|BOTH' {params.only_shad} | \ awk '($6 == "TRUE" || $6 == ""BOTH) && (!seen[$0]++){{a[$1"\\t"$5]+=1}}END{{for(i in a) print i,a[i]}}' | awk '{{print $1,$2,$3/$2}}' > {params.tmp1} # join with the whole table on cluster ID join -13 -21 -a1 <(sort --parallel={threads} -T {params.local_tmp} -k3,3 {output.sp_sh}.tmp) \ <(sort --parallel={threads} -k1,1 {params.tmp1}) > {params.tmp2} # Add 0 for the clusters with no shadows and # reorder column output table: orf - length - cl_name - cl_size - prop_shadow - is.shadow - is.spurious awk -vOFS='\\t' '{{ for(i=1; i<=7; i++) if($i ~ /^ *$/) $i = 0 }};1' {params.tmp2} | \ awk -vOFS='\\t' '{{print $2,$3,$1,$4,$7,$5,$6}}' > {output.sp_sh}.tmp if [[ {params.module} == "update" ]]; then # Download original dataset spurious and shadow info and combine with the new dataset if [[ ! -s {params.or_sp_sh} ]]; then wget https://ndownloader.figshare.com/files/23067131 -O {params.or_sp_sh} fi if [[ {params.eval_shared} == "true" ]]; then join -11 -21 -v1 <(zcat {params.or_sp_sh} |\ sort -k1,1 --parallel {threads} -T {params.local_tmp}) \ <(awk '{{print $1}}' {output.sp_sh}.tmp |\ sort -k1,1 --parallel {threads} -T {params.local_tmp} ) > {params.tmp1} sed -i 's/ /\\t/g' {params.tmp1} cat {output.sp_sh}.tmp {params.tmp1} > {output.sp_sh} else cat {output.sp_sh}.tmp <(zcat {params.or_sp_sh}) > {output.sp_sh} fi rm {output.sp_sh}.tmp else mv {output.sp_sh}.tmp {output.sp_sh} fi rm {params.tmp1} {params.tmp2} {params.all_shad} {params.spur} {params.only_shad} """ |
162 163 | run: shell("echo 'Spurious and Shadows detection DONE'") |
24 25 26 27 28 29 30 31 32 33 34 | shell: """ Rscript --vanilla {params.report_maker} --basedir {params.basedir} \ --outdir {params.outdir} \ --name {params.name_data} \ --input {params.input_data} \ --stage {params.stage} \ --wf_report {params.wf_report} \ --output {output.report} 2>{log.err} 1>{log.out} """ |
Support
- Future updates
Related Workflows





