Gezelvirus Workflow: Studying a Polinton-like Virus in Phaeocystis globosa
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Gezelvirus workflow
The repository contains the bioinformatic workflow used in the paper Roitman et al (2023) "Infection cycle and phylogeny of a Polinton-like virus with a virophage lifestyle infecting Phaeocystis globosa ". See also r
Code Snippets
11 12 | shell: "efetch -db nuccore -id {params.id} -format fasta > {output}" |
23 24 | shell: "efetch -db nuccore -id {params.id} -format gb > {output}" |
31 32 | shell: "cp {input} {output}" |
39 40 | shell: "mv {input} {output}" |
55 56 57 58 59 | shell: """ prokka --force --cpus {threads} --prefix {wildcards.genome} --locustag {params.locustag} --kingdom Viruses --outdir {params.outdir}/{wildcards.genome} {input} cp {params.outdir}/{wildcards.genome}/{wildcards.genome}.gbk {output} """ |
70 71 | shell: "seqkit grep -rp {params.search} {input} | seqkit replace -p '$' -r ' {wildcards.segment}' -o {output}" |
80 81 | shell: "awk -vl={wildcards.segment} '/^LOCUS/{{p=$2==l}}p' {input} > {output}" |
90 91 | shell: "perl workflow/helpers/biotags.pl -i {input} -p CDS -T id -t locus_tag,seq | awk '{{printf\"%s %s\\t%s\\n\",$2,$1,$3}}' | seqkit tab2fx | seqkit translate --trim -o {output}" |
100 101 | shell: "awk -vl={wildcards.segment} '/^LOCUS/{{p=$2==l}}p' {input} | sed -E 's/\\x93|\\x94/\"/g' > {output}" |
110 111 | shell: "perl workflow/helpers/biotags.pl -i {input} -p CDS -T id -t locus_tag,translation | awk '{{printf\"%s %s\\t%s\\n\",$2,$1,$3}}' | seqkit tab2fx -o {output}" |
120 121 | shell: "perl workflow/helpers/biotags.pl -i {input} -p CDS -T description -t locus_tag,translation | awk -F\\\\t '$2{{printf\"%s %s\\t%s\\n\",$2,$1,$3}}' | seqkit tab2fx -o {output}" |
130 131 | shell: "perl workflow/helpers/biotags.pl -i {input} -p CDS -T description -t protein_id,translation | awk -F\\\\t '$2{{printf\"%s %s\\t%s\\n\",$2,$1,$3}}' | seqkit tab2fx -o {output}" |
140 141 | shell: "perl workflow/helpers/biotags.pl -i {input} -T id,seq | seqkit tab2fx -o {output}" |
150 151 | shell: "perl workflow/helpers/biotags.pl -i {input} -p CDS -t locus_tag,translation | awk '$1' | seqkit tab2fx -o {output}" |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(tidyr)) suppressPackageStartupMessages(library(stringr)) stdin_fd <- file("stdin") all.lines <- readLines(stdin_fd) close(stdin_fd) param.start <- 1 data.start <- which(grepl("^ *No Hit", all.lines)) %>% first %>% `+`(1) align.start <- which(grepl("^No 1", all.lines)) %>% first param.end <- data.start - 2 data.end <- align.start - 1 align.end <- length(all.lines) if (is.na(align.start)) { data <- tibble( Query = character(), No = integer(), Hit.ID = character(), Hit.Description = character(), Q.ss_pred = character(), Q.query = character(), Q.consensus = character(), Q.Start = integer(), Q.End = integer(), Q.Length = integer(), T.consensus = character(), T.Start = integer(), T.End = integer(), T.Length = integer(), T.hit = character(), T.ss_dssp = character(), T.ss_pred = character(), Aligned_cols = integer(), E.value = numeric(), Identities = numeric(), Probab = numeric(), Score = numeric(), Similarity = numeric(), Sum_probs = numeric(), Template_Neff = numeric() ) } else { metadata <- data.frame(key = all.lines[param.start:param.end]) %>% mutate(value = substr(key, 14, 10000) %>% trimws, key = substr(key, 1, 14) %>% trimws) %>% filter(key != "") %>% {setNames(.$value, .$key)} %>% as.list data <- data.frame(Query = sub(" .*", "", metadata$Query), line = all.lines[align.start:align.end], stringsAsFactors = F) %>% filter(line != "") %>% extract(line, into = c("name", "value"), regex = "([^ ]+) ?(.+)?", remove = F) %>% mutate(No = ifelse(name == "No", value, NA) %>% as.integer) %>% mutate(Hit.ID = ifelse(substr(name, 1, 1) == ">", substr(name, 2, nchar(.)), NA)) %>% mutate(Hit.Description = ifelse(substr(name, 1, 1) == ">", value, NA)) %>% mutate(Match = ifelse(grepl("=", name), line, NA)) %>% mutate(name = ifelse(grepl("Q Consensus", lag(line)) & grepl("T Consensus", lead(line)), "M", name)) %>% mutate(value = ifelse(name == "M", line, value)) %>% fill(No) %>% group_by(Query, No) %>% summarize( Hit.ID = na.omit(Hit.ID) %>% first, Hit.Description = na.omit(Hit.Description) %>% first, Match = na.omit(Match) %>% first, Q.ss_pred = value[name == "Q" & grepl("^ss_pred ", value)] %>% substr(., 16, nchar(.)) %>% paste(collapse = "") %>% gsub(" +", "", .), Q.query = value[name == "Q" & grepl("^Consensus ", lead(value))] %>% substr(., 16, nchar(.)) %>% paste(collapse = " "), Q.consensus = value[name == "Q" & grepl("^Consensus ", value)] %>% substr(., 16, nchar(.)) %>% paste(collapse = " "), T.consensus = value[name == "T" & grepl("^Consensus ", value)] %>% substr(., 16, nchar(.)) %>% paste(collapse = " "), T.hit = value[name == "T" & grepl("^Consensus ", lag(value))] %>% substr(., 16, nchar(.)) %>% paste(collapse = " "), T.ss_dssp = value[name == "T" & grepl("^ss_dssp ", value)] %>% substr(., 16, nchar(.)) %>% paste(collapse = " ") %>% gsub(" +", "", .), T.ss_pred = value[name == "T" & grepl("^ss_pred ", value)] %>% substr(., 16, nchar(.)) %>% paste(collapse = "") %>% gsub(" ", "", .), .groups = "drop" ) %>% extract(Q.consensus, into = c("Q.Start", "Q.End", "Q.Length"), regex = "^ *(\\d+) .+ (\\d+) +[(](\\d+)[)]$", remove = F, convert = T) %>% extract(T.consensus, into = c("T.Start", "T.End", "T.Length"), regex = "^ *(\\d+) .+ (\\d+) +[(](\\d+)[)]$", remove = F, convert = T) %>% mutate( Q.consensus = gsub("[0-9() ]+", "", Q.consensus), Q.query = gsub("[0-9() ]+", "", Q.query), T.consensus = gsub("[0-9() ]+", "", T.consensus), T.hit = gsub("[0-9() ]+", "", T.hit), ) %>% #extract(Hit.Description, into = "Hit.Organism", regex = "[{]([^}]+)[}]", remove = F) %>% #extract(Hit.Description, into = "Hit.Description", regex = "([^;]+)", remove = F) %>% #extract(Hit.Description, into = "Hit.Keywords", regex = "[^;]+; ([^;]+)", remove = F) %>% mutate(Match = str_split(Match, " +")) %>% unnest(cols = Match) %>% separate(Match, into = c("key", "value"), "=") %>% mutate(value = sub("%", "", value) %>% as.numeric) %>% spread(key, value) %>% rename(E.value = `E-value`) %>% mutate(Aligned_cols = as.integer(Aligned_cols)) } write.table(data, quote = F, sep = "\t", row.names = F) |
21 22 | shell: "csvgrep -t -c Gene -r '^({params.regex})$' {input} | csvgrep -c clade -r '^({params.groups})$' | csvcut -c ID,Genome | csvformat -T -K1 > {output}" |
33 34 | shell: "cut -f1 {input.tsv} | xargs seqkit faidx {input.fas} | seqkit seq -o {output}" |
45 46 | shell: "mafft --localpair --maxiterate 1000 --thread {threads} {input} > {output}" |
57 58 | shell: "trimal -in {input} -out {output} -gt {params.gt}" |
73 74 | shell: "iqtree2 -s {input} --prefix {params.prefix} -redo --alrt 1000 -B 1000 --seed {params.seed} -T {threads}" |
90 91 | script: "scripts/ggtree-viruses.R" |
16 17 | shell: "getorf -minsize {params.minsize} -maxsize {params.maxsize} -filter {input} > {output}" |
28 29 | shell: "cdhit -i {input} -o {output} -c {params.c} -d 0" |
38 39 | shell: "mafft {input} > {output}" |
48 49 | shell: "hmmbuild {output} {input}" |
59 60 | shell: "hmmsearch -o /dev/null --tblout {output} {input.hmm} {input.faa}" |
70 71 | shell: "hmmsearch -o /dev/null --tblout {output} {input.hmm} {input.faa}" |
84 85 | shell: "blastp -query {input.query} -db {input.faa} -outfmt '6 {params.headers}' -out {output}" |
98 99 | shell: "blastp -query {input.query} -db {input.faa} -outfmt '6 {params.headers}' -out {output}" |
112 113 | shell: "awk -ve={params.evalue} '!/^#/&&$5<e{{print$1}}' {input.tblout} | sort -u | xargs -r seqkit faidx -f {input.fasta} | seqkit replace -p '^([^ ]+)' -r '$1 {wildcards.genome}' -o {output}" |
126 127 | shell: "awk -ve={params.evalue} '!/^#/&&$5<e{{print$1}}' {input.tblout} | sort -u | xargs -r seqkit faidx -f {input.fasta} | seqkit replace -p '^([^ ]+)' -r '$1 {wildcards.genome}' -o {output}" |
140 141 | shell: "awk -ve={params.evalue} '$11<e' {input.blast} | cut -f2 | sort -u | xargs -r seqkit faidx -f {input.fasta} > {output}" |
154 155 | shell: "awk -ve={params.evalue} '$11<e' {input.blast} | cut -f2 | sort -u | xargs -r seqkit faidx -f {input.fasta} > {output}" |
165 166 | shell: "hmmsearch -o /dev/null --tblout {output} {input.hmm} {input.fasta}" |
189 190 | shell: "seqkit seq -gm{params.m} {input} | seqkit rmdup -o {output}" |
203 204 | shell: "seqkit seq -gm{params.m} {input} | seqkit rmdup -o {output}" |
214 215 | shell: "hmmalign --trim --outformat A2M {input.hmm} {input.fasta} > {output}" |
227 228 | shell: "seqkit replace -sp [a-z] {input.a2m} | seqkit seq -nigm{params.m} | xargs seqkit faidx {input.a2m} | seqkit replace -sp [a-z] | seqkit rmdup -so {output}" |
238 239 | shell: "seqkit replace -sp [a-z] {input.a2m} | seqkit seq -nigM{params.M} | xargs seqkit faidx {input.a2m} | seqkit replace -sp [a-z] | seqkit rmdup -so {output}" |
256 257 | shell: "raxml-ng --redo --evaluate --msa {input.fasta} --tree {input.iqtree} --model {params.model} --seed {params.seed} --prefix {params.prefix} &> {log}" |
273 274 | shell: "epa-ng --redo -s {input.fasta} -t {input.tree} --model {input.model} -q {input.short} -w {params.dir} &> {log}" |
285 286 | shell: "gappa examine graft --jplace-path {input} --fully-resolve --out-dir {params.out_dir}" |
302 303 | shell: "iqtree2 -s {input} --prefix {params.prefix} -redo --alrt 1000 -B 1000 --seed {params.seed} -T {threads}" |
319 320 | script: "scripts/ggtree.R" |
335 336 | script: "scripts/ggtree.R" |
349 350 | script: "scripts/ggtree.R" |
20 21 | shell: "perl workflow/helpers/biotags.pl -i {input} -p CDS -t seq-{params.max_len} | awk '$1' | nl -w1 | seqkit tab2fx | seqkit seq -gm{params.min_len} -o {output}" |
37 38 | shell: "meme -minw {params.minw} -maxw {params.maxw} -nmotifs {params.nmotifs} -minsites {params.minsites} -maxsites Inf -dna -oc {params.outdir} {input}" |
47 48 49 50 | shell: """ awk -vm='{params.motif_re}' '$1=="MOTIF"{{s=$2!~m}}!s' {input} > {output} """ |
59 60 | shell: "seqret -supper1 -filter -outseq {output} {input}" |
72 73 | shell: "fimo --oc {params.out_dir} {input.meme} {input.fasta}" |
85 86 | shell: "fimo --oc {params.out_dir} {input.meme} {input.fasta}" |
98 99 | script: "scripts/fimo.R" |
112 113 | script: "scripts/meme.R" |
120 121 | shell: "python workflow/scripts/svg_stack.py --direction=v {input} > {output}" |
16 17 | shell: "seqkit seq -o {output} {input}" |
27 28 | shell: "ffdb fasta -d {output.data} -i {output.index} {input}" |
38 39 | shell: "ffdb fasta -d {output.data} -i {output.index} {input}" |
46 47 | shell: "cut -f2,3 {input} | nl -w1 > {output}" |
64 65 | shell: "ffindex_apply -q -d {output.data} -i {output.index} {input.data} {input.index} -- hhblits -cpu 1 -v 0 -b 1 -z 1 -d {params.db} -i stdin -oa3m stdout -e {params.e} -n {params.n}" |
76 77 | shell: "ffindex_apply -q -d {output.data} -i {output.index} {input.data} {input.index} -- $CONDA_PREFIX/scripts/addss.pl -v 0 /dev/stdin /dev/stdout" |
96 97 | shell: "ffindex_apply -q -d {output.data} -i {output.index} {input.data} {input.index} -- hhsearch -v 1 -cpu 1 -b 1 -z 1 -i stdin -d {params.db} -o stdout -p {params.p} -Z {params.BZ} -B {params.BZ} -ssm {params.ssm} -contxt {input.contxt}" |
115 116 | shell: "ffindex_apply -q -d {output.data} -i {output.index} {input.data} {input.index} -- hhsearch -v 1 -cpu 1 -b 1 -z 1 -i stdin -d {params.db} -o stdout -p {params.p} -Z {params.BZ} -B {params.BZ} -ssm {params.ssm} -contxt {input.contxt}" |
126 127 | shell: "ffindex_apply -q {input.data} {input.index} -- Rscript workflow/helpers/parse_hhsuite.R | sed '2,${{/^Query/d}}' > {output}" |
136 137 | shell: "mmseqs createdb {input} {output}" |
146 147 | shell: "mmseqs createdb {input} {output}" |
165 166 | shell: "mmseqs cluster --min-seq-id {params.min_seq_id} -c {params.coverage} --threads {threads} {input} {params.output_prefix} {params.tmp}" |
179 180 | shell: "mmseqs result2msa {input.db} {input.db} {params.input_prefix} {params.output_prefix} --msa-format-mode 1" |
199 200 | shell: "mpirun -np {threads} cstranslate_mpi -i {params.input_prefix} -o {params.output_prefix} -A {input.cs219_lib} -D {input.context} -x 0.3 -c 4 -I ca3m -b" |
212 213 | shell: "mmseqs createtsv {input.db} {input.db} {params.clu_prefix} {output}" |
231 232 | script: "scripts/abc_graph.R" |
246 247 | shell: "mcl {input} -o {output} --abc -scheme {params.scheme} -te {threads} -I {params.I}" |
276 277 | script: "scripts/mcl_graph.R" |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | library(dplyr) library(tidyr) with(snakemake@input, { clu_tsv_file <<- clu_tsv segment_tsv_files <<- segment_tsv virus_tsv_files <<- virus_tsv }) with(snakemake@params, { coverage <<- coverage probab <<- probab coverage_q_frag <<- coverage_q_frag coverage_t_frag <<- coverage_t_frag identities_frag <<- identities_frag probab_frag <<- probab_frag }) output_file <- unlist(snakemake@output) clusters <- read.table(clu_tsv_file, sep = "\t", col.names = c("Cluster", "ID")) segment_tsv <- lapply(segment_tsv_files, read.table, header = T, sep = "\t", quote = "") virus_tsv <- lapply(virus_tsv_files, read.table, header = T, sep = "\t", quote = "") data <- c(segment_tsv, virus_tsv) %>% bind_rows %>% mutate(Q.Coverage = (Q.End - Q.Start + 1) / Q.Length * 100, T.Coverage = (T.End - T.Start + 1) / T.Length * 100) %>% # filter(Probab >= probab, Q.Coverage >= coverage, T.Coverage >= coverage) %>% filter(Probab >= probab, Q.Coverage >= coverage & T.Coverage >= coverage | Q.Coverage >= coverage_q_frag & T.Coverage >= coverage_t_frag & Identities >= identities_frag & Probab >= probab_frag) %>% left_join(clusters, by = c(Hit.ID = "Cluster")) %>% select(Query, ID, Probab) write.table(data, output_file, sep = "\t", row.names = F, col.names = F, quote = F) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | import re from Bio import SeqIO input_gbk = str(snakemake.input) output_gbk = str(snakemake.output) flank = snakemake.params['flank'] segment = snakemake.params['segment'] coords_re = re.compile(r'(.+)_(\d+)-(\d+)[. ]*$') def get_coords(description): search = coords_re.search(description) assert search, "Unexpected sequence name: %s" % description scaffold, start, end = search.groups() start = int(start) end = int(end) + flank if start > flank: start = start - flank else: start = 1 return scaffold, start, end found = False with open(input_gbk) as fd: records = SeqIO.parse(fd, 'genbank') for record in records: scaffold, start, end = get_coords(record.description) if scaffold == segment['scaffold'] and start <= segment['scaffold_start'] and end >= segment['scaffold_end']: found = True break assert found, "Segment %s not found" % segment['segment_id'] segment_start = segment['scaffold_start'] - start segment_end = segment['scaffold_end'] - start + 1 sub_record = record[segment_start:segment_end] if segment['strand'] < 0: sub_record = sub_record.reverse_complement() sub_record.name = segment['name'] sub_record.id = "%s_%d-%d" % (segment['scaffold'], segment['scaffold_start'], segment['scaffold_end']) sub_record.description = sub_record.id sub_record.annotations["molecule_type"] = "DNA" SeqIO.write(sub_record, output_gbk, "genbank") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | library(dplyr) input_file <- unlist(snakemake@input) output_file <- unlist(snakemake@output) with(snakemake@params, { motif_re <<- motif_re q_thresh <<- q_thresh }) gff <- read.table(input_file, header = T) %>% filter(grepl(motif_re, matched_sequence), q.value < q_thresh) %>% mutate(description = sprintf("Name=%s;Alias=%s;pvalue=%f;qvalue=%f;sequence=%s;", motif_id, motif_alt_id, p.value, q.value, matched_sequence)) %>% mutate(source = "fimo", type = "nucleotide_motif", frame = ".") %>% select(sequence_name, source, type, start, stop, score, strand, frame, description) write.table(gff, file = output_file, sep = "\t", row.names = F, col.names = F, quote = F) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | from Bio import SeqIO from Bio.SeqFeature import SeqFeature, FeatureLocation, CompoundLocation from collections import defaultdict ref_annots = [] gff_file = str(snakemake.input["gff"]) fna_file = str(snakemake.input["fna"]) blastn_file = str(snakemake.input["blastn"]) pfam_file = str(snakemake.input["pfam"]) hmm_dbs = snakemake.input["hmm_dbs"] hmmscan_files = snakemake.input["hmmscan"] markers_files = snakemake.input["markers"] output_file = str(snakemake.output) evalue_threshold = snakemake.params["evalue"] coding_feature = snakemake.params["coding_feature"] min_repeat_id = snakemake.params["min_repeat_id"] min_repeat_gap = snakemake.params["min_repeat_gap"] min_repeat_len = snakemake.params["min_repeat_len"] aliases = { "MCP": "Major capsid protein (MCP)", "mCP": "Minor capsid protein (mCP)", "uncharacterized protein": "-", "DNA polymerase - helicase": "DNA polymerase-helicase" } features = defaultdict(list) with open(blastn_file) as fh: for line in fh: qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore = line.rstrip().split() qstart = int(qstart) qend = int(qend) sstart = int(sstart) send = int(send) if float(pident) > min_repeat_id and qseqid == sseqid and int(length) >= min_repeat_len and qstart < qend and sstart > send and sstart - qend >= min_repeat_gap: loc1 = FeatureLocation(qstart - 1, qend, strand = 1) loc2 = FeatureLocation(send - 1, sstart, strand = -1) feature = SeqFeature(CompoundLocation([loc1, loc2]), type = 'repeat_region') feature.qualifiers['rpt_type'] = 'inverted' feature.qualifiers['inference'] = 'ab initio prediction:blastn' feature.qualifiers['note'] = 'pident: %s%%' % pident features[qseqid].append(feature) descs = {} for fname in hmm_dbs: with open(fname) as fh: NAME = ACC = None for line in fh: if line.startswith('ACC'): key, ACC = line.rstrip().split() if line.startswith('NAME'): key, NAME = line.rstrip().split() if line.startswith('DESC'): key, DESC = line.rstrip().split(maxsplit = 1) if ACC: descs[ACC] = DESC if NAME: descs[NAME] = DESC NAME = ACC = None hits = defaultdict(list) with open(pfam_file) as fh: for line in fh: if not line.startswith('#') and line != '\n': seq_id, aln_start, aln_end, env_start, env_end, hmm_acc, hmm_name, type, hmm_start, hmm_end, hmm_len, bit_score, E_value, significance, clan = line.rstrip().split() hit = "%s: %s (%s)" % (hmm_acc, descs[hmm_acc], E_value) hits[seq_id].append(hit) for fname in markers_files: with open(fname) as fh: for line in fh: if not line.startswith('#'): t_name, t_acc, q_name, q_acc, E_value, score, bias, best_E_value, best_score, best_bias, exp, reg, clu, ov, env, dom, rep, inc, desc = line.rstrip().split(maxsplit = 18) if float(E_value) <= evalue_threshold: hit = "Marker %s: %s (%s)" % (q_name, descs[q_name], E_value) hits[t_name].append(hit) for fname in hmmscan_files: with open(fname) as fh: for line in fh: if not line.startswith('#'): t_name, t_acc, q_name, q_acc, E_value, score, bias, best_E_value, best_score, best_bias, exp, reg, clu, ov, env, dom, rep, inc, desc = line.rstrip().split(maxsplit = 18) if desc in aliases: desc = aliases[desc] definition = '' if t_name != desc and desc != '-': definition = ' ' + desc if t_name != desc and desc != '-' else '' if float(E_value) <= evalue_threshold: hit = "%s:%s (%s)" % (t_name, definition, E_value) hits[q_name].append(hit) with open(gff_file) as fh: for line in fh: if not line.startswith('#') and line != '\n': seqname, source, feature_type, start, end, score, strand, frame, attributes = line.rstrip().split('\t') if feature_type == coding_feature: atts = {} for att in attributes.split(';'): key, value = att.split('=') atts[key] = value if 'Parent' in atts: parent = atts['Parent'] else: parent = atts['ID'] loc = FeatureLocation(int(start) - 1, int(end)) feature = SeqFeature(loc, type = 'CDS', strand = int(strand + '1')) feature.qualifiers['locus_tag'] = parent feature.qualifiers['inference'] = "ab initio prediction:" + source.replace("_v", ":") if parent in hits: feature.qualifiers['note'] = hits[parent] features[seqname].append(feature) with open(output_file, 'w') as out_fh: with open(fna_file) as in_fh: rec_num = 1 for record in SeqIO.parse(in_fh, 'fasta'): record.annotations['molecule_type'] = 'DNA' record.name = "locus_%d" % rec_num for feature in features[record.id]: if feature.type == 'CDS': transl = feature.translate(record, cds = False) feature.qualifiers['translation'] = transl.seq.rstrip('*') record.features.append(feature) SeqIO.write(record, out_fh, 'genbank') rec_num += 1 |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | library(dplyr) library(tidyr) library(ape) library(ggtree) library(treeio) library(phangorn) library(stringr) library(ggplot2) library(phytools) if (interactive()) { setClass("snake", slots = list(input = "list", output = "list")) snakemake <- new("snake", input = list( tree = "analysis/phylogeny/MCP_NCLDV_epa/epa_result.newick", fasta = "analysis/phylogeny/MCP_NCLDV.fasta", outgroups = "metadata/queries/MCP_NCLDV_outgroups.faa", synonyms = "metadata/organisms.txt", hmm = Sys.glob("hmm_algae/*.hmm") ), output = list( image = "test.svg", jtree = "output/MCP_NCLDV.jtree" )) } with(snakemake@input, { tree_file <<- tree fasta_file <<- fasta synonyms_file <<- synonyms outgroup_file <<- outgroups }) with(snakemake@output, { out_image_file <<- image out_jtree_file <<- jtree }) with(snakemake@params, { outgroup_rooting <<- outgroup_rooting }) read.fasta.headers <- function(fnames) { file.info(fnames) %>% filter(size > 0) %>% rownames %>% lapply(treeio::read.fasta) %>% lapply(names) %>% unlist %>% data.frame(title = .) } synonyms <- read.table(synonyms_file, header = T, sep = "\t", fill = T, na.strings = "") %>% mutate(Collapse = ifelse(is.na(Collapse), Name, Collapse)) headers <- read.fasta.headers(fasta_file) %>% extract(title, into = c("label", "ID"), regex = "^([^ ]+) ([^ ]+)", remove = F) %>% left_join(synonyms, by = "ID") no_name <- filter(headers, is.na(Name)) %>% pull(label) %>% paste(collapse = ", ") if (no_name != "") { print(paste("No aliases found for: ", no_name)) quit(status = 1) } tree <- read.tree(tree_file) tree <- phangorn::midpoint(tree, node.labels = "support") if (outgroup_rooting) { outgroup_df <- read.fasta.headers(outgroup_file) outgroups <- with(outgroup_df, sub(" .*", "", title)) tree <- ape::root(tree, node = MRCA(tree, outgroups), edgelabel = T, resolve.root = T) } tree <- as_tibble(tree) %>% mutate(support = ifelse(node %in% parent & label != "", label, NA)) %>% separate(support, into = c("SH_aLRT", "UFboot"), sep = "/", convert = T) %>% left_join(headers, by = "label") %>% mutate(label.show = Name) %>% mutate(isInternal = node %in% parent) %>% `class<-`(c("tbl_tree", "tbl_df", "tbl", "data.frame")) tree_data <- as.treedata(tree) write.jtree(tree_data, file = out_jtree_file) ntaxa <- filter(tree, ! node %in% parent) %>% nrow colors <- list( Haptophyta = "orange", Chlorophyta = "green", Streptophyta = "darkgreen", MAG = "purple", Stramenopiles = "brown", Cryptophyta = "red", Amoebozoa = "gold4", Euglenozoa = "yellow", Choanoflagellata = "darkslateblue", Glaucophyta = "cyan", Animals = "blue", Dinoflagellata = "gray50", Rhizaria = "gray30" ) scaleClades <- function(p, df) { with(df, Reduce(function(.p, .node) { offs <- offspring(.p$data, .node) scale <- 0.5 / (nrow(offs) - 1) scaleClade(.p, .node, scale) }, node, p)) } collapseClades <- function(p, df) { with(df, Reduce(function(.p, .node) { fill <- unlist(colors[Host[node == .node]]) .p$data[.p$data$node == .node, "label.show"] <- label.show[node == .node] collapse(.p, .node, "mixed", fill = fill) }, node, p)) } #labelClades <- function(p) { # with(df, Reduce(function(.p, .node) { # .p + geom_cladelab(node = .node, label = label[node == .node], align = T, offset = .2, textcolor = 'blue') # }, node, p)) #} multi_species <- allDescendants(tree_data@phylo) %>% lapply(function(x) filter(tree, node %in% x)) %>% bind_rows(.id = "ancestor") %>% group_by(ancestor) %>% filter(n_distinct(Collapse, na.rm = T) == 1, sum(!isInternal) > 1) %>% # , !any(Group == "Haptophyta")) %>% ungroup %>% mutate(ancestor = as.numeric(ancestor)) %>% filter(! ancestor %in% node) %>% filter(!is.na(Collapse)) %>% group_by(ancestor, Collapse) %>% summarize(num_tips = sum(!isInternal), Host = first(na.omit(Host))) %>% mutate(label.show = sprintf("%s (%d)", Collapse, num_tips)) %>% rename(node = ancestor) p <- ggtree(tree_data) + geom_nodepoint(aes(x = branch, subset = !is.na(UFboot) & UFboot >= 90, size = UFboot)) + geom_tiplab(aes(label = label.show), size = 4, align = T, linesize = 0) + geom_text2(aes(subset = node %in% multi_species$node, x = max(x, na.rm = T), label = label.show), nudge_x = 0.01, size = 4, hjust = 0) + geom_tippoint(aes(color = Host), size = 3) + geom_treescale(width = 0.5) + scale_size_continuous(limits = c(90, 100), range = c(1, 3)) + scale_shape_manual(values = seq(0,15)) + scale_color_manual(values = colors) p <- scaleClades(p, multi_species) p <- collapseClades(p, multi_species) # p <- facet_plot(p, mapping = aes(x = as.numeric(as.factor(query.name)), shape = DESC), data = genes, geom = geom_point, panel = 'Genes') ggsave(out_image_file, p, height = ntaxa * 0.1, width = 7, limitsize = F) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | library(dplyr) library(tidyr) library(ape) library(ggtree) library(treeio) library(phangorn) library(stringr) library(ggplot2) library(phytools) if (interactive()) { setClass("snake", slots = list(input = "list", output = "list")) snakemake <- new("snake", input = list( tree = "analysis/phylogeny_viruses/MCP-small.treefile", tsv = "analysis/phylogeny_viruses/MCP-small.tsv", virus_metadata = "metadata/viruses.tsv", segment_metadata = "metadata/viral_segments.tsv", colors = "metadata/subclade_colors.txt" ), output = list( image = "test.svg", jtree = "test.jtree" )) } with(snakemake@input, { tree_file <<- tree tsv_file <<- tsv virus_metadata_file <<- virus_metadata segment_metadata_file <<- segment_metadata colors_file <<- colors }) with(snakemake@output, { out_image_file <<- image out_jtree_file <<- jtree }) colors <- read.table(colors_file, col.names = c("subclade", "color"), comment.char = "") %>% with(setNames(color, subclade)) viruses <- read.table(virus_metadata_file, header = T, sep = "\t", na.strings = "") segments <- read.table(segment_metadata_file, header = T, sep = "\t", na.strings = "") metadata <- bind_rows(viruses, segments) %>% mutate(subclade = ifelse(is.na(subclade), clade, subclade)) orgs <- read.table(tsv_file, col.names = c("label", "Organism"), sep = "\t") tree <- read.tree(tree_file) %>% phangorn::midpoint(node.labels = "support") %>% as_tibble %>% mutate(support = ifelse(node %in% parent & label != "", label, NA)) %>% separate(support, into = c("SH_aLRT", "UFboot"), sep = "/", convert = T) %>% left_join(orgs, by = "label") %>% left_join(metadata, by = c(Organism = "short")) %>% mutate(isInternal = node %in% parent) %>% `class<-`(c("tbl_tree", "tbl_df", "tbl", "data.frame")) ntaxa <- filter(tree, ! node %in% parent) %>% nrow tree_data <- as.treedata(tree) write.jtree(tree_data, file = out_jtree_file) p <- ggtree(tree_data) + geom_nodepoint(aes(x = branch, subset = !is.na(UFboot) & UFboot >= 90, size = UFboot)) + geom_tiplab(aes(label = Organism, color = subclade), size = 2, align = F, linesize = 0) + scale_color_manual(values = colors) + geom_treescale(width = 0.5) + scale_size_continuous(limits = c(90, 100), range = c(1, 2)) ggsave(out_image_file, p, height = ntaxa * 0.1, width = 4, limitsize = F) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | from Bio import SeqIO from Bio.SeqFeature import SeqFeature, FeatureLocation, CompoundLocation from pandas import read_csv ref_annots = [] input_files = snakemake.input["gbk"] data_file = str(snakemake.input["data"]) output_file = str(snakemake.output) data = read_csv(data_file, sep = "\t") with open(output_file, 'w') as fd_out: for input_file in input_files: with open(input_file) as fd_in: for record in SeqIO.parse(fd_in, 'genbank'): for feature in record.features: quals = feature.qualifiers if feature.type == 'CDS' and 'locus_tag' in feature.qualifiers: ID = feature.qualifiers['locus_tag'][0] if ID: row = data[data.ID == ID] if 'note' not in feature.qualifiers: feature.qualifiers['note'] = [] if not row.empty: if row.Gene_Pfam.notnull().item(): feature.qualifiers['note'].append('Gene by hhsearch Pfam match: %s' % row.Gene_Pfam.item()) feature.qualifiers['product'] = row.Gene_Pfam.item() if row.Gene_Cluster.notnull().item(): feature.qualifiers['note'].append('Gene by cluster membership: %s' % row.Gene_Cluster.item()) feature.qualifiers['product'] = row.Gene_Cluster.item() if row['Hit.ID'].notnull().item() and row.Probab.item: feature.qualifiers['note'].append('hhsearch best hit against Pfam: %s %s (%s)' % (row['Hit.ID'].item(), row['Hit.Description'].item(), row.Probab.item())) SeqIO.write(record, fd_out, 'genbank') |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 | library(dplyr) library(tidyr) library(scatterpie) library(tibble) library(treeio) library(igraph) library(ggraph) library(ape) if (interactive()) { mcl_file <- "analysis/hh_mcl/mcl.txt" virus_metadata_file <- "metadata/viruses.tsv" segment_metadata_file <- "metadata/viral_segments.tsv" segment_ffindex_files <- Sys.glob("analysis/PLV_segments/*.ffindex") virus_ffindex_files <- Sys.glob("analysis/PLVs/*.ffindex") segment_pfam_files <- Sys.glob("analysis/PLV_segments_hh/*-hhsearch-pfam.tsv") virus_pfam_files <- Sys.glob("analysis/PLVs_hh/*-hhsearch-pfam.tsv") genes_cluster_file <- "metadata/genes_cluster.tsv" genes_pfam_file <- "metadata/genes_pfam.tsv" probab_threshold <- 80 core_genes_file <- "pie.pdf" jtree_file <- "output/MCP_PLV.jtree" reduced_tree_file <- "output/MCP_PLV_reduced.svg" chosen_clades <- c("Gezel", "Dwarf", "Mesomimi") clade_levels <- c("Mesomimi", "Dwarf", "Virophage", "TVS", "Gezel") ref_genomes <- c("Sputnik", "Mavirus_Spezl", "TVV_S1") colors_file <- "metadata/subclade_colors.txt" plv_order_file <- "metadata/plv_order.txt" family_colors_file <- "metadata/family_colors.txt" } else { with(snakemake@input, { mcl_file <<- mcl virus_metadata_file <<- virus_metadata segment_metadata_file <<- segment_metadata segment_pfam_files <<- segment_pfam virus_pfam_files <<- virus_pfam segment_ffindex_files <<- segment_ffindex virus_ffindex_files <<- virus_ffindex genes_cluster_file <<- genes_cluster genes_pfam_file <<- genes_pfam jtree_file <<- jtree colors_file <<- colors plv_order_file <<- plv_order family_colors_file <<- family_colors }) with(snakemake@params, { probab_threshold <<- probab chosen_clades <<- chosen_clades clade_levels <<- clade_levels ref_genomes <<- ref_genomes }) with(snakemake@output, { data_file <<- data core_genes_file <<- core_genes bipart_file <<- bipartite reduced_tree_file <<- reduced_tree }) } segments <- basename(segment_ffindex_files) %>% sub(".ffindex", "", .) viruses <- basename(virus_ffindex_files) %>% sub(".ffindex", "", .) genes_pfam <- read.table(genes_pfam_file, header = T, sep = "\t", fill = T, na.strings = "") %>% mutate(Comb = 1:n()) %>% separate_rows(Pfam, sep = ",") %>% group_by(Comb) %>% mutate(Comb.Num = n()) genes_cluster <- read.table(genes_cluster_file, header = T, sep = "\t") metadata <- bind_rows( read.table(virus_metadata_file, sep = "\t", header = T), read.table(segment_metadata_file, sep = "\t", header = T) ) %>% `rownames<-`(.$short) clusters <- readLines(mcl_file) %>% data.frame(ID = ., Cluster = 1:length(.)) %>% separate_rows(ID, sep = "\t") %>% left_join(genes_cluster, by = "ID") %>% group_by(Cluster) %>% mutate(Gene_Cluster = first(na.omit(Gene_Cluster))) segment_pfam <- lapply(segment_pfam_files, read.table, header = T, sep = "\t", quote = "") virus_pfam <- lapply(virus_pfam_files, read.table, header = T, sep = "\t", quote = "") pfam_all <- c(segment_pfam, virus_pfam) %>% bind_rows %>% select(-c(Q.query, T.ss_pred, T.hit, T.consensus, Q.consensus, T.ss_dssp, Q.ss_pred)) %>% extract(Hit.Description, into = "Hit.Name", regex = "; (.+?) ;", remove = F) pfam <- filter(pfam_all, Probab >= probab_threshold) to_gff <- function(.data) { mutate(.data, Source = "hhsearch", Feature = "domain", Strand = ".", Fname = ".", Attributes = paste(Hit.ID, Hit.Description)) %>% select(ID, Source, Feature, Q.Start, Q.End, Probab, Strand, Fname, Attributes) } most_freq <- function(.data) { na.omit(.data) %>% table(useNA = "always") %>% which.max %>% names } first_nona <- function(.data) { first(na.omit(.data)) } segment_genes <- lapply(segment_ffindex_files, read.table, sep = "\t") %>% setNames(segments) virus_genes <- lapply(virus_ffindex_files, read.table, sep = "\t") %>% setNames(viruses) data <- c(segment_genes, virus_genes) %>% bind_rows(.id = "Genome") %>% select(Genome, ID = 2) %>% left_join(pfam, by = c(ID = "Query")) %>% left_join(metadata, by = c(Genome = "short")) %>% left_join(genes_pfam, by = c(Hit.Name = "Pfam")) %>% group_by(Genome, ID, Comb) %>% mutate(Gene_Pfam = ifelse(n_distinct(Hit.ID) >= Comb.Num, Gene_Pfam, NA), Gene_Pfam = ifelse(is.na(No_threshold) | No <= No_threshold, Gene_Pfam, NA)) %>% left_join(clusters, by = "ID") %>% mutate(Cluster = as.character(ifelse(is.na(Cluster), ID, Cluster))) %>% ungroup %>% arrange(No, -Comb.Num) %>% group_by(Genome, ID, Cluster, clade, subclade) %>% summarize(Gene_Pfam = first_nona(Gene_Pfam), Gene_Cluster = first_nona(Gene_Cluster), Hit.ID = first(Hit.ID), Hit.Description = first(Hit.Description), Probab = first(Probab)) %>% group_by(Cluster) %>% mutate(Gene_Pfam = most_freq(Gene_Pfam), Gene_Cluster = most_freq(Gene_Cluster)) %>% mutate(Gene = ifelse(is.na(Gene_Cluster), Gene_Pfam, Gene_Cluster), Gene = ifelse(is.na(Gene), paste0("Cluster_", Cluster), Gene), Gene = ifelse(is.na(Gene), paste0("ID_", ID), Gene)) %>% ungroup write.table(data, data_file, sep = "\t", quote = F, row.names = F) virus_segments <- read.table(segment_metadata_file, sep = "\t", header = T) %>% mutate(scaffold = sub("Isogal_", "", scaffold)) # NB: workaround for Isogal scaffolds. TODO: do a proper fix tree <- read.jtree(jtree_file) tip.data <- filter(tree@data, !isInternal) %>% mutate(label = sub(" .*", "", title)) %>% extract(title, into = c("scaffold", "gene_num", "genome", "start", "end"), regex = "^(.+?)_(\\d+) (.+?) \\[(\\d+) - (\\d+)\\]", remove = F, convert = T) %>% left_join(virus_segments, by = c("scaffold")) %>% filter(is.na(scaffold_start) | is.na(start) | start >= scaffold_start & end <= scaffold_end) %>% mutate(Name = ifelse(is.na(short), Name, short)) %>% {setNames(.$Name, .$label)} tree@phylo$tip.label <- tip.data[tree@phylo$tip.label] tips.to.keep <- filter(data, clade == "Gezel") %>% pull(Genome) %>% `[`(. %in% tree@phylo$tip.label) tips.to.drop <- tree@phylo$tip.label[! tree@phylo$tip.label %in% tips.to.keep] my.tree <- tree@phylo %>% keep.tip(tips.to.keep) %>% drop.tip(tips.to.drop) %>% ladderize %>% chronos(lambda = 0.1) svg(reduced_tree_file) plot(my.tree) dev.off() #order.tips <- data.frame(tip.num = my.tree$edge[,2]) %>% # filter(tip.num <= length(my.tree$tip.label)) %>% # mutate(Genome = my.tree$tip.label[tip.num], num = n():1) order.tips <- read.table(plv_order_file, col.names = "Genome") %>% mutate(num = n():1) clades.data <- filter(data, clade %in% chosen_clades | Genome %in% ref_genomes) %>% mutate(Count = 1) %>% complete(Gene, nesting(Genome, clade, subclade), fill = list(Count = 0)) %>% group_by(Gene) %>% mutate(N_Subclades = n_distinct(subclade[clade == "Gezel" & Count > 0]), N_Genomes = n_distinct(Genome[clade == "Gezel" & Count > 0])) %>% filter(N_Subclades > 2) %>% mutate(clade = factor(clade, levels = clade_levels)) %>% left_join(order.tips, by = "Genome") %>% replace_na(list(num = 0)) %>% arrange(-N_Genomes, num, Gene, clade, desc(subclade), Genome) %>% ungroup %>% mutate(Gene = factor(Gene, levels = unique(Gene)), Genome = factor(Genome, levels = unique(Genome))) %>% filter(Count > 0) family_colors <- read.table(family_colors_file, sep = "\t", col.names = c("Gene", "color"), comment.char = "") %>% right_join(clades.data, by = "Gene") %>% group_by(Gene, Cluster, color) %>% filter(!is.na(color)) %>% summarize(n = n(), .groups = "drop") %>% group_by(Gene) %>% arrange(n) %>% mutate(a = seq(0.6, 0, len = n())) %>% mutate(r = col2rgb(color)[1], g = col2rgb(color)[2], b = col2rgb(color)[3]) %>% mutate(r = (255 - r) * a + r, g = (255 - g) * a + g, b = (255 - b) * a + b) %>% mutate(color = rgb(r / 255, g / 255, b / 255)) %>% with(setNames(color, Cluster)) ggplot_scatterpie <- function(.data, x.col, y.col, z.col, val.col, group.col) { x_uniq <- levels(.data[[x.col]]) y_uniq <- levels(.data[[y.col]]) .data <- ungroup(.data) %>% mutate(x = as.numeric(.[[x.col]]), y = as.numeric(.[[y.col]])) %>% rename(value = !!val.col, z = !!z.col) ggplot() + geom_scatterpie(aes(x = x, y = y), data = .data, long_format = T, cols = "z", color = NA) + scale_x_continuous(breaks = 1:length(x_uniq), labels = x_uniq) + scale_y_continuous(breaks = 1:length(y_uniq), labels = y_uniq) + xlab(x.col) + ylab(y.col) #+ #facet_grid(rows = group.col, scales = "free_y", space = "free_y") } p <- ggplot_scatterpie(clades.data, "Gene", "Genome", "Cluster", "Count", "subclade") + scale_fill_manual(values = family_colors, na.value = "black") + coord_equal() + theme_void() + theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.position = "none", axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1), axis.text.y = element_text()) ggsave(core_genes_file, p, height = 12) vertex_metadata <- list( Cluster = distinct(data, Cluster, Gene) %>% mutate(label = ifelse(grepl("Cluster", Gene), NA, as.character(Gene)), subclade = "Cluster") %>% column_to_rownames("Cluster"), Genome = mutate(metadata, label = short) ) %>% bind_rows(.id = "Type") g <- filter(data, clade == "Gezel") %>% distinct(Genome, Cluster) %>% group_by(Cluster) %>% filter(n() > 2) %>% mutate(present = T) %>% spread(Genome, present, F) %>% column_to_rownames("Cluster") %>% as.matrix %>% graph.incidence %>% set_vertex_attr("subclade", value = vertex_metadata[match(V(.)$name, rownames(vertex_metadata)),"subclade"]) %>% set_vertex_attr("label", value = vertex_metadata[match(V(.)$name, rownames(vertex_metadata)),"label"]) %>% set_vertex_attr("gene", value = vertex_metadata[match(V(.)$name, rownames(vertex_metadata)),"Gene"]) %>% set_vertex_attr("node_type", value = vertex_metadata[match(V(.)$name, rownames(vertex_metadata)),"Type"]) colors <- read.table(colors_file, comment.char = "") %>% with(setNames(V2,V1)) %>% c(Cluster = "black", "darkgray") set.seed(1234) p <- ggraph(g, "fr") + geom_edge_link(alpha = .1) + geom_node_point(aes(shape = node_type, colour = subclade, size = type)) + geom_node_text(aes(filter = !type, label = label), size = 1.5, color = "red", repel = T, nudge_y = -0.01, hjust = "right") + geom_node_text(aes(filter = type, label = label), size = 2, color = "black", repel = T, nudge_y = -0.01, hjust = "right") + scale_size_manual(values = c(1,2)) + scale_color_manual(values = colors) + theme_graph(base_family = "sans") ggsave(bipart_file, p, width = 10, height = 8) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | suppressPackageStartupMessages(library(dplyr)) library(XML) library(ggplot2) library(ggseqlogo) library(stringr) library(tidyr) suppressPackageStartupMessages(library(cowplot)) input_file <- unlist(snakemake@input) output_file <- unlist(snakemake@output) with(snakemake@params, { motif_res <<- data.frame(motif_name = motif_res, regex = motif_res) meme_name <<- meme_name max_len <<- max_len }) data <- xmlToList(xmlParse(input_file)) sequences <- data$training_set %>% `[`(names(.) == "sequence") %>% bind_rows %>% mutate(length = as.numeric(length)) motif.data <- data$motifs %>% `[`(names(.) == "motif") %>% setNames(sapply(., function(x) x$.attrs["id"])) motifs <- motif.data %>% lapply(function(x) as.data.frame(t(x$.attrs))) %>% bind_rows %>% mutate_at(vars(-id, -name, -alt), as.numeric) %>% mutate(sites_r = sites / nrow(sequences)) motifs.filtered <- left_join(motifs, motif_res, by = character()) %>% filter(str_match(name, regex) > 0) %>% distinct(motif_name, .keep_all = T) site.attrs <- function(site) { attrs <- as.data.frame(t(site$.attrs)) attrs$left_flank <- site$left_flank attrs$site <- paste(site$site[names(site$site) == "letter_ref"], collapse = "") attrs$right_flank <- site$right_flank return(attrs) } fetch.sites <- function(sites) { sites <- sites$contributing_sites sites[names(sites) == "contributing_site"] %>% lapply(site.attrs) %>% bind_rows } sites <- lapply(motif.data, fetch.sites) %>% bind_rows(.id = "id") %>% filter(id %in% motifs.filtered$id) %>% left_join(sequences, by = c(sequence_id = "id")) %>% mutate(position = as.numeric(position) - length, pvalue = as.numeric(pvalue)) plot.motif <- function(motif, .y) { motif.sites <- sites %>% filter(id == .y$id) label <- with(motif, paste( gsub("_", " ", meme_name), name, paste("Sites =", sites, paste0("(", round(sites_r * 100, 1), "%)")), # paste("IC =", round(ic, 3)), paste("E-value =", e_value), sep = "\n" )) g1 <- ggplot() + annotate("text", label = label, x = 1, y = 1) + theme_bw() + theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) g2 <- motif.sites %>% pull(site) %>% ggseqlogo(seq_type = "dna") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) g3 <- ggplot(motif.sites, aes(x = position)) + geom_histogram(binwidth=10) + xlim(-150, 0) + theme_bw() + theme(axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title = element_blank()) g <- plot_grid(g1, g2, g3, nrow = 1, align = 'h') return(g) } if (nrow(motifs.filtered) == 0) { g <- ggplot() + theme_void() ggsave(output_file, g, width = 1, height = 1) } else { g.list <- motifs.filtered %>% group_by(id) %>% group_map(plot.motif) ggsave(output_file, plot_grid(plotlist = g.list, ncol = 1), width = 7, height = length(g.list) * 1.4) } |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | library(dplyr) library(tidyr) library(igraph) library(ggplot2) # library(gggenes) input <- snakemake@input output <- snakemake@output params <- snakemake@params hmmsearch_files <- unlist(input) # figure_file <- unlist(output["figure"]) bed_file <- unlist(output["bed"]) E.value.threshold <- unlist(params["e_value_threshold"]) # 0.001 distance.threshold <- unlist(params["distance_threshold"]) # 200000 genes.threshold <- unlist(params["genes_threshold"]) # 2 read.hmmer.tblout <- function(fname) { col.names <- c("target.name", "target.accession", "query.name", "query.accession", "full.sequence.E.value", "full.sequence.score", "full.sequence.bias", "best.1.domain.E.value", "best.1.domain.score", "best.1.domain.bias", "domain.number.estimation.exp", "domain.number.estimation.reg", "domain.number.estimation.clu", "domain.number.estimation.ov", "domain.number.estimation.env", "domain.number.estimation.dom", "domain.number.estimation.rep", "domain.number.estimation.inc", "description.of.target") numeric.cols <- which(col.names == "full.sequence.E.value") : which(col.names == "domain.number.estimation.exp") integer.cols <- which(col.names == "domain.number.estimation.reg") : which(col.names == "domain.number.estimation.inc") readLines(fname) %>% data.frame(line = .) %>% filter(!grepl("^#", line)) %>% separate(line, into = col.names, sep = " +", extra = "merge", convert = F) %>% mutate_at(numeric.cols, as.numeric) %>% mutate_at(integer.cols, as.integer) } data <- lapply(hmmsearch_files, read.hmmer.tblout) %>% bind_rows %>% extract(description.of.target, into = c("start", "end"), regex = "(\\d+) - (\\d+)", convert = T) %>% extract(target.name, into = "scaffold", regex = "^(.+)_\\d+$", remove = F) %>% extract(query.name, into = "gene", regex = "^([^_]+)", remove = F) %>% arrange(best.1.domain.E.value) %>% select(target.name, scaffold, query.name, gene, best.1.domain.E.value, best.1.domain.score, start, end) %>% filter(best.1.domain.E.value < E.value.threshold) %>% distinct(target.name, .keep_all = T) relations <- left_join(data, data, by = "scaffold") %>% filter(start.x < start.y) %>% mutate(distance = pmin(abs(start.x - start.y), abs(start.x - end.y), abs(end.x - start.y), abs(end.x - end.y))) %>% filter(distance < distance.threshold) %>% arrange(start.x, start.y) %>% distinct(target.name.x, .keep_all = T) %>% select(target.name.x, target.name.y, scaffold, start.x, start.y, distance) g <- graph_from_data_frame(relations, vertices = data) V(g)$membership <- components(g)$membership genes <- as_data_frame(g, "vertices") %>% mutate(molecule = paste(scaffold, membership)) %>% group_by(molecule) %>% mutate(n_genes = n_distinct(gene)) %>% filter(n_genes > 1) group_by(genes, scaffold, membership) %>% summarize(start = min(start, end) - 1, end = max(start, end), .groups = "drop") %>% select(scaffold, start, end) %>% write.table(bed_file, row.names = F, col.names = F, quote = F, sep = "\t") |
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 | from lxml import etree # Ubuntu Karmic package: python-lxml import sys, re, os import base64 from optparse import OptionParser from io import IOBase from six import string_types import logging log = logging.getLogger(__name__) VERSION = '0.1.0' # keep in sync with setup.py UNITS = ['pt','px','in','mm','cm'] PX_PER_INCH = 96 # http://wiki.inkscape.org/wiki/index.php/Units_In_Inkscape # old inkscape versions are 90 or something else MM_PER_INCH = 25.4 PT2IN = 1.0/72.0 IN2PT = 72.0 CM2PT = 72.0/2.54 PT2PX = 1.25 PX2PT = 1.0/1.25 relIRI_re = re.compile(r'url\(#(.*)\)') def get_unit_attr(value): """ coordinate handling from http://www.w3.org/TR/SVG11/coords.html#Units """ units = None # default (user) for unit_name in UNITS: if value.endswith(unit_name): log.debug("Units for {} are {}".format(value, unit_name)) units = unit_name value = value[:-len(unit_name)] break val_float = float(value) # this will fail if units str not parsed return val_float, units def convert_to_pixels(val, units): if units == 'px' or units is None: val_px = val elif units == 'pt': val_px = val*PT2PX elif units == 'in': val_px = val*IN2PT*PT2PX elif units == 'mm': val_px = val / MM_PER_INCH * PX_PER_INCH elif units == 'cm': val_px = val*CM2PT*PT2PX else: raise ValueError('unsupport unit conversion to pixels: %s'%units) log.debug("{} {} = {} px".format(val, units, val_px)) return val_px def fix_ids( elem, prefix, level=0 ): ns = '{http://www.w3.org/2000/svg}' if isinstance(elem.tag, string_types) and elem.tag.startswith(ns): tag = elem.tag[len(ns):] if 'id' in elem.attrib: elem.attrib['id'] = prefix + elem.attrib['id'] # fix references (See http://www.w3.org/TR/SVGTiny12/linking.html#IRIReference ) for attrib in elem.attrib.keys(): value = elem.attrib.get(attrib,None) if value is not None: if attrib.startswith('{http://www.w3.org/1999/xlink}'): relIRI = False else: relIRI = True if (not relIRI) and value.startswith('#'): # local IRI, change iri = value[1:] value = '#' + prefix + iri elem.attrib[attrib] = value elif relIRI: newvalue = re.sub( relIRI_re, r'url(#'+prefix+r'\1)', value) if newvalue != value: elem.attrib[attrib] = newvalue # Do same for children for child in elem: fix_ids(child,prefix,level=level+1) def export_images( elem, filename_fmt='image%03d', start_idx=1 ): """replace inline images with files""" ns = '{http://www.w3.org/2000/svg}' href = '{http://www.w3.org/1999/xlink}href' count = 0 if isinstance(elem.tag, string_types) and elem.tag.startswith(ns): tag = elem.tag[len(ns):] if tag=='image': buf = etree.tostring(elem, pretty_print=True) im_data = elem.attrib[href] exts = ['png','jpeg'] found = False for ext in exts: prefix = 'data:image/'+ext+';base64,' if im_data.startswith(prefix): data_base64 = im_data[len(prefix):] found = True break if not found: raise NotImplementedError('image found but not supported') # decode data data = base64.b64decode(data_base64) # save data idx = start_idx + count fname = filename_fmt%idx + '.' + ext if os.path.exists(fname): raise RuntimeError('File exists: %r'%fname) with open(fname,mode='w') as fd: fd.write( data ) # replace element with link elem.attrib[href] = fname count += 1 # Do same for children for child in elem: count += export_images(child, filename_fmt=filename_fmt, start_idx=(start_idx+count) ) return count header_str = """<?xml version="1.0" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> <!-- Created with svg_stack (http://github.com/astraw/svg_stack) --> """ # ------------------------------------------------------------------ class Document(object): def __init__(self): self._layout = None def setLayout(self,layout): self._layout = layout def save(self, fileobj, debug_boxes=False, **kwargs): if self._layout is None: raise ValueError('No layout, cannot save.') accum = LayoutAccumulator(**kwargs) self._layout.render(accum, debug_boxes=debug_boxes) try: isfile = isinstance(fileobj, file) except NameError: isfile = isinstance(fileobj, IOBase) if isfile: fd = fileobj close = False else: fd = open(fileobj, mode='w') close = True buf = accum.tostring(pretty_print=True) fd.write(header_str) fd.write( buf.decode() ) if close: fd.close() class SVGFileBase(object): def __init__(self, fname): self._fname = fname self._root = etree.parse(fname).getroot() if self._root.tag != '{http://www.w3.org/2000/svg}svg': raise ValueError('expected file to have root element <svg:svg>') if 'height' in self._root.keys(): height, height_units = get_unit_attr(self._root.get('height')) width, width_units = get_unit_attr(self._root.get('width')) else: # R svglite does not set the height and width attributes. Get them from the viewBox attribute. vbox = self._root.get('viewBox') _, _, width, height = vbox.split(' ') width = float(width) height = float(height) width_units = height_units = 'px' # The default self._width_px = convert_to_pixels( width, width_units) self._height_px = convert_to_pixels( height, height_units) log.debug("Size of {} is {:.2f} x {:.2f} px".format(fname, self._width_px, self._height_px)) self._orig_width_px = self._width_px self._orig_height_px = self._height_px self._coord = None # unassigned def get_root(self): return self._root def get_size(self,min_size=None,box_align=None,level=None): return Size(self._width_px, self._height_px) def _set_size(self, size): if self._width_px != size.width: log.warning("Changing width of {} from {:.2f} to {:.2f}".format( self._fname, self._width_px, size.width)) self._width_px = size.width if self._height_px != size.height: log.warning("Changing height of {} from {:.2f} to {:.2f}".format( self._fname, self._height_px, size.height)) self._height_px = size.height def _set_coord(self,coord): self._coord = coord def export_images(self, *args, **kwargs): export_images(self._root, *args, **kwargs) class SVGFile(SVGFileBase): def __str__(self): return 'SVGFile(%s)'%repr(self._fname) class SVGFileNoLayout(SVGFileBase): def __init__(self,fname,x=0,y=0): self._x_offset = x self._y_offset = y super(SVGFileNoLayout, self).__init__(fname) def _set_coord(self,coord): self._coord = (coord[0] + self._x_offset, coord[1] + self._y_offset ) def __str__(self): return 'SVGFileNoLayout(%s)'%repr(self._fname) class LayoutAccumulator(object): def __init__(self): self._svgfiles = [] self._svgfiles_no_layout = [] self._raw_elements = [] def add_svg_file(self, svgfile): assert isinstance(svgfile,SVGFile) if svgfile in self._svgfiles: raise ValueError('cannot accumulate SVGFile instance twice') self._svgfiles.append( svgfile ) def add_svg_file_no_layout(self,svgfile): assert isinstance(svgfile,SVGFileNoLayout) if svgfile in self._svgfiles_no_layout: raise ValueError('cannot accumulate SVGFileNoLayout instance twice') self._svgfiles_no_layout.append( svgfile ) def add_raw_element(self,elem): self._raw_elements.append( elem ) def tostring(self, **kwargs): root = self._make_finalized_root() return etree.tostring(root, **kwargs) def _set_size(self, size): self._size = size def _make_finalized_root(self): # get all required namespaces and prefixes NSMAP = {None : 'http://www.w3.org/2000/svg', 'sodipodi':'http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd', } for svgfile in self._svgfiles: origelem = svgfile.get_root() for key, value in origelem.nsmap.items(): if key in NSMAP: assert value == NSMAP[key] # Already in namespace dictionary continue elif key == 'svg': assert value == NSMAP[None] # svg is the default namespace - don't insert again. continue log.debug("adding {} to NSMAP at {}".format(value, key)) NSMAP[key] = value root = etree.Element('{http://www.w3.org/2000/svg}svg', nsmap=NSMAP) if 1: # inkscape hack root_defs = etree.SubElement(root,'{http://www.w3.org/2000/svg}defs') root.attrib['version']='1.1' fname_num = 0 do_layout = True work_list=[] for svgfile in (self._svgfiles): work_list.append( (fname_num, do_layout, svgfile) ) fname_num += 1 do_layout = False for svgfile in (self._svgfiles_no_layout): work_list.append( (fname_num, do_layout, svgfile) ) fname_num += 1 for (fname_num, do_layout, svgfile) in work_list: origelem = svgfile.get_root() fix_id_prefix = 'id%d:' % fname_num elem = etree.SubElement(root,'{http://www.w3.org/2000/svg}g') elem.attrib['id'] = 'id{}'.format(fname_num) elem_sz = svgfile.get_size() width_px = elem_sz.width height_px = elem_sz.height # copy svg contents into new group for child in origelem: if 1: # inkscape hacks if child.tag == '{http://www.w3.org/2000/svg}defs': # copy into root_defs, not into sub-group log.debug("Copying element from {}".format(svgfile)) for subchild in child: fix_ids( subchild, fix_id_prefix ) root_defs.append( subchild ) continue elif child.tag == '{http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd}:namedview': # don't copy continue elif child.tag == '{http://www.w3.org/2000/svg}metadata': # don't copy continue elem.append(child) fix_ids( elem, fix_id_prefix ) translate_x = svgfile._coord[0] translate_y = svgfile._coord[1] if do_layout: if svgfile._orig_width_px != width_px: log.info("svgfile: {} id {}".format(svgfile, fix_id_prefix)) log.error("orig width {:.2f} != width_px {:.2f}".format( svgfile._orig_width_px, width_px)) raise NotImplementedError('rescaling width not implemented ' '(hint: set alignment on file %s)'%( svgfile,)) if svgfile._orig_height_px != height_px: log.info("svgfile: {} id {}".format(svgfile, fix_id_prefix)) log.error("orig height {:.2f} != height_px {:.2f}".format( svgfile._orig_height_px, height_px)) raise NotImplementedError('rescaling height not implemented ' '(hint: set alignment on file %s)'%( svgfile,)) orig_viewBox = origelem.get('viewBox') if orig_viewBox is not None: # split by comma or whitespace vb_tup = orig_viewBox.split(',') vb_tup = [c.strip() for c in vb_tup] if len(vb_tup)==1: # not separated by commas vb_tup = orig_viewBox.split() assert len(vb_tup)==4 vb_tup = [float(v) for v in vb_tup] vbminx, vbminy, vbwidth, vbheight = vb_tup sx = width_px / vbwidth sy = height_px / vbheight tx = translate_x - vbminx ty = translate_y - vbminy elem.attrib['transform'] = 'matrix(%s,0,0,%s,%s,%s)'%(sx, sy, tx, ty) log.debug("matrix xform ({}, 0, 0, {}, {}, {})".format(sx, sy, tx, ty)) else: elem.attrib['transform'] = 'translate(%s,%s)'%(translate_x, translate_y) log.debug("Translating ({}, {})".format(translate_x, translate_y)) root.append( elem ) for elem in self._raw_elements: root.append(elem) root.attrib["width"] = repr(self._size.width) root.attrib["height"] = repr(self._size.height) return root # ------------------------------------------------------------------ class Size(object): def __init__(self, width=0, height=0): self.width=width self.height=height # directions for BoxLayout LeftToRight = 'LeftToRight' RightToLeft = 'RightToLeft' TopToBottom = 'TopToBottom' BottomToTop = 'BottomToTop' # alignment values AlignLeft = 0x01 AlignRight = 0x02 AlignHCenter = 0x04 AlignTop = 0x020 AlignBottom = 0x040 AlignVCenter = 0x080 AlignCenter = AlignHCenter | AlignVCenter class Layout(object): def __init__(self, parent=None): if parent is not None: raise NotImplementedError('') class BoxLayout(Layout): def __init__(self, direction, parent=None): super(BoxLayout, self).__init__(parent=parent) self._direction = direction self._items = [] self._contents_margins = 0 # around edge of box self._spacing = 0 # between items in box self._coord = (0, 0) # default self._size = None # uncalculated def _set_coord(self,coord): self._coord = coord def render(self, accum, min_size=None, level=0, debug_boxes=0): size = self.get_size(min_size=min_size) if level==0: # set document size if top level accum._set_size(size) if debug_boxes>0: # draw black line around BoxLayout element debug_box = etree.Element('{http://www.w3.org/2000/svg}rect') debug_box.attrib['style']=( 'fill: none; stroke: black; stroke-width: 2.000000;') sz=size debug_box.attrib['x']=repr(self._coord[0]) debug_box.attrib['y']=repr(self._coord[1]) debug_box.attrib['width']=repr(sz.width) debug_box.attrib['height']=repr(sz.height) accum.add_raw_element(debug_box) for (item, stretch, alignment, xml) in self._items: if isinstance( item, SVGFile ): accum.add_svg_file(item) if debug_boxes>0: # draw red line around SVG file debug_box= etree.Element('{http://www.w3.org/2000/svg}rect') debug_box.attrib['style']=( 'fill: none; stroke: red; stroke-width: 1.000000;') sz=item.get_size() debug_box.attrib['x']=repr(item._coord[0]) debug_box.attrib['y']=repr(item._coord[1]) debug_box.attrib['width']=repr(sz.width) debug_box.attrib['height']=repr(sz.height) accum.add_raw_element(debug_box) elif isinstance( item, SVGFileNoLayout ): accum.add_svg_file_no_layout(item) if debug_boxes>0: # draw green line around SVG file debug_box= etree.Element('{http://www.w3.org/2000/svg}rect') debug_box.attrib['style']=( 'fill: none; stroke: green; stroke-width: 1.000000;') sz=item.get_size() debug_box.attrib['x']=repr(item._coord[0]) debug_box.attrib['y']=repr(item._coord[1]) debug_box.attrib['width']=repr(sz.width) debug_box.attrib['height']=repr(sz.height) accum.add_raw_element(debug_box) elif isinstance( item, BoxLayout ): item.render( accum, min_size=item._size, level=level+1, debug_boxes=debug_boxes) else: raise NotImplementedError( "don't know how to accumulate item %s"%item) if xml is not None: extra = etree.Element('{http://www.w3.org/2000/svg}g') extra.attrib['transform'] = 'translate(%s,%s)'%( repr(item._coord[0]),repr(item._coord[1])) extra.append(xml) accum.add_raw_element(extra) def get_size(self, min_size=None, box_align=0, level=0): cum_dim = 0 # size along layout direction max_orth_dim = 0 # size along other direction if min_size is None: min_size = Size(0, 0) # Step 1: calculate required size along self._direction if self._direction in [LeftToRight, RightToLeft]: max_orth_dim = min_size.height dim_min_size = Size(width=0, height=max_orth_dim) else: max_orth_dim = min_size.width dim_min_size = Size(width=max_orth_dim, height=0) cum_dim += self._contents_margins # first margin item_sizes = [] for item_number,(item,stretch,alignment,xml) in enumerate(self._items): if isinstance(item, SVGFileNoLayout): item_size = Size(0, 0) else: item_size = item.get_size(min_size=dim_min_size, box_align=alignment,level=level+1) item_sizes.append( item_size ) if isinstance(item, SVGFileNoLayout): # no layout for this file continue if self._direction in [LeftToRight, RightToLeft]: cum_dim += item_size.width max_orth_dim = max(max_orth_dim,item_size.height) else: cum_dim += item_size.height max_orth_dim = max(max_orth_dim,item_size.width) if (item_number+1) < len(self._items): cum_dim += self._spacing # space between elements cum_dim += self._contents_margins # last margin orth_dim = max_orth_dim # value without adding margins max_orth_dim += 2*self._contents_margins # margins # --------------------------------- # Step 2: another pass in which expansion takes place total_stretch = 0 for item,stretch,alignment,xml in self._items: total_stretch += stretch if (self._direction in [LeftToRight, RightToLeft]): dim_unfilled_length = max(0,min_size.width - cum_dim) else: dim_unfilled_length = max(0,min_size.height - cum_dim) stretch_hack = False if dim_unfilled_length > 0: if total_stretch == 0: # BoxLayout in which stretch is 0, but unfilled space # exists. # XXX TODO: what is Qt policy in this case? stretch_hack = True stretch_inc = 0 else: stretch_inc = dim_unfilled_length / float(total_stretch) else: stretch_inc = 0 cum_dim = 0 # size along layout direction cum_dim += self._contents_margins # first margin is_last_item = False for i,(_item,old_item_size) in enumerate(zip(self._items,item_sizes)): if (i+1) >= len(self._items): is_last_item=True (item,stretch,alignment,xml) = _item if (self._direction in [LeftToRight, RightToLeft]): new_dim_length = old_item_size.width + stretch*stretch_inc if stretch_hack and is_last_item: new_dim_length = old_item_size.width + dim_unfilled_length new_item_size = Size( new_dim_length, orth_dim ) else: new_dim_length = old_item_size.height + stretch*stretch_inc if stretch_hack and is_last_item: new_dim_length = old_item_size.width + dim_unfilled_length new_item_size = Size( orth_dim, new_dim_length ) if isinstance(item, SVGFileNoLayout): item_size = Size(0,0) else: item_size = item.get_size(min_size=new_item_size, box_align=alignment,level=level+1) if self._direction == LeftToRight: child_box_coord = (cum_dim, self._contents_margins) elif self._direction == TopToBottom: child_box_coord = (self._contents_margins, cum_dim) else: raise NotImplementedError( 'direction %s not implemented'%self._direction) child_box_coord = (child_box_coord[0] + self._coord[0], child_box_coord[1] + self._coord[1]) child_box_size = new_item_size item_pos, final_item_size = self._calc_box( child_box_coord, child_box_size, item_size, alignment ) # FIXME : don't call internal funtion on another class # FIXME : get_size should not set size item._set_coord( item_pos ) item._set_size( final_item_size ) if self._direction in [LeftToRight, RightToLeft]: # Use requested item size so ill behaved item doesn't # screw up layout. cum_dim += new_item_size.width else: # Use requested item size so ill behaved item doesn't # screw up layout. cum_dim += new_item_size.height if not is_last_item: cum_dim += self._spacing # space between elements cum_dim += self._contents_margins # last margin # --------------------------------- # Step 3: calculate coordinates of each item if self._direction in [LeftToRight, RightToLeft]: size = Size(cum_dim, max_orth_dim) else: size = Size(max_orth_dim, cum_dim) self._size = size return size def _calc_box(self, in_pos, in_sz, item_sz, alignment): if (AlignLeft & alignment): left = in_pos[0] width = item_sz.width elif (AlignRight & alignment): left = in_pos[0]+in_sz.width-item_sz.width width = item_sz.width elif (AlignHCenter & alignment): left = in_pos[0]+0.5*(in_sz.width-item_sz.width) width = item_sz.width else: # expand left = in_pos[0] width = in_sz.width if (AlignTop & alignment): top = in_pos[1] height = item_sz.height elif (AlignBottom & alignment): top = in_pos[1]+in_sz.height-item_sz.height height = item_sz.height elif (AlignVCenter & alignment): top = in_pos[1]+0.5*(in_sz.height-item_sz.height) height = item_sz.height else: # expand top = in_pos[1] height = in_sz.height pos = (left,top) size = Size(width,height) return pos,size def _set_size(self, size): self._size = size def setSpacing(self,spacing): self._spacing = spacing def addSVG(self, svg_file, stretch=0, alignment=0, xml=None): if not isinstance(svg_file, SVGFile): svg_file = SVGFile(svg_file) if xml is not None: xml = etree.XML(xml) self._items.append((svg_file, stretch, alignment, xml)) def addSVGNoLayout(self, svg_file, x=0, y=0, xml=None): if not isinstance(svg_file,SVGFileNoLayout): svg_file = SVGFileNoLayout(svg_file,x=x,y=y) stretch=0 alignment=0 if xml is not None: xml = etree.XML(xml) self._items.append((svg_file,stretch,alignment,xml)) def addLayout(self, layout, stretch=0): assert isinstance(layout, Layout) alignment=0 # always expand a layout xml=None self._items.append((layout, stretch, alignment, xml)) class HBoxLayout(BoxLayout): def __init__(self, parent=None): super(HBoxLayout, self).__init__(LeftToRight, parent=parent) class VBoxLayout(BoxLayout): def __init__(self, parent=None): super(VBoxLayout, self).__init__(TopToBottom, parent=parent) # ------------------------------------------------------------------ def main(): usage = '''%prog FILE1 [FILE2] [...] [options] concatenate SVG files This will concatenate FILE1, FILE2, ... to a new svg file printed to stdout. ''' parser = OptionParser(usage, version=VERSION) parser.add_option("--margin",type='str', help='size of margin (in any units, px default)', default=None) parser.add_option("--direction",type='str', default='vertical', help='horizontal or vertical (or h or v)') (options, args) = parser.parse_args() fnames = args if options.direction.lower().startswith('v'): direction = 'vertical' elif options.direction.lower().startswith('h'): direction = 'horizontal' else: raise ValueError('unknown direction %s'%options.direction) if options.margin is not None: margin_px = convert_to_pixels(*get_unit_attr(options.margin)) else: margin_px = 0 if 0: fd = open('tmp.svg',mode='w') else: fd = sys.stdout doc = Document() if direction == 'vertical': layout = VBoxLayout() elif direction == 'horizontal': layout = HBoxLayout() for fname in fnames: layout.addSVG(fname, alignment=AlignCenter) layout.setSpacing(margin_px) doc.setLayout(layout) doc.save( fd ) if __name__=='__main__': main() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | library(dplyr) library(tidyr) library(tools) library(ggraph) library(ggforce) library(igraph) library(tidyverse) if (interactive()) { setClass("snakemake", slots = list(input = "list", output = "list", params = "list", wildcards = "list")) snakemake <- new("snakemake", input = list( network = "bak2/analysis/vcontact2/results/c1.ntw", genomes = "bak2/analysis/vcontact2/results/genome_by_genome_overview.csv", profiles = "bak2/analysis/vcontact2/results/vConTACT_profiles.csv", segments = "metadata/viral_segments.tsv", viruses = "metadata/viruses.tsv", colors = "metadata/subclade_colors.txt" ), output = list( clustering = "bak2/output/vcontact2_clustering.svg" ) ) } with(snakemake@input, { network_file <<- network genomes_file <<- genomes profiles_file <<- profiles segments_file <<- segments viruses_file <<- viruses colors_file <<- colors }) with(snakemake@output, { clustering_file <<- clustering }) segments <- read.table(segments_file, sep = "\t", header = T) viruses <- bind_rows( read.table(segments_file, sep = "\t", header = T), read.table(viruses_file, sep = "\t", header = T) ) subclades <- with(viruses, setNames(subclade, short)) sort_components <- function(.graph) { decompose(.graph) %>% `[`(order(sapply(., length), decreasing = T)) } genomes <- read.csv(genomes_file, na.strings = "") %>% group_by(VC) %>% mutate(Cluster = ifelse(is.na(VC), Genome, paste(VC, paste(Genome, collapse = ", ")))) %>% left_join(viruses, by = c(Genome = "short")) subclusters <- with(genomes, setNames(VC.Subcluster, Genome)) self_edges <- with(genomes, data.frame(A = Genome, B = Genome, weight = 1e-10)) network <- read.table(network_file, col.names = c("A","B","weight")) %>% bind_rows(self_edges) %>% mutate(weight = weight / 1000) %>% graph_from_data_frame(directed = F) %>% set_vertex_attr("subclade", value = subclades[V(.)$name]) %>% set_vertex_attr("subcluster", value = subclusters[V(.)$name]) colors <- read.table(colors_file, comment.char = "") %>% with(setNames(V2,V1)) %>% c("darkgray") p <- ggraph(network, "fr") + geom_edge_link(aes(color = -weight, width = weight), alpha = .1) + geom_node_point(aes(color = subclade), size = 2) + geom_node_text(aes(label = ifelse(is.na(subcluster), name, sprintf("%s\n(%s)", name, subcluster))), repel = T, nudge_y = -0.01) + scale_color_manual(values = colors) # + #theme_graph() ggsave(clustering_file, p, width = 10, height = 10) |
16 17 | script: "scripts/extract_element.py" |
27 28 | script: "scripts/mcl_add_to_gbk.py" |
13 14 | shell: "getorf -minsize 100 -maxsize 100000 -filter {input} > {output}" |
23 24 | shell: "hmmsearch -o /dev/null --tblout {output} {input.hmm} {input.fasta}" |
37 38 39 40 41 | shell: """ awk '!/^#/&&$5<{params.evalue}{{print $1,$3}}' OFS=\\\\t {input.txt} | \ parallel --colsep \\\\t seqkit faidx -f {input.fasta} {{1}} \| seqkit replace -p "'$'" -r '" {{2}}"' | seqkit seqkit seq -gm 100 """ |
55 56 | script: "scripts/parse_hmmsearch.R" |
67 68 | shell: "seqkit subseq --up-stream {params.flanks} --down-stream {params.flanks} --bed {input.bed} {input.genome} | cut -f1 -d: > {output}" |
77 78 79 80 81 | shell: """ seqkit split -iO {output} {input} rename s/segments.id_// {output}/*.fna """ |
90 91 | shell: "gmsn.pl --format GFF3 --virus --output {output} {input}" |
102 103 | shell: "prodigal -i {input} -m -g 1 -p meta -f gff > {output}" |
117 118 | shell: "gffcompare -p {params.cprefix} -CTo {params.outprefix} {input}" |
130 131 | shell: "gffread -g {input.fna} -w - -o {output.gff} {input.gtf} | seqkit translate --trim -o {output.faa}" |
145 146 | shell: "cat {input} > {output}" |
153 154 | shell: "cat {input} > {output}" |
167 168 | shell: "hmmscan --cpu {threads} -o /dev/null --tblout {output.tblout} --domtblout {output.domtblout} {input.hmm} {input.fasta}" |
181 182 | shell: "hmmscan --cpu {threads} -o /dev/null --tblout {output.tblout} --domtblout {output.domtblout} {input.hmm} {input.fasta}" |
195 196 | shell: "hmmscan --cpu {threads} -o /dev/null --tblout {output.tblout} --domtblout {output.domtblout} {input.hmm} {input.fasta}" |
205 206 | shell: "hmmsearch -o /dev/null --tblout {output} {input.hmm} {input.fasta}" |
219 220 | shell: "pfam_scan.pl -fasta {input} -dir {params.pfam_dir} -outfile {output} -cpu {threads}" |
230 231 | shell: "blastn -query {input.fna} -db {input.fna} -outfmt 6 -out {output} -evalue 1e-20" |
256 257 | script: "scripts/gff2gbk.py" |
82 83 | run: print(f"{bcolors.FAIL}No task selected - choose one of %s\nKeep in mind that all rules except search_segments are dependent on extract_segments{bcolors.ENDC}\n" % ', '.join(exposed_rules)) |
92 93 | shell: "makeblastdb -in {input} -dbtype prot" |
102 103 | shell: "makeblastdb -in {input} -dbtype nucl" |
112 113 | shell: "seqkit faidx {input}" |
122 123 | shell: "seqkit faidx -f {input}" |
130 131 | shell: "mad {input}" |
13 14 | shell: "cat {input} > {output}" |
24 25 26 27 | shell: """ parallel --tagstring {{/.}} seqkit seq -ni {{}} ::: {input} | awk -vOFS=, 'BEGIN{{print"protein_id","contig_id","keywords"}}{{print$2,$1,""}}' > {output} """ |
47 48 | shell: "vcontact2 --threads {threads} --db {params.db} --raw-proteins {input.fasta} --rel-mode {params.rel_mode} --proteins-fp {input.mapping} --pcs-mode {params.pcs_mode} --vcs-mode {params.vcs_mode} --c1-bin $CONDA_PREFIX/lib/cluster_one-v1.0.jar --output-dir {output.outdir}" |
62 63 | script: "scripts/vcontact2.R" |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/BejaLab/Gezelvirus
Name:
gezelvirus
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
None
Keywords:
prodigal
raxml-ng
vcontact2
APE
Biopython
BLAST
FiMO
GAPPA
gffcompare
gffread
ggseqlogo
ggtree
HHblits
hmmalign (genouest)
hmmbuild (genouest)
hmmscan (EBI)
hmmsearch (genouest)
MAFFT API (EBI)
MCL
MEME
metaprokka
MMseqs
Pandas
seqkit
seqret
Snakemake
subSeq
treeio
trimAl
cowplot
dplyr
ggforce
ggplot2
ggraph
igraph
phangorn
phytools
scatterpie
stringr
tibble
tidyr
tidyverse
XML
epang
lxml
six
viral-genomics
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...