Carotenoid Antenna Workflow for Energy Transfer in Rhodopsin Pumps
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Carotenoid antenna workflow
This is the snakemake workflow behind the paper Chazan et al (2022) "Energy transfer in ubiquitous rhodopsin pumps with xanthophyll antennas". For supplementary data, see the Figshare repository .
Run as:
snakemake -c{number of cores} --use-conda
(the dependencies are under the conda control). Check the file
workflow/Snakefile
for specific rules.
To run the pipeline you have to download the heavy input data:
# Create direcotries
mkdir -p data/gem/split data/jgi data/om-rgc
#
# Download GEM data
wget https://portal.nersc.gov/GEM/genomes/faa.tar
tar xf faa.tar
find faa -name '*.faa.gz' | xargs zcat | seqkit split -s 1000000 -O data/gem/split/
#
# Download OM-RGC data
wget -P data/om-rgc/ https://www.ebi.ac.uk/biostudies/files/S-BSST297/Files/OM-RGC_v2.tsv.gz
wget -P data/om-rgc/ https://www.ebi.ac.uk/biostudies/files/S-BSST297/Files/OM-RGC_v2_gene_profile_metaG.tsv.gz
gunzip data/om-rgc/*.tsv.gz
#
# Download JGI rhodopsins
wget -O Data_jgi.tar.gz https://figshare.com/ndownloader/files/38419106
tar xfz Data_jgi.tar.gz
Code Snippets
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | from Bio import SeqIO from sys import stderr import csv import re from crc64iso.crc64iso import crc64 info_file = str(snakemake.input['info']) a2m_file = str(snakemake.input['a2m']) pos_file = str(snakemake.input['pos']) usearch_file = str(snakemake.input['usearch']) blasso_file = str(snakemake.input['blasso']) out_file = str(snakemake.output) infos = dict() data_type = 'JGI' header = [] id_cols = { 'OM-RGC_ID', 'genome_id', 'gene_oid', 'ID' } with open(info_file) as fh: tsv = csv.reader(fh, delimiter = '\t') header = next(tsv) header = [v + str(header[:i].count(v) + 1) if header.count(v) > 1 else v for i, v in enumerate(header)] id_cols = set(header) & id_cols assert len(id_cols) > 0, "Id column not found" id_col = id_cols.pop() has_sequence = 'sequence' in header for line in tsv: row = dict(zip(header, line)) record_id = row.pop(id_col).replace(' ', '_') if has_sequence: del row['sequence'] infos[record_id] = row header.remove(id_col) if has_sequence: header.remove('sequence') positions = [] groups = [] with open(pos_file) as fh: for line in fh: pos, group = line.rstrip().split() pos0 = int(pos) - 1 positions.append((pos0, group)) if len(group) > 1 and group not in groups: groups.append(group) blassos = dict() with open(blasso_file) as fh: for line in fh: record_id, wl_mean, wl_sd = line.split() blassos[record_id] = { 'wl_mean': wl_mean, 'wl_sd': wl_sd } matches = dict() with open(usearch_file) as fh: for line in fh: query, target, ident = line.split('\t') if query not in matches: target_id, *rest = target.split() search = re.search('{(.+?)}', target) assert search, "Unexpected reference header format: %s" % target family = search.group(1) matches[query] = { 'target': target, 'family': family, 'ident': float(ident) } aminoacids = [ 'A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y' ] with open(out_file, 'w') as out_fh: tsv = csv.writer(out_fh, delimiter = "\t") tsv.writerow([ 'record_id', 'checksum', 'target', 'family', 'ident', 'wl_mean', 'wl_sd' ] + groups + header) with open(a2m_file) as a2m_fh: a2m = SeqIO.parse(a2m_fh, 'fasta') ref = next(a2m) ref_pos = 0 aln_pos = [] for pos in range(len(ref.seq)): if ref.seq[pos] not in [ '-', '.' ]: for pos0, group in positions: if ref_pos == pos0: aln_pos.append((pos, group)) ref_pos += 1 for rec in a2m: if rec.id in matches: seq = re.sub('[^A-Z]', '', str(rec.seq)) cks = crc64(seq) match = matches[rec.id] blasso = blassos[rec.id] info_id, *rest = rec.id.split('/') info = infos[info_id] res = { group: [] for group in groups } for pos, group in aln_pos: if group in groups: res[group].append(rec.seq[pos]) elif rec.seq[pos] != group: break else: row = [ rec.id, cks, match['target'], match['family'], match['ident'], blasso['wl_mean'], blasso['wl_sd'] ] row += [ ''.join(res[x]) for x in groups ] row += info.values() tsv.writerow(row) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | snakemake@source("plot_functions.R") with(snakemake@input, { metadata_file <<- metadata tsv_file <<- tsv crt_files <<- crt colors_file <<- colors }) with(snakemake@output, { tax_file <<- tax sum_file <<- sum car_file <<- car }) aliases <- list( BCD = "BLH", Lycopene_cycl = "CrtY", crtI_fam = "CrtI" ) metadata <- read.table(metadata_file, header = T, sep = "\t", comment.char = "") %>% filter(ecosystem_category == "Aquatic", habitat != "Groundwater") %>% mutate(Ecosystem.Type = ecosystem_type, Specific.Ecosystem = habitat, Ecosystem.Subtype = habitat, Isolation = habitat, x = longitude, y = latitude) %>% my_ecosystem taxa <- read.table(colors_file, sep = "\t", comment.char = "") %>% arrange(V1) %>% with(setNames(V2,V1)) genes <- lapply(crt_files, read.table) %>% bind_rows %>% select(1, 3, 5) %>% setNames(c("record_id", "gene", "E_value")) %>% filter(E_value < 1e-15) %>% separate(record_id, into = c("genome_id", "Gene.ID"), sep = "/") %>% mutate(gene = recode(gene, !!!aliases)) %>% distinct(genome_id, gene) all_data <- read.table(tsv_file, header = T, sep = "\t", quote = "", comment.char = "") %>% rename(Taxonomy = "ecosystem") %>% # fix bug filter(ecosystem_category == "Aquatic", !grepl("-", motif)) %>% mutate(Ecosystem.Type = ecosystem_type, Specific.Ecosystem = habitat, Ecosystem.Subtype = habitat, Isolation = habitat, x = longitude, y = latitude, Abundance = 1) %>% my_ecosystem %>% my_rhodopsin_class %>% separate(record_id, into = c("genome_id", "Gene.ID"), sep = "/") %>% mutate(window_type = case_when( window %in% c("W", "F") ~ "FW", window %in% c("G") ~ "G", T ~ "other" )) %>% distinct(genome_id, class, window, .keep_all = T) data <- complete(all_data, class, window, nesting(x, y, Ecosystem), fill = list(Abundance = 0)) %>% group_by(x, y, Ecosystem, class, window) %>% summarize(Abundance = sum(Abundance), .groups = "drop") by_ecosystem <- group_by(data, class, window, Ecosystem) %>% summarize(Abundance = sum(Abundance), .groups = "drop") %>% group_by(Ecosystem) %>% group_split %>% lapply(my_logo_matrix) aliases <- list(Cyanobacteriota = "Cyanobacteria") by_taxa <- distinct(all_data, otu_id, .keep_all = T) %>% separate_rows(Taxonomy, sep = ";") %>% filter(class %in% c("x", "p"), window_type %in% c("FW", "G")) %>% separate(Taxonomy, into = c("Rank", "Name"), sep = "__") %>% filter(Rank == "p") %>% group_by(class, window_type, Ecosystem, Name) %>% summarize(Abundance = sum(Abundance), .groups = "drop") %>% group_by(Name) %>% mutate(Name = ifelse(sum(Abundance) > 10, Name, NA)) %>% mutate(Name = recode(Name, !!!aliases)) tmp <- left_join(metadata, all_data, by = "genome_id", suffix = c("", ".all_data")) %>% distinct(genome_id, .keep_all = T) %>% group_by(Ecosystem) %>% summarize(num_px = sum(class %in% c("x", "p") & window_type %in% c("FW", "G")), num_all = n()) all_taxa <- unique(na.omit(by_taxa$Name)) missing_taxa <- all_taxa[! all_taxa %in% names(taxa)] if (length(missing_taxa) > 0) { write("Taxa not in the color file:", stderr()) write(missing_taxa, stderr()) q() } by_gene <- all_data %>% filter(class %in% c("x", "p"), window_type %in% c("WF", "G")) %>% left_join(genes, by = "genome_id") %>% mutate(present = T) %>% complete(nesting(genome_id, class, window_type), gene, fill = list(present = F)) %>% filter(!is.na(gene)) %>% group_by(class, window_type, gene, present) %>% summarize(n = n()) p <- ggplot(by_gene, aes(x = gene, fill = present, y = n)) + geom_bar(position = "stack", stat = "identity") + facet_grid(class ~ window_type) + theme_bw() ggsave(car_file, p) p <- ggplot(by_taxa, aes(fill = Name, x = Ecosystem, y = Abundance)) + geom_bar(position = "stack", stat = "identity") + scale_fill_manual(values = taxa) + facet_grid(class ~ window_type) + theme_bw() ggsave(tax_file, p, width = 7, height = 5) p <- my_ggplot_ecosystems() p <- Reduce(my_add_grob_to_ecosystem, by_ecosystem, p) ggsave(sum_file, p, height = 8, width = 16) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | snakemake@source("plot_functions.R") with(snakemake@input, { tsv_files <<- tsv metadata_file <<- metadata }) with(snakemake@output, { map_file <<- map sum_file <<- sum motif_counts_file <<- motif_counts motif_abundance_file <<- motif_abundance }) metadata <- read.table(metadata_file, header = T, sep = "\t", quote = "", comment.char = "", na.strings = c("", "n/a")) %>% filter(JGI.Data.Utilization.Status == "Unrestricted", GOLD.Analysis.Project.Type == "Metagenome Analysis") %>% my_ecosystem %>% mutate(x = round(Longitude), y = round(Latitude)) all.data <- lapply(tsv_files, read.table, header = T, sep = "\t", quote = "", comment.char = "") %>% lapply(select, "record_id", "checksum", "motif", "window", "blasso", "family", "wl_mean", "wl_sd", "Locus.Tag", "Genome.ID", "NCBI.Biosample.Accession", "Scaffold.Read.Depth") %>% bind_rows %>% filter(!grepl("-", motif)) %>% distinct(record_id, .keep_all = T) %>% group_by(Genome.ID) %>% arrange(-n()) %>% group_by(NCBI.Biosample.Accession) %>% filter(Genome.ID == first(Genome.ID)) %>% my_rhodopsin_class %>% left_join(metadata, by = c(Genome.ID = "taxon_oid")) %>% mutate(Abundance = Scaffold.Read.Depth) %>% filter(!is.na(x)) %>% group_by(x, y) %>% filter(n_distinct(Ecosystem) == 1) %>% filter(sum(class %in% c("x", "p")) > 9) data <- filter(all.data, class %in% c("x", "p")) %>% ungroup %>% complete(class, window, nesting(x, y, Ecosystem), fill = list(Abundance = 0)) %>% group_by(x, y, Ecosystem) %>% group_by(x, y, Ecosystem, class, window) %>% summarize(Abundance = sum(Abundance), .groups = "drop") motifs_counts <- all.data %>% distinct(checksum, .keep_all = T) %>% group_by(family, motif, Ecosystem) %>% summarize(Abundance = n()) motifs_high <- all.data %>% group_by(motif) %>% summarize(Abundance = sum(Abundance)) %>% arrange(-Abundance) %>% head(n = 12) %>% pull(motif) motifs_weighted <- all.data %>% group_by(family, motif, Ecosystem) %>% summarize(Abundance = sum(Abundance), .groups = "drop") %>% mutate(motif_top = factor(motif, levels = motifs_high)) %>% group_by(Ecosystem) %>% mutate(pct = Abundance / sum(Abundance) * 100) %>% group_by(family) %>% mutate(Family_abundance = sum(Abundance)) %>% ungroup %>% filter(Family_abundance / sum(Abundance) > 0.001) %>% arrange(-Family_abundance) %>% mutate(family = factor(family, levels = unique(family))) p <- ggplot(motifs_counts, aes(fill = motif, x = family, y = Abundance)) + geom_bar(position = "stack", stat = "identity") + facet_wrap(. ~ Ecosystem, scales = "free_y") ggsave(motif_counts_file, p, width = 14) p <- ggplot(motifs_weighted, aes(fill = motif_top, x = family, y = pct)) + geom_bar(position = "stack", stat = "identity") + facet_wrap(. ~ Ecosystem) + scale_fill_brewer(palette = "Set3", na.value = "grey50") + theme_bw() ggsave(motif_abundance_file, p, width = 5.6595, height = 3.1395) #by_location <- group_by(data, x, y, Ecosystem) %>% # group_split %>% # lapply(my_logo_matrix) by_location <- group_by(data, x, y, Ecosystem) %>% filter(class %in% c("p", "x")) %>% summarize(Total = log10(sum(Abundance)), G = sum(Abundance[window %in% "G"], na.rm = T), WF = sum(Abundance[window %in% c("W", "F")], na.rm = T)) by_ecosystem <- group_by(data, class, window, Ecosystem) %>% summarize(Abundance = sum(Abundance), .groups = "drop") %>% group_by(Ecosystem) %>% group_split %>% lapply(my_logo_matrix) #p <- my_ggplot_world() #p <- Reduce(my_add_grob_to_map, by_location, p) width <- 16 height <- 8 p <- my_pies(by_location, width, height) ggsave(map_file, p, height = height, width = width) p <- my_ggplot_ecosystems() p <- Reduce(my_add_grob_to_ecosystem, by_ecosystem, p) ggsave(sum_file, p, height = 3, width = 4) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | snakemake@source("plot_functions.R") library(readxl) library(car) library(lme4) with(snakemake@input, { tsv_file <<- tsv metadata_file <<- metadata abundance_file <<- abundance crtZ_file <<- crtZ crtZ_abundance_file <<- crtZ_abundance single_copy_files <<- single_copy }) with(snakemake@params, { sheet_name <<- sheet }) with(snakemake@output, { map_file <<- map map_percell_file <<- map_percell sum_file <<- sum dep1_file <<- dep1 lat1_file <<- lat1 test1_file <<- test1 dep2_file <<- dep2 lat2_file <<- lat2 test2_file <<- test2 }) metadata <- read_excel(metadata_file, sheet_name, .name_repair = "universal") %>% mutate(x = round(Longitude), y = round(Latitude)) %>% mutate(Ecosystem = factor("Marine")) abundance <- read.table(abundance_file, header = T) %>% gather(Sample, Abundance, -OMRGC_ID) %>% left_join(metadata, by = c(Sample = "PANGAEA.sample.id")) all.data <- read.table(tsv_file, header = T, sep = "\t", na.strings = c("NA", "")) %>% filter(!is.na(wl_mean)) %>% my_rhodopsin_class %>% left_join(abundance, by = c(record_id = "OMRGC_ID")) %>% mutate(Depth = suppressWarnings(as.numeric(Depth..nominal)), fenestrated = case_when(window == "G" ~ T, window %in% c("W","F") ~ F, T ~ NA)) single_copy <- lapply(single_copy_files, read.table, header = T) %>% lapply(select, -OMRGC_ID) %>% lapply(colSums) %>% bind_rows %>% colMeans %>% data.frame(Abundance = ., Sample = names(.)) %>% filter(Sample %in% all.data$Sample) %>% left_join(metadata, by = c(Sample = "PANGAEA.sample.id")) motifs_counts <- all.data %>% distinct(checksum, .keep_all = T) %>% group_by(family, motif) %>% summarize(Abundance = n()) motifs_weighted <- all.data %>% group_by(family, motif) %>% summarize(Abundance = sum(Abundance)) # NB: these plots are not saved #p <- ggplot(motifs_counts, aes(fill = motif, x = family, y = Abundance)) + # geom_bar(position = "stack", stat = "identity") #p <- ggplot(motifs_weighted, aes(fill = motif, x = family, y = Abundance)) + # geom_bar(position = "stack", stat = "identity") ratio.data <- filter(all.data, class %in% c("x", "p"), !is.na(fenestrated), !is.na(Depth)) %>% # filter(Polar == "Non polar") %>% group_by(Sample, x, y, Ecosystem, Depth, Station, Polar, fenestrated) %>% summarize(Abundance = sum(Abundance), .groups = "drop") %>% spread(fenestrated, Abundance, fill = 0) %>% mutate(Abundance = `TRUE` + `FALSE`, ratio = `TRUE` / Abundance) model <- lmer(ratio ~ log10(Depth + 1) + abs(y) + (1 | Station), data = ratio.data) capture.output(Anova(model, type = "II"), file = test1_file) p <- ggplot(ratio.data, aes(x = Depth, y = ratio, size = Abundance)) + geom_point() + geom_smooth(method = 'lm', formula = y ~ x) + scale_x_continuous(trans = 'log10') ggsave(dep1_file, p) p <- ggplot(ratio.data, aes(x = abs(y), y = ratio, size = Abundance)) + geom_point() + geom_smooth(method = 'lm', formula = y ~ x) ggsave(lat1_file, p) weighted_mean <- function(x, w) { sum(w * x) / sum(w) } weighted_sd <- function(x, w) { x_mean <- weighted_mean(x, w) M <- sum(w > 0) sqrt( sum(w * (x - x_mean)^2) * M / (M-1) / sum(w) ) } wl.data <- filter(all.data, class %in% c("x", "p"), !is.na(fenestrated), !is.na(Depth)) %>% group_by(x, y, Ecosystem, Depth, Station, Polar) %>% summarize(wl_avg = weighted_mean(wl_mean, Abundance), wl_sd = weighted_sd(wl_mean, Abundance), Abundance = sum(Abundance), .groups = "drop") model <- lmer(wl_avg ~ log10(Depth + 1) + abs(y) + (1 | Station), data = wl.data) capture.output(Anova(model, type = "II"), file = test2_file) p <- ggplot(wl.data, aes(x = Depth, y = wl_avg, size = Abundance)) + geom_pointrange(aes(ymin = wl_avg - wl_sd, ymax = wl_avg + wl_sd)) + geom_smooth(method = 'lm', formula = y ~ x) + scale_size_continuous(range = c(0.1,0.5)) + scale_x_continuous(trans = 'log10') ggsave(dep2_file, p) p <- ggplot(wl.data, aes(x = abs(y), y = wl_avg, size = Abundance)) + geom_pointrange(aes(ymin = wl_avg - wl_sd, ymax = wl_avg + wl_sd)) + scale_size_continuous(range = c(0.1,0.5)) + geom_smooth(method = 'lm', formula = y ~ x) ggsave(lat2_file, p) crtZ.abundance <- read.table(crtZ_abundance_file, header = T) %>% gather(Sample, Abundance, -OMRGC_ID) %>% left_join(metadata, by = c(Sample = "PANGAEA.sample.id")) crtZ.data <- read.table(crtZ_file) %>% select(OMRGC_ID = "V1", evalue = "V5") %>% filter(evalue < 1e-15) %>% left_join(crtZ.abundance, by = "OMRGC_ID") %>% group_by(Sample, x, y, Station) %>% summarize(CrtZ.abundance = sum(Abundance), .groups = "drop") %>% left_join(select(ratio.data, Sample, Depth, Abundance, `TRUE`, `FALSE`), by = "Sample") %>% mutate(ratio = CrtZ.abundance / `TRUE`) model <- lmer(ratio ~ log10(Depth + 1) + abs(y) + (1 | Station), data = crtZ.data) capture.output(Anova(model, type = "II"), file = test2_file) p <- ggplot(crtZ.data, aes(x = CrtZ.abundance, y = `TRUE`)) + geom_point() + geom_smooth(method = 'lm', formula = y ~ x) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') ggsave("TRUE.pdf", p) p <- ggplot(crtZ.data, aes(x = CrtZ.abundance, y = `FALSE`)) + geom_point() + geom_smooth(method = 'lm', formula = y ~ x) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') ggsave("FALSE.pdf", p) p <- ggplot(crtZ.data, aes(x = Depth, y = ratio, size = Abundance)) + geom_point() + geom_smooth(method = 'lm', formula = y ~ x) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') ggsave("tmp.pdf", p) # prots <- distinct(all.data, record_id, .keep_all=T) # p <- ggplot(prots, aes(x = window, y = wl_mean)) + geom_boxplot() # ggsave(win_wl_file, p) data <- group_by(all.data, x, y, Ecosystem, class, window) %>% summarize(Abundance = sum(Abundance), .groups = "drop") %>% complete(class, window, nesting(x, y, Ecosystem), fill = list(Abundance = 0)) single_copy_data <- group_by(single_copy, x, y, Ecosystem) %>% summarize(Abundance = sum(Abundance), .groups = "drop") by_location <- left_join(data, single_copy_data, by = c("x", "y", "Ecosystem"), suffix = c("", ".SC")) %>% group_by(x, y, Ecosystem) %>% filter(class %in% c("p", "x")) %>% summarize( Total.abundance = sum(Abundance), Total.percell = Total.abundance / first(Abundance.SC), G = sum(Abundance[window %in% "G"], na.rm = T), WF = sum(Abundance[window %in% c("W", "F")], na.rm = T), .groups = "drop" ) by_ecosystem <- group_by(data, class, window, Ecosystem) %>% summarize(Abundance = sum(Abundance), .groups = "drop") %>% my_logo_matrix width <- 16 height <- 8 p <- mutate(by_location, Total = log10(Total.abundance)) %>% my_pies(width, height, labeller = function(r) sprintf('1e%d', r)) ggsave(map_file, p, height = height, width = width) scale <- 10 p <- mutate(by_location, Total = Total.percell * scale) %>% my_pies(width, height, labeller = function(r) sprintf('%d%%', r / scale * 100)) ggsave(map_percell_file, p, height = height, width = width) p <- my_ggplot_ecosystems() p <- my_add_grob_to_ecosystem(p, by_ecosystem) ggsave(sum_file, p, height = 3, width = 4) #### Calculation of % of fenestrated rhods per cell in a sample #### #### not yet integrated #### data <- group_by(all.data, Sample, Layer, class, window) %>% summarize(Abundance = sum(Abundance), .groups = "drop") %>% complete(class, window, Sample, fill = list(Abundance = 0)) single_copy_data <- group_by(single_copy, Sample) %>% summarize(Abundance = sum(Abundance), .groups = "drop") by_location <- left_join(data, single_copy_data, by = "Sample", suffix = c("", ".SC")) %>% filter(class %in% c("p", "x"), window %in% c("G","W","F")) %>% group_by(Sample, Layer) %>% summarize( Total.abundance = sum(Abundance), Total.percell = Total.abundance / first(Abundance.SC), G = sum(Abundance[window %in% "G"], na.rm = T), WF = sum(Abundance[window %in% c("W", "F")], na.rm = T), .groups = "drop" ) %>% mutate(G.percell = Total.percell * G / Total.abundance) %>% group_by(Layer) %>% summarize(G.percell.mean = mean(G.percell), G.percell.sd = sd(G.percell)) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | library(treeio) library(ggtree) library(ape) library(phytools) library(dplyr) library(tidyr) library(ggplot2) library(phangorn) library(ggnewscale) library(castor) library(seqinr) if (interactive()) { Snakemake <- setClass("Snakemake", slots = list(input = "list", output = "list")) snakemake <- Snakemake( input = list(outgroup = "input/outgroups.fasta", tree = "analysis/phylogeny/rhodopsins.treefile", metadata = "analysis/parse/phylogeny.tsv"), output = list("tmp.pdf") ) } with(snakemake@input, { outgroup_file <<- outgroup tree_file <<- tree tsv_file <<- tsv colors_file <<- colors a2m_file <<- a2m }) with(snakemake@output, { output_file_small <<- small output_file_big <<- big output_jtree <<- jtree }) taxa <- read.table("metadata/taxa.txt", sep = "\t", comment.char = "") %>% arrange(1) %>% with(setNames(V2, V1)) outgroups <- names(read.fasta(outgroup_file)) tsv <- read.table(tsv_file, header = T, sep = "\t", na.strings = "", fill = T) %>% select(-target) metadata <- read.fasta(a2m_file, seqtype = "AA", as.string = T) %>% {data.frame(label = names(.), sequence = as.character(.))} %>% left_join(tsv, by = c(label = "record_id")) %>% mutate(is_outgroup = label %in% outgroups) %>% mutate(Alias = gsub(",.+", "", Alias)) %>% mutate(Activity = ifelse(grepl("\\]$", Activity), NA, gsub("[][]", "", Activity))) %>% mutate(Highlight = !is.na(Highlight)) %>% mutate(D85 = substr(motif, 1, 1), T89 = substr(motif, 2, 2), D96 = substr(motif, 3, 3), G156 = window) tree <- read.tree(tree_file) tree.unrooted <- as_tibble(tree) %>% left_join(metadata, by = "label") %>% `class<-`(c("tbl_tree", "data.frame")) %>% as.treedata write.jtree(tree.unrooted, file = output_jtree) tree.tib <- ape::root(tree, outgroups, edgelabel = T, resolve.root = T) %>% drop.tip(outgroups) %>% as_tibble %>% mutate(support = suppressWarnings(as.numeric(label))) %>% left_join(metadata, by = "label") %>% mutate(is_outgroup = ifelse(label %in% outgroups, T, NA)) %>% `class<-`(c("tbl_tree", "data.frame")) tree.phylo <- as.treedata(tree.tib)@phylo clustalx <- c( A = "BLUE", I = "BLUE", L = "BLUE", M = "BLUE", F = "BLUE", W = "BLUE", V = "BLUE", C = "BLUE", K = "RED", R = "RED", E = "MAGENTA", D = "MAGENTA", N = "GREEN", Q = "GREEN", S = "GREEN", T = "GREEN", G = "ORANGE", P = "YELLOW", H = "CYAN", Y = "CYAN" ) clades <- filter(tree.tib, ! node %in% parent) %>% pull(Clade) %>% as.factor hsp <- hsp_max_parsimony(tree.phylo, as.numeric(clades)) %>% `$`("likelihoods") %>% data.frame(check.names = F) %>% mutate(node = row_number()) %>% gather(hsp, prob, -node) %>% filter(prob > 0.9) %>% mutate(hsp = levels(clades)[as.numeric(hsp)]) tree <- left_join(tree.tib, hsp, by = "node") %>% mutate(hsp.parent = .[match(parent, node),"hsp"]) %>% mutate(hsp = ifelse(!is.na(hsp.parent) & hsp == hsp.parent, NA, hsp)) %>% as.treedata p_small <- ggtree(tree, layout = "circular") + geom_highlight(mapping = aes(subset = !is.na(hsp), fill = hsp), alpha = 0.1) + geom_treescale() + geom_tiplab(mapping = aes(subset = !is.na(Alias), label = Alias, size = Highlight), offset = 0.1) + # geom_tippoint(mapping = aes(subset = !is.na(Activity), color = Activity), size = 1) + scale_size_manual(values = c(2.5, 5)) + new_scale("size") + # labels geom_point2(aes(subset = !is.na(support) & support >= 90, size = support >= 95), color = "darkgray") + scale_size_manual(values = c(0.5, 1)) + # support values new_scale("color") + geom_text2(aes(label = G156, color = G156, angle = angle - 90, x = 4.3), size = 3) + scale_color_manual(values = clustalx) # residues p_big <- ggtree(tree, aes(color = Taxon), layout = "rectangular") + geom_highlight(mapping = aes(subset = !is.na(hsp), fill = hsp), alpha = 0.1) + scale_color_manual(values = taxa) + new_scale("color") + geom_treescale() + geom_tiplab(mapping = aes(label = sprintf("%s [%s]", ifelse(is.na(Alias), label, Alias), Organism)), size = 2, offset = 0.1) + geom_tippoint(mapping = aes(subset = !is.na(Activity), color = Activity), size = 1) + geom_point2(aes(subset = !is.na(support) & support >= 90, size = support >= 95, x = branch), shape = 15, color = "darkgray") + scale_size_manual(values = c(0.5, 1)) + # support values new_scale("color") + geom_text2(aes(label = D85, color = D85, x = 4.8), size = 1.5) + geom_text2(aes(label = T89, color = T89, x = 5.0), size = 1.5) + geom_text2(aes(label = D96, color = D96, x = 5.2), size = 1.5) + geom_text2(aes(label = G156, color = G156, x = 5.4), size = 2) + scale_color_manual(values = clustalx) # residues ggsave(output_file_small, p_small, height = 7, width = 8) ggsave(output_file_big, p_big, height = 6.5, width = 5) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | snakemake@source("plot_functions.R") library(car) library(lme4) library(ggpubr) library(emmeans) library(dunn.test) library(photobiology) with(snakemake@input, { om_rgc_file <<- om_rgc gem_file <<- gem jgi_files <<- jgi }) output_file <- unlist(snakemake@output) read.data <- function(fnames) { lapply(fnames, read.table, header = T, sep = "\t", fill = T, quote = "") %>% lapply(select, "motif", "blasso", "window", "checksum", "family", "wl_mean", "wl_sd") %>% bind_rows } residues <- c("G", "F", "W") data <- bind_rows( read.data(om_rgc_file), read.data(gem_file), read.data(jgi_files) ) %>% my_rhodopsin_class %>% filter(class %in% c("x", "p"), window %in% residues, !is.na(wl_mean), !grepl("X", blasso)) %>% distinct(checksum, .keep_all = T) %>% group_by(blasso) %>% filter(n() > 1) %>% group_by(blasso, window, family, .keep_all = T) %>% summarize(wl_mean = mean(wl_mean), n = n(), .groups = "drop") %>% mutate(window = factor(window, levels = residues)) %>% mutate(family = factor(family)) %>% mutate(color = w_length2rgb(wl_mean)) kruskal <- split(data, f = data$family) %>% lapply(function(x) kruskal.test(wl_mean ~ window, data = x)) %>% lapply(`[`, c("statistic","p.value")) %>% bind_rows(.id = "family") dunn <- split(data, f = data$family) %>% lapply(function(x) with(x, dunn.test(wl_mean, window, method = "bh"))) %>% lapply(`[`, c("Z","P.adjusted","comparisons")) %>% bind_rows(.id = "family") %>% separate(comparisons, into = c("group1", "group2"), sep = " - ", remove = F) %>% mutate(Q.lab = case_when(P.adjusted < 0.001 ~ "***", P.adjusted < 0.01 ~ "**", P.adjusted < 0.05 ~ "*", T ~ "ns")) %>% mutate(y.position = 580 - (group1 == "F" | group2 == "F") * 7) n_fun <- function(x) { data.frame(y = 450, label = sprintf("n = %d", length(x))) } p <- ggplot(data, aes(x = window, y = wl_mean)) + geom_jitter(aes(color = color, size = n)) + geom_boxplot(color = "red", alpha = 0.1) + scale_size_continuous(range = c(0.1, 5), breaks = c(8, 40, 200, 1000, 5000), limits = c(1, 5000)) + scale_colour_identity() + stat_summary(fun.data = n_fun, geom = "text", size = 3) + facet_grid(. ~ family) + ylim(c(450, 580)) + xlab("Fenestration residue") + ylab("Predicted absorption maximum") + theme_bw() + stat_pvalue_manual(data = dunn, label = "Q.lab", xmin = "group1", xmax = "group2", y.position = "y.position", size = 2) ggsave(output_file, p, width = 5.5, height = 2.5) |
38 39 | shell: "cat {input} > {output}" |
46 47 | shell: "cp {input} {output}" |
57 58 | shell: "cat {input} | tr ' ' _ > {output}" |
67 68 | shell: "cp {input} {output}" |
75 76 | shell: "cp {input} {output}" |
85 86 | shell: "seqkit rmdup -s -o {output} {input}" |
97 98 | shell: "seqkit rmdup -s {input} | mafft --thread {threads} --auto - > {output}" |
107 108 | shell: "reformat.pl fas sto {input} {output} -M 50" |
117 118 | shell: "hmmbuild {output} {input}" |
128 129 | shell: "hmmalign --outformat A2M -o {output} {input.hmm} {input.faa}" |
138 139 | shell: "reformat.pl {input} {output}" |
152 153 | shell: "usearch -threads {threads} -usearch_global {input.query} -db {input.target} -id {params.id} -userout {output} -userfields query+target+id -query_cov {params.query_cov} -target_cov 0 -maxaccepts 1000000" |
166 167 | script: "scripts/parse.py" |
181 182 | script: "scripts/plot_gem.R" |
195 196 | script: "scripts/plot_jgi.R" |
220 221 | script: "scripts/plot_om_rgc.R" |
232 233 | script: "scripts/plot_wavelengths.R" |
242 243 | shell: "csvcut -tc OM-RGC_ID,sequence {input} | csvformat -TK1 | seqkit tab2fx | seqkit translate -o {output}" |
252 253 | shell: "csvgrep -tc OG -m {params.cog} {input} | csvformat -T > {output}" |
263 264 | shell: "csvgrep -tc OG -m {wildcards.cog} {input.tsv} | csvcut -c OM-RGC_ID | csvgrep -tc OMRGC_ID -f- {input.abundance} | csvformat -T > {output}" |
272 273 | shell: "cut -f2 {input.tsv} | sed s/OM-RGC_ID/OMRGC_ID/ | grep -f- {input.abundance} > {output}" |
281 282 | shell: "(head -n1 {input.abundance}; cut -f1 -d' ' {input.txt} | grep -v '^#' | grep -f- {input.abundance}) > {output}" |
292 293 | shell: "csvcut -tc OM-RGC_ID,sequence {input.txt} | csvformat -TK1 | seqkit tab2fx | seqkit translate | cat {input.ref} - > {output}" |
300 301 | shell: "cp {input} {output}" |
313 314 | shell: "hmmsearch -E {params.E} -o /dev/null --tblout {output} {input.hmm} {input.fasta}" |
326 327 | shell: "hmmsearch -E {params.E} -o /dev/null --tblout {output} {input.hmm} {input.fasta}" |
336 337 | shell: "seqkit faidx {input}" |
347 348 | shell: "cat {input} > {output}" |
358 359 | shell: "grep -v '^#' {input.txt} | cut -f1 -d' ' | sort -u | xargs seqkit faidx {input.fasta} > {output}" |
370 371 | shell: "cat {input.ingroup} {input.outgroup} | seqkit grep -vf <(cut -f1 {input.ignore}) | mafft --auto - > {output}" |
383 384 | shell: "cat {input.ingroup} {input.outgroup} | seqkit grep -vf <(cut -f1 {input.ignore}) | hmmalign --outformat A2M {input.hmm} - | reformat.pl a3m a2m - {output}" |
395 396 | shell: "trimal -in {input} -out {output} -gt {params.gt}" |
405 406 | shell: "seqkit replace -sp [a-z.] -o {output} {input}" |
422 423 | shell: "iqtree2 -s {input} --prefix {params.prefix} -redo -seed {params.seed} -B 1000 -nt {threads} -pers {params.pers} -nstop {params.nstop}" |
438 439 | script: "scripts/plot_rhodopsins.R" |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/BejaLab/antenna
Name:
antenna
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
None
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...