Comparative analysis of the genomes from two Acanthamoeba castellanii strains
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Description
This repository contains scripts and documentation related to the analysis and comparison of the Acanthamoeba castellanii genome from strains C3 and Neff. Analyses include: Annotation statistics, busco, quast, orthologous gene comparison with related species, circos plot, sequence divergence between strains and Hi-C contact profiles at the rDNA sequences.
A frozen copy of this repository as well as output data are available for download in the associated Zenodo record .
Installation
The pipeline is written using snakemake and manages dependencies using conda. Most of the pipeline steps are run inside self-contained conda environments, which are automatically built upon execution. There are two dependencies (MCScanX and dnaglider) which are not available through conda and need to be installed separately.
Dependencies:
-
python3.7+
-
snakemake
-
pandas
-
numpy
-
-
conda
-
dnaglider
-
MCScanX
The input data (genomes, annotations, ...) are downloaded automatically from Zenodo when executing the pipeline.
Usage
The analyses are separated into distinct workflows in the
rules
directory.
The whole analysis pipeline can be run using snakemake as follows:
snakemake --use-conda -j4
Structure
The master script
Snakefile
will call each workflow one after the other. Each workflow contains rules with input and output files, which execute code or external scripts. Each rule is executed in its own conda environment and will download its dependencies on the first execution. The overall workflow can be represented as a graph:
The
envs
directory contains conda environment build specifications for the different rules.
General parameters for the pipeline are stored in the
config.yaml
file and can be modified. The strains to analyze as well as the path to their sequence files are defined in
samples.tsv
. All external scripts executed by rules are stored in the
scripts
folder. Custom python utility libraries imported in the pipeline are stored in
src
.
The
doc
directory contains jupyter notebook with general analyses of the pipeline results.
Code Snippets
13 | script: '../scripts/00_fetch_proteomes.py' |
22 | shell: 'cd-hit -i {input} -o {output} -c {params.sim} -M 0' |
41 42 43 44 45 46 | shell: """ zenodo_get -d https://doi.org/10.5281/zenodo.5507417 -w {output.url_tbl} wget $(grep "shared_assets" {output.url_tbl}) -O - \ | tar xzvf - --directory={params.in_dir} >/dev/null """ |
14 15 16 17 18 19 20 21 22 23 24 | shell: """ st={wildcards.strain} sed 's/^>\(.*\)/>'"$st"'_\\1/' {input.genome} > {output.genome} sed 's/^>\(.*\)/>'"$st"'_\\1/' {input.cds} > {output.cds} sed 's/^>\(.*\)/>'"$st"'_\\1/' {input.proteome} > {output.proteome} sed 's/^\([^#]\)/'"$st"'_\\1/' {input.annotations} | sed 's/ID=/ID='"$st"'_/' | sed 's/Parent=/Parent='"$st"'_/' \ > {output.annotations} """ |
34 35 36 37 38 | shell: """ cp {input.genome} {output.genome} cp {input.annot} {output.annot} """ |
51 | shell: "Rscript scripts/01_annot_stats.R {input.v1} {input.neff} {input.c3} {output.tbl} {output.plt}" |
63 64 65 66 67 | shell: """ assembly-stats -t {input.neff_v2} {input.neff_v1} > {params.assembly_tbl} Rscript scripts/01_assembly_stats.R {params.assembly_tbl} {output} """ |
78 | shell: 'quast -t {threads} -e -o {output} {input.neff_v2} {input.ref_fa}' |
85 86 87 88 89 | shell: """ busco -f --long --mode genome -c {threads} --lineage eukaryota -i {input} -o $(basename {output}) mv "$(basename {output})" "$(dirname {output})/" """ |
95 96 97 98 99 100 101 102 103 | shell: """ sed 's/^[\t ]*//' {input}/run_eukaryota_odb10/short_summary.txt \ | grep -v '^[#\*]' \ | sed '/^$/d' \ | sed 's/[\t ]*$//' \ | grep -v '\%' \ > {output} """ |
113 | script: '../scripts/01_plot_busco.py' |
121 122 123 124 125 126 127 | shell: """ awk -vOFS='\t' ' /^>/ {{if (seqlen){{print seqlen}}; printf "%s\t",substr($0, 2) ;seqlen=0;next; }} {{ seqlen += length($0)}}END{{print seqlen}} ' {params.genome} > {output} """ |
136 137 138 139 140 141 | shell: """ grep -v "^#" {params.rdna} | awk -vOFS='\t' '{{print $1,$4,$5,$9}}' > {output} awk -vOFS='\t' '$5 == 1 {{print $1,$2,$3,"HGT"}}' {input.hgt} >> {output} awk -vOFS='\t' '{{print $1,$2,$3,"virus"}}' {input.vir} >> {output} """ |
153 | script: "../scripts/01_plot_karyo.py" |
9 | shell: "seqkit grep -r -f {params.hgt_ids} {input} | seqkit translate > {output}" |
17 | script: '../scripts/02_get_v1_hgt_gff.py' |
28 29 30 31 32 33 34 35 36 37 | shell: """ dnaglider -threads {threads} \ -fields "GC,GCSKEW,ATSKEW,ENTRO,KMER" \ -kmers 2,3,4 \ -window {params.win} \ -stride {params.step} \ -fasta {input} \ -out {output} """ |
50 51 52 53 54 55 56 57 58 | shell: """ makeblastdb -dbtype prot -in {params.v2_prots} -title neff -out {params.v2_db} blastp -evalue 1e-30 \ -outfmt "6 qseqid sseqid pident evalue" \ -db {params.v2_db} \ -num_threads {threads} \ -query {input.v1_hgt} > {output.blast} """ |
66 67 68 69 70 71 72 73 74 75 76 77 | run: df = pd.read_csv( input[0], sep='\t', header=None, names=['query', 'target', 'pident', 'evalue'] ) keep_idx = df.groupby('query')['pident'].transform(max) == df['pident'] # For each query HGT, retain the match with highest sequence identity df = df[keep_idx] # Filter to keep only matches with > 95% identity df.loc[df.pident > params['pident']].to_csv(output[0], sep='\t', index=False) |
85 86 87 88 89 90 91 92 93 94 95 96 97 | run: blast = pd.read_csv(input['blast'], sep='\t') annot = pd.read_csv( input['annot'], sep='\t') strain = wildcards['strain'] annot = annot.loc[annot.ID.str.startswith(strain), :] annot.ID = annot.ID.str.replace(f"{strain}"+r"_([a-zA-Z0-9_]*).*", r"\1") annot.chrom= annot.chrom.str.replace(f"{strain}_", "") blast['hgt'] = 1 blast.target = blast.target.str.replace(r'([a-zA-Z0-9_]*).*', r'\1') annot = blast.merge(annot, how='right', left_on='target', right_on='ID') annot.hgt = annot.hgt.fillna(0).astype(int) annot = annot.loc[:, ['chrom', 'start', 'end', 'ID', 'hgt', 'n_exon']] annot.to_csv(output[0], sep='\t', header=None, index=False) |
108 109 110 111 112 113 114 115 | shell: """ echo -e "chrom\tstart\tend\tgeneID\tHGT\tNEXON\tGC\tGCSKEW\tATSKEW\tENTRO\t2MER\t3MER\t4MER" > {output} bedtools intersect -a {input.win} -b {input.hgt} -wb \ | sort --parallel={threads} -k11,11 -k12,12n \ | bedtools groupby -g 11,12,13,14,15,16 -c 4,5,6,7,8,9,10 -o mean \ | sort --parallel={threads} -k1,1 -k2,2n >> {output} """ |
13 14 15 16 17 18 19 | shell: """ python scripts/03_rdna_hic.py \ {input.cool}::/resolutions/{params.res} \ {input.rdna} \ {output} """ |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | shell: """ echo -n "" > {output.prot} for i in {input.prot}; do cat $i >> {output.prot} done echo -n "" > {output.genome} for i in {input.genome}; do cat $i >> {output.genome} done echo -n "" > {output.annot} for i in {input.annot}; do cat $i >> {output.annot} done """ |
40 41 42 43 44 45 46 47 | shell: """ bash scripts/04_gen_mcscanx_input.sh -g {input.annot} \ -o {params.out_dir} \ -f {input.prot} \ -c {threads} MCScanX -s 3 {params.out_dir}/MCScanX_in """ |
62 63 64 65 66 | shell: """ cp {params.cfg} {output.cfg} bash scripts/04_gen_circos_files.sh {input.ref} {params.mcsx_prefix} {params.dir} """ |
74 75 76 77 78 79 80 81 | run: karyo = pd.read_csv(join(params['circos_dir'], 'karyotype.txt'), sep=' ', header=None) chroms = list(karyo.iloc[:, 2]) links = pd.read_csv(input[0], sep=' ', header=None, names=['c1', 's1', 'e1', 'c2', 's2', 'e2', 'col'] ) filtered = links.loc[(np.isin(links.c1,chroms)) & (np.isin(links.c2, chroms)), :] filtered.to_csv(output[0], sep=' ',header=None, index=False) |
96 97 98 99 100 101 | shell: """ bundlelinks -strict -min_bundle_size 1000 -max_gap 50000 \ -links {input.mcsx} | sed 's/lgrey=$//' > {params.circos_dir}/bundles.txt circos -conf {input.cfg} -outputfile {output} """ |
114 115 116 117 118 119 120 121 | shell: """ minimap2 -c {input.neff} {input.c3} -x map-ont \ > {output.paf} python ./scripts/04_compute_seq_divergence.py \ {output.paf} \ {output.div} """ |
127 | script: "../scripts/04_plot_seq_divergence.py" |
15 | shell: "genomepy install -g '{params.genomedir}/{wildcards.virus}' {params.assembly}" |
25 | shell: "minimap2 -xasm20 -t {threads} {input.amoeba} {input.virus}/*/*fa > {output}" |
36 37 38 39 | shell: """ fd ".*paf" {params.pafdir} -x awk -vOFS='\t' -vvir={{/.}} '{{print $6,$8,$9,vir,$10/$11}}' {{}} > {output} """ |
44 45 46 47 48 49 50 51 52 53 54 55 | run: import matplotlib.pyplot as plt import seaborn as sns import matplotlib as mpl mpl.use("Agg") vir = pd.read_csv( input[0], sep="\t", names=["chrom", "start", "end", "virus", "ident"] ) vir["seqlen"] = vir.end - vir.start # Amount of each virus inserted sns.scatterplot(data=vir, x="seqlen", y="ident", hue="virus") plt.savefig(output[0]) |
64 65 66 67 68 69 | shell: """ sort -k1,1 -k2,2n {input} \ | bedtools merge -o mean -c 5 -i - -d {params.neigh_dist} \ > {output} """ |
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | shell: """ # Discard coordinates with chromosomes absent from the cool file # and get the bed file into 2d coordinates cut -f1-3 {input} \ | grep -f <(cooler dump -t chroms {params.cool} | awk '{{print $1"\t"}}') \ | awk -vOFS='\t' '{{print $0,$1,$2+10000,$3+10000}}' \ | coolpup.py {params.cool} \ - \ --nshifts 10 \ --mindist 0 \ --minshift 100000 \ --outname {output.tbl} \ --log WARNING plotpup.py {output.tbl} \ --col_names {wildcards.strain} \ --enrichment 0 \ --vmin 1 \ --output {output.fig} """ |
109 110 111 112 113 114 115 116 117 118 119 120 121 | shell: """ # Convert the bed file int 2d coordinates. cut -f1-3 {input} | awk -vOFS='\t' '{{print $0,$0}}' > {params.tpos} chromosight quantify --pattern=borders \ --threads {threads} \ --perc-zero=60 \ --perc-undetected=60 \ --win-size=31 \ {params.tpos} \ {params.cool} \ {params.base} """ |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | shell: """ # Convert fasta to tabular, sort by seqname, keep longest seq # and convert back to fasta sed 's/>[^ ]* \(.*\)$/>\\1\t/' {input} \ | tr -d '\\n' \ | sed 's/>/\\n>/g' \ | sed '/^$/d' \ | awk -vFS='\t' -vOFS='\t' '{{print $2,length($2),$1}}' \ | sort -k3,3 -k2,2n \ | uniq -f2 \ | awk '{{print $3,$1}}' \ | sed 's/ /\\n/' \ > {output} """ |
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | shell: """ ulimit -n 10000 # Move all organisms proteomes to orthofinder workdir mkdir -p {params.orthofinder_dir} rm -f {params.orthofinder_dir}/* cp {input.org_proteomes} {params.orthofinder_dir} # Move Ac proteomes as well, but trim proteome from filename for strain in {input.ac_proteomes}; do fname=$(basename $strain) new_fname="${{fname/_proteome_filtered/}}" cp $strain "{params.orthofinder_dir}/$new_fname" done #cp "{{input.hgt_neff_v1}}" "{params.orthofinder_dir}/" orthofinder -f {params.orthofinder_dir} \ -o {output} \ -S diamond \ -t {threads} \ -n "amoeba" """ |
70 71 72 73 74 75 76 | shell: """ cat {input}/Results_amoeba/Species_Tree/SpeciesTree_rooted.txt > {output.tree} python scripts/06_orthocount_to_fasta.py \ {input}/Results_amoeba/Orthogroups/Orthogroups.GeneCount.tsv \ {output.matrix} """ |
89 90 91 92 93 94 95 96 | shell: """ echo "_seqFile {input.matrix}" > {params.conf_file} # Tree seems to have incompatible format with CLI GLOOME but works in web version... #echo "_treeFile {input.tree}" >> {params.conf_file} echo "_outDir {output}" >> {params.conf_file} ./bin/gainLoss.VR01.266 {params.conf_file} """ |
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | run: ortho_dir = join(input[0], "Results_amoeba", "Orthogroups") ortho = pd.read_csv(join(ortho_dir, 'Orthogroups.tsv'), sep='\t') unassigned = pd.read_csv(join(ortho_dir, 'Orthogroups_UnassignedGenes.tsv'), sep='\t') # Add unassigned genes as self-contained orthogroups ortho = pd.concat([ortho, unassigned], axis=0) # Get gene families absent in all of A. castellanii strains get_absent = lambda b: b.isnull().values st_abs = {strain: get_absent(ortho[strain]) for strain in params['focus_st']} ac_abs = np.bitwise_and.reduce([*st_abs.values()]) # Convert count to presence/absence for all species pres_mat = ortho.drop(columns='Orthogroup').apply(lambda c: ~c.isnull(), axis=0) pres_mat.index = ortho.Orthogroup # Save presence matrix for further analysis (encore true/false as 0/1) pres_mat.astype(int).to_csv(output['pres_mat'], index=True, header=True, sep='\t') # Exclude strains of interest when counting amoeba_df = ortho.loc[:, list(params['amoeba'])] #bact_df = ortho.loc[:, list(params['bact'])] # Binarize variables: Is there any gene for species c in each orthogroup any_pres = lambda df: df.apply(lambda c: ~ c.isnull(), axis=0).apply(np.any, axis=1) #bact_pres = any_pres(bact_df) #bact_abs = ~bact_pres amoeba_pres = any_pres(amoeba_df) amoeba_abs = ~amoeba_pres # Record genes found only in amoeba but not bacteria # Note: hgt events inferred in Clarke et al. also included for comparison) vdf = pd.DataFrame({ 'neff' : ~st_abs['Neff'], #'NEFF_v1_hgt_cds' : ~get_absent(ortho.NEFF_v1_hgt_cds), 'c3' : ~st_abs['C3'], 'ac' : ~ac_abs, #'bact' : bact_pres, 'amoeba': amoeba_pres, }) vdf.index = ortho.Orthogroup vdf.astype(int).to_csv(output['pres_compact'], sep='\t', index=True, header=True) |
154 | script: '../scripts/06_plot_orthogroups_venn.py' |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | import time import urllib from Bio import Entrez def safe_request(fun): """ Wraps function requesting data to allow safe errors and retry. Parameters ---------- fun : python function The python function that queries a server Returns ------- wrapped_f : python function A wrapped version of the input function which will call itself recursively every 5 seconds if the server is overloaded and will return None if the query record does not exist """ def wrapped_f(*args, **kwargs): try: a = fun(*args, **kwargs) return a except urllib.error.HTTPError as e: if e.code in (429, 502): time.sleep(5) print( "Sending too many requests, sleeping 5sec and retrying..." ) wrapped_f(*args, **kwargs) else: print("No sequence found for %s" % args[0]) return None return wrapped_f @safe_request def name_to_proteins( name, db="protein", email="someone@email.com", filters="" ): """ Given an organism name, fetch all available protein sequences. Parameters ---------- name : str Name of an organism or group of organism (e.g. Legionella). db : str Entrez db name to use for the search. email : str User email for identification. filters : str Additional filters to use for the query. E.g. " AND refseq[filter]" to filter results from refseq Returns ------- seq_record : TextIOWrapper iterable object containing the query Fasta records. """ Entrez.email = email query_string = name + "[Orgn]" time.sleep(0.1) # First search to see the number of hits (returns values for 10 by default) query = Entrez.read( Entrez.esearch(term=query_string, db=db, email=email, retmax=10 ** 9) ) seqs = "" s = 0 chunk_size = 1000 # Fetch proteins by batch of 100 to avoid maximum number of queries for e in range(chunk_size, len(query["IdList"]), chunk_size): print(f"fetching entries {s} to {e}") fetched = False while not fetched: try: seqs += Entrez.efetch( id=query["IdList"][s:e], rettype="fasta", db=db ).read() fetched = True except: print("Fetching failed, trying again.") time.sleep(5) s = e time.sleep(0.1) e = len(query["IdList"]) print(f"fetching entries {s} to {e}") seqs += Entrez.efetch( id=query["IdList"][s:e], rettype="fasta", db=db ).read() return seqs org_df = snakemake.params["org"] organism = org_df.loc[ org_df["clean_name"] == f"{snakemake.wildcards.organism}", "name" ] time.sleep(0.1) # Do not spam NCBI :) proteome = name_to_proteins( organism, email="demo@email.com", filters=" AND refseq[filter]" ) try: print(f"Writing {proteome.count('>')} proteins for {organism}.") with open(snakemake.output[0], "w") as outf: outf.write(proteome) except TypeError: print(f"No proteome found for {organism[0]}") pass |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | library(readr) library(stringr) library(ggplot2) library(dplyr) library(tidyr) library(tibble) library(gridExtra) args <- commandArgs(trailingOnly=TRUE) ### GET ARGS ### v1_gff <- args[1] neff_gff <- args[2] c3_gff <- args[3] out_tbl <- args[4] out_plot <- args[5] ### LOAD DATA ### load_gff <- function(in_gff, name){ gff <- read_tsv(in_gff, comment="#", col_names=F) colnames(gff) <- c("chrom", "source", "type", "start", "end", "score", "strand", "phase", "attributes") gff <- gff %>% filter(type %in% c("mRNA", "CDS", "gene", "exon")) ### Clean data gff <- gff %>% mutate(attributes=gsub("Parent=", "ID=", attributes)) %>% mutate(ID=str_extract(attributes, 'ID=[^;\\.]*')) %>% mutate(ID=gsub("=[a-zA-Z]{1,}:", "=", ID)) %>% mutate(ID=sapply(ID, function(x){str_split(x,regex('[=-]'))[[1]][2]})) %>% mutate(name=name) return(gff) } gff <- load_gff(v1_gff, "NEFF_v1") %>% bind_rows(load_gff(neff_gff, "Neff")) %>% bind_rows(load_gff(c3_gff, "C3")) ### COMPUTE STATS ### gff_gene_len <- gff %>% filter(type == 'gene') %>% mutate(gene_len = end - start) # N exon / gene (drop duplicated exons) gff_exons <- gff %>% filter(type != 'gene') %>% group_by(ID) %>% distinct(chrom, start, end, type, .keep_all=T) %>% mutate(n_exon=sum(type == 'exon')) %>% filter(type == 'mRNA') %>% mutate(mrna_len = end - start) # Combine informations and reorder columns gff_stats <- gff_gene_len %>% select(ID, gene_len) %>% inner_join(gff_exons, by='ID') %>% select(ID, chrom, start, end, type, strand, n_exon, gene_len, mrna_len, attributes) ### VISUALISE ### # Label generator to get number of samples n_obs <- function(x){return(data.frame(y=0, label=paste0("n=", length(x))))} feature_len <- ggplot(data=gff %>% filter(type %in% c("CDS", "gene")), aes(x=name, y=log10(end - start), fill=name)) + geom_violin() + stat_summary(fun.data = n_obs, geom = "text") + theme_minimal() + xlab("") + ylab("log10 length") + ggtitle("Gene length") + facet_wrap(.~type) + scale_y_continuous(limits = c(0, 6)) + geom_boxplot(width=0.1) exon_quantiles <- gff_exons %>% group_by(name) %>% summarise(x=list(enframe(quantile(n_exon, probs=c(0.25,0.5,0.75)), "quantiles", "exons"))) %>% unnest(x) %>% spread(key=quantiles, value=exons) # Assign a quantile label to each exon number based on its strain's exon number distribution exons_tbl <- gff_exons %>% inner_join(exon_quantiles, by="name") %>% mutate( n_exon_quant = case_when( n_exon <= `25%` ~ "25% quantile", n_exon >= `75%` ~ "75% quantile", TRUE ~ "Median" ) ) %>% select(name, n_exon, n_exon_quant) exons_tbl <- exons_tbl %>% mutate( n_exon_quant = factor( n_exon_quant, ordered=T, levels=c("25% quantile", "Median", "75% quantile") ) ) exon_per_gene <- ggplot(data=exons_tbl, aes(x=n_exon, fill=n_exon_quant)) + scale_fill_manual(values=c("#6666ee", "#999999", "#ee6666")) + geom_histogram(binwidth=1) + theme_minimal() + stat_summary(aes(x = 0.1, y = n_exon, xintercept = stat(y), group = name), fun.y = median, geom = "vline") + ylab("Genes") + xlab("Exons per genes") + facet_grid(name~.) + scale_x_continuous(limits = c(-0, 100)) ### SAVE OUTPUT ### svg(out_plot, height=10, width=12) grid.arrange(feature_len, exon_per_gene) dev.off() write_tsv(gff_stats, out_tbl) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | library(ggplot2) library(patchwork) library(readr) library(tidyr) library(dplyr) library(tools) library(stringr) args <- commandArgs(trailingOnly=T) # No need to scale, now #library(scales) # minscale <- function(x){s = scale(x); return(s + abs(min(s)))} # mutate_at(vars(-assembly), minscale ) %>% asb <- read_tsv(args[1]) %>% rename(n_contigs=number, assembly=filename) %>% mutate(assembly=file_path_sans_ext(basename(assembly))) %>% select(assembly, total_length, n_contigs, N50, N70, N90, N50n, N70n, N90n) %>% pivot_longer(-assembly, 'metric') %>% mutate(assembly = gsub("_genome", "", assembly)) svg(args[2]) len_asb <- ggplot( data=asb %>% filter(metric == 'total_length'), aes(x=assembly, y=value, fill=assembly) ) + geom_bar(stat='identity') + theme_minimal() + ylab("Length") + xlab("") + ggtitle("Assembly length") + theme(legend.position="none") num_tigs <- ggplot( data=asb %>% filter(metric == 'n_contigs'), aes(x=assembly, y=value, fill=assembly) ) + geom_bar(stat='identity') + theme_minimal() + ylab("# contigs") + xlab("") + ggtitle("Total number of contigs") + theme(legend.position="none") Nxx <- ggplot( data=asb %>% filter(str_detect(metric, '0$')), aes(x=assembly, y=value, fill=assembly) ) + geom_bar(stat='identity') + facet_wrap(~metric) + theme_minimal() + xlab("") + ylab("Length") + theme(legend.position="none") Nxxn <- ggplot( data=asb %>% filter(str_detect(metric, 'n$')), aes(x=assembly, y=value, fill=assembly) ) + geom_bar(stat='identity') + xlab("") + ylab("# contigs") + theme_minimal() + facet_wrap(~metric) (len_asb + num_tigs + plot_spacer()) / (Nxx) / (Nxxn) dev.off() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | from collections import OrderedDict import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt def read_busco(tbl: str, name: str) -> pd.DataFrame: """ Read a busco summary table into a dataframe and create relevant columns """ df = pd.read_csv(tbl, names=["number", "type"], sep="\t", header=None) df["strain"] = name # Single letter busco class as a column df["busco"] = df["type"].str.extract(r".*\(([A-Z])\).*") df.loc[pd.isnull(df.busco), "busco"] = "T" # Transform number of buscos to percentages df["perc"] = 100 * np.round( df["number"] / df.number[df.busco == "T"].values[0], decimals=2 ) return df # Concatenate busco tables from all strains and set relevant # variables as indices busco = ( pd.concat([read_busco(tbl, name) for name, tbl in snakemake.input.items()]) .reset_index(drop=True) .sort_values("strain", ascending=False) .set_index(["busco", "strain"]) ) mpl.use("Agg") # Keep track of the order in which strains will be plotted str_to_num = OrderedDict({"v1": 0, "neff": 1, "c3": 2}) # Map each BUSCO class to a color col_dict = OrderedDict( [("M", "#F04441"), ("F", "#F0E441"), ("D", "#3592C6"), ("S", "#58B4E8"),] ) # We plot one busco class per iteration for i, t in enumerate(col_dict.keys()): # Subset busco table for current strain busco_type = busco.loc[t] # Retrieve x index from strains (deterministic order) r = [str_to_num[s] for s in busco_type.index.values] # Stacked barplot, uses cumulative height of previous bars if i: plt.barh(r, busco_type.perc.values, color=col_dict[t], left=cum_height) cum_height += busco_type.perc.values else: cum_height = busco_type.perc.values plt.barh(r, busco_type.perc.values, color=col_dict[t]) def busco_string(strain): """Use the BUSCO summary table to construct the standard busco string""" mis = busco.loc[("M", strain), "number"] fra = busco.loc[("F", strain), "number"] com = busco.loc[("C", strain), "number"] sin = busco.loc[("S", strain), "number"] dup = busco.loc[("D", strain), "number"] tot = busco.loc[("T", strain), "number"] return f"{strain} - C:{com}[S:{sin}, D:{dup}], F:{fra}, M:{mis}, n={tot}" # Write the strain name and busco string besides each bar plt.yticks(r, [busco_string(s) for s in str_to_num.keys()]) plt.xlabel("% BUSCOs") plt.ylabel("Strain") plt.savefig(str(snakemake.output)) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | import sys import re from collections import defaultdict import csv from Bio import SeqIO from Bio.Graphics import BasicChromosome from Bio.SeqFeature import SeqFeature, FeatureLocation from reportlab.lib.units import cm from matplotlib import cm as mcm from matplotlib.colors import to_hex logh = open(snakemake.log[0], "w") # {chrom: length} entries = {} for rec in SeqIO.parse(snakemake.params["genome"], "fasta"): if len(rec.seq) > 100000: entries[rec.id] = len(rec.seq) # all available colors # cols = list(colors.getAllNamedColors().values())[8:] cols = [to_hex(mcm.tab10(c)) for c in range(len(entries))] # {chrom: {type: [features]}} where a feature is [start, end, type, color] features = defaultdict(list) type_id = 0 ftype_to_col = {} with open(snakemake.input["features"]) as bedh: bed = csv.DictReader( bedh, delimiter="\t", fieldnames=["chrom", "start", "end", "type"] ) for feat in bed: start, end = int(feat["start"]), int(feat["end"]) if start > end: start, end = end, start try: if end > entries[feat["chrom"]]: continue except KeyError: continue if feat["type"] not in ftype_to_col.keys(): ftype_to_col[feat["type"]] = cols[type_id] logh.write(f"{feat['type']}: {cols[type_id]}\n") type_id += 1 features[feat["chrom"]].append( SeqFeature( location=FeatureLocation(int(feat["start"]), int(feat["end"])), type=feat["type"], qualifiers={"color": [ftype_to_col[feat["type"]]]}, ) ) max_len = max(entries.values()) telomere_length = 10000 # For illustration chr_diagram = BasicChromosome.Organism(output_format="svg") chr_diagram.page_size = (29.7 * cm, 21 * cm) # A4 landscape for name, length in entries.items(): # features = [f for f in record.features if f.type == "tRNA"] chrom_feat = features[name] chrom_num = re.sub(r"^.*_([0-9]+)$", r"\1", name) cur_chromosome = BasicChromosome.Chromosome(chrom_num) # Set the scale to the MAXIMUM length plus the two telomeres in bp, # want the same scale used on all five chromosomes so they can be # compared to each other cur_chromosome.scale_num = max_len + 2 * telomere_length # Add an opening telomere start = BasicChromosome.TelomereSegment() start.scale = telomere_length cur_chromosome.add(start) # Add a body - again using bp as the scale length here. body = BasicChromosome.AnnotatedChromosomeSegment(length, chrom_feat) body.scale = length cur_chromosome.add(body) # Add a closing telomere end = BasicChromosome.TelomereSegment(inverted=True) end.scale = telomere_length cur_chromosome.add(end) # This chromosome is done chr_diagram.add(cur_chromosome) chr_diagram.draw(snakemake.output[0], snakemake.params["title"]) logh.close() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import re hgt_ids = {i.strip() for i in open(snakemake.input[0])} outf = open(snakemake.output[0], "w") with open(snakemake.params["gff"], "r") as gff: hgt = False n_exons = 0 for line in gff: fields = line.split("\t") if re.match("^#", line): continue elif fields[2] == "gene": # ID chrom start end type strand n_exon gene_len mrna_len attributes if hgt: outf.write(f"{curr_hgt.strip()}\t{n_exons}\n") gene_id = re.search("gene:([^;]+);", fields[8]).group(1) if gene_id in hgt_ids: hgt = True n_exons = 0 curr_hgt = line else: hgt = False elif fields[2] == "exon": n_exons += 1 |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | import sys import pandas as pd import numpy as np import matplotlib.pyplot as plt from matplotlib import patches as mpatches import matplotlib as mpl import matplotlib.gridspec as gridspec import cooler ## LOAD CLI ARGS cool_path = sys.argv[1] rdna_path = sys.argv[2] out_path = sys.argv[3] # Discard contigs smaller than this size # Note: assumes chromosomes are sorted by size in the cool file MIN_SIZE = 100000 ## LOAD INPUT FILES c = cooler.Cooler(cool_path) gff = pd.read_csv(rdna_path, sep="\t", comment="#", usecols=range(9)) gff.columns = [ "seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attribute", ] ## EXTRACT COOL DATA chroms = c.chroms()[:] # Discard all chromosomes after the first shorter than MIN_SIZE chroms["include"] = chroms.length > MIN_SIZE stop_bin = c.extent(chroms.loc[np.where(~chroms.include)[0][0], "name"])[1] bins = c.bins() # Remove chromosomes that are absent from cool or too small gff = gff.loc[np.isin(gff.seqname, chroms.loc[chroms.include, "name"]), :] ## TRANSFORM # Transform rDNA coords to bins gff["minbin"] = gff.apply( lambda r: np.min( bins.fetch(f"{r.seqname}:{r.start}-{r.end}").index.values ), axis=1, ) gff["maxbin"] = gff.apply( lambda r: np.max( bins.fetch(f"{r.seqname}:{r.start}-{r.end}").index.values ), axis=1, ) gff = gff.drop_duplicates(subset=["attribute", "minbin", "maxbin"]) ## VISUALIZE mat = c.matrix(balance=False)[:stop_bin, :stop_bin] # np.fill_diagonal(mat, 0) # Define color saturation threshold # sat_thr = np.percentile(mat[mat > 0], 85) # Plot heatmap with rDNA overlay plt.style.use("seaborn-white") gs = gridspec.GridSpec(4, 1, height_ratios=[5, 1, 1, 1]) mpl.rcParams["figure.figsize"] = (5, 10) fig = plt.figure() ax1 = fig.add_subplot(gs[0]) ax2 = fig.add_subplot(gs[1]) ax3 = fig.add_subplot(gs[2]) ax4 = fig.add_subplot(gs[3]) fig.align_xlabels() ax2.margins(x=0) ax3.margins(x=0) ax4.margins(x=0) coldict = { "18s_rRNA": ("b", "-.", "18S", ax3), "28s_rRNA": ("g", ":", "28S", ax2), "8s_rRNA": ("r", "--", "5S", ax4), } legend_entries = [ mpatches.Patch(color=v[0], label=v[2], ls=v[1]) for k, v in coldict.items() ] plt.legend(handles=legend_entries, loc="upper right") # mat[np.isnan(mat)] = 0 ax1.imshow(np.log(mat), cmap="Reds", interpolation="none", rasterized=True) # Add vertical lines to line plots at rdna positions for r in gff.iterrows(): coldict[r[1]["attribute"]][3].axvline( x=r[1]["maxbin"], alpha=0.4, # linestyle=coldict[r[1]["attribute"]][1], lw=0.3, c=coldict[r[1]["attribute"]][0], ) # Add chromosome grid for r in chroms.iterrows(): if r[1].include: ax2.axvline(x=c.extent(r[1]["name"])[1], alpha=0.4, c="grey", lw=0.5) ax3.axvline(x=c.extent(r[1]["name"])[1], alpha=0.4, c="grey", lw=0.5) ax4.axvline(x=c.extent(r[1]["name"])[1], alpha=0.4, c="grey", lw=0.5) # Subset bins containing rRNA rdna_bins = { sub: np.unique(gff.maxbin[gff.attribute == sub]) for sub in np.unique(gff.attribute) } # Compute contact sum per bin for each subunit rdna_sig = {} rdna_sig["28S"] = mat[rdna_bins["28s_rRNA"], :].mean(axis=0) rdna_sig["18S"] = mat[rdna_bins["18s_rRNA"], :].mean(axis=0) rdna_sig["5S"] = mat[rdna_bins["8s_rRNA"], :].mean(axis=0) ax2.plot( range(mat.shape[0]), np.log10(rdna_sig["28S"]), label="28S", lw=0.5, c="g" ) ax3.plot( range(mat.shape[0]), np.log10(rdna_sig["18S"]), label="18S", lw=0.5, c="b" ) ax4.plot( range(mat.shape[0]), np.log10(rdna_sig["5S"]), label="5S", lw=0.5, c="r" ) fig.savefig(out_path, dpi=2500) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import sys import re def get_cigar_gaps(cigar): # Match all indels and capture indel size gap_regex = re.compile(r"([0-9]+)[ID]") total_gap_size = 0 for gap_size in re.findall(gap_regex, cigar): total_gap_size += int(gap_size) return total_gap_size with open(sys.argv[1]) as paf, open(sys.argv[2], "w") as outf: for line in paf: align = line.split("\t") if align[16] != "tp:A:P": # Only consider primary alignments continue cigar = align[-1] n_gaps = get_cigar_gaps(cigar) n_ambiguous = int(align[15].lstrip("nn:i:")) n_bad = int(align[12].lstrip("NM:i:")) n_matches = int(align[9]) n_mismatches = n_bad - (n_ambiguous + n_gaps) div = n_mismatches / (n_matches + n_mismatches) outf.write(f"{align[5]}\t{align[7]}\t{align[8]}\t{div}\n") |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | REF="$1" # MCscanX output prefix MCSX="$2" CIR_DIR=$3 CAND="$4" # Only include scaffolds larger than 100kb len_cutoff=100000 # Karyotype awk '/^>/ {if (seqlen){print seqlen}; printf "%s ",$0;seqlen=0;next;} {seqlen += length($0)} END{print seqlen}' $REF | awk -v minlen=$len_cutoff '$2 >= minlen {print "chr","-",$1,$1,"0",$2,"Neff"}' | tr -d '>' \ >${CIR_DIR}/karyotype.txt # Aesthetics, can comment out: Make Neff scaffolds reverse orderd. Looks better # for 2-strain comparison grep "Neff_scaffold" ${CIR_DIR}/karyotype.txt > ${CIR_DIR}/neff_karyo.txt grep -v "Neff_scaffold" ${CIR_DIR}/karyotype.txt | sort -rV > ${CIR_DIR}/other_karyo.txt cat ${CIR_DIR}/neff_karyo.txt ${CIR_DIR}/other_karyo.txt > ${CIR_DIR}/karyotype.txt rm ${CIR_DIR}/neff_karyo.txt ${CIR_DIR}/other_karyo.txt # Candidate genes #tail -n +2 $CAND | # awk '{print $1,$2,$3}' \ # >${CIR_DIR}/hgt_candidates.txt # Parsing MCScanX collinearity output into circos links n_genes=$(grep -v "^#" "$MCSX.collinearity" | wc -l) n_done=0 while read line do # Make links involving HGT candidates red g1="${line%% *}" g2="${line##* }" # attr=$( \ # awk -v g1=${g1%%-*} -v g2=${g2%%-*} \ # 'BEGIN{attr="color=lgrey_a4,z=0"} \ # {if($0 ~ g1 || $0 ~ g2){attr="color=dred,z=30";exit}} \ # END{print attr}' \ # ${CAND} \ # ) attr='lgrey' # Get coordinate of gene ID from GFF file awk -v g1=$g1 -v g2=$g2 -v attr="$attr" \ 'BEGIN{N1=0;N2=0} $2 ~ g1 { N1+=1;c1=$1;s1=$3;e1=$4} $2 ~ g2 { N2+=1;c2=$1;s2=$3;e2=$4} END{ if (N1 < 1 || N2 < 1) {exit 1} else {print c1,s1,e1,c2,s2,e2,attr}}' "$MCSX.gff" echo -ne " $n_done / $n_genes \r" 1>&2 (( n_done++ )) # Convert gene IDs to coordinates done < <(grep -v "^#" "$MCSX.collinearity" | tr -d ' ' | tr '\t' ' ' | cut -d$' ' -f2,3) \ > "${CIR_DIR}/mcsx.txt" |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | function usage () { cat <<EOF Usage: `basename $0` -g genes -o output_folder -r ref [-h] -g gtf file with genes coordinates -o output folder for MCScanX files -f reference genome -c number of cpus to use for BLAST -h displays this help EOF exit 0 } N_CPU=1 # Parsing CL arguments while getopts ":g:o:f:c:h" opt; do case $opt in g ) GFF=${OPTARG} ;; o ) OUT_F=${OPTARG} ;; f ) FASTA=${OPTARG};; c ) N_CPU=${OPTARG} ;; h ) usage ;; \?) usage ;; esac done # Testing if mandatory arguments have been provided if [ "x" == "x$FASTA" ] || [ "x" == "x$GFF" ] || \ [ "x" == "x$OUT_F" ]; then echo "Error: A reference genome, input GFF file and output folder \ are required." usage exit 0 fi #1: format GFF file for MCScanX, only keeping CDS echo "Formatting GFF..." OUT_GFF="$OUT_F/MCScanX_genes.gff" awk 'BEGIN{OFS="\t"} $3 ~ "CDS" {id_match=match($9,/ID=[^;]*/) id=substr($9,RSTART+3,RLENGTH-3) print $1,id,$4,$5}' $GFF > $OUT_GFF echo "GFF formatted !" #2: build blast database from sequences and all vs all blast BASE_OUT=$(basename $FASTA) BASE_OUT=${BASE_OUT%%.*} makeblastdb -in $FASTA -dbtype prot -out "${OUT_F}/${BASE_OUT}" echo "Blasting transcriptome against itself." blastp -query $FASTA \ -db "${OUT_F}/${BASE_OUT}" \ -outfmt 6 \ -max_target_seqs 5 \ -num_threads $N_CPU \ -out "${OUT_F}/${BASE_OUT}.blast" #3 Renaming chromosomes according to MCScanX conventions MC_IN="$OUT_F/MCScanX_in" # Use two first characters of fasta file as organism abbreviation #SP_CODE=${BASE_OUT:0:2} #sed "s/^scaffold_\([0-9]*\)/$SP_CODE\1/g" | sed 's/\.cds//g' "$OUT_GFF" > "$MC_IN.gff" #sed "s/scaffold_\([0-9]*\)/$SP_CODE\1/g" | sed 's/::[^ ]*//g' "${OUT_F}/${BASE_OUT}.blast" > "$MC_IN.blast" echo "All input files are ready:" echo " - BLAST output: $MC_IN.blast" echo " - GFF file: $MC_IN.gff" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize=(12, 12)) div = pd.read_csv( snakemake.input[0], sep="\t", names=["chrom", "start", "end", "div"] ) div = div.sort_values(["chrom", "start"]) div["align_len"] = abs(div.end - div.start) div["log_len"] = np.log10(div.align_len) jp = sns.jointplot( data=div, x="log_len", y="div", kind="kde", shade=True, cmap="plasma", levels=200, ) jp.set_axis_labels( "Log10 length of aligned block [bp]", "Per base divergence in aligned blocks", ) avg_div = np.average(div["div"], weights=div["align_len"]) avg_log_len = np.mean(div.log_len) jp.ax_joint.axvline( x=avg_log_len, linestyle="--", label=f"", ) jp.ax_joint.axhline(y=avg_div, linestyle="--") jp.ax_marg_x.axvline(x=avg_log_len) jp.ax_marg_y.axhline(y=avg_div) jp.fig.suptitle( "Neff - C3 gap excluded sequence divergence\n" f"mean divergence length: {100*avg_div:.2f}%" ) plt.savefig(snakemake.output[0]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | import sys import pandas as pd gene_count = sys.argv[1] out_fasta = sys.argv[2] gc = pd.read_csv(gene_count, sep="\t") def get_presence(freqs): """ Given an array of frequences, return gene presence code Parameters ---------- freqs : numpy array of ints 1D Array of gene frequences. Returns ------- str : Gene presence code, in the form of a string of 1 (present) and 0 (absent) Examples -------- >>> get_presence(np.array([32, 1, 0, 0, 2])) '11001' """ freqs[freqs > 1] = 1 freqs = freqs.astype(str) return "".join(freqs) # Store species names and associated ortholog presence code in a dict sp_presence = { gc.columns[i]: get_presence(gc.iloc[:, i]) for i in range(1, gc.shape[1] - 1) } with open(out_fasta, "w") as fa: for sp, presence in sp_presence.items(): fa.write(">" + str(sp) + "\n") fa.write(presence + "\n") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | import pandas as pd from matplotlib_venn import venn2, venn3 import matplotlib as mpl import matplotlib.pyplot as plt mpl.use("Agg") vdf = pd.read_csv(snakemake.input["pres_compact"], sep="\t").drop( "Orthogroup", axis=1 ) vdf = vdf.astype(bool) # vdf = vdf.rename(columns={'NEFF_v1_hgt_cds': 'v1'}) # Make Venn diagram. sets are A, B, AB, C, AC, BC, ABC fig, ax = plt.subplots(1, 2, figsize=(15, 12)) venn3( subsets=( len(vdf.query(" c3 and not neff and not amoeba")), # A len(vdf.query("not c3 and neff and not amoeba")), # B len(vdf.query(" c3 and neff and not amoeba")), # AB len(vdf.query("not c3 and not neff and amoeba")), # C len(vdf.query(" c3 and not neff and amoeba")), # AC len(vdf.query("not c3 and neff and amoeba")), # BC len(vdf.query(" c3 and neff and amoeba")), # ABC ), set_labels=("C3", "Neff", "Amoeba"), ax=ax[0], ) venn2( subsets=( len(vdf.query(" c3 and not neff")), len(vdf.query("not c3 and neff")), len(vdf.query(" c3 and neff")), ), set_labels=("C3", "Neff"), ax=ax[1], ) fig.savefig(snakemake.output["venn"]) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | library(tidyverse) library(topGO) library(viridis) # Parse CL args args <- commandArgs(trailingOnly=T) # Extract gene IDs and GO terms. annot_tbl <- read_tsv(args[1], col_types='cciicciiicciiiiiii') %>% dplyr::select(ID, attributes, hgt) %>% mutate(GO=str_replace(attributes, '.*Ontology_term=([^;]*);', "\\1")) hgt_candidates <- annot_tbl %>% filter(hgt == 1) %>% pull(ID) out_file <- args[2] tmp_mapfile <- paste(dirname(out_file), 'id2go.tsv', sep='/') out_fig <- paste0(tools::file_path_sans_ext(out_file), ".svg") # Write geneID - GO terms mapping to file annot_tbl %>% dplyr::select(ID, GO) %>% mutate(GO = str_replace_all(GO, ";", ", ")) %>% write_tsv(tmp_mapfile) geneID2GO <- readMappings(tmp_mapfile) geneNames <- names(geneID2GO) geneList <- factor(as.integer(geneNames %in% hgt_candidates)) names(geneList) <- geneNames GOdata <- new( "topGOdata", description = "HGT analyis in A. castellanii", ontology = "BP", allGenes = geneList, nodeSize = 10, gene2GO = geneID2GO, annot = annFUN.gene2GO ) fisher.stat <- new( "classicCount", testStatistic = GOFisherTest, name = "Fisher test" ) resultFisher <- getSigGroups(GOdata, fisher.stat) weight.stat <- new( "weightCount", testStatistic = GOFisherTest, name = "Fisher test with weight algorithm", sigRatio = 'ratio' ) resultWeight <- getSigGroups(GOdata, weight.stat) #showSigOfNodes(GOdata, score(resultWeight), firstSigNodes = 9, useInfo = 'all') # Get 'significant' terms pvals <- score(resultWeight) sig <- pvals[pvals<0.01] termStat(GOdata, names(sig)) # Report top enriched GO terms allRes <- GenTable( GOdata, classic = resultFisher, weight = resultWeight, orderBy = "weight", ranksOf = "classic", topNodes = 200 ) write_tsv(allRes, out_file) allRes <- allRes %>% mutate(GO.ID=paste(GO.ID, Term, sep=', ')) tidy_res = as.tibble(allRes) %>% mutate( weight = as.numeric(weight), classic = as.numeric(classic), GO.ID = factor( GO.ID, ordered=T, levels=unique(GO.ID[ordered(weight)]) ) ) # Visualise top enriched GO terms svg(out_fig) ggplot( data=tidy_res %>% top_n(30, -weight), aes(x=GO.ID, y=-log10(weight)) ) + geom_segment( aes(xend=GO.ID, yend=min(-log10(weight))), size=1.1 ) + geom_point(aes(size=Annotated, color=Significant / Expected)) + geom_hline(aes(yintercept=2), lty=2, col='red') + theme_minimal() + xlab("") + ylab("-log10 pvalue") + ggtitle("GO enrichment test in HGT candidates\n(Fisher exact test, weight algorithm)") + scale_color_viridis() + coord_flip() dev.off() |
90 | shell: "Rscript scripts/go_enrich.R {input.annot} {input.candidates} {output}" |
Support
- Future updates
Related Workflows





