Ancestral Fasta Generation Workflow for Population Genomics
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
make-ancestral-fasta-snakeflow
Introduction
For a lot of interesting population genomics work today, it is necessary or useful to know which allele at a SNP is derived, and which is ancestral in the species under study. For many cases, that information can be obtained (or at least approximated) by finding the base carried at each such site in a closely related species. An argument could be made for getting as many closely related species as possible, putting them all on a phylogenetic tree, along with the focal species, and then doing a gigantic, genome-wide multiple alignment and inferring the ancestral states at each position from all that information. That sounds like a lot of work when you are trying to get a whole genome’s worth of ancestral states, and in many cases you might not have a whole lot of genomes to work with, anyway. So, here, I will take a cue from an ANGSD website example here where they use chimpanzee to designate ancestral states so as to polarize the site frequency spectrum.
For both the program ANGSD and the more recently developed software RELATE this information about ancestral alleles/bases can be input to the programs as a fasta file of the same length as the focal species fasta file, but just including the sequence of the closely related species.
It seems to me that there are not a lot of really great workflows for creating such fasta files. That is what I am trying to provide here. My main goal for it is aligning large chunks of genomes between closely related salmonid species, and then creating fasta files giving the ancestral alleles.
For example, if I have a lot of high-quality, whole-genome sequencing from O. mykiss that I want to phase and throw into RELATE, I can align the Chinook salmon genome to the O. mykiss genome and use that to make the ancestral fasta. In this case, we say that the O. mykiss genome is our target genome, and Chinook salmon is our query genome. For the most part, I am most interested in segments that have been assembled into chromosomes in both species.
Especially because of the duplicated genome history of salmonids, it is worth taking a look at the results of an original round of alignment in order to carefully select which chromosomes appear to be homologous. Therefore, this workflow must be down in two steps:
I. Do an initial round of alignment and plot some results from that. From these results, create a CSV file like that found in
config/homolog_sets.csv
, which tells us which query-species chromosomes appear to be homologous to each target-species chromosome. The first 10 lines of that file appear here.
target,homologs
CM007935.1,CM009207.1
CM007936.1,CM009220.1 CM009224.1
CM007937.1,CM009204.1
CM007938.1,CM009202.1 CM009219.1
CM007939.1,CM009206.1 CM009221.1
CM007940.1,CM009205.1
CM007941.1,CM009208.1
CM007942.1,CM009206.1 CM009211.1
CM007943.1,CM009211.1 CM009217.1
Note that if multiple query chromosomes align to a target chromsome, they are separated by spaces.
-
Using the file
config/homolog_sets.csv
, map only the appropriate homologs to each target chromosome, and then turn the result into the ancestral fasta file.
Because there are two distinct steps to this workflow, and there has to be some user interaction, there is not a simple
all
rule that runs everything. Rather, explicit output files must requested from snakemake on the command line, as described in the sections of the overview below.
Overview of the two parts of the workflow
Finding query chromosomes homologous to each target chromosome
In brief, this part of the snakemake workflow does this:
-
Choose the target genome and declare which sequences in that you want to align things to. For my purposes this is typically the assembled chromosomes.
-
Choose the query genome and declare which sequences from the query you wish to map to the target. Once again, this is typically the assembled chromosomes, though one might wish to use all the sequences, even those that are not fully assembled into chromosomes.
-
Break the target genome into a bunch of smaller fastas, each containing exactly one of the sequences/chromosomes. We will call these “single-chromosome fastas.”
-
Map all the query sequences against each of the single-chromosome fastas.
-
Run single_cov2 on each of resulting MAFs, so that we retain only the very best alignment of any overlapping alignments.
-
Summarize how many base pairs are aligned onto each single-chromosome fasta from each of the different target query sequences (i.e. chromosomes). And also prepare a file of alignments that is suitable for plotting with ggplot.
These steps are all accomplished by running the workflow to this point with a command like:
snakemake --cores 20 --use-conda \
results/report/step20_notransition_inner1000_ident92/aligned_lengths.tsv \
results/report/step20_notransition_inner1000_ident92/plotable_segments.tsv
BigNote:
The above command requests
by way of the directory name used
(i.e.,
step20_notransition_inner1000_ident92
), that the following options be passed to
lastz
:
--steps=20 --notransition --inner=1000 --identity=92.
Users wanting different values of those parameters can simply request the output file in a different directory, like
step15_notransition_inner950_ident97
, and snakemake will take care of it. This way, it makes it pretty easy to do multiple runs at different parameter options and collect all the results.
We can make a rulegraph for those outputs like this:
snakemake --rulegraph results/report/step20_notransition_inner1000_ident92/aligned_lengths.tsv results/report/step20_notransition_inner1000_ident92/plotable_segments.tsv | dot -Tsvg > figs/rulegraph1.svg
It looks like this:
Once those two output files are available, they can be used to create some plots using the not-so-well-documented code in directory `misc` like this:
The plot above (made in `misc/plot-aligned-length.R) shows that of the material that aligns to each chromosome, most of it comes from one, two, or three different query chromosomes.
When you go assigning homologs to chromosomes, however, it is also important to see if a section of a query chromosome appears to align to two places (which should not happen, usually), or if there are paralogous segments that are mapping anywhere. To answer those questions, it is helpful to make a huge cross-alignment facet grid like the one made by the script
misc/homology-grid.R
:
It is pretty easy to go through this by eyeball and choose the query chromosomes that belong in each row of the
config/homolog_sets.csv
files. Basically, dark lines in the columns represent parts of different query chromosomes that map to the same target chromosome. Dark lines in the rows represent parts of a single query chromosome that align to different target chromsomes. You want to choose things so that no part of a query chromosome is going to get re-used. This means that the same part of the query chromosome should not appear in multiple columns. Also, mapping of paralogs is somewhat evident here by the (somewhat lighter) lines. These should not be paired with the target chromosomes.
Once the file
config/homolog_sets.csv
is made, you can proceed to the next section.
Mapping homologous chromosomes and creating a fasta
The contents of
config/homolog_sets.csv
are used to guide the mapping in the second half of the workflow. In brief, the steps in this part of the snakemake workflow are as follows:
-
Create a query fasta file for each target chromosome that contains the query chromosomes that have some homologous material to the target.
-
Run
lastz
to map each query fasta against each target fasta. (You might want to use more stringent alignment criteria for this.) -
Run
single_cov2
on each of the outputs. -
Run
maf2fasta
(from the multiz package) on each the resulting single_cov2 outputs. -
Use an R-script to condense the resulting fastas into a fasta that is congruent with the target genome. And at the same time, prepare a summary of the number of bases in the alignment for that chromosome that are of different types (e.g. target has a
C
and query has anA
, etc.). -
Catenate all the those condensed fastas into a file with a path like:
results/catenated_anc_fasta/step20_notransition_inner1000_ident95/ancestral.fna
and catenate all the summaries into a file with a path like this:
results/report/step20_notransition_inner1000_ident95/pairwise_aligned_base_summary.csv
This is achieved by, for example, doing:
snakemake --cores 20 --use-conda --use-envmodules \
results/catenated_anc_fasta/step20_notransition_inner1000_ident95/ancestral.fna \
results/report/step20_notransition_inner1000_ident95/pairwise_aligned_base_summary.csv
The rulegraph for this second section of the workflow can be obtained with this:
snakemake --rulegraph results/catenated_anc_fasta/step20_notransition_inner1000_ident95/ancestral.fna results/report/step20_notransition_inner1000_ident95/pairwise_aligned_base_summary.csv | dot -Tsvg > figs/rulegraph2.svg
And it looks like this:
Configuration
All the configuration can be done by modifying the contents of the
config
directory.
This workflow is set up showing the values used for mapping the Chinook genome (the query) to the
O. mykiss
genome (our target). You can change values in the
config/config.yaml
file to work for your own pair of closely related species.
More on this later, but it should be pretty self-explanatory for anyone familiar with snakemake.
Code Snippets
14 15 16 | shell: "wget -O {output.fna}.gz {params.url} 2> {log.wget}; " " gunzip {output.fna}.gz 2> {log.gunzip}" |
28 29 | shell: "samtools faidx {input.fna} 2> {log}" |
16 17 | shell: "samtools faidx {input.tfasta} {params.tc} > {output} 2> {log}" |
34 35 | shell: "samtools faidx {input.qfasta} {params.qc} > {output} 2> {log}" |
59 60 | shell: "lastz {input} {params} > {output} 2> {log}" |
75 76 | shell: "single_cov2 {input} > {output} 2> {log}" |
22 23 | shell: "samtools faidx {input.qfasta} {params.homos} > {output} 2> {log}" |
48 49 | shell: "lastz {input} {params} > {output} 2> {log}" |
64 65 | shell: "single_cov2 {input} > {output} 2> {log}" |
80 81 | shell: "maf2fasta {input.target_fna} {input.maf} fasta > {output} 2> {log}" |
100 101 | script: "../scripts/condense-and-summarise-fastas.R" |
111 112 | shell: "cat {input} > {output} 2> {log}" |
122 123 | shell: "cat {input} > {output} 2> {log}" |
134 135 | shell: "awk 'NR==1 {{header=$0; print; next}} $0==header {{next}} {{print}}' {input} > {output} 2> {log} " |
7 8 | shell: "workflow/scripts/count_aligned_bases.sh {wildcards.tchrom} {input} > {output}" |
17 18 | shell: "(echo -e 'target\tquery\tnum_bases\tfraction'; cat {input}) > {output}" |
26 27 | shell: "workflow/scripts/plotable-segs-from-maf.sh {input} > {output}" |
35 36 | shell: "(echo -e 'target\ttstart\ttend\ttstrand\tquery\tqstart\tqend\tqstrand'; cat {input}) > {output}" |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 | log <- file(snakemake@log[[1]], open="wt") sink(log, type = "output") sink(log, type = "message") #### get all the snakemake variables #### maf2fasta_output <- snakemake@input$maf2fasta final_fasta <- snakemake@output$anc_fasta pair_summary_file <- snakemake@output$bp_pairs mask_fasta <- snakemake@output$mask_fasta # we need the target sequence name for a few things. # because lastz munges names with dots in them, we just use the output # file name, which has the {tchrom} on it. tchrom = snakemake@wildcards$tchrom #### Do the rest #### library(tidyverse) # read the sequences in. They are on every other line. The first # is the target, the following are separate ones for each query chromosome seqs <- read_lines(maf2fasta_output) # break those sequences into a list of vectors seq_vec_list <- str_split(seqs[seq(2, length(seqs), by = 2)], pattern = "") snames <- seqs[seq(1, length(seqs), by = 2)] rm(seqs) # if there were multiple query sequences, condense them into a single one. # the "-" has the lowest value of any of the possible letters in the aligned # sequences, so this step takes the base at the aligned query sequence, if there # is an aligned query sequency. (Remember we did single_cov2 so there will be only # one aligned query at any point.) if(length(seq_vec_list) > 2) { anc <- do.call(pmax, seq_vec_list[-1]) } else { anc <- seq_vec_list[[2]] } # now, we count up the number of different types of sites pair_counts <- table(paste(seq_vec_list[[1]], anc)) # and make a tibble of those numbers count_summary <- tibble( name = names(pair_counts), n = as.integer(pair_counts) ) %>% separate(name, into = c("target", "ancestral"), sep = " ") %>% mutate(chrom = tchrom, .before = target) # and write that file out write_csv(count_summary, file = pair_summary_file) # and now, from anc, we subset out the sites that are "-"s in the target. # That gives us an ancestral sequence that is congruent with the # original target sequence. And then we replace the "-"'s in the ancestral # seq with Ns anc_fasta_seq <- anc[seq_vec_list[[1]] != "-"] anc_fasta_seq[anc_fasta_seq == "-"] <- "N" # We also create a fasta that masks the sites that are considered # to be in repeat regions in either the target or the ancestral # genome. This is done by marking any position that is not a # capital letter A, C, G, or T # in both sequences an N, and anything that is a capital lettter # A, C, G, or T in both a P. # Note that this also chucks sites that are designated by IUPAC codes # in the query. Hmmm.... targ_seq <- seq_vec_list[[1]] targ_seq <- targ_seq[targ_seq != "-"] mask_seq = ifelse( (targ_seq %in% c("A", "C", "G", "T")) & (anc_fasta_seq %in% c("A", "C", "G", "T")), "P", "N" ) # now make a matrix out of those (anc_fasta_seq and mask_seq) to print them in lines. # We have to extend each with NAs to be a length that is a multiple of 70 so that # it doesn't get chopped off when we make a matrix of it for fast printing. final_line_bits <- length(anc_fasta_seq) %% 70 if(final_line_bits != 0) { length(anc_fasta_seq) <- length(anc_fasta_seq) + (70 - final_line_bits) length(mask_seq) <- length(mask_seq) + (70 - final_line_bits) } # now, make a matrix of those and write them out. We ensure that # the NA's in the last row of the matrix are just empty (i.e. ""'s). # writing out the ancestral fasta cat(">", tchrom, "\n", sep = "", file = final_fasta) matrix(anc_fasta_seq, ncol = 70, byrow = TRUE) %>% write.table( file = final_fasta, sep = "", na = "", quote = FALSE, append = TRUE, row.names = FALSE, col.names = FALSE ) # writing out the mask fasta cat(">", tchrom, "\n", sep = "", file = mask_fasta) matrix(mask_seq, ncol = 70, byrow = TRUE) %>% write.table( file = mask_fasta, sep = "", na = "", quote = FALSE, append = TRUE, row.names = FALSE, col.names = FALSE ) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | CHROM=$1 INFILE=$2 awk -v chrom=$CHROM ' /^a/ {seq=0} /^s/ {seq++} /^s/ && seq==2 { n[$2]+=length($7); tot+=length($7); seq++ } END { for(i in n) printf("%s\t%s\t%d\t%.4f\n",chrom, i,n[i], n[i]/tot) } ' $INFILE | sort -n -b -r -k 3 |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | awk ' /^a/ {seq=0} /^s/ { seq++; if($5 == "-") { start = $6 - $3; end = start - $4; } else { start = $3; end = $3 + $4 } printf("%s\t%d\t%d\t%s", $2, start, end, $5); if(seq == 1) printf("\t"); else printf("\n"); } ' $1 |
Support
- Future updates
Related Workflows





