Genomic G+C Content and Optimal Living Temperatures in Prokaryotes

We know that G+C content increases thermostability of nucleic acids.  So we would expect that prokaryotes that live at higher temperatures (i.e., those with a higher optimal living temperature or Topt) would have a higher G+C content in their genomes.  The evidence (or not) for this has been the subject of some quite heated exchanges in the literature. There are two broad things at issue here.  First, it is obvious that genomic G+C content will be associated with gene/protein content, which is associated with the functions that an organism needs to perform.  Second, it should also be obvious that the phylogenetic relationships of different organisms will reflect shared patterns in genomic G+C content and Topt.  So it is quite likely that all these factors obscure the underlying relationship between genomic G+C content and Topt.  In fact, when you plot a graph of genomic G+C content and Topt in bacteria, there is no apparent relationship.  Of course, I am not saying anything new, and others have tried to uncover this relationship, including restricting comparisons to certain closely related groups or by applying certain multivariate methods.

A few years ago, I tried to tackle this with a talented intern, Norbert Kopocz — you can see the presentation of his results here.  We essentially performed a simple phylogenetic comparative analysis by computing the ancestral G+C content and the ancestral Topt at each node on a tree using squared-changes parsimony.  Then we correlated the change in G+C content (ΔGC) against the change in Topt (ΔTopt) along each branch.  By doing this, we essentially analyzed whether an increase (or decrease) in G+C content is associated with an increase (or decrease) in Topt, regardless of what the starting values for G+C and Topt are.   Our analyses indicated that this relationship is statistically significant, although the proportion of variation explained is low.  (By the way, at the time I was reasonably sure that this ancestral reconstruction approach was related to the independent contrasts method, and I have just found a paper that seems to demonstrate this; however, the paper doesn’t seem to resolve the apparent inflation in the amount of data — ancestral state reconstruction obtains correlations based on data for the 2(n-1) branches on a tree, whereas independent contrasts has n-1 comparisons ).

Nowadays, there are, of course, much better ways of doing this analysis, especially given the number of full-length genomes available.  We would not need to reconstruct genomic G+C content using the values of this trait at the tips of the tree.  Instead, we could reconstruct ancestral genes (maybe, large parts of the genomes?) to obtain the ancestral genomic G+C content at the nodes.   Perhaps there are even better ways of doing the analysis: ways that build in the correlation between ΔGC and ΔTopt as a model parameter in a phylogenomic analysis. We could then test whether this correlation is significantly different from zero.

Alternative uses of phylogenetic methods

Phylogenetic methods have been used to infer the relationships of things other than genes, organisms or species, or indeed, anything biological. Language, culture, political systems, religions, stories (including most recently, Red Riding Hood), and so on.

I train in martial arts, and I think it would be really interesting to do a phylogeny of the different martial arts.  There are many folk histories about how the different fighting traditions have evolved, and some of these have been passed along as fact, although its pretty obvious in many instances that they cannot be true.  There has also been quite a bit of politics involved in taking ownership of different forms because of the ebb and flow of national and cultural borders.

I’m pretty certain that there has been no real scientific study of the relationships between these arts. It would not be too hard to do, I think.  Of course, one would have to figure out ways to account for “borrowing”, but the language phylogeneticists have been doing this.

Inferring Dispersal and Vicariance when Area Relationships are Known

Long story about why, but essentially, very early in my graduate career I was interested in salticids. The distributional information on these species are quite well known. We  are really interested in inferring rates of dispersal/vicariance based on the distributions of these species. We also have fossil records from this family (and other salticids) as well as quite extensive sequence data from extant members of the group.

The challenge has been this — all software that I know about for inferring dispersal/vicariance events (e.g., Lagrange, S-DIVA, RASP) don’t take account of known geographic history. This is because such software assumes that we don’t know when these areas split or converged. However, for groups with global distributions, and quite deep histories, we do know about the large-scale movements of continents, and we should be able to incorporate this information into our methods. At this point there is no way that I know of to do this.

Now, this also relates to the issue of the gene-tree/species-tree problem, but in a not-so-obvious way.  The multi-species coalescent embeds the gene genealogies within the species phylogeny. This allows us to take account of multiple gene histories and reconstruct the species history.   StarBEAST does this, as does BEST.  There is a parallel here with the biogeography situation — we are embedding species phylogenies within area histories. The difference is this — area histories can be networks rather than trees.

So, at present there is no way that I know of, either with biogeographies or the multi-species coalescent, to specify the known tree/network of the larger process (i.e., either the larger species tree which contains the gene trees, or the larger area tree that contains the species phylogenies).  I don’t think that this is theoretically difficult — its just that no one seems to do it this way.

Distance-based estimation of evolutionary parameters

These days, people routinely use Bayesian inference to estimate evolutionary parameters. MrBayes and BEAST are extremely popular packages, and deservedly so.  But there is no getting round the fact that these analyses take time.  So, what if we used distance-based methods to perform the same kinds of analyses?  Distance-based methods tend to be a lot faster, although the variances of the estimates are usually larger.  But perhaps for a given dataset — especially a large one, with long sequences — the sampling variance may be negligible.

So — here’s an example of what we might do.  Imagine that we are trying to work out parameters associated with a relaxed clock model.  Here is a plausible algorithm:

1.  Build a neighbor-joining tree.

2.  Root the tree by finding the point such that the variance of distances between the root and all tips is minimized.

3. For a tree with n tips, there will be at most n-1 branches that will need to be lengthened or shortened to ensure that all tips terminate at exactly the same distance from the root. Find these n-1 branches, and calculate the multipliers that modify the lengths of these branches.

There are many way to do (3), of course; perhaps the easiest may be some kind of stepwise approach.

By the way, this is not the way that “standard” relaxed clock models work — with your typical relaxed clock model, you have a distribution of rates and/or an inheritance model of rates.  The model above tries to identify certain branches where there is a speed-up/slow-down of rates.  From an evolutionary perspective, this is equivalent to saying that there are some lineages where species may have encountered environmental situations that lead to rapid acquisition of substitutions.

There are other things we can do with distance-based methods.  The original skyline plots, for instance, did not use Bayesian methods.  The beauty of the Bayesian skyline plot is that it gives a smooth representation of population trajectory. But can we get the same smoothness by bootstrapping out distance-based trees?

Modelling changes between cellular compartments on a genealogy

We know that viruses like HIV can move in and out of different cellular and/or systemic compartments.  For instance, a few years ago, Perelson et al developed a three-compartment model to account for the decay in viral loads during antiretroviral therapy. Each compartment corresponds to a different type of cell, each with a different generation time.  The proportion of viruses produced by each compartment is also different.

It seems that we should be able to use a phylogenetic tree to derive equivalent estimates of the relative generation times of these cellular compartments and the proportions of each compartment.  How do we do this? Well, we can assume that for all viruses, the intrinsic mutation rate remains the same; however, a virus that is produced by a cell with a longer generation time will have a lower observed rate of mutation.  So, imagine a simple case in which there are only two cellular compartments, and viruses switch stochastically between one and the other. If close to 100% of the viruses come from one compartment, then almost all lineages will have the same observed rate.  Thus, we would expect that that a phylogeny of the viruses will look as though it agreed with a molecular clock.  As we increase the number of the viruses produced by the other compartment, we can expect to see greater deviation from the molecular clock.  As viruses move between compartments more rapidly, the overall rate along all lineages can be approximated by a weighted average of the two rates, and we return to a tree that looks clock-like. So there is a sweet spot where the rate of movement between compartments is sufficiently low, so that we are able to estimate the relative proportions and generation times of the different compartments based on the degree of deviation from the molecular clock.

One can imagine that a simple solution may be to treat this as an example of the structured coalescent and incorporate migration.  But it turns out to be not so simple for two reasons.  First, when we sample the viral sequences to build our phylogeny (or more precisely our genealogy), we don’t know which compartment/cell these viruses were last in. In other words, we don’t know the “demes” or “areas” from which these viruses were sampled. Second, the observed substitution rates change depending on which “deme” or “area” the virus finds itself in.

So — this model of cellular compartments and differing generation times/mutation rates provides a mechanistic explanation for relaxed clock models when these are applied to viruses.  It also suggests that we should be able to develop a covarion-type analysis to figure out the relative proportions of the different compartments and their generation times. But how?

Phylogenetic Inference using Incompletely Specified Conditional Priors

Many years ago, I was a systematist, working on the taxonomy of an obscure group of flatworms (long story, details at some later date). In any case, I was using morphological characters to construct the phylogeny, and one of the challenges was to assign reasonable weights to the characters to reflect their propensity to change over evolutionary time.  More complex characters would have a greater inertia to change, compared to simpler characters.  People had devised various schemes, both subjective and objective, but these methods required weight assignments across all characters.  But my problem was more general — suppose, for three characters A, B and C, you knew that both B and C could change more frequently than A, but you had no idea how B and C related to each other.  How would you assign weights in this case?  I developed a method that I called “information-rich character weighting”, and I applied it to parsimony analysis.The procedure was an iterative method that used weights derived from the character consistency indices.

Of course, I didn’t realize it at the time, but what I was doing was clumsily trying to couch the problem in a Bayesian framework: in the paper, I argue that the relative weights are a reflection of prior belief.  Which leads me to this project. In Bayesian phylogenetic inference, is there a way of specifying incompletely-specified conditional priors on the relative rates of change of different characters?  By this I mean that, as with the example of A, B and C above, I have some prior belief that B and C evolve more rapidly than A, but I have no prior belief about the relative rates of B and C.  How can I implement this in a Bayesian phylogenetic analysis?


On this page, you will find some preliminary, often vague, ideas about potential research topics that we are interested in.  You can find these projects under “Recent Posts” at the bottom of this page.   Many of these are bite-sized, and would probably suit a semester or summer project for a person with the right skills.   Also, in some cases, there may be solutions out there that we have not encountered. If that is the case, please email us — we would love to hear from you!