2 June 2015 by Remco Bouckaert
A few weeks back I attended the conference on Species delimitation in the age of genomics at ANU, which made me realise there is quite an interest in the topic. So, here a quick review of methods for species delimitation available in BEAST. The main methods are
- Bayes factor delimitation (BFD)
- Threshold based methods DISSECT/STACEY
Bayes factor delimitation
BFD is based on the multi-species coalescent where there two or more scenarios for species assignments. These scenarios can differ in that taxa for different species can be merged, split or you can even test whether a lineage fits better with one species than another.
The multi-species coalescent can be based on *BEAST or SNAPP, depending on the kind of data that you have: gene sequences for *BEAST and SNP or AFLP data for SNAPP. The idea of BFD is based on a fundamental method of Bayesian methods, which is comparing models based on their marginal likelihoods. Note that this is different from the likelihood typically shown in Tracer; the marginal likelihood is the likelihood marginalised over all parameters in the model.
prior x likelihood posterior = -------------------- marginal likelihood
The marginal likelihood is most reliably calculated using a stepping stone analysis, though this can be quite tedious to estimate since it is rather computational intensive. There are other methods for model fit, like AICM, that are less computational intensive, but these tend to be less reliable (as outlined by Ayden et al, PloS one, 2014).
For each of the scenarios, you set up a *BEAST or SNAPP analysis with a different species assignment, and estimate the marginal likelihood for each of these scenarios. To do this in BEAST, you need the MODEL_SELECTION package. There are more details here on how to set up a stepping stone/path sampling analysis.
Once you have the marginal likelihoods for each of your scenarios, the Bayes factor comparing say scenario A and B is just the difference between the marginal likelihood estimates.
Threshold based methods DISSECT/STACEY
Threshold based methods are based on setting a level epsilon and declare any split in the species tree that is below that value to be within a species, while any split above the threshold is deemed to represent the birth of a new species. The benefit of this method is that it can be performed during an MCMC run, so it does not require a stepping stone analysis. Also, it does not require setting up different scenarios. However, it does require setting a rather arbitrary threshold, though at the meeting at the ANU conference it was argued by various speakers that defining species is to some extent a social construct, which involves some subjective criteria. So, perhaps this is not such a big deal. update: Graham Jones clarified that the threshold is an approximation to zero without any biological meaning, so should not be interpreted as a social construct but its choice is based on a balance of accuracy and speed.
The DISSECT and STACEY packages allows you to do threshold based species delimitation. STACEY is *BEAST on steroids, with DISSECT as species delimitation method. It integrates out population sizes for the branches, which helps convergence though it means that population size information is lost. Furthermore, it has a number of MCMC proposals that help in mixing; a *BEAST analysis can easily get stuck in an area of tree space that forms a local optimum where the species tree gets locked into a topology because one or more of the gene trees have a good fit to the data preventing the species tree to move to alternative topologies. STACEY offers three extra proposals on top of the standard *BEAST proposals that help to get out of these local optima.
A STACEY analysis can be set up in BEAUti, but DISSECT requires a bit more handwork in editing the XML. For STACEY, you need to select the STACEY template under the File/Templates menu to get started. Both need a bit of fiddling with the XML to get going — see documentation for these packages to work out the details.
After running a DISSECT or STACEY XML file through BEAST, you can use the SpeciesDelimitationAnalyser to process the log files and find out the distribution over species assignments.
Threshold methods have been tested for *BEAST, though should work without too much hassle with SNAPP. The prior for SNAPP needs to be replaced with the
BFD paper for *BEAST.
BFD* paper for SNAPP.
Tutorial for BFD*, with example data.
DISSECT paper, also at bioRxiv.