Tag Archives: SNAPP

How much data do I need for SNAPP?

30 June 2015 by Remco Bouckaert

The unwelcome answer to that question is; it depends.

Number of lineages per species

First of all, there should be more than one lineage (= one haploid sequence) for every species. If there is only a single lineage, there are no coalescent events possible in the branches ending in the tips, and the branch above it will have on average only a single coalescent event. This means that the populations sizes for each of the branches will be informed by only a single coalescent event (on average) and there will be very little signal to inform population sizes. The result is that almost certainly, the population size will be sampled from the prior. And since population size and branch lengths are confounded (large population size means larger branch lengths) and the prior on population sizes is quite broad by default, it may take a lot of time to converge.

So, multiple lineages per species is recommended. Of course, this has to be balanced with the penalty in computational that is incurred. So, you have to experiment a bit to find out what is computationally feasible, and how much signal can be obtained from the data.

Sequence length

In SNAPP, every site in a sequence has its own gene tree that is assumed to be independent of all other gene trees. So, adding sites also means adding gene trees.

When samples are very closely related, all coalescent events happen very closely to the present time (the sampling time). If so and you look at a branch ending in a species, there is only a single lineage left at the top of the branch. This means we are running in the problem described above; there is no signal in the data left to determine population sizes, and convergence will be difficult. There is no point in adding more sites that have this property, since it would just slow down the calculation without adding more information.

When samples are very distantly related, all coalescent events happen in the branch stemming out of the root. This means, there is no topological information in such samples, and every species tree will fit equally well. On top of this, there is no information to inform population sizes, so SNAPP will not give a lot of information, and will have a terrible time to reach convergence.

In between these extremes, there is the goldilocks zone, where samples coalesc not too early, and not too late, but just in at the right time. In this goldilocks zone, there will be some lineage sorting, so branches above those ending in tips will contain some population size information. This is the kind of data you would like to add.

Of course, it is hard to tell beforehand what kind of data you have, so it is hard to tell beforehand what is the ideal sequence length.

Thanks to David Bryant for pointing out most of the above.

SNAPP handling missing data and path sampling made easier

21 July 2014 by Remco Bouckaert

SNAPP treats each SNP as having its own gene tree, but what happens when there is data missing for some sites for a SNP? SNAPP simply assumes these taxa do not exist in the gene tree. If you have a species containing 3 lineages for which one has missing data, SNAPP assumes there is a gene tree with only 2 lineages for that species. So, what happens when all 3 lineages is missing? Previous versions of SNAPP (v1.1.5 and before) could not handle this situation and just removed these sites from the data. This can be a problem when you do species delimitation using BFD*; when you split a species and data is missing, some sites may be deleted that are not deleted when doing the un-split analysis. As a results, marginal likelihoods are calculated for different data sets, so these are not comparable. In v1.1.6, no sites with missing data are deleted any more, even when this leaves some species having zero lineages. SNAPP assumes the gene tree simply has no taxa in that species any more.

Another change in SNAPP v1.1.6 is that it uses the MCMC class from BEAST instead of SNAPP, which means there is no timeout and sampling from prior options any more, but this makes doing a path sampling analysis a lot easier.

First, make sure you have SNAPP at least v1.1.6 installed and Model Selection v1.0.2. Open the package manager (in BEAUti under the File/Manage packages menu) and select SNAPP, then click the Install button. After a little while a warning pops up that SNAPP is installed, and you might need to restart BEAUti to be able to set up a SNAPP analysis.

Then, select the SNAPP template, to set up a SNAPP analysis.

Click import data, and select a nexus file with your alignment. The taxa will automatically be assigned to a species by guessing based on the lineage names in the nexus files. You can change this of course.

In the Model Parameters panel, make sure you either calculate the values for u and v (see the Rough guide to SNAPP how to do this) or just click the estimate flag next to Mutation Rate U. When there is a sufficient amount of SNP data the estimates for U and V converge quite rapidly most of the time. Note: Leaving the default values for u and v and not estimating these mutation rates is almost surely going to lead to bad fits!

You can save this analysis as XML file, say runA.xml and run it with BEAST.

Now, if you want to calculate the marginal likelihood, you can start the AppStore application that is part of BEAST, and select the Path Sampler icon — either double click or click launch to start the Path sampler.

A window pops up where you can select the XML file you save from BEAUti and specify the parameters for the path sampling analysis.

Once you click OK, the path sampling analysis is set up and run in a separate window, where after a little (or long — depending on the data) while the marginal likelihood will be printed.