Tag Archives: performance

Load Balancing — this site is deprecated: goto https://www.beast2.org/ for up-to-date information

18 August 2014 by Remco Bouckaert

If you have a good graphics cards, you can use it with BEAGLE to increase the speed of BEAST runs. If you have multiple graphics cards, when starting BEAST from the command line, the -beagle_order flag can be used to tell which thread goes on which GPU. Start BEAST with the -beagle_info flag to find out what kind of hardware you have and which numbers they are.

For example, I get on my aging univesity computer, the following output:

                  BEAST v2.2.0 Prerelease, 2002-2014
       Bayesian Evolutionary Analysis Sampling Trees
                 Designed and developed by
Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut and Marc A. Suchard
                              
               Department of Computer Science
                   University of Auckland
                  remco@cs.auckland.ac.nz
                  alexei@cs.auckland.ac.nz
                              
             Institute of Evolutionary Biology
                  University of Edinburgh
                     a.rambaut@ed.ac.uk
                              
              David Geffen School of Medicine
           University of California, Los Angeles
                     msuchard@ucla.edu
                              
                Downloads, Help & Resources:
              	http://beast2.cs.auckland.ac.nz
                              
Source code distributed under the GNU Lesser General Public License:
              	http://code.google.com/p/beast2
                              
                     BEAST developers:
	Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled, 
	Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li, 
	Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel, 
          	Oliver Pybus, Chieh-Hsi Wu, Walter Xie
                              
                         Thanks to:
    	Roald Forsberg, Beth Shapiro and Korbinian Strimmer

BEAGLE resources available:
0 : CPU
    Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG VECTOR_SSE VECTOR_NONE THREADING_NONE PROCESSOR_CPU


1 : GeForce GTX 295
    Global memory (MB): 896
    Clock speed (Ghz): 1.24
    Number of cores: 240
    Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_CUDA


2 : GeForce GTX 295
    Global memory (MB): 895
    Clock speed (Ghz): 1.24
    Number of cores: 240
    Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_CUDA


3 : GeForce GTX 295
    Global memory (MB): 896
    Clock speed (Ghz): 1.24
    Number of cores: 240
    Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_CUDA


4 : GeForce GTX 295
    Global memory (MB): 896
    Clock speed (Ghz): 1.24
    Number of cores: 240
    Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_CUDA

It shows there are 4 GPUs numbered 1 to 4 as well as a CPU numbered 0. It also reports some statistics like the number of cores and memory on each GPU. I understand that on OSX due to a bug in the OpenCL library the reported statistics are somewhat exaggerated (e.g. it reports 280 cores on my Macbook air while the HD Graphics 5000 only has 40).

If I have an analysis with two partitions, or one partition, but using the ThreadedTreeLikelihood instead of the TreeLikelihood, and I would start a job with

beast -beagle_order 2,4 -threads 2 beast.xml

it will use two threads, and the first will use GPU number 2 and the second will use GPU number 4. So, you can divide your data among these cards such that all of your GPUs are utilised.

But what if your dataset is too large to fit on one or more of your GPUs? Then, using the -beagle_order flag can be used with some of the partitions using GPUs and others using CPUs. Say, you have a single GPU numbered 1, and a CPU numbered 0, then using

beast -beagle_order 0,0,0,1 -threads 4 beast.xml

will place the first three threads on the CPU and the last on the GPU. Typically, the CPU and GPU do not run at the same speed and I’ll assume the GPU is much faster. What you will see when running the above is then that the three CPU threads are running behind the GPU. So, the GPU thread will be waiting on the CPU threads to finish, and this shows up in CPU load being well below 400% On Mac or Linux I use ‘top’ to see how much CPU load there is. If you could put less data on the CPUs and more on the GPU, you could utilise your computer more efficient.

Now you can — this requires BEAST v2.2.0-pre release, and BEASTlabs for v2.2.0. The ThreadedTreeLikelihood has a proportions attribute where you can specify how much of the data should go into a thread. By default, the ThreadedTreeLikelihood splits up the data in equal parts, which is fine if you only use CPU or only GPU. But when you mix, the proportions specifies proportions of patterns used per thread as space delimited string. This is useful when using a mixture of BEAGLE devices that run at different speeds, e.g GPU and CPU. The string is duplicated if there are more threads than proportions specified. For example, ‘1 2′ as well as ’33 66’ with 2 threads specifies that the first thread gets one third of the patterns and the second two thirds. With 3 threads, it is interpreted as ‘1 2 1’ = 25%, 50%, 25% and with 7 threads it is ‘1 2 1 2 1 2 1’ = 10% 20% 10% 20% 10% 20% 10%. If not specified, all threads get the same proportion of patterns.

By keeping an eye on CPU utilisation, you can see how changing proportions have an impact on CPU load.

Note that mixing the -beagle_GPU and -beagle_SSE flag causes all threads to use CPUs, so the CPU necessarily cannot use the SSE instructions, which means in most practical cases some speed is lost.

Checkpointing Tricks — this site is deprecated: goto https://www.beast2.org/ for up-to-date information

7 April 2014 by Remco Bouckaert

BEAST 2 stores its internal state occasionally when running an MCMC chain. This means that you can run a chain for a number of samples then resume the chain where it was stopped. This is especially handy when your computer power supply is not very stable or you run BEAST on a cluster that limits the length of jobs.

To resume a run, just select the resume option from the BEAST dialog, or if you run from the command line, use the -resume option

Fast model selection/Skipping Burn-in

Disclaimer: this is for exploratory analysis only. For the full analysis running different chains (say 4) till convergence from different random starting points is the recommended way.

Having said this, exploring different models can take a lot of time, choosing different clock models and their configurations, choosing different tree priors, and substitution models with and without gamma heterogeneity and what have you. Exploring all these settings takes time and often it turns out that many models can be dismissed due to their bad performance. However, every time you change settings you have to go through burn-in. If you set of taxa is small, this is no issue, but if you have many taxa and use models with slow convergence, it may pay to use the partial state of a previous run

For example, we set up a Standard analysis in BEAUti, importing the alignment from examples/nexus/anolis.nex, and change the substitution model to HKY, and leave everything at its default settings, then save the file as anolis.xml. So, we start with a simple strict clock analysis with Yule prior, and HKY substitution model without gamma categories or invariant sites. Run BEAST on anolis.xml for a bit, and it produces the file anolis.xml.state that may look something like this:


((((0:0.15218807324699915,(((12:0.07204086917146929,13:0.07204086917146929)45:0.032320490875063,27:0.10436136004653229)35:0.014086879510625344,16:0.11844823955715764)51:0.033739833689841514)48:0.016898052792112206,(22:0.07111803269260457,24:0.07111803269260457)43:0.09796809334650679)55:0.030058453325126883,(((2:0.11202391387224231,28:0.11202391387224231)39:0.053360607716790937,((3:0.05388400742110195,23:0.05388400742110195)30:0.07583538052837979,17:0.12971938794948173)50:0.03566513363955151)41:0.02332777785655904,(19:0.1339032955067472,25:0.1339032955067472)42:0.05480900393884508)53:0.010432279918645954)40:1.2871176295578546E-4,(((((4:0.12883443292345567,20:0.12883443292345567)33:0.018679591712616628,((1:0.09919779884460916,7:0.09919779884460916)44:0.030740336184632705,(11:0.022188644677647536,18:0.022188644677647536)47:0.10774949035159434)46:0.01757588960683043)37:0.01507953828728531,(14:0.1447999403050793,21:0.1447999403050793)49:0.0177936226182783)32:0.013615826070641102,(5:0.10720805090460693,9:0.10720805090460693)29:0.06900133808939178)34:0.01433517128441375,((6:0.09951749852384634,10:0.09951749852384634)54:0.0557955632110991,((8:0.1076629407668736,26:0.1076629407668736)31:0.013630712415594895,15:0.1212936531824685)38:0.03401940855247694)52:0.03523149854346702)36:0.008728730848781563)56:0.0
birthRate.t:anolis[1 1] (-Infinity,Infinity): 7.55654639037806 
kappa.s:anolis[1 1] (0.0,Infinity): 3.174302336238077 
freqParameter.s:anolis[4 1] (0.0,1.0): 0.27551971155001337 0.26213332703028214 0.11680817105655772 0.34553879036314655 

Now, in BEAUti, you can change the site model to use 4 gamma categories and estimate the shape parameter (click check box with ‘estimate’ next to it), — also, change the log file names — and save as anolis2.xml. If you copy anolis.xml.state to anolis2.xml.state and run BEAST on anolis2.xml it will use the end state for the tree, birth rate, kappa and frequency parameters of the anolis.xml run as the starting state of the anolis2.xml run. For the gamma rate, it will use the setting provided in BEAUti.

You will notice a rise in posterior from about -22428 to -20328 and a rise in log likelihood from about -22453 to -20345 a whopping 2000 log points improvement. The 95% HDP intervals for the likelihood do not even overlap, so adding gamma
heterogeneity helps in fitting the model considerably in this case and the without heterogeneity model does not be considered any further.

The trees are shaped considerably different as well:

Of course, the anolis dataset with only 29 taxa and 1456 characters in the alignment is just a toy example and there is not such a large gain in speed, but for larger analyses re-using a state can help considerably.

Speeding up MCMC (in some cases)

When the tree is sampled from a bad area — which is almost always is initially, if you start with a random tree — the calculation of the treelikelihood can run into numerical trouble. It will switch to ‘scaling partials’, which can result in a severe performance hit. After a while, the tree can move to an area with higher likelihood and scaling is not required any more. One would think that scaling could be switched off after a while, but that is not always desirable. There are a number of schemes, like NONE, ALWAYS, and DYNAMIC, but we are waiting for the scheme that always works. The code promises this scheme:

KICK_ASS("kickAss"),// should be good, probably still to be discovered

but unfortunately, it is commented out.

One thing you can do to speed up BEAST is switch off scaling, but that may result in zero likelihoods initially. Another thing you can do is run the chain initially with scaling, then stop the chain after a little while and resume the chain. The resumed chain will start without scaling and if it is in a good state, it will not need scaling any more and run a lot faster.