Let's talk science
Doctor At Last II: the Thesis and the Hat
The Thesis
In the previous post, we covered all the vain things surrounding the thesis . Let’s talk about the thesis itself, in my case it is entitled Algorithmic Advancements and Massive Parallelism for LargeScale Datasets in Phylogenetic Bayesian Markov Chain Monte Carlo. Well, that essentially says it all, doesn’t it?
Essentially, it deals with the challenge of inferring evolutionary trees (as in tree of life) on supercomputers. The underlying genetic input datasets have gotten bigger and bigger over the years. While the Exelixis Lab already had highly efficient and scalable software for inferring evolutionary trees using the maximum likelihood criterion (based on a framework that models evolution as a continuous Markov chain), no comparable tool was available that does the same thing in a Bayesian framework. This was the motivation to develop ExaBayes. In the Bayesian framework, we are attempting to sample the parameter space of trees (and other model parameters such as mutation rates) proportionally to the probability of the parameter combination. To that end, we use discretetime Markov chains. We simulate Markov chains moving in the parameter space and have them accept or reject parameter states proportionally to the probability of the respective states, thus the procedure is called Markov chain Monte Carlo (MCMC).
Sometimes it helps to couple several heated chains to the (cold) chain
of interest and have them attempt to swap their location in the
parameter space (whether this succeeds depends on the probabilities of
the states of either chain). This variation of the method is called
Metropoliscoupled Markov chain Monte Carlo (which leads to the
beautiful acronyms MCMCMC or MC). Luckily, the
phylogenetic likelihood library was developed in parallel to
ExaBayes
and takes care of all the numerical challenges of computing
the phylogenetic likelihood function in a highly efficient
manner. Thus, I could focus on developing a threetier parallelization
using MPI and/or threads. One novelty, I am proud of is an
asynchronous implementation of the chain swapping algorithm. This is
actually the first time, I came into contact with asynchronous
programming (well, it is polling there), I would not have expected
that it becomes so frequent in my next occupation . Sometimes I wonder
whether something more involved using remote procedure calls
could outperform the current approach. The threaded version uses
messages queues.
At the end of my PhD, I was attracted by the statistical aspects of phylogenetic Bayesian MCMC. Some consider it more of an art rather than a science to design proposal functions that allow your chain to move through the parameter space without getting stuck. I designed an independence sampler that uses NewtonRaphson optimization for fitting a distribution that is very similar to the target branch length distributions (I forgot to mention, the number of expected substitutions on a branch is also a model parameter). The proposal itself worked out spectacularly well (uncorrelated proposals with an acceptance rate of up to 90%), so we tried to combine it with tree topology proposals which pose the major challenge in phylogenetic Bayesian MCMC. And, alas, these hybrid proposals did not consistently outperform plain topology proposals.
What else? Actually, the entire thesis started with an algorithm for
rogue taxon analyses (simply put, a taxon means species here). A rogue
taxon analysis is used to optimize the information content in a
consensus tree. On the Bayesian side, we obtain a huge number of trees
sampled via and for maximum likelihood analysis, we also end up with a
set of tree when trying to calculate the support for the most likely
tree in a bootstrap analysis. The information of the tree set is
summarized by computing a consensus tree. Now, few unstable taxa
(rogue taxa) can have a devastating effect on the information content
of the consensus tree. The algorithmic challenge is to determine a set
of taxa that yields the best improvement for the resulting consensus
tree in terms of information content, if we remove the set. The greedy
algorithm I developed for that task employs some sorting tricks to
reduce the number of comparisons between taxon sets. The C
implementation uses bitsets to execute set comparisons
efficiently. This was before my C++
days, it would be fun to
implement the same thing as elegantly as possible in C++
(hopefully
without any overhead).
Writing the thesis was more fun than I expected it to be. Once you
have got the knack of it, it is not all that bad: introducing your
nomenclature (I used glossaries
for the acronyms), polishing your
figures (mostly created via the R
package lattice
and inkscape
vector graphics when illustration was needed) and types setting
formulas. really excels here (contrary to Excel’s sister
program if you allow me that obvious pun).
The Hat
If you have some exellent lab mates, they make you a proper (cardboard) doctor hat such that your doctor father has something to award to you right after the defense (so at least you have got a hat, since the degree certificate may take a while depending on how obsessivecompulsive your spell checking is for the final thesis submit). I was pretty amazed by the hat and frankly could not believe that somebody might have gotten a greater hat. That was until I saw my colleages’ (Kassian Kobert) hat which was equally amazing (he was moving to Lyon, France, and had an Eiffel tower on its top). So here is why the hat is great:

There are some table tennis balls (reminding me of some tough doubles with Kassian against theoretical physicists – which team Exelixis won in the end)

Dr. exa. Aberer as well as a picture of me as Reverend Thomas Bayes (or what is assumed to be his picture). Two allusions to
ExaBayes
, well done! But maybe the joke is on me: I must admit I may have been a bit preachy about Bayesian statistics versus maximum likelihood at times… 
The top of the hat features a probability landscape (with a Bavarian flag in the global optimum). Yellow dots indicate samples taken by the chain (the string) and my colleague Thomas Flouris did not grow tired of pointing out that the chain totally misses the global optimum. What a subversive hat!
What’s up next? Let’s talk about science and industry.
SCIENCE
Post a Comment
All comments are held for moderation; Markdown formatting accepted;
By posting you agree to the following privacy policy.