Let's talk science

Doctor At Last II: the Thesis and the Hat

In the second part in a series on the doctor title , I am reflecting on the thesis itself and its most valuable outcome: a paper hat

The Thesis

In the previous post, we covered all the vain things surrounding the thesis . Let’s talk about the thesis itself, in my case it is entitled Algorithmic Advancements and Massive Parallelism for Large-Scale Datasets in Phylogenetic Bayesian Markov Chain Monte Carlo. Well, that essentially says it all, doesn’t it?

Essentially, it deals with the challenge of inferring evolutionary trees (as in tree of life) on supercomputers. The underlying genetic input datasets have gotten bigger and bigger over the years. While the Exelixis Lab already had highly efficient and scalable software for inferring evolutionary trees using the maximum likelihood criterion (based on a framework that models evolution as a continuous Markov chain), no comparable tool was available that does the same thing in a Bayesian framework. This was the motivation to develop ExaBayes. In the Bayesian framework, we are attempting to sample the parameter space of trees (and other model parameters such as mutation rates) proportionally to the probability of the parameter combination. To that end, we use discrete-time Markov chains. We simulate Markov chains moving in the parameter space and have them accept or reject parameter states proportionally to the probability of the respective states, thus the procedure is called Markov chain Monte Carlo (MCMC).

Sometimes it helps to couple several heated chains to the (cold) chain of interest and have them attempt to swap their location in the parameter space (whether this succeeds depends on the probabilities of the states of either chain). This variation of the method is called Metropolis-coupled Markov chain Monte Carlo (which leads to the beautiful acronyms MCMCMC or MC). Luckily, the phylogenetic likelihood library was developed in parallel to ExaBayes and takes care of all the numerical challenges of computing the phylogenetic likelihood function in a highly efficient manner. Thus, I could focus on developing a three-tier parallelization using MPI and/or threads. One novelty, I am proud of is an asynchronous implementation of the chain swapping algorithm. This is actually the first time, I came into contact with asynchronous programming (well, it is polling there), I would not have expected that it becomes so frequent in my next occupation . Sometimes I wonder whether something more involved using remote procedure calls could outperform the current approach. The threaded version uses messages queues.

At the end of my PhD, I was attracted by the statistical aspects of phylogenetic Bayesian MCMC. Some consider it more of an art rather than a science to design proposal functions that allow your chain to move through the parameter space without getting stuck. I designed an independence sampler that uses Newton-Raphson optimization for fitting a -distribution that is very similar to the target branch length distributions (I forgot to mention, the number of expected substitutions on a branch is also a model parameter). The proposal itself worked out spectacularly well (uncorrelated proposals with an acceptance rate of up to 90%), so we tried to combine it with tree topology proposals which pose the major challenge in phylogenetic Bayesian MCMC. And, alas, these hybrid proposals did not consistently outperform plain topology proposals.

What else? Actually, the entire thesis started with an algorithm for rogue taxon analyses (simply put, a taxon means species here). A rogue taxon analysis is used to optimize the information content in a consensus tree. On the Bayesian side, we obtain a huge number of trees sampled via and for maximum likelihood analysis, we also end up with a set of tree when trying to calculate the support for the most likely tree in a bootstrap analysis. The information of the tree set is summarized by computing a consensus tree. Now, few unstable taxa (rogue taxa) can have a devastating effect on the information content of the consensus tree. The algorithmic challenge is to determine a set of taxa that yields the best improvement for the resulting consensus tree in terms of information content, if we remove the set. The greedy algorithm I developed for that task employs some sorting tricks to reduce the number of comparisons between taxon sets. The C implementation uses bitsets to execute set comparisons efficiently. This was before my C++ days, it would be fun to implement the same thing as elegantly as possible in C++ (hopefully without any overhead).

Writing the thesis was more fun than I expected it to be. Once you have got the knack of it, it is not all that bad: introducing your nomenclature (I used glossaries for the acronyms), polishing your figures (mostly created via the R package lattice and inkscape vector graphics when illustration was needed) and types setting formulas. really excels here (contrary to Excel’s sister program if you allow me that obvious pun).

The Hat

If you have some exellent lab mates, they make you a proper (cardboard) doctor hat such that your doctor father has something to award to you right after the defense (so at least you have got a hat, since the degree certificate may take a while depending on how obsessive-compulsive your spell checking is for the final thesis submit). I was pretty amazed by the hat and frankly could not believe that somebody might have gotten a greater hat. That was until I saw my colleages’ (Kassian Kobert) hat which was equally amazing (he was moving to Lyon, France, and had an Eiffel tower on its top). So here is why the hat is great:

  • There are some table tennis balls (reminding me of some tough doubles with Kassian against theoretical physicists – which team Exelixis won in the end)

  • Dr. exa. Aberer as well as a picture of me as Reverend Thomas Bayes (or what is assumed to be his picture). Two allusions to ExaBayes, well done! But maybe the joke is on me: I must admit I may have been a bit preachy about Bayesian statistics versus maximum likelihood at times…

  • The top of the hat features a probability landscape (with a Bavarian flag in the global optimum). Yellow dots indicate samples taken by the chain (the string) and my colleague Thomas Flouris did not grow tired of pointing out that the chain totally misses the global optimum. What a subversive hat!


What’s up next? Let’s talk about science and industry.

Post a Comment

All comments are held for moderation; Markdown formatting accepted;
By posting you agree to the following privacy policy.

This is a honeypot form. Do not use this form unless you want to get your IP address blacklisted. Use the second form below for comments.
Name: (required)
E-mail: (required, not published)
Website: (optional)
Name: (required)
E-mail: (required, not published)
Website: (optional)