Stevan White

Achievements at the AEI

The Albert-Einstein-Institut, also known as the Max-Planck-Institut für Gravitationsphysik, is at the forefront of studies in the theory of Relativity. In the Numerical Relativity group, physicists make computer simulations of the dynamics of extremely dense objects, such as black holes and neutron stars.

One goal the Numerical Relativity group has been to model the gravitational radiation (waves in the structure of space-time itself) resulting from a pair of black holes spiralling into one another. It sounds outlandish, but they have calculated that this should occur often enough that detection should be practical. This is the basis for such huge gravitational wave detector projects as LIGO and LISA.

Once the nature of the gravitational signal coming from such events is well understood, the data from the detectors can be examined for them.

Cactus thorns

In my position as a scientific programmer for the Cactus framework, I wrote several Cactus modules (called “thorns”). These were enhancements meant to give more information to the physicists, and to improve performance of the system.

Job Chaining

The JobChaining thorn has proven particularly useful.

The purpose of job chaining is to improve job flow in a batch system, while supporting long-running simulations. It addresses the perceived problem that, when several very long-running jobs are running, users of short-running jobs must either wait inordinately or take special measures to get their results.

The idea is for a job to checkpoint and exit after some prescribed amount of time, and to then be automatically re-queued in the batch system. In the mean time, other jobs that have been waiting in the queue could run. When the job runs again, it reads in its previous state from the last checkpoint.

In principle this isn’t hard. The difficulties lay mostly in the idiosyncrasies of existing code in Cactus, and in the way the batch system was being managed.

Job chaining also facilitated the use of the cluster nodes’ local disks for intermediate checkpointing. The set of nodes used by one run in the job chain are typically different from the set used by another. By using the fast node interconnect, the checkpoint files can be very rapidly transferred in parallel.

Cluster purchases

I worked on two computer cluster purchases at the AEI, one that succeeded, and another that didn’t. The clusters were custom designed and built for the special needs of the Numerical Relativity group, under a fixed budget. The efforts involved research of new technologies, and specification, contractualization and testing of the clusters.

My special contribution to the second cluster was a push for two new technologies: multi-core processor chips and InfiniBand networking. I had to prove to my very conservative colleagues that with this combination, we could get substantially greater computing power for the money. I had to show that these technologies were reliable, and that our groups applications would benefit specifically.

The successful outcome was Belladonna, which is now the physicists’ favorite machine, allowing them to do simulations that could not be done otherwise.