Design and optimisation of parallel applications such as this highly scalable lattice-Boltzmann simulation of immiscible fluid flow under oscillatory shear running on the UK’s flagship supercomputer
The ubiquity of parallel software
Boosting the performance (i.e. the speed) of any software application available today, whether on a smartphone, laptop, desktop, cluster or supercomputer, turns on the ability to parallelise it as efficiently as possible. This is because all modern computers have a many or multi core architecture since individual chip speeds are no longer increasing. Thus, what was once the preserve of an elite and small group of experts has become of central importance to essentially all scientific applications. In addition to expertise in parallelising codes for multicore machines, there are substantial potential performance gains to be gained by porting codes to so-called novel architectures including general purpose graphical processing units, field programmable gate arrays, and so on. Parallelisation extends also to graphical rendering of images for the rapid visualisation of complex systems, where again speed is often of paramount importance.
At the high end, where very large and/or very fast simulations are required, one needs to be able to perform extreme scaling of one’s codes to extract the maximum performance from them. Today, the fastest computers available globally operate around 10-20 petaflops and typically are comprised of machines with hundreds of thousands of cores. At such performance levels, and for the anticipated exascale machines (which may appear in the next five years), new approaches to the efficient design of scalable software are required to ensure that scalability persists to millions of cores and increasingly heterogeneous architectures.
Scaling scientific applications
We have extensive experience of upscaling computational fluid dynamics codes, particularly ones based on the lattice-Boltzmann method. We have developed a range of such packages, including LB3D for complex, multicomponent fluid flow, HYPO4D for homogeneous fluid dynamics, and HemeLB for haemodynamics. LB3D and HYPO4D have been scaled to 294,000 and 262,000 cores respectively, exhibiting linear speed up on the IBM Blue Gene P/Q architectures respectively. HemeLB, which is a more specialised code optimised to simulate blood flow in sparse geometries, currently scales to around 35,000 cores on similar computers, for the most complicated patient specific neurovasculatures so far studied. For this application, every spatial domain decomposition is selected uniquely, based on the patient specific intracranial vasculature, and the domain decomposition may be further optimised by paying careful attention to the nature of the fluid grid points (using, for example, ParMETIS). We use an extensive array of performance metrics and parallel debugging tools to optimise our code deployments on different architectures, and are frequently able to achieve optimal performance within a matter of a few weeks of effort.
Successful scaling of such codes enables one to perform new, faster, bigger and better science and engineering. For LB3D, we have been able to perform some of the world’s largest simulations of complex colloidal fluids, including binary and ternary amphiphilic liquid crystals; while for HYPO4D, we have been able to identify a large number of unstable periodic orbits through a novel spacetime relaxation procedure which allows but space and time to be parallelised in the routines which located these orbits.
In the case of HemeLB, the code optimisation allows for the execution of patient specific blood flow simulations in real time, thanks to advance reservation and urgent computing methods, in order to provide clinical decision support ahead of proposed interventions by neuroradiologists dealing with patients presenting with a range of neuropathologies. (See also Patient specific simulation for surgical planning.)
The figure shows how optimisations in the treatment of MPI collective operations and halo_exchange led to linear scaling performance up to more than 262,000 cores within the HYPO4D turbulence code running on the ca. 300,000 core 1.2 petaflops IBM Blue Gene/P JUGENE machine located in Jülich, Germany.
D. Groen, J. Hetherington, H. B. Carver, R. W. Nash, M. O. Bernabeu, P. V. Coveney, “Analyzing and Modeling the Performance of the HemeLB Lattice-Boltzmann Simulation Environment”, Journal of Computational Science, (2013), 4 (5), 412–422, DOI: 10.1016/j.jocs.2013.03.002
M. D. Mazzeo, S. Manos, P. V. Coveney, “In situ ray tracing and computational steering for interactive blood flow simulation”, Computer Physics Communications, 181, (2), 355-370, (2010). DOI: 10.1016/j.cpc.2009.10.013
R. S. Saksena, B. Boghosian, L. Fazendeiro, O. A. Kenway, S. Manos, M. D. Mazzeo, S. K. Sadiq, J. L. Suter, D. Wright, and P. V. Coveney, “Real Science at the Petascale”, Philosophical Transactions of the Royal Society A, 367, (1897), 2557-2571, (2009). DOI: 10.1098/rsta.2009.0049