To tackle large scenarios, simulations often employ strategies like adaptivity and tree-based algorithms/approximations. However, whilst necessary, these strategies are making efficient parallelization a difficult challenge, with existing solutions often being application-specific and hard to generalize. Instead of being able to just use a simple "parallel-for" pattern to distribute the work across different CPU cores (or compute nodes), we have to take a multitude of data- and execution dependencies into account and how each computation (or communication) task is depending on the results of other tasks. This problem is especially relevant on accelerators (GPUs) where a large number of parallel work items is usually used to hide latencies and overheads. Worse, newer Supercomputers (Summit, El Capitan, Aurora, Perlmutter, ...) follow the trend of offering more and more of their computational power in the form of just such accelerators and these are exactly the kind of machines we need for large-scales simulations in the first place.
In this project, we explore how one can use task-based programming to tackle the dependencies in such simulations, enabling us to use fine-grained parallelism, yet also achieve efficient GPU utilization. In other words: We aim to combine the advantages of task-based programming, such as the intuitive and generic expression of dependencies, with the sheer computational resources offered by the accelerators in the current and next generations of supercomputers.
The codebase we use for our experiments and implementation is Octo-Tiger, which is developed and used at the Louisiana State University (LSU). Octo-Tiger is an astrophysics application simulating the stellar mergers of binary star systems. It is an example of one such simulation codes where adaptivity, efficient compute kernels and the distribution across many compute nodes all are paramount for simulating real-world scenarios. To this end, Octo-Tiger is built upon HPX, a distributed asynchronous many-task runtime system also developed at LSU, to enable task-based programming. Octo-Tiger makes extensive use of adaptivity and multiple solvers, using HPX to create a fine-grained task-graph. It also supports different sub-grid sizes, meaning we can increase the number of computation per task at the expense of less adaptivity. All of this makes Octo-Tiger an ideal candidate to investigate efficient, task-based GPU programming as it has the means of using tasks and adjusting the task sizes for comparisons, as well as the potential to massively benefit from a GPU implementation for real-world simulations where both performance and adaptivity is key.
Over the course of this project, we have already implemented a multitude of GPU kernels (and SIMD vectorized CPU kernels) for the solvers within Octo-Tiger. We used those kernels to investigate how to efficiently handle the challenges that arise when using fine-grained GPU kernels with tasks, utilizing buffer- and stream reusage, as well as implicit work aggregation using executors together with a HPX tasks integration. We further investigated the integration of Kokkos with HPX in Octo-Tiger, yielding portable compute kernels, where we can simply switch the compute-backend between machines. In addition, with the ability of HPX to switch the communication backends, this makes Octo-Tiger extremely customizable for a variety of machines/supercomputers. We are currently working on improving the techniques developed within Octo-Tiger and making them more generic, to be easily used in other task-based applications as well.