13Mar2026
14:00 Master's Defense IC room 85
Topic on
OpenMP Beyond the Node: Remote Offloading and Scalable Communication in GPU Clusters
Student
Jhonatan Cléto
Advisor / Teacher
Herve Cédric Yviquel
Brief summary
Heterogeneous HPC clusters, composed of multinode systems with GPU accelerators, have become the dominant platform for large-scale, data-intensive scientific applications. While the MPI+X programming model is widely adopted in this context, it imposes significant complexity on developers, especially regarding communication management, synchronization, data movement, and load balancing among distributed accelerators. These challenges reduce productivity and limit application portability. This master's thesis investigates extensions to the OpenMP offloading model that aim to unify heterogeneous shared-memory and distributed programming while preserving the simplicity of OpenMP. The first contribution is the MPI Proxy Plugin (MPP), an extension of the OpenMP Offloading runtime in LLVM that performs transparent offloading of target regions to remote accelerators using MPI. By exploring asynchronous MPI operations and coroutines in C++20, MPP abstracts explicit message passing from the application, allowing for overlap between communication and computation and enabling unmodified OpenMP programs to run on multi-node GPU clusters. Experimental results on NVIDIA H100 and AMD MI300A systems demonstrate that MPP achieves near-linear scalability for computationally intensive workloads, reaching up to 63x acceleration on 64 GPUs, while also highlighting the impact of runtime overheads and task granularity on performance. The second contribution addresses a fundamental limitation of the OpenMP specification: the lack of native support for collective communication between multiple devices. This work presents the OpenMP Collective Communication Library (OMPCCL), which extends the OpenMP Target model with portable collective primitives and device group semantics. Broadcast, All-Reduce, Reduce-Scatter, and related collective implementations are evaluated on a cluster with 64 GPUs, achieving speedups of up to 5x compared to naive approaches. The results also analyze the influence of network topology on collective performance and provide practical guidelines for the efficient implementation of collective operations in distributed GPU environments. Overall, this work demonstrates that extending OpenMP with runtime-level remote offloading and collective communication between multiple devices is a viable path to simplify the development of scalable heterogeneous applications. The proposed approaches reduce programming complexity while offering competitive performance compared to traditional MPI+OpenMP-based solutions, contributing to a unified and portable programming model for next-generation HPC systems.
Examination Board
Headlines:
Hervé Cédric Yviquel IC / UNICAMP
Rodolfo Jardim de Azevedo IC / UNICAMP
Lucas Mello Schnorr INF / UFRGS
Substitutes:
Edson Borin IC / UNICAMP
Arthur Francisco Lorenzon INF / UFRGS