06August2025
09:00 Master's Defense room 53 of IC2
Topic on
Checkpointing Optimization in Adjoint Mode Applications: A Prefetching and Compression-Based Approach
Student
Thiago José Mazarão Maltempi
Advisor / Teacher
Sandro Rigo
Brief summary
Heterogeneous computing systems with GPUs are essential to handle computationally intensive tasks in the areas of physics, machine learning, and seismic imaging. These applications often use reverse differentiation methods (adjoint/reverse-mode), which require recomputation. To reduce algorithmic complexity, the checkpointing technique is applied, which seeks to balance recomputation and memory usage. However, when checkpoint data needs to be stored in the host memory due to GPU memory limitations, communication latency becomes a significant bottleneck, consuming up to 75% of the total execution time. This dissertation proposes a new approach to mitigate this problem: combining a checkpoint prefetching mechanism with data compression within the GPU. For this purpose, the modular GPUZIP library was developed, which was tested with checkpointing algorithms that access their data in a temporal and deterministic manner, such as Revolve, Z-Cut, and Uniform, allowing caching and prefetching of checkpoint data. The prefetching mechanism proactively schedules asynchronous transfers from host memory to GPU memory that occur concurrently with the computation. However, the prefetching mechanism may not have enough time to transfer all data before it is used. To speed up the transfer, GPUZIP integrates lossy compression (cuZFP and NVIDIA Bitcomp) into the prefetching mechanism. The prefetching and compression mechanisms were evaluated separately, considering different cache sizes and several compression parameter configurations, in order to identify the adjustments that provide the greatest acceleration without compromising the quality of the results, in addition to quantifying the isolated and combined impact of each technique on the same application. The results demonstrate significant performance gains across multiple datasets and checkpointing strategies. The combination of prefetching and compression was the configuration that obtained the best performance gains of up to 5,1× with Revolve, 8,9× with Z-Cut, and 5,8× with Uniform. GPUZIP significantly reduced blocking times in Host-GPU transfers and eliminated redundant data movements. This research shows that combining prefetching with compression is an effective strategy to overcome the communication bottleneck in GPU-based applications that use reverse differentiation. GPUZIP is designed to be extensible and compatible with different checkpointing algorithms and compression libraries.
Examination Board
Headlines:
Sandro Rigo IC / UNICAMP
Arthur Francisco Lorenzon INF / UFRGS
Rodolfo Jardim de Azevedo IC / UNICAMP
Substitutes:
Lucas Francisco Wanner IC / UNICAMP
Alexandro José Baldassin IGCE / UNESP