Powered by
31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2026), January 31 – February 4, 2026,
Sydney, NSW, Australia
Frontmatter
Welcome from the Chairs
Welcome to the 31st Annual ACM SIGPLAN Symposium on Principles and Prac-
tice of Parallel Programming (PPoPP): the premier forum for leading work on all
aspects of parallel and performance programming, including theoretical founda-
tions, techniques, languages, compilers, runtime systems, tools, applications, and
practical experience. This symposium focuses on improving the programming
productivity and performance engineering of all concurrent and parallel systems—
multicore, multi-threaded, heterogeneous, clustered, and distributed systems, grids,
accelerators such as ASICs, GPUs, FPGAs, data centers, clouds, large scale ma-
chines, and quantum computers. PPoPP is also interested in new and emerging
parallel workloads and applications, such as artificial intelligence and large-scale
scientific/enterprise workloads.
Concurrency Control
Binary Compatible Critical Section Delegation
Junyao Zhang,
Zhuo Wang, and
Zhe Zhou
(Fudan University, China; Alibaba Cloud Computing, China)
In high-performance applications, critical sections often become performance bottlenecks due to contention among multiple cores. Critical section delegation mitigates this overhead by consistently executing critical sections on the same core, thereby reducing contention. Traditional delegation techniques, however, typically require manual code refactoring, which limits their practicality and adoption. More recent schemes that integrate with existing lock APIs have lowered this barrier, but delegating unsupported critical sections can still cause undefined behavior, including crashes.
To make delegation broadly accessible—even for end-users without access to or knowledge of source code, we introduce BCD, a binary-compatible delegation mechanism. BCD leverages kernel support to delegate userspace critical sections at the point of futex_wake. It employs an in-kernel virtual machine to supervise the execution of delegated userspace instructions, eliminating the need for binary modification. Crucially, BCD can detect and safely exclude unsupported critical sections, preserving correctness. This enables existing applications to transparently benefit from critical section delegation, without requiring developer intervention.
@InProceedings{PPoPP26p1,
author = {Junyao Zhang and Zhuo Wang and Zhe Zhou},
title = {Binary Compatible Critical Section Delegation},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {1-12},
doi = {10.1145/3774934.3786439},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Functional
Results Reproduced
Hapax Locks: Scalable Value-Based Mutual Exclusion
Dave Dice and
Alex Kogan
(Independent, USA; Oracle Labs, USA)
We present Hapax Locks, a novel locking algorithm that is simple, enjoys constant-time arrival and unlock paths, provides FIFO admission order, and which is also space efficient and generates relatively little coherence traffic under contention in the common case. Hapax Locks offer performance (both latency and scalability) that is comparable with the best state of the art locks, while at the same time Hapax Locks impose fewer constraints and dependencies on the ambient runtime environment, making them particularly easy to integrate or retrofit into existing systems or under existing lock application programming interfaces.
@InProceedings{PPoPP26p13,
author = {Dave Dice and Alex Kogan},
title = {Hapax Locks: Scalable Value-Based Mutual Exclusion},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {13-25},
doi = {10.1145/3774934.3786443},
year = {2026},
}
Publisher's Version
Fixing Non-blocking Data Structures for Better Compatibility with Memory Reclamation Schemes
Md Amit Hasan Arovi and
Ruslan Nikolaev
(Pennsylvania State University, USA)
We present a new technique, Safe Concurrent Optimistic Traversals (SCOT), to address a well-known problem related to optimistic traversals with classical and more recent safe memory reclamation (SMR) schemes, such as Hazard Pointers (HP), Hazard Eras (HE), Interval-Based Reclamation (IBR), and Hyaline. Unlike Epoch-Based Reclamation (EBR), these (robust) schemes protect against stalled threads but lack support for well-known data structures with optimistic traversals, e.g., Harris' list and the Natarajan-Mittal tree. Such schemes are either incompatible with them or need changes with performance trade-offs (e.g., the Harris-Michael list).
SCOT keeps existing SMR schemes intact and retains performance benefits of original data structures. We implement and evaluate SCOT with Harris' list and the Natarajan-Mittal tree, but it is also applicable to other data structures. Furthermore, we provide a simple modification for wait-free traversals. We observe similar performance speedups (e.g., Harris vs. Harris-Michael lists) that were previously available only to EBR users. Our version of the tree also achieves very high throughput, comparable to that of EBR, which is often treated as a practical upper bound.
@InProceedings{PPoPP26p26,
author = {Md Amit Hasan Arovi and Ruslan Nikolaev},
title = {Fixing Non-blocking Data Structures for Better Compatibility with Memory Reclamation Schemes},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {26-39},
doi = {10.1145/3774934.3786455},
year = {2026},
}
Publisher's Version
Published Artifact
Info
Artifacts Available
Artifacts Reusable
Results Reproduced
Multiverse: Transactional Memory with Dynamic Multiversioning
Gaetano Coccimiglio,
Trevor Brown, and
Srivatsan Ravi
(University of Waterloo, Canada; University of Southern California, USA)
Software transactional memory (STM) allows programmers to easily implement concurrent data structures. STMs simplify atomicity. Recent STMs can achieve good performance for some workloads but they have some limitations. In particular, STMs typically cannot support long-running reads which access a large number of addresses that are frequently updated. Multiversioning is a common approach used to support this type of workload. However, multiversioning is often expensive and can reduce the performance of transactions where versioning is not necessary.
In this work we present Multiverse, a new STM that combines the best of both unversioned TM and multiversioning. Multiverse features versioned and unversioned transactions which can execute concurrently. A main goal of Multiverse is to ensure that unversioned transactions achieve performance comparable to the state of the art unversioned STM while still supporting fast versioned transactions needed to enable long running reads.
We implement Multiverse and compare it against several STMs. Our experiments demonstrate that Multiverse achieves comparable or better performance for common case workloads where there are no long running reads. For workloads with long running reads and frequent updates Multiverse significantly outperforms existing STMS. In several cases for these workloads the throughput of Multiverse is several orders of magnitude faster than other STMs.
@InProceedings{PPoPP26p40,
author = {Gaetano Coccimiglio and Trevor Brown and Srivatsan Ravi},
title = {Multiverse: Transactional Memory with Dynamic Multiversioning},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {40-52},
doi = {10.1145/3774934.3786436},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Scheduling and Load Balancing
Rethinking Thread Scheduling under Oversubscription: A User-Space Framework for Coordinating Multi-runtime and Multi-process Workloads
Aleix Roca and
Vicenç Beltran
(Barcelona Supercomputing Center, Spain)
The convergence of high-performance computing (HPC) and artificial intelligence (AI) is driving the emergence of increasingly complex parallel applications and workloads. These workloads often combine multiple parallel runtimes within the same application or across co-located jobs, creating scheduling demands that place significant stress on traditional OS schedulers. When oversubscribed (there are more ready threads than cores), OS schedulers rely on periodic preemptions to multiplex cores, often introducing interference that may degrade performance. In this paper, we present: (1) The User-space Scheduling Framework (USF), a novel seamless process scheduling framework completely implemented in user-space. USF enables users to implement their own process scheduling algorithms without requiring special permissions. We evaluate USF with its default cooperative policy, (2) SCHED_COOP, designed to reduce interference by switching threads only upon blocking. This approach mitigates well-known issues such as Lock-Holder Preemption (LHP), Lock-Waiter Preemption (LWP), and scalability collapse. We implement USF and SCHED_COOP by extending the GNU C library with the nOS-V runtime, enabling seamless coordination across multiple runtimes (e.g., OpenMP) without requiring invasive application changes. Evaluations show gains up to 2.4x in oversubscribed multi-process scenarios, including nested BLAS workloads, multi-process PyTorch inference with LLaMA-3, and Molecular Dynamics (MD) simulations.
@InProceedings{PPoPP26p53,
author = {Aleix Roca and Vicenç Beltran},
title = {Rethinking Thread Scheduling under Oversubscription: A User-Space Framework for Coordinating Multi-runtime and Multi-process Workloads},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {53-67},
doi = {10.1145/3774934.3786451},
year = {2026},
}
Publisher's Version
Waste-Efficient Work Stealing
Kyle Singer,
Kunal Agrawal, and
Tao B. Schardl
(Massachusetts Institute of Technology, USA; Washington University in St. Louis, USA)
Although randomized work stealing is effective at automatically load-balancing task-parallel programs, it can waste computational resources when scheduling programs that lack sufficient parallelism to use all available threads. For such programs, threads will waste cycles attempting to steal parallel tasks when none are available. This waste can reduce the machine’s efficiency by wasting computational resources and energy and needlessly burdening the operating system.
This paper introduces WEWS, a simple, practical, and provably efficient extension to randomized work stealing that mitigates waste. WEWS dynamically adjusts the number of active threads to reduce the waste of randomized work stealing. WEWS executes a parallel computation with the same asymptotic running time as traditional randomized work stealing while bounding the waste to O(min{PT∞, T1 + P2}) instructions. WEWS also follows the work-first principle to perform well in practice.
WEWS requires no special support from the operating system or hardware, which simplifies its implementation. We implemented WEWS within the OpenCilk runtime system and compared it to other common waste-mitigation strategies. Across 10 parallel benchmarks, we find that WEWS has minimal impact on parallel running times while, on programs with limited parallelism, substantially reducing waste.
@InProceedings{PPoPP26p68,
author = {Kyle Singer and Kunal Agrawal and Tao B. Schardl},
title = {Waste-Efficient Work Stealing},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {68-80},
doi = {10.1145/3774934.3786452},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
DiggerBees: Depth First Search Leveraging Hierarchical Block-Level Stealing on GPUs
Yuyao Niu,
Yuechen Lu,
Weifeng Liu, and
Marc Casas
(Barcelona Supercomputing Center, Spain; Universitat Politècnica de Catalunya, Spain; China University of Petroleum-Beijing, China)
Depth First Search (DFS) is a fundamental graph traversal algorithm with broad applications. While existing work-stealing DFS approaches achieve strong performance on CPUs, mapping them to modern GPUs faces three major challenges: (1) limited shared memory cannot accommodate deep stacks, (2) frequent stack operations hinder efficient intra-block execution, and (3) irregular workloads complicate scalable inter-block execution.
In this paper, we propose DiggerBees, a GPU-optimized parallel DFS algorithm with hierarchical block-level stealing, consisting of three components. First, we introduce a two-level stack structure to mitigate shared memory limitations. Second, we employ warp-level DFS with intra-block work stealing to enable efficient execution within a block. Third, we implement inter-block work stealing to achieve scalable execution across blocks and sustain high parallelism. Experimental results on the latest NVIDIA GPUs show that DiggerBees outperforms existing DFS approaches, CKL-PDFS, ACR-PDFS, and NVG-DFS, achieving average speedups of 1.37×, 1.83×, and 30.18×, respectively. Moreover, DiggerBees even surpasses high-performance GPU BFS implementations on graphs with deep and narrow traversal paths, and scales efficiently across GPU generations.
@InProceedings{PPoPP26p81,
author = {Yuyao Niu and Yuechen Lu and Weifeng Liu and Marc Casas},
title = {DiggerBees: Depth First Search Leveraging Hierarchical Block-Level Stealing on GPUs},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {81-94},
doi = {10.1145/3774934.3786457},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
PANA: A Fine-Grained Runtime-Adaptive Load Balancing for Parallel SpMV on Multicore CPUs
Haodong Bian,
Youhui Zhang,
Xiang Fei,
Jianqiang Huang, and
Xiaoying Wang
(Tsinghua University, China; Qinghai University, China; Zhongguancun Laboratory, China)
SpMV has been widely utilized and is regarded as a significant kernel in various scientific and engineering computing applications, where its parallel performance is heavily influenced by matrix sparsity and hardware architecture. Despite extensive prior research, static partitioning strategies that narrowly target computation or memory access remain a key performance bottleneck, severely stifling performance advancement of SpMV on modern multicore CPUs.
This paper presents PANA, a fine-grained runtime-adaptive load balancing for parallel SpMV that operates at the micro-operation level. It employs fine-grained partitioning and a dynamic runtime adjustment mechanism to achieve balanced per-core computation and memory access. Experimental results demonstrate that across 2,898 SuiteSparse matrices, PANA consistently outperforms all baseline methods (CAMLB, CSR5, Merge, CVR, SpV8, MKL, and AOCL) in terms of overall performance, achieving geometric-mean speedups ranging from 1.36× to 7.72× on Intel and AMD CPUs.
@InProceedings{PPoPP26p95,
author = {Haodong Bian and Youhui Zhang and Xiang Fei and Jianqiang Huang and Xiaoying Wang},
title = {PANA: A Fine-Grained Runtime-Adaptive Load Balancing for Parallel SpMV on Multicore CPUs},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {95-108},
doi = {10.1145/3774934.3786459},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Concurrent Data Structures
UFO Trees: Practical and Provably-Efficient Parallel Batch-Dynamic Trees
Quinten De Man,
Atharva Sharma,
Kishen N Gowda, and
Laxman Dhulipala
(University of Maryland, USA)
The dynamic trees problem is to maintain a tree under edge updates while supporting queries like connectivity queries or path queries. Despite the first data structure for this fundamental problem---the link-cut tree---being invented 40 years ago, our experiments reveal that they are still the fastest sequential data structure for the problem. However, link-cut trees cannot support parallel batch-dynamic updates and have limitations on the kinds of queries they support.
In this paper, we design a new parallel batch-dynamic trees data structure called UFO trees that simultaneously supports a wide range of query functionality, supports work-efficient parallel batch-dynamic updates, and is competitive with link-cut trees when run sequentially. We prove that a key reason for the strong practical performance of both link-cut trees and UFO trees is that they can perform updates and queries in sub-logarithmic time for low-diameter trees. We perform an experimental study of our optimized C++ implementations of UFO trees with ten other dynamic tree implementations, several of which are new, in a broad benchmark of both synthetic and real-world trees of varying diameter and size. Our results show that, in both sequential and parallel settings, UFO trees are the fastest dynamic tree data structure that supports a wide range of queries. Our new implementation of UFO trees has low space usage and easily scales to billion-size inputs, making it a promising building block for implementing more complex dynamic graph algorithms in practice.
@InProceedings{PPoPP26p109,
author = {Quinten De Man and Atharva Sharma and Kishen N Gowda and Laxman Dhulipala},
title = {UFO Trees: Practical and Provably-Efficient Parallel Batch-Dynamic Trees},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {109-122},
doi = {10.1145/3774934.3786431},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Sharded Elimination and Combining for Highly-Efficient Concurrent Stacks
Ajay Singh,
Nikos Metaxakis, and
Panagiota Fatourou
(ICS-FORTH, Greece; University of Crete, Greece)
We present a new blocking linearizable stack implementation which utilizes sharding and fetch&increment to achieve significantly better performance than all existing concurrent stacks. The proposed implementation is based on a novel elimination mechanism and a new combining approach that are efficiently blended to gain high performance. Our implementation results in enhanced parallelism and low contention when accessing the shared stack. Experiments show that the proposed stack implementation outperforms all existing concurrent stacks by up to 2X in most workloads. It is particularly efficient in systems supporting a large number of threads and in high contention scenarios.
@InProceedings{PPoPP26p123,
author = {Ajay Singh and Nikos Metaxakis and Panagiota Fatourou},
title = {Sharded Elimination and Combining for Highly-Efficient Concurrent Stacks},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {123-135},
doi = {10.1145/3774934.3786458},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Concurrent Balanced Augmented Trees
Evan Wrench,
Ajay Singh,
Younghun Roh,
Panagiota Fatourou,
Siddhartha Jayanti,
Eric Ruppert, and
Yuanhao Wei
(University of British Columbia, Canada; ICS-FORTH, Greece; Massachusetts Institute of Technology, USA; University of Crete, Greece; Dartmouth College, USA; York University, Canada)
Augmentation makes search trees tremendously more versatile, allowing them to support efficient aggregation queries, order-statistic queries, and range queries in addition to insertion, deletion, and lookup. In this paper, we present the first lock-free augmented balanced search tree supporting generic augmentation functions. Our algorithmic ideas build upon a recent augmented unbalanced search tree presented by Fatourou and Ruppert [DISC, 2024]. We implement both data structures, solving some memory reclamation challenges in the process, and provide an experimental performance analysis of them. We also present optimized versions of our balanced tree that use delegation to achieve better scalability and performance (by more than 2x in most workloads). Our experiments show that our augmented balanced tree completes updates 2.2 to 30 times faster than the unbalanced augmented tree, and outperforms unaugmented trees by up to several orders of magnitude on 120 threads.
@InProceedings{PPoPP26p136,
author = {Evan Wrench and Ajay Singh and Younghun Roh and Panagiota Fatourou and Siddhartha Jayanti and Eric Ruppert and Yuanhao Wei},
title = {Concurrent Balanced Augmented Trees},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {136-149},
doi = {10.1145/3774934.3786437},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Parallel Dynamic Spatial Indexes
Ziyang Men,
Bo Huang,
Yan Gu, and
Yihan Sun
(University of California at Riverside, USA)
Maintaining spatial data (points in two or three dimensions) is crucial and has a wide range of applications, such as graphics, GIS, and robotics. To handle spatial data, many data structures, called spatial indexes, have been proposed, e.g. kd-trees, oct/quadtrees (also called Orth-trees), R-trees, and bounding volume hierarchies (BVHs). In real-world applications, spatial datasets tend to be highly dynamic, requiring batch updates of points with low latency. This calls for efficient parallel batch updates on spatial indexes. Unfortunately, there is very little work that achieves this.
In this paper, we systematically study parallel spatial indexes, with a special focus on achieving high-performance update performance for highly dynamic workloads. We select two types of spatial indexes that are considered optimized for low-latency updates: Orth-tree and R-tree/BVH. We propose two data structures: the P-Orth tree, a parallel Orth-tree, and the SPaC-tree family, a parallel R-tree/BVH. Both the P-Orth tree and the SPaC-tree deliver superior performance in batch updates compared to existing parallel kd-trees and Orth-trees, while preserving better or competitive query performance relative to their corresponding Orth-tree and R-tree counterparts. We also present comprehensive experiments comparing the performance of various parallel spatial indexes and share our findings at the end of the paper.
@InProceedings{PPoPP26p150,
author = {Ziyang Men and Bo Huang and Yan Gu and Yihan Sun},
title = {Parallel Dynamic Spatial Indexes},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {150-163},
doi = {10.1145/3774934.3786412},
year = {2026},
}
Publisher's Version
Published Artifact
Info
Artifacts Available
Artifacts Reusable
Results Reproduced
GPU and Heterogeneous Computing
PRISM: An Efficient GPU-Based Lossy Compression Framework for Progressive Data Retrieval with Multi-Level Interpolation
Bing Lu,
Zedong Liu,
Hairui Zhao,
Dejun Luo,
Wenjing Huang,
Yida Gu,
Jinyang Liu,
Guangming Tan, and
Dingwen Tao
(Institute of Computing Technology at Chinese Academy of Sciences, China; Jilin University, China; University of Chinese Academy of Sciences, China; University of California at Riverside, USA)
With the exponential growth of computing power, large-scale scientific simulations are producing massive volumes of data, leading to critical storage and I/O challenges. Error-bounded lossy compression has become one of the most effective solutions for reducing data size while preserving accuracy. Meanwhile, to achieve high-performance compression on such large datasets, leveraging GPUs has become increasingly essential. GPU-based lossy compressors deliver strong performance, but typically support only single-precision decompression, limiting their ability to meet the diverse accuracy requirements of scientific workflows. Progressive compressors can address this limitation by enabling on-demand precision retrieval. However, existing progressive lossy compressors on GPU still suffer from low throughput. To overcome these challenges, we present PRISM, a GPU-based progressive lossy compressor that achieves both high throughput and multi-precision retrieval, which introduces a high performance progressive framework that integrates the multiple interpolation predictors, efficient bitplane extraction, and an enhanced lossless compression that combines sign-absolute coding with zero-aware parallel algorithms. Evaluations on representative real-world datasets from five scientific domains show that PRISM significantly outperforms state-of-the-art progressive compressors on GPU, reducing retrieval data volume by over 15.6× and achieving up to 20.1× higher throughput on the NVIDIA H100 GPU under the same error bounds.
@InProceedings{PPoPP26p164,
author = {Bing Lu and Zedong Liu and Hairui Zhao and Dejun Luo and Wenjing Huang and Yida Gu and Jinyang Liu and Guangming Tan and Dingwen Tao},
title = {PRISM: An Efficient GPU-Based Lossy Compression Framework for Progressive Data Retrieval with Multi-Level Interpolation},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {164-176},
doi = {10.1145/3774934.3786438},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Dynamic Detection of Inefficient Data Mapping Patterns in Heterogeneous OpenMP Applications
Luke Marzen,
Junhyung Shim, and
Ali Jannesari
(Iowa State University, USA)
With the growing prevalence of heterogeneous computing, CPUs are increasingly being paired with accelerators to achieve new levels of performance and energy efficiency. However, data movement between devices remains a significant bottleneck, complicating application development. Existing performance tools require considerable programmer intervention to diagnose and locate data transfer inefficiencies. To address this, we propose dynamic analysis techniques to detect and profile inefficient data transfer and allocation patterns in heterogeneous applications. We implemented these techniques into OMPDataPerf, which provides detailed traces of problematic data mappings, source code attribution, and assessments of optimization potential in heterogeneous OpenMP applications. OMPDataPerf uses the OpenMP Tools Interface (OMPT) and incurs only a 5 % geometric‑mean runtime overhead.
@InProceedings{PPoPP26p177,
author = {Luke Marzen and Junhyung Shim and Ali Jannesari},
title = {Dynamic Detection of Inefficient Data Mapping Patterns in Heterogeneous OpenMP Applications},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {177-189},
doi = {10.1145/3774934.3786454},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Root-Down Exposure for Maximal Clique Enumeration on GPUs
Zhe Pan,
Peng Qu, and
Youhui Zhang
(Tsinghua University, China; Zhongguancun Laboratory, China)
Maximal clique enumeration (MCE) in large-scale graphs is critical across various application domains, including social network analysis, bioinformatics, and computer vision. However, existing GPU-based MCE solutions suffer from inefficient load-balancing mechanisms. These mechanisms force busy workers to pause and hand over workloads to idle workers, which introduces significant synchronization overhead and increases memory usage. To address these limitations, we introduce a root-down exposure mechanism where busy workers dynamically expose their current root, enabling idle workers to pull workloads from the exposed node directly without synchronization. We then propose a bitmap-centric MCE scheme and an aggressive node generation rule to further simplify the memory layout and accelerate enumeration. We combine them into RDMCE, a Root-Down MCE solution on GPUs. Across large real-world graphs with up to 146 billion maximal cliques, RDMCE is 1.25-5.38× faster than any next-best state-of-the-art GPU-based solutions and is the only one that completes enumeration on every test dataset, offering a more efficient and scalable MCE solution.
@InProceedings{PPoPP26p190,
author = {Zhe Pan and Peng Qu and Youhui Zhang},
title = {Root-Down Exposure for Maximal Clique Enumeration on GPUs},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {190-203},
doi = {10.1145/3774934.3786449},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
ROME: Maximizing GPU Efficiency for All-Pairs Shortest Path via Taming Fine-Grained Irregularities
Weile Luo,
Yuhan Chen,
Xiangrui Yu,
Qiang Wang,
Ruibo Fan,
Hongyuan Liu, and
Xiaowen Chu
(Hong Kong University of Science and Technology (Guangzhou), China; Harbin Institute of Technology, Shenzhen, China; Stevens Institute of Technology, USA; Hong Kong University of Science and Technology, Hong Kong)
All-Pairs Shortest Path (APSP), a fundamental problem in graph analytics, can be solved efficiently by reducing the computational workload through vertex reordering. However, it fails on GPUs due to fine-grained granularity, shape, and dependency irregularities, which cause severe hardware underutilization. We introduce ROME, a system that tames these irregularities by spatially restructuring computation into regularized workloads and temporally overlapping them with an asynchronous pipeline. ROME achieves 14.7-244.5× speedup over the state-of-the-art multicore CPU solution and 11.2-338.0× speedup over the state-of-the-art GPU solution. Notably, our results achieve mostly above 20% and up to 34.7% of peak min-plus OPs across all tested graphs.
@InProceedings{PPoPP26p204,
author = {Weile Luo and Yuhan Chen and Xiangrui Yu and Qiang Wang and Ruibo Fan and Hongyuan Liu and Xiaowen Chu},
title = {ROME: Maximizing GPU Efficiency for All-Pairs Shortest Path via Taming Fine-Grained Irregularities},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {204-217},
doi = {10.1145/3774934.3786461},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Stencil and Sparse Matrix Computation
SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swapping
Qiqi Gu,
Chenpeng Wu,
Heng Shi, and
Jianguo Yao
(Shanghai Jiao Tong University, China; Shanghai Enflame Technology, China)
Recent research has focused on accelerating stencil computations by exploiting emerging hardware like Tensor Cores. To leverage these accelerators, the stencil operation must be transformed to matrix multiplications. However, this transformation introduces undesired sparsity into the kernel matrix, leading to significant redundant computation.
In this paper, we present SPIDER, the first system to turn this unresolved sparsity into an optimization opportunity by exploring the potential of Sparse Tensor Cores (SpTCs) for stencil acceleration. Specifically, SPIDER introduces an efficient and elegant transformation method that integrates two cooperative techniques: an ahead-of-time strided swapping transformation for kernel matrices and an on-the-fly row-swapping mechanism for inputs. This rule-based approach effectively transforms stencil computation into operations compatible with SpTCs, introducing only slight compile-time overhead and zero runtime overhead. Additionally, SPIDER incorporates multiple optimizations to maximize computational efficiency. Experimental evaluations demonstrate that SPIDER outperforms vendor library cuDNN by 6.23× and state-of-the-art (SOTA) Tensor Core-based approaches (ConvStencil, FlashFFTStencil, etc.) by 1.98× on average.
@InProceedings{PPoPP26p218,
author = {Qiqi Gu and Chenpeng Wu and Heng Shi and Jianguo Yao},
title = {SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swapping},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {218-231},
doi = {10.1145/3774934.3786414},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
ASM-SpMM: Unleashing the Potential of Arm SME for Sparse Matrix Multiplication Acceleration
Jiazhi Jiang,
Xijia Yao,
Jiayu Chen,
Jinhui Wei,
Dan Huang, and
Yutong Lu
(Sun Yat-sen University, China)
Sparse Matrix–Matrix Multiplication (SpMM) is a core kernel in scientific computing, data analytics, and artificial intelligence, supporting applications such as linear solvers and Graph Neural Networks (GNNs). The Scalable Matrix Extension (SME) in Armv9 introduces dedicated matrix acceleration for ARM CPUs, but exploiting its full potential for SpMM requires architecture-aware optimizations to address irregular sparsity and hardware constraints.
We present ASM-SpMM, a high-performance SpMM library co-designed with ARM SME. ASM-SpMM combines a memory-efficient compression format, an SME-aware prefetching kernel optimized for outer-product execution, a hybrid matrix–vector execution strategy, and work-stealing-based dynamic load balancing across heterogeneous cores. Experiments on emerging Armv9 platforms demonstrate up to 7.9× speedup over state-of-the-art SpMM libraries across diverse matrices. A GNN inference case study further shows that ASM-SpMM significantly improves end-to-end performance over widely used GNN frameworks, highlighting the effectiveness of SME-aware SpMM optimization on ARM CPUs.
@InProceedings{PPoPP26p232,
author = {Jiazhi Jiang and Xijia Yao and Jiayu Chen and Jinhui Wei and Dan Huang and Yutong Lu},
title = {ASM-SpMM: Unleashing the Potential of Arm SME for Sparse Matrix Multiplication Acceleration},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {232-244},
doi = {10.1145/3774934.3786422},
year = {2026},
}
Publisher's Version
Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores
Kaige Zhang,
Hailong Yang,
Xin You,
Tianyu Feng,
Yufan Xu,
Zhongzhi Luan,
Yi Liu, and
Depei Qian
(Beihang University, China; Independent Researcher, USA)
Sparse matrix-vector multiplication (SpMV) is a fundamental operation in scientific computing, machine learning, and graph analytics, demanding efficient execution on modern hardware. Recent advances in hardware accelerators, such as Tensor Cores, have significantly improved the performance of many compute-intensive workloads. However, effectively utilizing Tensor Cores for SpMV remains challenging due to its irregular sparsity patterns and the mismatch between SpMV’s computational characteristics and constrained architecture design, leading to suboptimal performance and underutilization of Tensor Cores. In this paper, we systematically analyze the state-of-the-art SpMV optimizations on Tensor Cores, identify key performance bottlenecks, and propose Drawloom, a Tensor-Core-aware framework for SpMV with efficient Tensor Core mapping and optimized pipeline execution. Drawloom leverages a redesigned Tensor Core mapping strategy with a zig-zag chained sparse storage format, as well as a multi-stage register pipeline to better exploit hardware parallelism. Our evaluation on SuiteSparse dataset demonstrates that Drawloom outperforms cuSPARSE by 2.71×/1.90× (in FP16), 2.95×/2.39× (in FP32), and 2.47×/1.54× (in FP64) on A100 and H100 GPUs, respectively. Compared to the state-of-the-art SpMV implementations, Drawloom achieves a performance speedup of 1.26×/1.18× (in FP16) and 1.49×/1.56× (in FP64) on A100 and H100 GPUs, respectively.
@InProceedings{PPoPP26p245,
author = {Kaige Zhang and Hailong Yang and Xin You and Tianyu Feng and Yufan Xu and Zhongzhi Luan and Yi Liu and Depei Qian},
title = {Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {245-258},
doi = {10.1145/3774934.3786441},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
VDHA: Vector-Driven Hash Aggregation for Sparse Matrix-Sparse Vector Multiplication on GPUs
Yuchen Li,
Zhe Pan,
Peng Qu, and
Youhui Zhang
(Tsinghua University, China; Zhongguancun Laboratory, China)
Sparse matrix-sparse vector multiplication (SpMSpV) is a core primitive in graph analytics and scientific computing, also arising in spiking neural networks for event-driven spike propagation. On GPUs, the performance of the prevalent and efficient SpMSpV paradigm is often bottlenecked by the write-back phase of accumulating non-zero multiply–accumulate results; its many-to-one index scatter pattern causes severe conflicts and poor bandwidth utilization on GPUs. We present VDHA, a GPU-based weighted SpMSpV kernel that leverages block-private hash tables for local aggregation, substantially reducing write conflicts and improving memory coalescing. To further amplify this benefit, we incorporate column splitting with lightweight reordering to expose more locality, and employ a fetch–compute–writeback pipeline to overlap hash computation with memory accesses. Extensive evaluation on over 300 matrices with more than 5 million nonzeros, including web-scale graphs (Konect/LAW) and scientific workloads (SuiteSparse), shows that VDHA consistently outperforms state-of-the-art baselines. On web graphs, it achieves a 1.41× geometric-mean speedup (up to 3.42×), while on SuiteSparse it delivers 1.13× (up to 2.55×). We also provide a lightweight predictive model that identifies matrices favorable to VDHA with 91.3% accuracy.
@InProceedings{PPoPP26p259,
author = {Yuchen Li and Zhe Pan and Peng Qu and Youhui Zhang},
title = {VDHA: Vector-Driven Hash Aggregation for Sparse Matrix-Sparse Vector Multiplication on GPUs},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {259-272},
doi = {10.1145/3774934.3786447},
year = {2026},
}
Publisher's Version
Artifacts Reusable
Results Reproduced
Mixed Precision and Quantization
RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision Quantization
Qihao Zhang,
MingLiang Tang,
Mingshu Zhai,
Kinman Lei, and
Jidong Zhai
(Tsinghua University, China)
Mixed precision quantization has been adopted to accelerate large language models (LLMs) serving by leveraging high-throughput low-precision compute units in GPUs while preserving outliers in higher precision to maintain model accuracy. However, existing methods focus on mitigating single-dimensional channel-wise outliers, leading to model accuracy degradation when scaled to 4-bit precision.
In this paper, we present an algorithm-system co-design to effectively handle dual-dimensional outliers across both channel and token dimensions in LLMs. We introduce a novel rotation-based mixed precision quantization algorithm that suppresses and migrates channel-wise outliers to the token dimension. Based on this algorithm, we propose RoMeo, an efficient LLM serving system designed to overcome the unique system challenges posed by sparse computation pattern and dynamic outlier detection inherent in token-wise outlier handling. Extensive evaluations across various LLMs demonstrate that RoMeo improves quantized model accuracy by up to 5.17% compared to state-of-the-art methods QuaRot and MixQ, while maintaining efficiency comparable to uniform precision quantizations, achieving up to 2.10 × end-to-end speedup over half-precision baseline. RoMeo is available at https://github.com/thu-pacman/RoMeo.
@InProceedings{PPoPP26p273,
author = {Qihao Zhang and MingLiang Tang and Mingshu Zhai and Kinman Lei and Jidong Zhai},
title = {RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision Quantization},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {273-287},
doi = {10.1145/3774934.3786419},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
High-Throughput Non-uniformly Quantized 3-bit LLM Inference
YuAng Chen,
Wenqi Zeng, and
Jeffrey Xu Yu
(Chinese University of Hong Kong, China; Hong Kong University of Science and Technology, China; Hong Kong University of Science and Technology (Guangzhou), China)
While Large Language Models (LLMs) are widely adopted, their massive parameter size constrains practical deployment. A common solution is clustering-based non-uniform quantization, which effectively compresses models to as low as 3 bits per weight while preserving high accuracy. However, instead of accelerating memory-bound LLM inference, the memory reduction paradoxically often causes a significant slowdown due to dequantization overhead and GPU underutilization. To address the issue, we propose Quantix, a framework designed to convert memory savings into inference speedups. Quantix applies two key optimizations: (1) a hardware-aligned bit shuffling scheme for efficient data access, and (2) a fused dequantization-multiplication pipeline that effectively maps workloads on both CUDA and Tensor Cores. Quantix enables high-throughput batched inference, delivering average kernel-level speedups of 4.82× over FP16 cuBLAS and end-to-end speedups of up to 11.46× over state-of-the-art quantization methods on NVIDIA L40 GPUs.
@InProceedings{PPoPP26p288,
author = {YuAng Chen and Wenqi Zeng and Jeffrey Xu Yu},
title = {High-Throughput Non-uniformly Quantized 3-bit LLM Inference},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {288-300},
doi = {10.1145/3774934.3786423},
year = {2026},
}
Publisher's Version
JanusQuant: Accurate and Efficient 2-bit KV Cache Quantization for Long-Context Inference
Chengyu Sun,
Yaqi Xia,
Hulin Wang,
Donglin Yang,
Xiaobo Zhou, and
Dazhao Cheng
(Wuhan University, China; Nvidia Corporation, USA; University of Macau, China)
Long-context large language models (LLMs) have seen widespread adoption in recent years.
However, during inference, the key-value (KV) cache—which stores intermediate activations—consumes significant memory, particularly as sequence lengths grow.
Quantization offers a promising path to compress KV cache, but existing 2-bit approaches fall short of achieving optimal inference efficiency due to hardware-unfriendly algorithms and system implementations.
We present JanusQuant, a 2-bit KV cache quantization system that achieves both high accuracy and end-to-end efficiency through algorithm–system co-design for long-context generation tasks.
At its core is RtSmooth, a novel runtime smoothing quantization algorithm that mitigates outlier-induced accuracy loss via adaptive transformation.
Building on RtSmooth, JanusQuant further enhances quantized inference with a series of optimizations: a fast absmax positioning technique for lightweight quantization, a memory-efficient data structure for managing recent tokens, and a custom mixed-precision attention kernel to accelerate computation.
Across representative LLMs, JanusQuant preserves 99% of FP16 accuracy, reduces KV cache memory usage by up to 5.3×, and delivers up to 4.45× faster decoding throughput compared to state-of-the-art methods, while scaling efficiently to long-context inference.
@InProceedings{PPoPP26p301,
author = {Chengyu Sun and Yaqi Xia and Hulin Wang and Donglin Yang and Xiaobo Zhou and Dazhao Cheng},
title = {JanusQuant: Accurate and Efficient 2-bit KV Cache Quantization for Long-Context Inference},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {301-314},
doi = {10.1145/3774934.3786428},
year = {2026},
}
Publisher's Version
HierCut: Enabling 16-bit Format Mixed Precision for Molecular Dynamics through Hierarchical Cutoff
Zeyu Song,
Lin Gan,
Xiaohui Duan,
Zhengrui Li,
Jiayu Fu,
Yinuo Wang,
Guangzhao Li, and
Guangwen Yang
(Tsinghua University, China; Shandong University, China; Institute of Software at Chinese Academy of Sciences, China)
Mixed-precision methods offer the potential to achieve better performance while maintaining accuracy comparable to that of high-precision formats. However, the adoption of mixed precision—particularly with 16-bit formats—in scientific computing remains limited due to precision truncation.
To address this challenge, we propose HierCut, a mixedprecision strategy that leverages cutoff schemes in molecular dynamics (KOKKOS package of LAMMPS) by assigning different precision levels to particles across distinct cutoff layers. We further introduce techniques to improve the performance and accuracy of simulations using 16-bit numerical formats, and develop error analysis methods to guide the configuration of mixed-precision ratios. Our scheme achieves accuracy comparable to high-precision results, delivering a speedup of up to 3.75x over the original FP64 implementation of LAMMPS and up to 1.40x over the optimized FP32 kernels, based on NVIDIA A100 GPUs. Our implementation can be effectively scaled to boost large-scale multi-GPU simulations.
@InProceedings{PPoPP26p315,
author = {Zeyu Song and Lin Gan and Xiaohui Duan and Zhengrui Li and Jiayu Fu and Yinuo Wang and Guangzhao Li and Guangwen Yang},
title = {HierCut: Enabling 16-bit Format Mixed Precision for Molecular Dynamics through Hierarchical Cutoff},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {315-328},
doi = {10.1145/3774934.3786433},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Cluster and Cloud Computing
Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds
Xiaokang Hu,
Yuchao Cao,
Naixuan Guan,
Yifan Wu,
Xishi Qiu,
Shengdong Dai,
Ben Luo,
Sanchuan Cheng,
Fudong Qiu,
Yibin Shen, and
Jiesheng Wu
(Alibaba Cloud Computing, China)
Competition for the last-level cache (LLC) is a long-standing issue in multi-tenant cloud environments, often leading to severe performance interference among co-located virtual machines. LLC management in the cloud faces unique challenges, including unpredictable tenant workloads, misaligned performance metrics, and the need to ensure fairness under service level agreements (SLAs). Existing LLC allocation methods fall short in addressing these challenges. We present Cacheman, a comprehensive LLC management system designed from real-world cloud deployment experience. Cacheman introduces a novel gradient-based sharing mechanism for LLC ways, enabling smooth LLC allocation adjustments that simultaneously improve fairness and utilization efficiency. Its real-time allocation algorithm promptly detects and mitigates unfair LLC allocation, adapting to dynamic workloads with second-scale responsiveness. Additionally, Cacheman supports performance consistency for tenants running distributed applications by enforcing negotiated upper bounds on cache usage. Extensive experiments demonstrate that Cacheman effectively achieves its multi-dimensional goals, and long-term production deployment further shows that it significantly reduces SLA violations caused by LLC contention.
@InProceedings{PPoPP26p329,
author = {Xiaokang Hu and Yuchao Cao and Naixuan Guan and Yifan Wu and Xishi Qiu and Shengdong Dai and Ben Luo and Sanchuan Cheng and Fudong Qiu and Yibin Shen and Jiesheng Wu},
title = {Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {329-341},
doi = {10.1145/3774934.3786415},
year = {2026},
}
Publisher's Version
zBuffer: Zero-Copy and Metadata-Free Serialization for Fast RPC with Scatter-Gather Reflection
Xiangyu Liu,
Huiba Li,
Shun Gai,
Youmin Chen, and
Yiming Zhang
(Xiamen University, China; Alibaba Cloud, China; Shanghai Jiao Tong University, China)
This paper presents zBuffer, a zero-copy and metadata-free serialization library for high-performance and low-cost RPCs. At the core of zBuffer is scatter-gather reflection, a novel technique that collaboratively (i) leverages the NIC scatter-gather hardware feature to offload the costly data coalescing, and (ii) utilizes the static reflection mechanism of modern programming languages to enable type queries on complex data objects without requiring explicit metadata construction. We leverage C++ language features, mainly including template meta-programming and macros, to realize static reflection at compile time. Based on zBuffer, we design a fast RPC system (called zRPC) which eliminates all RPC memory copy overheads not only in (de)serialization but also in network transmission. Extensive evaluation shows that zBuffer/zRPC significantly outperforms state-of-the-art serialization/RPC mechanisms: zBuffer is approximately 7× faster than Cornflakes in serialization for complex data objects; and zRPC reduces 99th percentile latency by 21% and achieves 62% higher throughput than eRPC on the Masstree key-value (KV) store with the YCSB benchmark.
@InProceedings{PPoPP26p342,
author = {Xiangyu Liu and Huiba Li and Shun Gai and Youmin Chen and Yiming Zhang},
title = {zBuffer: Zero-Copy and Metadata-Free Serialization for Fast RPC with Scatter-Gather Reflection},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {342-354},
doi = {10.1145/3774934.3786426},
year = {2026},
}
Publisher's Version
Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters
Ruobing Han and
Hyesoon Kim
(Georgia Tech, USA)
The growing demand for GPU resources has led to widespread shortages in data centers, prompting the exploration of CPUs as an alternative for executing GPU programs. While prior research supports executing GPU programs on single CPUs, these approaches struggle to achieve competitive performance due to the computational capacity gap between GPUs and CPUs.
To further improve performance, we introduce CuCC, a framework that scales GPU-to-CPU migration to CPU clusters and utilizes distributed CPU nodes to execute GPU programs. Compared to single-CPU execution, CPU cluster execution requires cross-node communication to maintain data consistency. We present the CuCC execution workflow and communication optimizations, which aim to reduce network overhead. Evaluations demonstrate that CuCC achieves high scalability on large-scale CPU clusters and delivers runtimes approaching those of GPUs. In terms of cluster-wide throughput, CuCC enables CPUs to achieve an average of 2.59x higher throughput than GPUs.
@InProceedings{PPoPP26p355,
author = {Ruobing Han and Hyesoon Kim},
title = {Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {355-368},
doi = {10.1145/3774934.3786435},
year = {2026},
}
Publisher's Version
Trojan Horse: Aggregate-and-Batch for Scaling Up Sparse Direct Solvers on GPU Clusters
Yida Li,
Siwei Zhang,
Yiduo Niu,
Yang Du,
Qingxiao Sun,
Zhou Jin, and
Weifeng Liu
(China University of Petroleum-Beijing, China)
Sparse direct solvers are critical building blocks in a range of scientific applications on heterogeneous supercomputers. However, existing sparse direct solvers have not been able to well leverage the high bandwidth and floating-point performance of modern GPUs. The primary challenges are twofold: (1) the absence of a mechanism for aggregating small tasks to saturate the GPU, and (2) the lack of a mechanism for executing a diverse set of small tasks in batch mode on a single GPU.
We in this paper propose a strategy called Trojan Horse, which significantly enhances the execution efficiency of sparse direct solvers on GPU clusters. This mechanism divides each process's work into two stages: Aggregate (with two modules Prioritizer and Container) and Batch (with two modules Collector and Executor). In the Aggregate stage, a process first assesses the urgency of the input tasks through the Prioritizer module, and based on their priority, sends them to the Collector module or the Container module. In the batch stage, the Collector module receives high-priority heterogeneous tasks from the Prioritizer module and retrieves enough tasks from the Container module to send them to the Executor module for batch execution on GPU.In addition, our strategy is independent of solver libraries, and is integrated into SuperLU_DIST and PanguLU.
In the scale-up evaluation on a single NVIDIA A100 GPU, the Trojan Horse strategy delivers speedups of up to 418.79x (5.47x on average) for SuperLU_DIST and up to 5.59x (2.84x on average) for PanguLU. In the scale-out evaluation on two 16-GPU clusters from NVIDIA and AMD, respectively, Trojan Horse continues to deliver strong performance gains for both SuperLU_DIST and PanguLU across different GPU counts.
@InProceedings{PPoPP26p369,
author = {Yida Li and Siwei Zhang and Yiduo Niu and Yang Du and Qingxiao Sun and Zhou Jin and Weifeng Liu},
title = {Trojan Horse: Aggregate-and-Batch for Scaling Up Sparse Direct Solvers on GPU Clusters},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {369-383},
doi = {10.1145/3774934.3786442},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Distributed Training
COCCL: A Collective Communication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training
Xingchen Liu,
Haoran Kong,
Hairui Zhao,
Shengkai Lyu,
Zheng Wei,
Man Liu,
Xingjian Tian,
Liyang Zhao,
Zhuohan Chen,
Fakang Wang,
Zizhong Chen,
Zhan Wang,
Guangming Tan, and
Dingwen Tao
(University of Chinese Academy of Sciences, China; Shenzhen Loop Area Institute, China; Chinese University of Hong Kong, Shenzhen, China; Jilin University, China; Ant Group, China)
Collective communication is critical to scaling large language model (LLM) training across various parallelism strategies, including data, tensor, and pipeline parallelism on GPU clusters. However, as model sizes and training scales increase, communication overhead is emerging as a major performance bottleneck. While compression is a promising mitigation strategy, existing solutions often lack user-transparency, hinder deployment and extensibility, and are not co-designed with communication algorithms. To address these limitations, we present COCCL, a high-performance collective communication library built on top of NCCL. COCCL introduces a novel programming model that can easily integrate compression into communication workflows with flexible configurability. It features a suite of compression-aware collective algorithms and runtime overlap mechanisms that mitigate error propagation and reduce computational overhead. We integrate well-established compression techniques into COCCL and tune the compression configurations during 3D-parallel training on GPT and Qwen models with up to 7 billion parameters. Using the optimal configuration (COCCL-3D), we achieve 1.24× throughput improvement while maintaining training accuracy.
@InProceedings{PPoPP26p384,
author = {Xingchen Liu and Haoran Kong and Hairui Zhao and Shengkai Lyu and Zheng Wei and Man Liu and Xingjian Tian and Liyang Zhao and Zhuohan Chen and Fakang Wang and Zizhong Chen and Zhan Wang and Guangming Tan and Dingwen Tao},
title = {COCCL: A Collective Communication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {384-397},
doi = {10.1145/3774934.3786432},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-Tolerant Distributed Training
Xuanyu Wang,
Fangcheng Fu,
Haoyang Li,
Hao Ge,
Sheng Lin,
Jiawen Niu, and
Bin Cui
(Peking University, China; Shanghai Jiao Tong University, China)
Distributed deep learning (DL) training faces instability from GPU/node failures of multi-GPU clusters, necessitating robust fault recovery from model checkpoints. However, we find that existing works only considers node failures but fails to handle partial GPU unavailability, and suffers from inefficient model checkpointing saving and loading, particularly when the GPU availability changes.
This work presents Elastor, a fault-tolerant distributed DL training system featuring elastic and efficient model checkpointing. Firstly, to accommodates partial GPU unavailability, we manage to support heterogeneous model parallel partitioning to elastically resume training with any number of GPUs. Secondly, we devise a partition-agnostic and efficient model checkpointing method via fine-grained tensor splits to achieve seamless transitions across arbitrary partitioning. In addition, Elastor equips with a strategy searching algorithm that automatically discovers optimal model partitioning upon recovery as well as a meticulous overlapping design that minimizes the overhead caused by periodic model checkpointing and data preprocessing. Experimental results show that Elastor facilitates quick model checkpointing and failure recovery, while maintaining consistent training efficiency across varying GPU availability. Source code is available at https://github.com/PKU-DAIR/Hetu.
@InProceedings{PPoPP26p398,
author = {Xuanyu Wang and Fangcheng Fu and Haoyang Li and Hao Ge and Sheng Lin and Jiawen Niu and Bin Cui},
title = {Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-Tolerant Distributed Training},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {398-412},
doi = {10.1145/3774934.3786445},
year = {2026},
}
Publisher's Version
HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
Geng Zhang,
Shenggan Cheng,
Xuanlei Zhao,
Ziming Liu, and
Yang You
(National University of Singapore, Singapore)
As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel pipeline parallelism for long sequence transformer training. First, HelixPipe introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles. Second, it employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with computation. Additionally, HelixPipe utilizes recomputation without attention and chunked MLP to mitigate fragmentation and enable longer sequences. Experiments demonstrate that HelixPipe gains increasing advantages with longer sequence lengths, and outperforms existing methods in throughput and scalability across varying pipeline sizes, model sizes, and cluster configurations. Notably, it achieves a 26% speedup over baseline methods when training a 7B model with 128k sequence length on 64 H20 GPUs. Code is available at https://github.com/zxgx/Megatron-LM.
@InProceedings{PPoPP26p413,
author = {Geng Zhang and Shenggan Cheng and Xuanlei Zhao and Ziming Liu and Yang You},
title = {HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {413-424},
doi = {10.1145/3774934.3786417},
year = {2026},
}
Publisher's Version
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
Yida Gu,
Fakang Wang,
Jianhao Fu,
Zhenhang Sun,
Qianyu Zhang,
Hairui Zhao,
Xingchen Liu,
Yang Tian,
Wenjing Huang,
Zedong Liu,
Yifan Chen,
Jinwu Yang,
Yueyuan Zhou,
Qian Zhao,
Haoxu Li,
Tao Wang,
Feng Yu,
Zhan Wang,
Guangming Tan, and
Dingwen Tao
(University of Chinese Academy of Sciences, China; Ant Group, China; Jilin University, China)
As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes—substantially outperforming existing solutions.
@InProceedings{PPoPP26p425,
author = {Yida Gu and Fakang Wang and Jianhao Fu and Zhenhang Sun and Qianyu Zhang and Hairui Zhao and Xingchen Liu and Yang Tian and Wenjing Huang and Zedong Liu and Yifan Chen and Jinwu Yang and Yueyuan Zhou and Qian Zhao and Haoxu Li and Tao Wang and Feng Yu and Zhan Wang and Guangming Tan and Dingwen Tao},
title = {CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {425-438},
doi = {10.1145/3774934.3786429},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Parallel Algorithms
Pipelonk: Accelerating End-to-End Zero-Knowledge Proof Generation on GPUs for PLONK-Based Protocols
Zhiyuan Zhang,
Yanxin Cai,
Wenhao Yin,
Xueyu Wu,
Yi Wang,
Lei Ju, and
Zhuoran Ji
(Shandong University, China; Quan Cheng Laboratory, China; University of Hong Kong, China; Shenzhen University, China; State Key Laboratory of Cryptography and Digital Economy Security, China)
Zero-knowledge proofs (ZKPs) are cryptographic protocols that allow verification of statements without disclosing the underlying information. Among them, PLONK-based ZKPs are particularly notable for offering succinct, non-interactive proofs of knowledge with a universal trusted setup, leading to widespread adoption in blockchain and cryptocurrency applications. Nonetheless, their broader deployment is hindered by long proof-generation times and substantial memory demands. While GPUs can accelerate these computations, their limited memory capacity introduces significant challenges for efficient end-to-end proof generation.
This paper presents Pipelonk, a GPU-accelerated framework for end-to-end PLONK proof generation with two key contributions. First, Pipelonk introduces a segmentable operator library that offloads all operations, including those not trivially parallelized, to GPUs through new designs. Each operator supports segmented execution, allowing inputs to be divided into smaller segments processed independently, thus enabling large-scale computations on memory-constrained devices. Second, Pipelonk provides a pipeline executor that overlaps computation and data transfer. It globally schedules compute- and memory-intensive tasks while preserving data and security dependencies, balances transfer-latency hiding against peak memory, and adaptively selects per-operator segment sizes by modeling memory capacity and computational characteristics to maximize compute-transfer overlap. Evaluation shows that Pipelonk runs efficiently on devices with 8GB to 80GB memory, achieving an average speedup of 10.7× and up to 19.4× over the state-of-the-art baseline.
@InProceedings{PPoPP26p439,
author = {Zhiyuan Zhang and Yanxin Cai and Wenhao Yin and Xueyu Wu and Yi Wang and Lei Ju and Zhuoran Ji},
title = {Pipelonk: Accelerating End-to-End Zero-Knowledge Proof Generation on GPUs for PLONK-Based Protocols},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {439-451},
doi = {10.1145/3774934.3786448},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
ParDiff: Efficiently Parallelizing Reverse-Mode Automatic Differentiation with Direct Indexing
Shuhong Huang,
Shizhi Tang,
Yuan Wen,
Huanqi Cao,
Ruibai Tang,
Yidong Chen,
Jiping Yu,
Yang Li,
Chao Jiang,
Limin Xiao, and
Jidong Zhai
(Tsinghua University, China; Qingcheng.AI, China; University of Aberdeen, UK; Lenovo Research, China)
Automatic Differentiation (AD) is a technique that computes the derivatives of numerical programs by systematically applying the chain rule, playing a critical role in domains such as machine learning, simulation, and control systems. However, parallelizing differentiated programs remains a significant challenge due to the conflict between tapes (a data structure for intermediate variable storage) and summations: the differentiation process inherently introduces inter-thread summation patterns, which require prohibitively expensive atomic operations; and traditional tape designs tightly couple data retrieval with the program’s control flow, preventing code restructuring needed to eliminate these costly dependencies.
To address these challenges, we present ParDiff, a novel AD system with a direct-indexed tape design, which enables summation-aware loop transformations and various parallel schemes for differentiated programs. This results in a higher degree of parallelization, less synchronization, and reduced inter-thread data movement. We conduct comprehensive experiments on both multi-core CPUs and GPUs. Results show that ParDiff delivers up to 483.21× (geometric mean: 30.88×) speedup over the state-of-the-art fully-AD system, Enzyme. It also achieves a speedup of 2.05× and 2.06× over PyTorch on CPU and GPU, respectively. The source code is publicly available at https://github.com/roastduck/FreeTensor.
@InProceedings{PPoPP26p452,
author = {Shuhong Huang and Shizhi Tang and Yuan Wen and Huanqi Cao and Ruibai Tang and Yidong Chen and Jiping Yu and Yang Li and Chao Jiang and Limin Xiao and Jidong Zhai},
title = {ParDiff: Efficiently Parallelizing Reverse-Mode Automatic Differentiation with Direct Indexing},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {452-465},
doi = {10.1145/3774934.3786418},
year = {2026},
}
Publisher's Version
Faster and Cheaper: Pushing the Sequence Alignment Throughput with Commercial CPUs
Zhonghai Zhang,
Yewen Li,
Ke Meng,
Chunming Zhang, and
Guangming Tan
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Hong Kong University of Science and Technology, China; Phil Rivers Technology, China)
This paper proposes FastAlign, a faster, cheaper, and practical end-to-end solution for sequence alignment using commercial CPUs. It introduces two key innovations: a multi-stage seeding algorithm that improves search performance while maintaining low memory consumption, and an intra-query parallel seed-extension algorithm that eliminates redundancy and increases SIMD utilization. Evaluation results show that FastAlign achieves 2.27× ∼ 3.28× throughput speedup and 2.54× ∼ 5.65× cost reduction compared to state-of-the-art CPU and GPU baselines while guaranteeing 100% identical output to the de facto software BWA-MEM. FastAlign is open-sourced at https://github.com/zzhofict/BWA-FastAlign.git.
@InProceedings{PPoPP26p466,
author = {Zhonghai Zhang and Yewen Li and Ke Meng and Chunming Zhang and Guangming Tan},
title = {Faster and Cheaper: Pushing the Sequence Alignment Throughput with Commercial CPUs},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {466-479},
doi = {10.1145/3774934.3786421},
year = {2026},
}
Publisher's Version
PIM-zd-tree: A Fast Space-Partitioning Index Leveraging Processing-in-Memory
Yiwei Zhao,
Hongbo Kang,
Ziyang Men,
Yan Gu,
Guy E. Blelloch,
Laxman Dhulipala,
Charles McGuffey, and
Phillip B. Gibbons
(Carnegie Mellon University, USA; Tsinghua University, China; University of California at Riverside, USA; University of Maryland, USA; Reed College, USA)
Space-partitioning indexes are widely used for managing multi-dimensional data, but their throughput is often memory-bottlenecked. Processing-in-memory (PIM), an emerging architectural paradigm, mitigates memory bottlenecks by embedding processing cores directly within memory modules, allowing computation to be offloaded to these PIM cores.
In this paper, we present PIM-zd-tree, the first space-partitioning index specifically designed for real-world PIM systems. PIM-zd-tree employs a tunable multi-layer structure, with each layer adopting distinct data layouts, partitioning schemes, and caching strategies. Its design is theoretically grounded to achieve load balance, minimal memory-channel communication, and low space overhead. To bridge theory and practice, we incorporate implementation techniques such as practical chunking and lazy counters. Evaluation on a real-world PIM system shows that PIM-zd-tree’s throughput is up to 4.25× and 99× higher than two state-of-the-art shared-memory baselines.
@InProceedings{PPoPP26p480,
author = {Yiwei Zhao and Hongbo Kang and Ziyang Men and Yan Gu and Guy E. Blelloch and Laxman Dhulipala and Charles McGuffey and Phillip B. Gibbons},
title = {PIM-zd-tree: A Fast Space-Partitioning Index Leveraging Processing-in-Memory},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {480-495},
doi = {10.1145/3774934.3786411},
year = {2026},
}
Publisher's Version
Info
ML Inference
BEEMS: Boosting Machine Vision Efficiency via Computation Graph-Based Memory Smoothing
Hanjing Shen,
Fangxin Liu,
Jian Liu,
Li Jiang, and
Haibing Guan
(Shanghai Jiao Tong University, China; Beihang University, China)
With the rapid advances of deep learning-based computer vision (CV) technology, digital images are increasingly processed not by humans, but by downstream CV algorithms. In particular, the growing popularity of vision foundation models has heightened interest in deploying these models on edge devices. However, limited memory remains a key bottleneck, making memory footprint reduction essential. Mainstream model customization methods often require intensive deployment efforts and can severely degrade accuracy. Moreover, existing deep learning frameworks generally do not prioritize memory optimization. Existing memory management schemes face practical limitations, including layer-wise memory imbalance, high management overhead, and volatile memory budgets.
To tackle these issues, this work focuses on compilation-level optimizations that are explicitly designed to be memory-aware. We observe that memory usage during vision foundation model inference varies significantly over time (up to a 10× difference), with extended periods of low memory demand. Based on this, we propose BEEMS, a dual-objective compiler that optimizes both memory and latency by smoothing memory usage across the computational graph. BEEMS analyzes the vision foundation model computational graph to identify peak and trough operators in terms of memory demand. It then builds an efficient optimization search space, offering a flexible interface that applies different strategies based on operator characteristics. Specifically, peak operators are optimized using techniques such as operator partitioning, kernel substitution, swapping, and rematerialization to reduce memory pressure, while trough operators apply subgraph substitutions to improve latency. Experiments on six diverse models show that BEEMS reduces peak memory by up to 90% and improves latency by 10%, demonstrating its effectiveness in jointly optimizing memory and performance.
@InProceedings{PPoPP26p496,
author = {Hanjing Shen and Fangxin Liu and Jian Liu and Li Jiang and Haibing Guan},
title = {BEEMS: Boosting Machine Vision Efficiency via Computation Graph-Based Memory Smoothing},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {496-508},
doi = {10.1145/3774934.3786430},
year = {2026},
}
Publisher's Version
Laser: Unlocking Layer-Level Scheduling for Efficient Multi-SLO LLM Serving
Jianxiong Liao,
Quanxing Dong,
Yunkai Liang,
Zhi Zhou, and
Xu Chen
(Sun Yat-sen University, China)
Engaging applications with diverse SLO requirements has become indispensable for production-scale LLM serving systems. However, existing systems rely on iteration-level scheduling, which enforces inflexible, unified execution across multi-SLO workloads, significantly constraining the serving efficiency.
In this paper, we introduce layer-level scheduling, a novel mechanism that advances beyond conventional iteration-level granularity. This mechanism decomposes per-iteration computation into fine-grained layer operations, enabling the tailored execution of requests with differing requirements. However, this increased granularity introduces new challenges in both intra-instance request execution and cross-instance coordination, posing significant barriers to practical deployment. To address these challenges, we introduce Laser, a system designed for efficient multi-SLO LLM serving. The key aspect lies in the seamless integration of inter-instance request dispatching with layer-level scheduling within instances, delivering high serving throughput with SLO guarantees. Evaluations with real-world applications reveal that Laser effectively improves throughput by over 1.67x while maintaining the same SLO attainment rate compared to state-of-the-art systems.
@InProceedings{PPoPP26p509,
author = {Jianxiong Liao and Quanxing Dong and Yunkai Liang and Zhi Zhou and Xu Chen},
title = {Laser: Unlocking Layer-Level Scheduling for Efficient Multi-SLO LLM Serving},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {509-521},
doi = {10.1145/3774934.3786413},
year = {2026},
}
Publisher's Version
MixFusion: A Patch-Level Parallel Serving System for Mixed-Resolution Diffusion Models
Desen Sun,
Zepeng Zhao, and
Yuke Wang
(University of Waterloo, Canada; Carnegie Mellon University, USA; Rice University, USA)
Text-to-Image (T2I) diffusion models have recently attracted significant attention due to their ability to synthesize high-fidelity photorealistic images. However, serving diffusion models would suffer from hardware underutilization in real-world settings due to highly variable request resolutions. To this end, we present MixFusion, a parallel serving system that exploits fine-grained patch-level parallelism to enable efficient batching of mixed-resolution requests. Specifically, MixFusion introduces a novel patch-based processing workflow, significantly enabling concurrent processing across heterogeneous requests. Furthermore, MixFusion incorporates a patch-tailored cache management policy to exploit the patch-level locality benefits. In addition, MixFusion features an SLO-aware scheduling strategy with lightweight online latency prediction. Extensive evaluation demonstrates that MixFusion achieves 30.1% higher SLO satisfaction compared to the state-of-the-art solutions on average. Our code is available at https://github.com/desenSunUBW/mixfusion.
@InProceedings{PPoPP26p522,
author = {Desen Sun and Zepeng Zhao and Yuke Wang},
title = {MixFusion: A Patch-Level Parallel Serving System for Mixed-Resolution Diffusion Models},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {522-536},
doi = {10.1145/3774934.3786420},
year = {2026},
}
Publisher's Version
Artifacts Reusable
Results Reproduced
ChituDiffusion: A Data-Characteristic-Aware Serving System for Diffusion Models
Chengzhang Wu,
Liyan Zheng,
Haojie Wang,
Kezhao Huang,
Zixuan Ma,
Dong Dong, and
Jidong Zhai
(Tsinghua University, China)
Diffusion models have become the dominant approach for generative tasks in images, videos, and other domains. However, diverse data properties in generation requests, which are critical for efficient serving, remain underexploited. To address this issue, we propose a diffusion model serving system ChituDiffusion. ChituDiffusion leverages the locality of data properties to recompose a diffusion pipeline into dGraphs with shared optimization opportunities, enabling thorough compile-time and runtime co-optimizations. During compilation, ChituDiffusion compiles each dGraph into multiple execution engines optimized for specific data properties. At runtime, heterogeneous requests are elaborately reorganized into fine-grained batching tasks with similar properties and then efficiently executed by matched engines. Evaluation on five diffusion applications shows that ChituDiffusion improves the throughput by up to 2.13× (1.58× on average) on A100 and 2.19× (1.51× on average) on H100 compared with existing frameworks. The code for ChituDiffusion and the production traces have been made open-source at https://github.com/thu-pacman/chitu/tree/Diffusion.
@InProceedings{PPoPP26p537,
author = {Chengzhang Wu and Liyan Zheng and Haojie Wang and Kezhao Huang and Zixuan Ma and Dong Dong and Jidong Zhai},
title = {ChituDiffusion: A Data-Characteristic-Aware Serving System for Diffusion Models},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {537-550},
doi = {10.1145/3774934.3786424},
year = {2026},
}
Publisher's Version
Graphs and Graph Neural Networks
ElasGNN: An Elastic Training Framework for Distributed GNN Training
Siqi Wang,
Hailong Yang,
Pengbo Wang,
Hongliang Cao,
Yufan Xu,
Xuezhu Wang,
Zhongzhi Luan,
Yi Liu, and
Depei Qian
(Beihang University, China; Independent Researcher, USA)
Graph Neural Networks (GNNs) have emerged as powerful machine learning models for numerous graph-based applications. However, existing GNN training frameworks cannot scale the training process elastically, resulting in poor training throughput and low cluster utilization. Although elastic training has been proposed for Deep Neural Networks (DNNs), it cannot be directly adopted to GNNs due to the prohibitive scaling cost and inefficient scheduling. In this paper, we present ElasGNN, an elastic GNN training framework that achieves efficient dynamic resource allocation for GNN jobs. ElasGNN proposes an efficient elastic training engine to achieve high-performant GNN job scaling and introduces novel graph repartitioning algorithms for both scale-in and scale-out processes to further minimize the scaling cost. Moreover, ElasGNN designs an efficient elastic scheduler, utilizing a scaling-cost-aware scheduling policy to improve the GPU utilization and system throughput. The experimental results show that the ElasGNN can achieve shorter job completion time and makespan for training jobs of diverse GNN models.
@InProceedings{PPoPP26p551,
author = {Siqi Wang and Hailong Yang and Pengbo Wang and Hongliang Cao and Yufan Xu and Xuezhu Wang and Zhongzhi Luan and Yi Liu and Depei Qian},
title = {ElasGNN: An Elastic Training Framework for Distributed GNN Training},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {551-563},
doi = {10.1145/3774934.3786440},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
APERTURE: Algorithm-System Co-optimization for Temporal Graph Network Inference
Yiqing Wang,
Hailong Yang,
Enze Yu,
Qingxiao Sun,
Kejie Ma,
Kaige Zhang,
Chenhao Xie, and
Depei Qian
(Beihang University, China)
Temporal Graph Networks (TGNs) are widely used to model evolving relationships in dynamic graphs. However, existing inference systems enforce a step-wise paradigm: processing each temporal graph sequentially with a memory update followed by aggregation. We break this dependency by decoupling memory updates from aggregation while preserving prediction accuracy, thereby enabling a global view for fine-grained parallelism control. This design unlocks new optimization opportunities but introduces three system-level challenges: managing intermediate multi-state representations, curbing memory-bound update overheads, and selecting a safe yet efficient aggregation granularity. We present APERTURE, a TGN inference framework that bridges algorithmic semantics and system design. To address the above challenges, APERTURE (1) jointly aggregates temporal states via computation graph transformation, (2) minimizes redundant memory traffic through dependency-aware update reconstruction; (3) selects the optimal granularity by analytically modeling. The experimental results show that APERTURE achieves up to 59.3× speedup over state-of-the-art baselines without compromising accuracy.
@InProceedings{PPoPP26p564,
author = {Yiqing Wang and Hailong Yang and Enze Yu and Qingxiao Sun and Kejie Ma and Kaige Zhang and Chenhao Xie and Depei Qian},
title = {APERTURE: Algorithm-System Co-optimization for Temporal Graph Network Inference},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {564-576},
doi = {10.1145/3774934.3786450},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
TAC: Cache-Based System for Accelerating Billion-Scale GNN Training on Multi-GPU Platform
Zhiqiang Liang,
Hongyu Gao,
Jue Wang,
Fang Liu,
Xingguo Shi,
Junyu Gu,
Peng Di,
Sian Li,
Lei Tang,
Chunbao Zhou,
Lian Zhao,
Yangang Wang, and
Xuebin Chi
(Computer Network Information Center at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Ant Group, China; UNSW, Australia)
Graph neural networks (GNNs) have been proven to have increasingly widespread applications in the real world. In the mainstream mini-batch training mode, multiple cache-based GNN training acceleration systems have been proposed because of the possibility of selecting the same vertex multiple times during the sampling process. However, on ultra-large scale graphs, especially those exhibiting power-law characteristics, these systems are difficult to fully utilize the distribution characteristics of cached data, which limits training performance. To this end, we propose TAC, a GNN training acceleration system that fully exploits the distribution characteristics of cached data to optimize both data transmission and computational efficiency. Specifically, we have designed a data affinity optimization algorithm that significantly enhances the locality of cache access. Secondly, an adaptive sparse matrix operator for sparsity perception is proposed, which dynamically selects the optimal computing mode based on the location of data. Finally, we have constructed a fine-grained training pipeline that maximizes system parallelism by hiding the sampling and computation. The experimental results show that TAC significantly outperforms existing state-of-the-art cache acceleration systems on multiple benchmark datasets, demonstrating higher training efficiency.
@InProceedings{PPoPP26p577,
author = {Zhiqiang Liang and Hongyu Gao and Jue Wang and Fang Liu and Xingguo Shi and Junyu Gu and Peng Di and Sian Li and Lei Tang and Chunbao Zhou and Lian Zhao and Yangang Wang and Xuebin Chi},
title = {TAC: Cache-Based System for Accelerating Billion-Scale GNN Training on Multi-GPU Platform},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {577-590},
doi = {10.1145/3774934.3786460},
year = {2026},
}
Publisher's Version
DTMiner: A Data-Centric System for Efficient Temporal Motif Mining
Yinbo Hou,
Hao Qi,
Ligang He,
Jin Zhao,
Yu Zhang,
Hui Yu,
Longlong Lin,
Lin Gu,
Wenbin Jiang,
Xiaofei Liao, and
Hai Jin
(Huazhong University of Science and Technology, China; University of Warwick, UK; Hong Kong University of Science and Technology, China; Southwest University, China)
Mining temporal motifs in temporal graphs is essential for many critical applications. Although several solutions have been proposed to handle temporal motif mining, they still suffer from substantial inefficiencies due to significant redundant graph traversals and fragmented memory access, both caused by irregular search tree expansions across different motif matching tasks. In this work, we observe that data accesses issued by these tasks exhibit strong spatial similarity and temporal monotonicity. Based on these observations, this paper proposes an efficient data-centric temporal motif mining system DTMiner, which introduces a novel Load-Explore-Synchronize (LES) execution model to efficiently regularize data accesses to the common temporal graph data among different tasks. Specifically, DTMiner enables the temporal graph chunks to be sequentially loaded into the cache in temporal order and then triggers all relevant tasks to explore only these loaded data for search tree expansions in a fine-grained synchronization mechanism. In this way, different tasks can share the graph traversal corresponding to the same chunks, while fragmented memory accesses are restricted to the graph data residing in the cache, significantly reducing data access overhead. Experimental results demonstrate that DTMiner achieves 1.14×-11.98× performance improvement in comparison with the state-of-the-art temporal motif mining solutions.
@InProceedings{PPoPP26p591,
author = {Yinbo Hou and Hao Qi and Ligang He and Jin Zhao and Yu Zhang and Hui Yu and Longlong Lin and Lin Gu and Wenbin Jiang and Xiaofei Liao and Hai Jin},
title = {DTMiner: A Data-Centric System for Efficient Temporal Motif Mining},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {591-604},
doi = {10.1145/3774934.3786416},
year = {2026},
}
Publisher's Version
Optimizing Transformers
FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector Parallelism
Jianxing Xu,
Yuanbo Wen,
Jun Bi,
Ruibai Xu,
Guanglin Xu,
Rui Zhang,
Wei Li,
Ling Li,
Tianshi Chen,
Qi Guo, and
Yunji Chen
(University of Science and Technology of China, China; Institute of Computing Technology at Chinese Academy of Sciences, China; Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Cambricon Technologies, China)
The attention mechanism is central to modern deep learning, particularly in large language models (LLMs), but suffers from quadratic computational complexity. To accelerate attention computation on GPUs, fused attention techniques (e.g., FlashAttention) consolidate the matrix multiplication (GEMM) and softmax computations into a single kernel. However, these operations remain computationally decoupled: the GEMM leverages high-performance tensor units (Tensor Cores), while the softmax executes on slower vector units (CUDA cores). This imbalance induces severe vector intervals—periods where tensor units sit idle awaiting vector unit completion—significantly underutilizing tensor units. Furthermore, ongoing hardware advancements delivering faster tensor units exacerbate this bottleneck.
To resolve the vector interval bottleneck, in this paper, we propose FlashAttention-T, a fused attention implementation that advances toward fully tensorized fused attention. Our key insight is to offload critical softmax primitives to idle tensor units, maximizing hardware utilization and throughput. Concretely, we first introduce a series of operand value assignment methods to repurpose tensor matrix multiply-add (MMA) instructions for executing softmax primitives (e.g., element-wise scaling). Building on this, we design a tensorized online softmax algorithm with numerical stability guarantees, adhering to the constraints of repurposed tensor MMA instructions. To maximize performance, we parallelize the online softmax computation across tensor and vector units via architecture-aware scheduling techniques, fully leveraging heterogeneous parallelism.
Extensive evaluation of various attention configurations across multiple platforms including NVIDIA Ampere (A100 and AGX Orin) and Hopper (H100) GPUs demonstrates that FlashAttention-T achieves vector interval ratios 1.17–2.18× lower than the baseline on Ampere GPUs and reduces them down to 2.7% on the Hopper H100, with average speedups of up to 1.17× against FlashAttention-2 and FlashAttention-3 while preserving numerical stability and accuracy.
@InProceedings{PPoPP26p605,
author = {Jianxing Xu and Yuanbo Wen and Jun Bi and Ruibai Xu and Guanglin Xu and Rui Zhang and Wei Li and Ling Li and Tianshi Chen and Qi Guo and Yunji Chen},
title = {FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector Parallelism},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {605-619},
doi = {10.1145/3774934.3786425},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Accelerating Sparse Transformer Inference on GPU
Wenhao Dai,
Haodong Deng,
Mengfei Rong,
Xinyu Yang,
Hongyu Liu,
Fangxin Liu,
Hailong Yang,
Qianwen Cao, and
Qingxiao Sun
(China University of Petroleum-Beijing, China; Beihang University, China; Baidu, China; Shanghai Jiao Tong University, China)
Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or block-wise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.6× in MHA computation and 1.4× in end-to-end inference.
@InProceedings{PPoPP26p620,
author = {Wenhao Dai and Haodong Deng and Mengfei Rong and Xinyu Yang and Hongyu Liu and Fangxin Liu and Hailong Yang and Qianwen Cao and Qingxiao Sun},
title = {Accelerating Sparse Transformer Inference on GPU},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {620-634},
doi = {10.1145/3774934.3786434},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Functional
Results Reproduced
MetaAttention: A Unified and Performant Attention Framework across Hardware Backends
Feiyang Chen,
Yu Cheng,
Lei Wang,
Yuqing Xia,
Ziming Miao,
Lingxiao Ma,
Fan Yang,
Jilong Xue,
Zhi Yang,
Mao Yang,
Xingda Wei, and
Haibo Chen
(Shanghai Jiao Tong University, China; Peking University, China; Microsoft Research, China)
Computing attention is the backbone of transformer-based models like large language models. However, the increasing diversity of attention algorithms presents significant challenges for unleashing hardware performance. State-of-the-art variants like FlashAttention target a specific attention algorithm or hardware platform, which fail to generalize to other algorithms and platforms.
We present MetaAttention, a framework that automatically derives the optimal implementation of an attention algorithm given a hardware platform. Our key insight is that variants of attention can be abstracted into two operations: relevance scoring and aggregation, complemented by customizable functions and configurations like the input shape. Based on it, we systematically design a cross-backend attention runtime around these operations that generalizes to variants of attention with customizable operators. To unleash the hardware performance, we further propose an IntermediateTensor-based search method to find the optimal tiling strategy and the parallelism scheme according to the attention customization and hardware features. MetaAttention delivers up to a 10.4× speedup for configurations previously unsupported by state-of-the-art systems. Additionally, MetaAttention achieves performance comparable to manually-optimized libraries such as FlashMLA while significantly reducing the amount of code required.
@InProceedings{PPoPP26p635,
author = {Feiyang Chen and Yu Cheng and Lei Wang and Yuqing Xia and Ziming Miao and Lingxiao Ma and Fan Yang and Jilong Xue and Zhi Yang and Mao Yang and Xingda Wei and Haibo Chen},
title = {MetaAttention: A Unified and Performant Attention Framework across Hardware Backends},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {635-647},
doi = {10.1145/3774934.3786444},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
Matrix and Linear Algebra Algorithms
Towards Singular Value Decomposition for Rank-Deficient Matrices: An Efficient and Accurate Algorithm on GPU Architectures
Lu Shi,
WeiWei Xu, and
Shaoshuai Zhang
(University of Electronic Science and Technology of China, China; Nanjing University of Information Science and Technology, China)
Singular Value Decomposition (SVD) is a fundamental tool in numerous scientific and engineering domains. Many high-performance libraries, such as LAPACK, MAGMA, and cuSOLVER, provide general, truncated, and randomized SVD routines. However, when the input is a low-rank matrix whose rank is not explicitly known, existing routines usually treat it as full-rank, which leads to suboptimal performance. In this paper, we propose an efficient SVD algorithm specifically for rank-deficient matrices based on a recently proposed rank-revealing QR factorization, termed QB factorization. To further enhance numerical stability and efficiency, we introduce a Householder QB factorization and a mixed-precision SVD algorithm, accompanied by a rigorous error analysis demonstrating correctness and stability. Experimental results show that our method achieves up to 6978.71x speedup over the general (full) SVD routine in cuSOLVER and is 9.99x faster than randomized SVD in FP32 precision. Moreover, our method exhibits higher numerical accuracy than cuSOLVER full SVD, achieving substantially smaller backward errors while maintaining stable and reliable singular values. Beyond synthetic benchmarks, we also demonstrate its effectiveness in an image compression application with higher efficiency.
@InProceedings{PPoPP26p648,
author = {Lu Shi and WeiWei Xu and Shaoshuai Zhang},
title = {Towards Singular Value Decomposition for Rank-Deficient Matrices: An Efficient and Accurate Algorithm on GPU Architectures},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {648-659},
doi = {10.1145/3774934.3786427},
year = {2026},
}
Publisher's Version
A Diagonal Block Memory-Aware Polynomial Preconditioner for Linear and Eigenvalue Solvers
Xiaojian Yang,
Yuhui Ni,
Fan Yuan,
Shengguo Li,
Dezun Dong,
Chuanfu Xu,
Haipeng Jia, and
Jie Liu
(National University of Defense Technology, China; Xiangtan University, China; University of Chinese Academy of Sciences, China)
Krylov subspace methods are widely used in scientific computing to solve large sparse linear systems and eigenvalue problems. Their performance bottleneck is often dominated by high-order matrix-power kernels (MPK), especially in polynomial preconditioners that must scale to millions or billions of variables. We present Diagonal Block MPK (DBMPK), a lightweight and parallel-friendly optimization that partitions the input matrix into diagonal blocks and off-diagonal regions. This design enables efficient intra-block data reuse and eliminates inter-block dependencies. It improves cache locality, parallelism, and reduces preprocessing overheads, compared to existing techniques. Our evaluation on x86 and Arm HPC platforms shows that DBMPK improves MPK performance by 26.6%-38.4%. When applied to polynomial preconditioners for linear systems and eigenvalue problems, it achieves consistent end-to-end speedups of 18.6%-34.0%, including in weak scaling tests on 128 nodes, demonstrating strong scalability and practical impact.
@InProceedings{PPoPP26p660,
author = {Xiaojian Yang and Yuhui Ni and Fan Yuan and Shengguo Li and Dezun Dong and Chuanfu Xu and Haipeng Jia and Jie Liu},
title = {A Diagonal Block Memory-Aware Polynomial Preconditioner for Linear and Eigenvalue Solvers},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {660-673},
doi = {10.1145/3774934.3786446},
year = {2026},
}
Publisher's Version
A Distributed Matrix-Block-Vector Multiplication in Presence of System Performance Variability
Yuchen Ma,
Bin Ren, and
Andreas Stathopoulos
(College of William & Mary, USA)
Distributed matrix-block-vector multiplication (Matvec) algorithm is a critical component of many applications, but can be computationally challenging for dense matrices of dimension O(10^6–10^7) and blocks of O(10–100) vectors. We present performance analysis, implementation, and optimization of our SMatVec library for Matvec under the effect of system variability. Our modeling shows that 1D pipelining Matvec is as efficient as 2D algorithms at small to medium clusters, which are sufficient for these problem sizes. We develop a performance tracing framework and a simulator that reveal pipeline bubbles caused by modest ~5% system variability. To tolerate such variability, our SMatVec library, which combines on-the-fly kernel matrix generation and Matvec, integrates four optimizations: inter-process data preloading, unconventional static thread scheduling, cache-aware tiling, and multi-version unrolling. In our benchmarks on O(10^5) Matvec problems, SMatVec achieves up to 1.85× speedup over COSMA and 17× over ScaLAPACK. For O(10^6) problems, where COSMA and ScaLAPACK exceed memory capacity, SMatVec maintains linear strong scaling and achieves peak performance of 75% FMA Flop/s. Its static scheduling policy has a 2.27× speedup compared to the conventional work-stealing dynamic scheduler, and is predicted to withstand up to 108% performance variability under exponential distributed variability simulation.
@InProceedings{PPoPP26p674,
author = {Yuchen Ma and Bin Ren and Andreas Stathopoulos},
title = {A Distributed Matrix-Block-Vector Multiplication in Presence of System Performance Variability},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {674-686},
doi = {10.1145/3774934.3786453},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Characterizing Matrix Multiplication Units across General Parallel Patterns in Scientific Computing
Yuechen Lu,
Hongwei Zeng,
Marc Casas, and
Weifeng Liu
(China University of Petroleum-Beijing, China; Barcelona Supercomputing Center, Spain; Universitat Politècnica de Catalunya, Spain)
Matrix multiplication units (MMUs) in modern parallel processors enable efficient execution of tiled matrix multiplications at varying precisions. While their effectiveness in AI workloads has been well demonstrated, their utility in scientific computing lacks systematic analysis. In this work, we characterize MMUs across a broad range of scientific computing patterns by evaluating performance, power consumption, numerical precision, and memory access behavior. To support this analysis, we develop Cubie, a comprehensive benchmark suite comprising ten MMU-optimized kernels of key parallel patterns. We also categorize MMU utilization patterns into four quadrants and identify the MMU limitations that arise in scientific computing. Through detailed comparisons with vector units, we provide nine key observations on the behavior and implications of MMUs in general scientific workloads, offering valuable insights for architecture, algorithm, and application researchers.
@InProceedings{PPoPP26p687,
author = {Yuechen Lu and Hongwei Zeng and Marc Casas and Weifeng Liu},
title = {Characterizing Matrix Multiplication Units across General Parallel Patterns in Scientific Computing},
booktitle = {Proc.\ PPoPP},
publisher = {ACM},
pages = {687-701},
doi = {10.1145/3774934.3786456},
year = {2026},
}
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Results Reproduced
proc time: 2.14