PPoPP 2026
31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2026)
Powered by
Conference Publishing Consulting

31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2026), January 31 – February 4, 2026, Sydney, NSW, Australia

PPoPP 2026 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page
Welcome from the Chairs
PPoPP 2026 Organization
PPoPP 2026 Sponsors and Supporters

Concurrency Control

Binary Compatible Critical Section Delegation
Junyao Zhang, Zhuo Wang, and Zhe Zhou
(Fudan University, China; Alibaba Cloud Computing, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Hapax Locks: Scalable Value-Based Mutual Exclusion
Dave Dice and Alex Kogan
(Independent, USA; Oracle Labs, USA)
Publisher's Version
Fixing Non-blocking Data Structures for Better Compatibility with Memory Reclamation Schemes
Md Amit Hasan Arovi and Ruslan Nikolaev
(Pennsylvania State University, USA)
Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced
Multiverse: Transactional Memory with Dynamic Multiversioning
Gaetano Coccimiglio, Trevor Brown, and Srivatsan Ravi
(University of Waterloo, Canada; University of Southern California, USA)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced

Scheduling and Load Balancing

Rethinking Thread Scheduling under Oversubscription: A User-Space Framework for Coordinating Multi-runtime and Multi-process Workloads
Aleix Roca and Vicenç Beltran
(Barcelona Supercomputing Center, Spain)
Publisher's Version
Waste-Efficient Work Stealing
Kyle Singer, Kunal Agrawal, and Tao B. Schardl
(Massachusetts Institute of Technology, USA; Washington University in St. Louis, USA)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
DiggerBees: Depth First Search Leveraging Hierarchical Block-Level Stealing on GPUs
Yuyao Niu, Yuechen Lu, Weifeng Liu, and Marc Casas
(Barcelona Supercomputing Center, Spain; Universitat Politècnica de Catalunya, Spain; China University of Petroleum-Beijing, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
PANA: A Fine-Grained Runtime-Adaptive Load Balancing for Parallel SpMV on Multicore CPUs
Haodong Bian, Youhui Zhang, Xiang Fei, Jianqiang Huang, and Xiaoying Wang
(Tsinghua University, China; Qinghai University, China; Zhongguancun Laboratory, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced

Concurrent Data Structures

UFO Trees: Practical and Provably-Efficient Parallel Batch-Dynamic Trees
Quinten De Man, Atharva Sharma, Kishen N Gowda, and Laxman Dhulipala
(University of Maryland, USA)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Sharded Elimination and Combining for Highly-Efficient Concurrent Stacks
Ajay Singh, Nikos Metaxakis, and Panagiota Fatourou
(ICS-FORTH, Greece; University of Crete, Greece)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Concurrent Balanced Augmented Trees
Evan Wrench, Ajay Singh, Younghun Roh, Panagiota Fatourou, Siddhartha Jayanti, Eric Ruppert, and Yuanhao Wei
(University of British Columbia, Canada; ICS-FORTH, Greece; Massachusetts Institute of Technology, USA; University of Crete, Greece; Dartmouth College, USA; York University, Canada)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Parallel Dynamic Spatial Indexes
Ziyang Men, Bo Huang, Yan Gu, and Yihan Sun
(University of California at Riverside, USA)
Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced

GPU and Heterogeneous Computing

PRISM: An Efficient GPU-Based Lossy Compression Framework for Progressive Data Retrieval with Multi-Level Interpolation
Bing Lu, Zedong Liu, Hairui Zhao, Dejun Luo, Wenjing Huang, Yida Gu, Jinyang Liu, Guangming Tan, and Dingwen Tao
(Institute of Computing Technology at Chinese Academy of Sciences, China; Jilin University, China; University of Chinese Academy of Sciences, China; University of California at Riverside, USA)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Dynamic Detection of Inefficient Data Mapping Patterns in Heterogeneous OpenMP Applications
Luke Marzen, Junhyung Shim, and Ali Jannesari
(Iowa State University, USA)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Root-Down Exposure for Maximal Clique Enumeration on GPUs
Zhe Pan, Peng Qu, and Youhui Zhang
(Tsinghua University, China; Zhongguancun Laboratory, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
ROME: Maximizing GPU Efficiency for All-Pairs Shortest Path via Taming Fine-Grained Irregularities
Weile Luo, Yuhan Chen, Xiangrui Yu, Qiang Wang, Ruibo Fan, Hongyuan Liu, and Xiaowen Chu
(Hong Kong University of Science and Technology (Guangzhou), China; Harbin Institute of Technology, Shenzhen, China; Stevens Institute of Technology, USA; Hong Kong University of Science and Technology, Hong Kong)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced

Stencil and Sparse Matrix Computation

SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swapping
Qiqi Gu, Chenpeng Wu, Heng Shi, and Jianguo Yao
(Shanghai Jiao Tong University, China; Shanghai Enflame Technology, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
ASM-SpMM: Unleashing the Potential of Arm SME for Sparse Matrix Multiplication Acceleration
Jiazhi Jiang, Xijia Yao, Jiayu Chen, Jinhui Wei, Dan Huang, and Yutong Lu
(Sun Yat-sen University, China)
Publisher's Version
Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores
Kaige Zhang, Hailong Yang, Xin You, Tianyu Feng, Yufan Xu, Zhongzhi Luan, Yi Liu, and Depei Qian
(Beihang University, China; Independent Researcher, USA)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
VDHA: Vector-Driven Hash Aggregation for Sparse Matrix-Sparse Vector Multiplication on GPUs
Yuchen Li, Zhe Pan, Peng Qu, and Youhui Zhang
(Tsinghua University, China; Zhongguancun Laboratory, China)
Publisher's Version Artifacts Reusable Results Reproduced

Mixed Precision and Quantization

RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision Quantization
Qihao Zhang, MingLiang Tang, Mingshu Zhai, Kinman Lei, and Jidong Zhai
(Tsinghua University, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
High-Throughput Non-uniformly Quantized 3-bit LLM Inference
YuAng Chen, Wenqi Zeng, and Jeffrey Xu Yu
(Chinese University of Hong Kong, China; Hong Kong University of Science and Technology, China; Hong Kong University of Science and Technology (Guangzhou), China)
Publisher's Version
JanusQuant: Accurate and Efficient 2-bit KV Cache Quantization for Long-Context Inference
Chengyu Sun, Yaqi Xia, Hulin Wang, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng
(Wuhan University, China; Nvidia Corporation, USA; University of Macau, China)
Publisher's Version
HierCut: Enabling 16-bit Format Mixed Precision for Molecular Dynamics through Hierarchical Cutoff
Zeyu Song, Lin Gan, Xiaohui Duan, Zhengrui Li, Jiayu Fu, Yinuo Wang, Guangzhao Li, and Guangwen Yang
(Tsinghua University, China; Shandong University, China; Institute of Software at Chinese Academy of Sciences, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced

Cluster and Cloud Computing

Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds
Xiaokang Hu, Yuchao Cao, Naixuan Guan, Yifan Wu, Xishi Qiu, Shengdong Dai, Ben Luo, Sanchuan Cheng, Fudong Qiu, Yibin Shen, and Jiesheng Wu
(Alibaba Cloud Computing, China)
Publisher's Version
zBuffer: Zero-Copy and Metadata-Free Serialization for Fast RPC with Scatter-Gather Reflection
Xiangyu Liu, Huiba Li, Shun Gai, Youmin Chen, and Yiming Zhang
(Xiamen University, China; Alibaba Cloud, China; Shanghai Jiao Tong University, China)
Publisher's Version
Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters
Ruobing Han and Hyesoon Kim
(Georgia Tech, USA)
Publisher's Version
Trojan Horse: Aggregate-and-Batch for Scaling Up Sparse Direct Solvers on GPU Clusters
Yida Li, Siwei Zhang, Yiduo Niu, Yang Du, Qingxiao Sun, Zhou Jin, and Weifeng Liu
(China University of Petroleum-Beijing, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced

Distributed Training

COCCL: A Collective Communication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training
Xingchen Liu, Haoran Kong, Hairui Zhao, Shengkai Lyu, Zheng Wei, Man Liu, Xingjian Tian, Liyang Zhao, Zhuohan Chen, Fakang Wang, Zizhong Chen, Zhan Wang, Guangming Tan, and Dingwen Tao
(University of Chinese Academy of Sciences, China; Shenzhen Loop Area Institute, China; Chinese University of Hong Kong, Shenzhen, China; Jilin University, China; Ant Group, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-Tolerant Distributed Training
Xuanyu Wang, Fangcheng Fu, Haoyang Li, Hao Ge, Sheng Lin, Jiawen Niu, and Bin Cui
(Peking University, China; Shanghai Jiao Tong University, China)
Publisher's Version
HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
Geng Zhang, Shenggan Cheng, Xuanlei Zhao, Ziming Liu, and Yang You
(National University of Singapore, Singapore)
Publisher's Version
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
Yida Gu, Fakang Wang, Jianhao Fu, Zhenhang Sun, Qianyu Zhang, Hairui Zhao, Xingchen Liu, Yang Tian, Wenjing Huang, Zedong Liu, Yifan Chen, Jinwu Yang, Yueyuan Zhou, Qian Zhao, Haoxu Li, Tao Wang, Feng Yu, Zhan Wang, Guangming Tan, and Dingwen Tao
(University of Chinese Academy of Sciences, China; Ant Group, China; Jilin University, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable

Parallel Algorithms

Pipelonk: Accelerating End-to-End Zero-Knowledge Proof Generation on GPUs for PLONK-Based Protocols
Zhiyuan Zhang, Yanxin Cai, Wenhao Yin, Xueyu Wu, Yi Wang, Lei Ju, and Zhuoran Ji
(Shandong University, China; Quan Cheng Laboratory, China; University of Hong Kong, China; Shenzhen University, China; State Key Laboratory of Cryptography and Digital Economy Security, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
ParDiff: Efficiently Parallelizing Reverse-Mode Automatic Differentiation with Direct Indexing
Shuhong Huang, Shizhi Tang, Yuan Wen, Huanqi Cao, Ruibai Tang, Yidong Chen, Jiping Yu, Yang Li, Chao Jiang, Limin Xiao, and Jidong Zhai
(Tsinghua University, China; Qingcheng.AI, China; University of Aberdeen, UK; Lenovo Research, China)
Publisher's Version
Faster and Cheaper: Pushing the Sequence Alignment Throughput with Commercial CPUs
Zhonghai Zhang, Yewen Li, Ke Meng, Chunming Zhang, and Guangming Tan
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Hong Kong University of Science and Technology, China; Phil Rivers Technology, China)
Publisher's Version
PIM-zd-tree: A Fast Space-Partitioning Index Leveraging Processing-in-Memory
Yiwei Zhao, Hongbo Kang, Ziyang Men, Yan Gu, Guy E. Blelloch, Laxman Dhulipala, Charles McGuffey, and Phillip B. Gibbons
(Carnegie Mellon University, USA; Tsinghua University, China; University of California at Riverside, USA; University of Maryland, USA; Reed College, USA)
Publisher's Version Info

ML Inference

BEEMS: Boosting Machine Vision Efficiency via Computation Graph-Based Memory Smoothing
Hanjing Shen, Fangxin Liu, Jian Liu, Li Jiang, and Haibing Guan
(Shanghai Jiao Tong University, China; Beihang University, China)
Publisher's Version
Laser: Unlocking Layer-Level Scheduling for Efficient Multi-SLO LLM Serving
Jianxiong Liao, Quanxing Dong, Yunkai Liang, Zhi Zhou, and Xu Chen
(Sun Yat-sen University, China)
Publisher's Version
MixFusion: A Patch-Level Parallel Serving System for Mixed-Resolution Diffusion Models
Desen Sun, Zepeng Zhao, and Yuke Wang
(University of Waterloo, Canada; Carnegie Mellon University, USA; Rice University, USA)
Publisher's Version Artifacts Reusable Results Reproduced
ChituDiffusion: A Data-Characteristic-Aware Serving System for Diffusion Models
Chengzhang Wu, Liyan Zheng, Haojie Wang, Kezhao Huang, Zixuan Ma, Dong Dong, and Jidong Zhai
(Tsinghua University, China)
Publisher's Version

Graphs and Graph Neural Networks

ElasGNN: An Elastic Training Framework for Distributed GNN Training
Siqi Wang, Hailong Yang, Pengbo Wang, Hongliang Cao, Yufan Xu, Xuezhu Wang, Zhongzhi Luan, Yi Liu, and Depei Qian
(Beihang University, China; Independent Researcher, USA)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
APERTURE: Algorithm-System Co-optimization for Temporal Graph Network Inference
Yiqing Wang, Hailong Yang, Enze Yu, Qingxiao Sun, Kejie Ma, Kaige Zhang, Chenhao Xie, and Depei Qian
(Beihang University, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
TAC: Cache-Based System for Accelerating Billion-Scale GNN Training on Multi-GPU Platform
Zhiqiang Liang, Hongyu Gao, Jue Wang, Fang Liu, Xingguo Shi, Junyu Gu, Peng Di, Sian Li, Lei Tang, Chunbao Zhou, Lian Zhao, Yangang Wang, and Xuebin Chi
(Computer Network Information Center at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Ant Group, China; UNSW, Australia)
Publisher's Version
DTMiner: A Data-Centric System for Efficient Temporal Motif Mining
Yinbo Hou, Hao Qi, Ligang He, Jin Zhao, Yu Zhang, Hui Yu, Longlong Lin, Lin Gu, Wenbin Jiang, Xiaofei Liao, and Hai Jin
(Huazhong University of Science and Technology, China; University of Warwick, UK; Hong Kong University of Science and Technology, China; Southwest University, China)
Publisher's Version

Optimizing Transformers

FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector Parallelism
Jianxing Xu, Yuanbo Wen, Jun Bi, Ruibai Xu, Guanglin Xu, Rui Zhang, Wei Li, Ling Li, Tianshi Chen, Qi Guo, and Yunji Chen
(University of Science and Technology of China, China; Institute of Computing Technology at Chinese Academy of Sciences, China; Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Cambricon Technologies, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Accelerating Sparse Transformer Inference on GPU
Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, and Qingxiao Sun
(China University of Petroleum-Beijing, China; Beihang University, China; Baidu, China; Shanghai Jiao Tong University, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
MetaAttention: A Unified and Performant Attention Framework across Hardware Backends
Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Xingda Wei, and Haibo Chen
(Shanghai Jiao Tong University, China; Peking University, China; Microsoft Research, China)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced

Matrix and Linear Algebra Algorithms

Towards Singular Value Decomposition for Rank-Deficient Matrices: An Efficient and Accurate Algorithm on GPU Architectures
Lu Shi, WeiWei Xu, and Shaoshuai Zhang
(University of Electronic Science and Technology of China, China; Nanjing University of Information Science and Technology, China)
Publisher's Version
A Diagonal Block Memory-Aware Polynomial Preconditioner for Linear and Eigenvalue Solvers
Xiaojian Yang, Yuhui Ni, Fan Yuan, Shengguo Li, Dezun Dong, Chuanfu Xu, Haipeng Jia, and Jie Liu
(National University of Defense Technology, China; Xiangtan University, China; University of Chinese Academy of Sciences, China)
Publisher's Version
A Distributed Matrix-Block-Vector Multiplication in Presence of System Performance Variability
Yuchen Ma, Bin Ren, and Andreas Stathopoulos
(College of William & Mary, USA)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Characterizing Matrix Multiplication Units across General Parallel Patterns in Scientific Computing
Yuechen Lu, Hongwei Zeng, Marc Casas, and Weifeng Liu
(China University of Petroleum-Beijing, China; Barcelona Supercomputing Center, Spain; Universitat Politècnica de Catalunya, Spain)
Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced

proc time: 2.14