Powered by
2026 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), January 31 – February 4, 2026,
Sydney, Australia
Frontmatter
Papers
TRACE4J: A Lightweight, Flexible, and Insightful Performance Tracing Tool for Java
Haide He and
Pengfei Su
(University of California at Merced, USA)
Java is often considered a superior programming language choice owing to its high portability, strong memory safety, and rapid development cycle. However, this superiority comes with increased complexity within Java software stacks, driven by the extensive use of layered libraries, rising levels of abstraction, and the combination of interpretation and just-in-time compilation. This complexity disjoins source code and its execution details on the underlying hardware, making it challenging to write efficient Java code. Performance tracing is key to bridging this gap by providing detailed, temporally ordered insights into a program's runtime behavior. Existing tracing approaches generally fall into two categories: (1) instrumentation, which enjoys high accuracy but incurs significant overhead, and (2) sampling, which enjoys low overhead but sacrifices accuracy.
We introduce TRACE4J, a novel performance tracing tool for Java that overcomes the limitations of existing approaches. TRACE4J intelligently integrates CPU hardware facilities (performance monitoring units and breakpoints), the JVM tool interface, the Linux perf_event interface, and instruction decoding to deliver lightweight, flexible, and insightful performance tracing. It applies to unmodified Java programs, runs on standard JVMs and commodity CPUs, and provides both end-to-end and on-demand tracing, making it suitable for production environments. Through evaluation, we demonstrate TRACE4J’s ability to deliver actionable performance insights with low overhead (less than 5% time and memory impact). Using these insights, we were able to optimize several Java benchmarks and real-world applications, achieving substantial performance gains.
Published Artifact
Info
Enabling Spill-Free Compilation via Affine-Based Live Range Reduction Optimization
Prasanth Chatarasi,
Alex Gatea,
Wei Wang,
Chris Bowler,
Shubham Jain,
Masoud Ataei Jaliseh,
Nicole Khoun,
Alberto Mannari,
Bardia Mahjour,
Viji Srinivasan, and
Swagath Venkataramani
(IBM Research, USA; IBM, USA; IBM, Canada)
AI Accelerators employ dataflow architectures to achieve impressive peak compute performance (TOPS) and processing efficiencies (TOPS/W). Typically, dataflow architectures use wide data-paths to connect off-chip memory to dense compute arrays (via hierarchy of on-chip memories/vector register files) for efficient data movement with reuse, as well as compute. Such architectures often possess an independent lightweight control-path for loading programs and initializing registers, and lack traditional architectural features like instruction cache and execution stacks. This poses a unique challenge to compiler requiring program generation of complex compute kernels to fit within an instruction buffer and allocating a limited set of scalar registers without support to spill to memory.
This paper contributes a significant step towards spill-free compilation and proposes a Live range reduction optimization based on Affine expression propagation analysis. Our solution performs a global, compiler-directed analysis to model variable values as affine expressions of in-scope variables, enabling safe symbolic re-materialization of values at their use-sites leveraging near-by variables without introducing new operations. This shortens variable lifetimes, while significantly reducing register pressure without incurring program binary and execution overhead. The static nature and regular memory access patterns of AI applications make them well-suited for the proposed optimization. We demonstrate the effectiveness of the technique in the context of IBM Spyre accelerator and its compiler. Our results over a range of AI workloads spanning transformer and CNN models demonstrate spill-free code generation, with most of the workloads requiring less than 50% of the available registers.
Binary Diffing via Library Signatures
Andrei Rimsa,
Anderson Faustino da Silva,
Camilo Santana Melgaço, and
Fernando Magno Quintão Pereira
(CEFET-MG, Brazil; State University of Maringá, Brazil; Federal University of Minas Gerais, Brazil)
Binary diffing is the problem of determining whether two binary programs originate from the same source code. Binary diffing tools are used to identify malware, plagiarism, or code theft. Many instances of binary diffing assume an adversarial setting, where a malicious actor transforms binary code by changing the compiler. Traditional diffing techniques rely on statistical similarity analysis, often leveraging stochastic models. However, recent studies have shown that these classifiers perform poorly in the face of adversarial code transformations. To mitigate this scenario, this paper introduces a new diffing technique that is resilient against current obfuscation approaches. We propose comparing executables by matching their library signatures (libsig). A program's library signature is the sequence of calls it makes to functions outside its .text section. The proposed classifier, LibSig, is faster than off-the-shelf alternatives, such as ltrace, and, like it, works on stripped binaries or binaries running with address space layout randomization (ASLR) enabled. Furthermore, in contrast to ltrace, LibSig can be engineered to detect even library calls that bypass conventional application binary interface patterns. Our experiments on the GNU Core Utilities demonstrate that LibSig remains robust against obfuscators like Khaos and ollvm, as well as typical optimization patterns, outperforming binary diffing approaches such as SAFE, BinDiff, or Asm2Vec.
Published Artifact
Info
PIP: Making Andersen’s Points-to Analysis Sound and Practical for Incomplete C Programs
Håvard Rognebakke Krogstie,
Helge Bahmann,
Magnus Själander, and
Nico Reissmann
(NTNU, Norway; Independent Researcher, Norway)
Compiling files individually lends itself well to parallelization, but forces the compiler to operate on incomplete programs. State-of-the-art points-to analyses guarantee sound solutions only for complete programs, requiring summary functions to describe any missing program parts. Summary functions are rarely available in production compilers, however, where soundness and efficiency are non-negotiable. This paper presents an Andersen-style points-to analysis that efficiently produces sound solutions for incomplete C programs. The analysis accomplishes soundness by tracking memory locations and pointers that are accessible from external modules, and efficiency by performing this tracking implicitly in the constraint graph. We show that implicit pointee tracking makes the constraint solver 15× faster than any combination of five different state-of-the-art techniques using explicit pointee tracking. We also present the Prefer Implicit Pointees (PIP) technique that further reduces the use of explicit pointees. PIP gives an additional speedup of 1.9×, compared to the fastest solver configuration not benefiting from PIP. The precision of the analysis is evaluated in terms of an alias-analysis client, where it reduces the number of MayAlias-responses by 40% compared to LLVM’s BasicAA pass alone. Finally, we show that the analysis is scalable in terms of memory, making it suitable for optimizing compilers in practice.
Published Artifact
GRANII: Selection and Ordering of Primitives in GRAph Neural Networks using Input Inspection
Damitha Lenadora,
Vimarsh Sathia,
Gerasimos Gerogiannis,
Serif Yesil,
Josep Torrellas, and
Charith Mendis
(University of Illinois at Urbana-Champaign, USA; NVIDIA, USA)
Over the years, many frameworks and optimization techniques have been proposed to accelerate graph neural networks (GNNs). In contrast to the optimizations explored in these systems, we observe that different matrix re-associations of GNN computations lead to novel input-sensitive performance behavior. We leverage this observation to propose GRANII, a system that exposes different compositions of sparse and dense matrix primitives based on different matrix re-associations of GNN computations and selects the best among them based on input attributes. GRANII executes in two stages: (1) an offline compilation stage that enumerates all valid re-associations leading to different sparse-dense matrix compositions and uses input-oblivious pruning techniques to prune away clearly unprofitable candidates, and (2) an online runtime system that explores the remaining candidates and uses lightweight cost models to select the best re-association based on the input graph and the embedding sizes. On a wide range of configurations, GRANII achieves a geo-mean speedup of 1.56× for inference and 1.4× for training across multiple GNN models and systems. We also show GRANII’s technique functions on diverse implementations and with techniques such as sampling.
Archive submitted (270 kB)
Dependence-Driven, Scalable Quantum Circuit Mapping with Affine Abstractions
Marouane Benbetka,
Merwan Bekkar,
Riyadh Baghdadi, and
Martin Kong
(École Nationale Supérieure d’Informatique, Algeria; NYU Abu Dhabi, United Arab Emirates; Ohio State University, USA)
Qubit Mapping is a critical task in Quantum Compilation, as modern Quantum Processing Units (QPUs) are constrained to nearest-neighbor interactions defined by a qubit coupling graph. This compiler pass repairs the connectivity of two-qubit gates whose operands are not adjacent by inserting SWAP gates that move the state of qubits between directly connected qubits. Deciding when to introduce SWAPs while minimizing their count is critical because the error in quantum programs increases exponentially with the circuit latency, measured in number of gates along the critical path of the circuit. Prior work for this problem relied on heuristics and exact methods that partition the circuit into two or more layers, but failed to exploit valuable dependence information in any form.
This paper introduces a novel qubit mapping algorithm based on the weight of transitive dependences. The introduced mapper models quantum circuits with affine abstractions, thereby providing the ability to compute transitive dependences. In turn, the newfound information is used to partition circuits by dependence distances and compute, efficiently, distinct weights for each layer. We evaluate the efficiency of our mapper on IBM and Rigetti QPUs, using the large datasets from the QUEKO and QASMBench benchmark suites, and against four baseline tools (QMAP, Sabre, Cirq and TKET), demonstrating notable improvements in circuit depth and swap count while delivering competitive scalability.
Published Artifact
QIGen: A Kernel Generator for Inference on Nonuniformly Quantized Large Language Models
Tommaso Pegolotti,
Dan Alistarh, and
Markus Püschel
(ETH Zurich, Switzerland; IST Austria, Austria)
Efficient inference on large language models (LLMs) has become a popular topic in both academia and industry. Roughly speaking, LLMs consist of a collection of weight matrices, and generative inference on these models essentially computes a sequence of matrix-vector products and thus can be heavily memory-bound. Consequently, much work has been devoted to reducing the size of the weights to lower bit-widths through various forms of quantization. In turn, these diverse precision formats complicate both the arithmetic and optimized kernel implementation. So far, the vast majority of implementation work for mixed-precision LLM computation has been done manually. Currently, one of the most powerful and complex scalar LLM compression techniques is nonuniform quantization, in which a matrix is divided unevenly into parts that are quantized with different bit-widths, minimizing the output compression error. In this paper, we present QIGen, the first kernel generator for LLM inference on CPUs to support nonuniform quantization in full generality. Given a nonuniformly quantized LLM and target CPU characteristics, QIGen first generates the diverse set of needed custom matrix-vector product kernels and combines them with a suitable storage format. We benchmark and analyze QIGen-generated code in various experiments. In particular, we show that our code achieves Pareto optimality in terms of performance and accuracy with respect to the most used open-source tool. We show a speedup of up to 1.3× for the matrix-vector and 3.4× for the matrix-matrix computations even when using uniform quantization.
DyPARS: Dynamic-Shape DNN Optimization via Pareto-Aware MCTS for Graph Variants
Hao Qian,
Guangli Li,
Qiuchu Yu,
Xueying Wang, and
Jingling Xue
(UNSW, Australia; Institute of Computing Technology at Chinese Academy of Sciences, China; Beijing University of Posts and Telecommunications, China)
Dynamic-shape DNNs are widely used in applications such as variable-resolution image processing and language modeling with variable-length sequences. Existing DL (Deep-Learning) compilers apply rule-based rewriting to either transform a subgraph into a fixed variant at compile time (leading to sub-optimal performance) or generate multiple variants at runtime, incurring significant overhead. The challenge is discovering and applying shape-dependent subgraph variants that maintain high efficiency across diverse inputs with minimal runtime cost.
We propose DyPARS, a dynamic-shape DL compiler approach that discovers high-performance subgraph variants at compile time and applies the best ones at runtime. Leveraging Pareto-aware MCTS, DyPARS identifies shape-aware variants, incorporating shape-dependent kernel adaptations. These variants are integrated into a prediction-enhanced computational graph, enabling efficient variant selection based on input shapes with minimal overhead. DyPARS achieves average speedups of 1.31x and 1.80x over TorchInductor (JIT) and BladeDISC (non-JIT), respectively, across five DNN models, demonstrating robust efficiency across diverse inputs.
Published Artifact
Synthesizing Specialized Sparse Tensor Accelerators for FPGAs via High-Level Functional Abstractions
Hamza Javed and
Christophe Dubach
(McGill University, Canada)
Sparsity is inherent in many applications such as machine learning and graph analytics. However, achieving high efficiency in sparse computations requires specialized hardware accelerators like FPGAs, as traditional accelerators typically cater to dense data. While high level synthesis enables the automatic generation of FPGA-based accelerators, generic solutions produced via C-based synthesis flows often demand extensive development time, leading designers to prioritize broad applicability over fine-grained structural specialization. Consequently, these accelerators fail to fully exploit FPGA’s reconfigurablility, leaving substantial performance and efficiency gains untapped.
This paper pushes the boundary by automatically generating specialized accelerators that match a given fixed sparse structure (e.g.,in static graph analytics and pruned neural networks). It achieves this by leveraging functional abstractions within high level synthesis, an approach that has already proven effective in automating the generation of specialized dense tensor accelerator. Tensor shapes are encoded directly in the type system and specialized primitives for irregular data are introduced. Together, these innovations enable a concise specification of sparse accelerators and drive advanced optimizations—including dynamic partitioning and vector sharding—to produce hardware precisely tailored to the sparsity pattern of the underlying tensors.
Compared to state-of-the-art generic accelerators (HiSparse, HiSpMV and GraphLily), the approach achieves up to a 2.8× improvement in bandwidth efficiency for sparse matrix computations and a 1.8× speedup on graph algorithms. Against the hls4ml neural network acceleration framework, it achieves up to a 1.8× improvement in throughput with a 4× reduction in resource usage, enabling scaling to larger networks. These results establish this approach as a flexible, powerful, and rapid solution for designing high-performance specialized sparse accelerators.
Flow-Graph-Aware Tiling and Rescheduling for Memory-Efficient On-Device Inference
Yeonoh Jeong,
Taehyeong Park, and
Yongjun Park
(Yonsei University, Republic of Korea)
With the increasing popularity of artificial intelligence (AI) applications, deep neural networks (DNNs) are in demand for on-device serving in various real-life fields.
Running DNN inference on a resource-constrained edge device requires aggressive memory optimization.
While several recent tiling-based techniques reduce peak memory usage by partitioning large tensors into micro-tensors, they are specialized for the MCU environment and do not provide scalability to various edge platforms.
Moreover, they greedily search for the target of tiling without considering the memory flow across the model while partitioning.
In this paper, we propose OKO, a compiler-based optimization technique that minimizes peak memory usage by considering both tiling and the corresponding operation rescheduling.
OKO estimates the memory savings from the tiling method based on the lifetime and dependencies of the tensor, and then algorithmically selects the optimal tiling strategy.
It further maximizes the reuse of memory spaces by efficiently reordering operations and immediately releasing unnecessary tensors.
Evaluations on various edge devices show that OKO achieves effective memory savings of up to 80% and an average of 59% with no loss of accuracy and negligible overhead, supporting memory-efficient inference across a broad range of target devices.
VFlatten: Selective Value-Object Flattening using Hybrid Static and Dynamic Analysis
Arjun H. Kumar,
Bhavya Hirani,
Hang Shao,
Tobi Ajila,
Vijay Sundaresan,
Daryl Maier, and
Manas Thakur
(IIT Mandi, India; Sardar Vallabhbhai National Institute of Technology, Surat, India; IBM, Canada; IIT Bombay, India)
Object flattening is a non-trivial optimization that inlines the fields of an object inside its containers. Owing to its direct applicability for immutable objects, Java would soon allow programmers to mark compatible classes as “value types”, and Java Virtual Machines (JVMs) to transparently flatten their instances (value objects). Expectations include reduced memory footprint, faster field access, and overall improvement in performance. This paper describes the surprises and challenges we faced while experimenting with value types and object flattening on a real-world JVM, and presents the design of an efficient strategy that selectively flattens profitable value objects, using a novel combination of static and dynamic analyses.
Our value-object flattening strategy is based on insights that span the source program, the just-in-time (JIT) compiler employed by the JVM, as well as the underlying hardware. The first insight identifies source-level patterns that favour and oppose value-object flattening. The second insight finds an interesting dependence of object flattening on object scalarization, and estimates the capability of the JIT in avoiding overheads using escape analysis. Finally, the third insight correlates container objects with cache-line size, based on the load semantics of object fields. In order to develop an efficient strategy to flatten potentially profitable objects, we capture these insights in a tool called VFLATTEN that uses a novel combination of static and dynamic analyses and flattens value objects selectively in a production Java runtime.
Published Artifact
Hexcute: A Compiler Framework for Automating Layout Synthesis in GPU Programs
Xiao Zhang,
Yaoyao Ding,
Bolin Sun,
Yang Hu,
Tatiana Shpeisman, and
Gennady Pekhimenko
(University of Toronto, Canada; NVIDIA, Canada; Vector Institute, Canada)
Efficient GPU programming is crucial for achieving high performance in deep learning (DL) applications. The performance of GPU programs depends on how data is parallelized across threads and arranged within memory subsystems. The mapping functions describing tensors on GPUs are known as tensor layouts. Low-level programming frameworks, such as CUTLASS and Hidet, provide expressive layout abstractions but often require considerable programming effort to manually specify optimal layouts. High-level GPU programming languages, such as Triton, rely on compiler heuristics to generate dataflow, layouts, and pipelining strategies in GPU programs. However, the heuristics for dataflow and pipelining strategies are not generalizable to complex operators. To balance expressiveness and programmability, we propose Hexcute, a compiler framework that automates layout synthesis while providing explicit control over dataflow and pipelining. Hexcute formalizes layout synthesis as a constraint programming problem and solves it with a type-inference-based algorithm. This approach enables systematic exploration of optimal layouts and instructions.
Our evaluation shows that Hexcute matches the performance of libraries like cuBLAS and FlashAttention on GEMM, Attention, and their variants, while reducing the amount of code by 1.27×-7.94× compared to CUTLASS. For mixed-type mixture-of-experts (MoE) operators, Hexcute achieves an average speedup of 6.46× over Triton. In the end-to-end evaluations of vLLM, Hexcute delivers up to 2.60× speedup on DeepSeek-R1-AWQ and 2.04× on a Mamba-based model.
Published Artifact
BIT: Empowering Binary Analysis through the LLVM Toolchain
Puzhuo Liu,
Peng Di,
Jingling Xue, and
Yu Jiang
(Ant Group, China; Tsinghua University, China; UNSW, Australia)
Binary analysis plays a critical role in software comprehension and security analysis, especially when source code is unavailable or difficult to analyze. Lifting binaries to LLVM IR enables reuse of the rich LLVM toolchain for downstream binary analyses. However, existing binary lifters often fail to produce syntactically valid LLVM IR or to restore sufficient semantics, making downstream analyses unreliable or unfeasible.This paper introduces BIT, a novel binary lifter designed to ensure syntactic compliance as well as semantic adequacy BIT achieves this through a multistage approach that includes anchor variable identification, analysis context collection, and IR refinement. In the evaluation, BIT achieved excellent results across multiple downstream analyses when compared with various lifters In static analysis, the F1 score of bug detection is 0.85, which is better than Plankton’s 0.81; in symbolic execution, it outperforms McSema by 3,049× in path exploration and by 1.36× in test case generation, respectively; in reanalysis, BIT can complete all tasks and is consistent with the advanced work McSema. These results highlight BIT’s ability to bridge the gap between binary-level analysis and the LLVM toolchain
Multidirectional Propagation of Sparsity Information across Tensor Slices
Kaio Andrade,
Danila Seliayeu,
J. Nelson Amaral, and
Fernando Magno Quintão Pereira
(Federal University of Minas Gerais, Brazil; University of Alberta, Canada)
A computational graph in a tensor compiler represents a program as a set of kernels connected by edges that describe how tensors flow between them. Tensors may be dense or sparse, the latter storing only nonzero values. Recognizing sparsity allows the compiler to avoid redundant computation and reduce memory usage. This paper introduces Sparsity Propagation Analysis (SPA), a static analysis that conservatively infers sparsity in tensor slices: sets of tensor elements that share a fixed subset of indices. SPA estimates which slices can be treated as zero, via forward, backward, and lateral propagation of sparsity information. SPA applies to tensors of arbitrary dimension and to operators defined over general semirings (e.g., addition and multiplication, MAX and MIN, logical AND and OR). Its runtime grows with the number of slices rather than the number of elements, making it asymptotically faster than executing even a single iteration of the graph. We implement SPA in the Tensor Algebra Compiler (TACO) and evaluate it on benchmarks from the Einsum Collection. Results indicate that multidirectional propagation increases the precision of the sparsity analysis, reduces memory consumption, and enables optimizations that improve runtime performance.
Published Artifact
On the Precision of Dynamic Program Fingerprints Based on Performance Counters
Anderson Faustino da Silva,
Sérgio Queiroz de Medeiros,
Marcelo Borges Nogueira,
Jeronimo Castrillon, and
Fernando Magno Quintão Pereira
(State University of Maringá, Brazil; Federal University of Rio Grande do Norte, Brazil; TU Dresden, Germany; Federal University of Minas Gerais, Brazil)
Task classification is the challenge of determining whether two binary programs perform the same task. This problem is essential in scenarios such as malware identification, plagiarism detection, and redundancy elimination. Classification can be performed statically or dynamically. In the former case, the classifier analyzes the binary image of the program, whereas in the latter it observes the program's execution. Recent research has demonstrated that dynamic classification is more accurate, particularly in adversarial settings where programs may be obfuscated. This remains true even when both classifiers use the exact representation of programs, such as histograms of instruction opcodes. The superior accuracy of dynamic classification stems from its ability to disregard dead code inserted during the obfuscation process. However, state-of-the-art dynamic techniques, such as Valgrind plugins, can slow down program execution by as much as 100 times due to binary instrumentation. This paper proposes to eliminate this overhead by replacing program instrumentation with the sampling of hardware performance counters. Our findings reveal both advantages and limitations of this approach. On the positive side, classifiers based on hardware counters impose almost no runtime overhead while retaining greater accuracy than purely static classifiers, particularly in the presence of obfuscation. On the downside, counter-based classifiers are slightly less accurate than instrumentation-based approaches and offer coarser granularity, being limited to whole-program classification rather than individual functions. Despite these limitations, our results challenge the conventional belief that dynamic code classifiers are too costly to be deployed in environments such as online servers, operating systems, and virtual machines.
Published Artifact
Automatic Data Enumeration for Fast Collections
Tommy McMichen and
Simone Campanoni
(Northwestern University, USA; Google, USA)
Data collections provide a powerful abstraction to organize data, simplifying development and maintenance. Choosing an implementation for each collection is a critical decision, with performance, memory and energy tradeoffs that need to be balanced for each use case. Specialized implementations offer significant benefits over their general-purpose counterparts, but also require certain properties of the data they store, such as uniqueness or ordering. To employ them, developers must either possess domain knowledge or transform their data to exhibit the desired property, which is a tedious, manual process. One such transformation—commonly used in data mining and program analysis—is data enumeration, where data items are assigned unique identifiers to enable fast equality checks and compact memory layout. In this paper, we present an automated approach to data enumeration, eliminating the need for manual developer effort. Our implementation in the MEMOIR compiler achieves speedups of 2.16× on average (up to 8.72×) and reduces peak memory consumption by 5.6% on average (up to 50.7%). This work shows that automated techniques can manufacture data properties to unlock specialized collection implementations, pushing the envelope of collection-oriented optimization.
Published Artifact
Compiler-Runtime Co-operative Chain of Verification for LLM-Based Code Optimization
Hyunho Kwon,
Sanggyu Shin,
Ju Min Lee,
Hoyun Youm,
Seungbin Song,
Seongho Kim,
Hanwoong Jung,
Seungwon Lee, and
Hanjun Kim
(Yonsei University, Republic of Korea; SAIT, Republic of Korea)
Large Language Models (LLMs) have recently shown promise in compiler optimizations such as loop vectorization and memory access restructuring. However, due to their generative nature, LLM-optimized code may contain syntax errors or semantic inconsistencies. While state-of-the-art compilers using LLMs employ symbolic verification to ensure correctness, they fail to fully utilize LLM-based optimizations due to the limited and unreliable verification coverage. This work introduces CoV, a compiler-runtime co-operative Chain of Verification framework that safely integrates LLM-based code transformations into modern compilation workflows. CoV employs a multi-stage verification pipeline that begins with lightweight static checks such as syntax validation and profiling-based checksum filtering, and then applies symbolic equivalence verification using tools like Alive2. For code fragments that cannot be statically verified, CoV inserts runtime verification mechanisms to ensure correctness during execution. These runtime checks are optimized through verification parallelization and batching to minimize overhead. This work implements a prototype CoV framework atop an LLM-based automatic vectorizer within LLVM, and evaluates it using 151 loops in the TSVC benchmark suite and three realistic applications. CoV expands vectorization coverage by 13.9% and 10.6% over LLVM and GCC -O3 vectorization, respectively. In addition, CoV successfully vectorizes loops in three realistic applications that are not handled by the -O3 vectorization.
Eliminating Redundancy: Ultra-compact Code Generation for Programmable Dataflow Accelerators
Prasanth Chatarasi,
Alex Gatea,
Bardia Mahjour,
Jintao Zhang,
Alberto Mannari,
Chris Bowler,
Shubham Jain,
Masoud Ataei Jaliseh,
Nicole Khoun,
Kamlesh Kumar,
Viji Srinivasan, and
Swagath Venkataramani
(IBM Research, USA; IBM, USA; IBM, Canada; Unaffiliated, USA)
Modern AI accelerators adopt dataflow architectures to achieve both high peak throughput (TOPS) and energy efficiency (TOPS/W). These designs feature wide datapaths and hierarchical scratchpad memories that supply dense compute arrays with high-bandwidth data access and extensive operand reuse. Complementing the compute–memory subsystem is a lightweight control path that orchestrates data movement, program loading, and register initialization. To reduce energy and area overheads, conventional processor features—such as instruction caches, execution stacks, and branch speculation—are deliberately omitted. While this streamlined design maximizes efficiency, it shifts a critical responsibility onto the compiler: transforming complex kernels into highly compact instruction streams that must fit entirely within the limited instruction buffers (IBUFFs) of the accelerator’s programmable units.
In this paper, we introduce two novel compiler transformations—Loop Absorption (LA) and Loop Index Set Merging (LISM) for ultra compact code generation. Loop Absorption merges isomorphic sibling operations into a single loop body, while LISM unifies adjacent loops with similar bodies into a unified iteration space. Together, these complementary techniques eliminate redundant code patterns and produce compact hierarchical loop nests. We implement LA and LISM in the IBM Spyre compiler and evaluate them on diverse deep learning workloads including ResNet-50, Inception-v3, SSD, and BERT-Large. Across these models, our combined approach achieves a geometric mean compression of 1.48× over the baseline, enabling layers that previously exceeded IBUFF capacity to compile successfully.
Dr.avx: A Dynamic Compilation System for Seamlessly Executing Hardware-Unsupported Vectorization Instructions
Yue Tang,
Mianzhi Wu,
Yufeng Li,
Haoyu Liao,
Jianmei Guo, and
Bo Huang
(East China Normal University, China)
Modern processors are breaking a fundamental rule: backward compatibility within their own ISA families. We term this Generational ISA Fragmentation (GIF), where newer processors cannot execute instructions supported by prior generations within the same ISA family. This phenomenon is exemplified by Intel’s removal of AVX-512 from Alder Lake processors after years of deployment, ARM’s inconsistent support for SVE across cores, and RISC-V’s incompatible vector specifications. GIF causes illegal instruction crashes when running applications optimized for earlier processors on newer hardware, threatening the foundation of software portability that has underpinned decades of computing evolution.
We introduce Dr.avx, a dynamic compilation system that enables seamless execution of AVX-512 instructions on hardware that lacks native support. Dr.avx addresses the most instructive GIF instance, x86 AVX-512 fragmentation, by targeting the integer and floating-point operations that dominate real workloads. Our rewrite engine performs a fine-grained classification of AVX-512 opcode-operand patterns and employs three complementary strategies: Instr Mirroring, AVX Lowering, and Scalar Fallback. Experiments show that Dr.avx incurs a geometric mean overhead of 1.44× on SPEC CINT2017 relative to native AVX-512 execution, 17.3% better than Intel’s closed-source SDE. On production databases, Dr.avx sustains 75%–88% (MySQL) and 86%–99% (MongoDB) of native throughput, yielding 2.0–2.7× higher throughput than SDE. For LLM inference (llama.cpp), Dr.avx keeps 95%–99% of native tokens/s and delivers 2.5–4.8× speedup over SDE. Unlike Intel’s proprietary SDE, which provides no visibility into its implementation details, Dr.avx achieves functional correctness while providing an open, extensible, near-native performance implementation. Our work offers both a remedy for AVX-512 fragmentation and a blueprint for addressing similar compatibility challenges emerging across all major ISAs.
PriTran: Privacy-Preserving Inference for Transformer-Based Language Models under Fully Homomorphic Encryption
Yuechen Mu,
Guangli Li,
Shiping Chen, and
Jingling Xue
(UNSW, Australia; Institute of Computing Technology at Chinese Academy of Sciences, China; CSIRO’s Data61, Australia)
Transformer-based language models power many cloud services, but inference on sensitive data raises confidentiality concerns. Fully Homomorphic Encryption (FHE) enables computation on encrypted inputs while preserving privacy, but at high computational cost, making Transformers difficult to deploy. This paper presents PriTran, an efficient CKKS-based library for privacy-preserving Transformer inference on CPUs. Complementing the only prior work, RoLe, which supports only BERT-Tiny (2 encoders), PriTran introduces two novel algorithms with optimized data layouts that accelerate ciphertext–plaintext (CP) and ciphertext–ciphertext (CC) matrix multiplications (MMs) across all BERT models by reducing costly rotations and multiplications. On the MNLI dataset, RoLe fails on inputs longer than 36 tokens within a 5-hour per-token budget, while PriTran achieves average speedups of 29.3% and 22.2% for CP- and CC-MMs, respectively, and 24.1% end-to-end. We further evaluate PriTran on scaled BERT-Tiny variants with additional encoders and on BERT-Mini (4 encoders), demonstrating correctness and scalability beyond RoLe's limits. Within current FHE limits, these gains and RoLe's failure on longer inputs underscore PriTran's promise as a practical approach for FHE-based Transformer inference.
Partial-Evaluation Templates: Accelerating Partial Evaluation with Pre-compiled Templates
Florian Huemer,
Aleksandar Prokopec,
David Leopoldseder,
Raphael Mosaner, and
Hanspeter Mössenböck
(JKU Linz, Austria; Oracle Labs, Zurich, Switzerland; Oracle Labs, Vienna, Austria; Oracle Labs, Linz, Austria)
Partial evaluation allows specializing programs by substituting some of their inputs with static values and evaluating the resulting static expressions. When applied to a language interpreter, it specializes the interpreter to a fixed source program, resulting in a specialized interpreter that existing compilers can optimize and generate efficient machine code for. Although this approach achieves good peak performance in just-in-time compilers, partial evaluation is costly with respect to compilation time. In the case of bytecode interpreters, partial evaluation must process and transform each bytecode instruction. Since instructions with the same opcodes share common functionality, partial evaluation applies the same transformations repeatedly. Consequently, partial evaluation significantly increases the compilation time of just-in-time compilation.
We propose partial-evaluation templates, i.e., reusable collections of ahead-of-time-compiled graphs that are generated by a hand-written cogen approach. The templates are parametric in their static inputs, and cover common functionality in a bytecode interpreter implementation. At compile time, templates specialize themselves by substituting their static inputs, reducing themselves to a single pre-compiled graph that the compiler inlines and optimizes. This reduces the size of the intermediate representation and speeds up partial evaluation, as well as optimizations.
We integrated our approach into GraalVM, a state-of-the-art high-performance polyglot virtual machine, and GraalWasm, a bytecode-interpreter-based WebAssembly runtime. We extended the existing GraalVM compiler with a binding-time analysis that generates the templates and introduced a specializer for the templates into the existing partial evaluator. We enabled template generation for nearly all opcodes in GraalWasm, which reduced partial-evaluation time by up to 36% and warmup time by up to 17% without impacting peak performance.
FHEFusion: Enabling Operator Fusion in FHE Compilers for Depth-Efficient DNN Inference
Tianxiang Sui,
Jianxin Lai,
Long Li,
Peng Yuan,
Yan Liu,
Qing Zhu,
Xiaojing Zhang,
Linjie Xiao,
Mingzhe Zhang, and
Jingling Xue
(Ant Group, China; UNSW, Australia)
Operator fusion is essential for accelerating FHE-based DNN inference because it reduces multiplicative depth and, in turn, lowers the cost of ciphertext operations by keeping them at lower ciphertext levels. Existing approaches either rely on manual optimizations, which miss cross-operator opportunities, or on compiler pattern matching, which lacks generality. Standard DNN graphs omit FHE-specific behaviors, while fully lowering to primitive FHE operations introduces excessive granularity and obstructs effective optimization.
We present FHEFusion, a compiler framework for the CKKS scheme that enables fusion through a new IR. This IR preserves high-level DNN semantics while introducing FHE-aware operators—masking and compaction (Strided_Slice)—that are central to CKKS, thereby exposing broader fusion opportunities. Guided by algebraic rules and an FHE-aware cost model, FHEFusion reduces multiplicative depth and identifies profitable fusions. Integrated into ANT-ACE, a state-of-the-art FHE compiler, FHEFusion outperforms nGraph, the only framework with graph-level fusion, achieving up to 3.02× (average 1.40×) speedup across seven DNNs (13 variants from different RELU approximations) on CPUs, while maintaining inference accuracy.
Proton: Towards Multi-level, Adaptive Profiling for Triton
Keren Zhou,
Tianle Zhong,
Hao Wu,
Jihyeong Lee,
Yue Guan,
Yufei Ding,
Corbin Robeck,
Yuanwei Fang,
Jeff Niu, and
Philippe Tillet
(George Mason University, USA; OpenAI, USA; University of Virginia, USA; University of California at San Diego, USA; Meta, USA)
Domain-Specific languages (DSLs) such as Triton enable developers to write high-performance GPU kernels in a Python-friendly manner; however, profiling these kernels with existing tools often incurs runtime and storage overhead while failing to deliver actionable insights for both kernel authors and framework developers.
We present Proton, a multi-level, adaptive profiler tailored for the Triton programming language and compiler.
Proton provides frontend APIs to selectively profile relevant regions, aggregate results, capture custom metrics not available through hardware counters, and query profiles using a SQL-like language.
Proton's backend design unifies vendor profiling APIs with instrumentation-based profiling, ensuring portability and extensibility.
Using Proton, users are able to query custom and hardware metrics across the relevant levels of abstraction—full end-to-end model execution, isolated neural network layers, language-specific Triton operators, and compiler intermediate representations.
We demonstrate the tool's effectiveness through case studies on production-grade kernel development, continuous integration, multi-GPU analysis, language model inference, and intra-kernel profiling.
Our evaluations on end-to-end workloads, as well as standalone Triton kernels, demonstrate that Proton imposes lower runtime overhead and delivers significant reductions in profile sizes relative to existing framework and vendor profilers while being fully open source.
Published Artifact
Unlocking Python Multithreading Capabilities using OpenMP-Based Programming with OMP4Py
César Piñeiro and
Juan C. Pichel
(University of Santiago de Compostela, Spain)
Python exhibits inferior performance relative to traditional high performance computing (HPC) languages such as C, C++, and Fortran. This performance gap is largely due to Python's interpreted nature and the Global Interpreter Lock (GIL), which restricts multithreading efficiency. However, the introduction of a GIL-free variant in the Python interpreter opens the door to more effective exploitation of multithreading parallelism in Python. Based on this important new feature, we introduce OMP4Py with the aim of bringing OpenMP's familiar directive-based parallelization paradigm to Python. Its dual-runtime architecture design combines the benefits of a pure Python implementation with the performance and low-level capabilities required to maximize efficiency in compute-intensive tasks. In this way, OMP4Py offers both full Python support and the high performance required by HPC workloads.
Published Artifact
Towards Path-Aware Coverage-Guided Fuzzing
Giacomo Priamo,
Daniele Cono D'Elia,
Mathias Payer, and
Leonardo Querzoni
(Sapienza University of Rome, Italy; EPFL, Switzerland)
Automated fuzz testing is now standard practice, yet key blind spots persist. Coverage-guided fuzzers typically rely on edge coverage as a lightweight proxy for program behavior. However, this metric captures path variations only weakly: it cannot differentiate executions that follow distinct control-flow paths but traverse the same edges—causing many path-dependent bugs to go undetected. Path awareness would offer a richer coverage view but has been considered too costly for fuzzing.
We introduce a lightweight method for tracking intra-procedural execution paths, enabling efficient path-aware feedback. This enhances the fuzzer’s ability to detect subtle bugs, even in well-tested software. To counter the resulting seed explosion, we evaluate two strategies—culling and opportunistic path-aware fuzzing—that balance precision and throughput. Our findings show that path-aware fuzzing, when properly guided, uncovers more bugs and reveals untapped potential in fuzzing research.
Published Artifact
The Parallel-Semantics Program Dependence Graph for Parallel Optimization
Yian Su,
Brian Homerding,
Haocheng Gao,
Federico Sossai,
Yebin Chon,
David I. August, and
Simone Campanoni
(Northwestern University, USA; Argonne National Laboratory, USA; Princeton University, USA)
Modern shared-memory parallel programming models, such as OpenMP and Cilk, enable developers to encode a parallel execution plan within their code. Existing compilers, including Clang and GCC, directly lower or add additional compatible parallelism on top of the developers' plan. However, when better parallel execution plans exist that are incompatible with the original plan, compilers lack the capability of disregarding it and replacing it with a better one. To address this problem, this paper introduces the parallel-semantics program dependence graph (PS-PDG), an extension of the program dependence graph (PDG) abstraction that can simultaneously represent parallel semantics derived from both the developer's original plan and the compiler's own analysis. To demonstrate the power of PS-PDG, this paper also introduces GINO, an LLVM-based compiler capable of optimizing parallel execution plans using PS-PDG. Through exploring, reasoning, and implementing better parallel execution plans unlocked by PS-PDG, GINO outperforms the developer's original parallel execution plan by 46.6% at most, and by 15% on average over 56 cores across 8 benchmarks from the NAS benchmark suite.
Published Artifact
PASTA: A Modular Program Analysis Tool Framework for Accelerators
Mao Lin,
Hyeran Jeon, and
Keren Zhou
(University of California at Merced, USA; George Mason University, USA)
The increasing complexity and diversity of hardware accelerators in modern computing systems demand flexible, low-overhead program analysis tools. We present PASTA, a low-overhead and modular Program AnalysiS Tool Framework for Accelerators. PASTA abstracts over low-level profiling APIs and diverse deep learning frameworks, offering users a unified interface to capture and analyze runtime events at multiple levels. Its extensible design enables researchers and practitioners to rapidly prototype custom tools with minimal overhead. We demonstrate the utility of PASTA by developing several analysis tools, including tools for deep learning workload characterization and UVM optimization. Through extensive evaluations on mainstream deep learning workloads tested on NVIDIA and AMD GPUs under both single- and multi-GPU scenarios, we demonstrate PASTA’s broad applicability. On NVIDIA GPUs, we further show that PASTA provides detailed performance insights with significantly lower overhead (up to 1.3×10^4 faster) than conventional analysis tools, thanks to its GPU-accelerated backend. PASTA strikes a practical balance between usability, extensibility, and efficiency, making it well-suited for modern accelerator-based computing environments.
Published Artifact
Enabling Automatic Compiler-Driven Vectorization of Transformers
Shreya Alladi,
Alberto Ros, and
Alexandra Jimborean
(University of Murcia, Spain; Unaffiliated, Spain)
Compiling neural networks and Transformers for edge devices faces significant challenges due to resource constraints and the reliance on manually optimized operations for performance, among others. These limitations hinder the scalability and portability of neural network applications on resource-constrained platforms, such as edge devices utilizing the RISC-V ecosystem. Addressing these issues, this paper introduces innovative techniques to overcome the inefficiencies of current compilation methods and reduce dependence on manual optimizations.
This work proposes a novel compilation flow, ONNX-MLIR-LLVM (OML), which leverages MLIR and LLVM IR to enable automatic optimizations and generate stand-alone RISC-V binaries. Through comprehensive analysis, we identify key barriers preventing the auto-vectorizer from handling vectorization-friendly operators, particularly reduction operations and vectorization-unfriendly data layouts. We address these through a versatile MLIR reduction detection pass and a compile-time transpose pass, respectively.
Our automatic transformations (OML-vect) unlock the capabilities of the MLIR affine super-vectorizer, reducing reliance on manual vectorization. Evaluations on both x86 and RISC-V across eight neural networks and Transformer models demonstrate that automatic vectorization via OML-vect achieves on average 94% and 91% on x86 and RISC-V, respectively, compared to baseline and 2% and 8% more performance on x86 and RISC-V, respectively, compared to manually vectorized libraries, offering an efficient and portable solution for edge device deployments.
Published Artifact
Towards Threading the Needle of Debuggable Optimized Binaries
Cristian Assaiante,
Simone Di Biasio,
Snehasish Kumar,
Giuseppe Antonio Di Luna,
Daniele Cono D'Elia, and
Leonardo Querzoni
(Sapienza University of Rome, Italy; Google, USA)
Compiler optimizations may lead to loss of debug information, hampering developer productivity and techniques that rely on binary-to-source mappings, such as sampling-based feedback-directed optimization. While recent endeavors exposed debug information correctness and completeness bugs in compiler transformations, understanding where a complex optimizing pipeline “loses” debug information is an understudied problem.
In this paper, we first rectify accuracy issues in methods for measuring the availability of debug information, and show that the synthetic programs evaluated so far lead to metric values that differ from those we observe for real-world programs. Building on this, we present DebugTuner, a framework for systematically analyzing the impact of individual compiler optimization passes on debug information, and assemble a test suite of programs for collecting more realistic metrics. Using DebugTuner and the test suite, we identify transformations in gcc and clang that cause more debug information loss, and construct modified optimization levels that improve debuggability while retaining competitive performance. We obtain levels that outperform gcc’s Og for both debuggability and performance, and make recommendations for constructing an Og level for clang. Finally, we present a case study on AutoFDO where, by disabling selected passes in the profiling stage, the final optimized binary is more performant due to the improved quality of the binary-to-source mapping.
Published Artifact
Selene: Cross-Level Barrier-Free Pipelining for Irregular Nested Loops in High-Level Synthesis
Sungwoo Yun,
Seonyoung Cheon,
Dongkwan Kim,
Heelim Choi,
Kunmo Jeong,
Chan Lee,
Yongwoo Lee, and
Hanjun Kim
(Yonsei University, Republic of Korea; DGIST, Republic of Korea)
The growing demand for domain-specific accelerators in fields such as machine learning, graph analytics, and scientific computing has highlighted the need for productive and efficient hardware design methodologies. High-level synthesis (HLS) offers an attractive solution by generating hardware from high-level code, with loop pipelining as a cornerstone for maximizing throughput in regular computations. However, existing static and dynamic HLS approaches fail to achieve high pipeline utilization for irregular loop nests characterized by data-dependent bounds, unpredictable memory access patterns, and loop-carried dependencies.
To address the pipeline underutilization in irregular loop, this work proposes a new barrier-free pipeline architecture with a cross-level scheduling strategy and the corresponding HLS compiler Selene, that automatically synthesizes the proposed architecture. Our approach introduces a fine-grained pipeline controller and an outer-loop iteration interleaving mechanism, enabling concurrent execution of multiple outer-loop iterations and efficient handling of data dependencies. Implemented within a commercial HLS flow, Vitis HLS, Selene delivers significant speedups of 4.74× and 5.46× over both standard static and dynamic HLS tools on a range of irregular workload benchmarks, demonstrating its effectiveness in overcoming challenges to efficient hardware generation for data-dependent applications.
From Threads to Tiles: T2T, a Compiler for CUDA-to-NPU Translation via 2D Vectorization
Shuaijiang Li,
Jiacheng Zhao,
Ying Liu,
Shuoming Zhang,
Lei Chen,
Yijin Li,
Yangyu Zhang,
Zhicheng Li,
Runyu Zhou,
Xiyu Shi,
Chunwei Xia,
Yuan Wen,
Xiaobing Feng, and
Huimin Cui
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; University of Leeds, UK; University of Aberdeen, UK)
CUDA’s programming model, exposing massive parallelism via fine-grained scalar threads, has become the de facto standard for GPU computing. Concurrently, NPUs are emerging as highly efficient accelerators, but their architecture is fundamentally different, relying on coarse-grained, explicit 2-D tile-based instructions. This creates a critical challenge: bridging the semantic gap “From Threads to Tiles”. A direct translation is infeasible, as it requires lifting the implicit parallelism of CUDA’s scalar model into the explicit, multi-dimensional vector space of NPUs, a problem we formalize as a lifting challenge.
This paper introduces T2T, a compiler framework that automates this “Threads to Tiles” translation via the 2-D Vectorization technique. T2T first transforms a CUDA kernel’s implicit SIMT parallelism into a structured, explicit loop nest via our Unified Parallelism Abstraction (UPA), making the parallelism analyzable. From this representation, T2T’s core vectorization engine systematically selects optimal pairs of loops and maps them onto the NPU’s 2-D tile instructions to maximize hardware utilization. To ensure correctness and handle performance-critical CUDA features, a final set of semantics-preserving optimizations is applied, including efficient control-flow management and vectorization of warp-level intrinsics.
We implement T2T based on Polygeist and evaluate representative NPU architectures. On a diverse set of benchmarks, kernels translated by T2T achieve up to 73% of native CUDA performance on an A100 GPU and outperform baseline translation approaches by up to 6.9×. Our work demonstrates that a systematic, compiler-driven approach to 2-D vectorization is a principled and high-performance path for porting the rich CUDA ecosystem to the evolving landscape of NPU accelerators.
Ember: A Compiler for Embedding Operations on Decoupled Access-Execute Architectures
Marco Siracusa,
Olivia Hsu,
Víctor Soria-Pardos,
Joshua Randall,
Arnaud Grasset,
Eric Biscondi,
Douglas J. Joseph,
Randy Allen,
Fredrik Kjolstad,
Miquel Moreto, and
Adrià Armejach
(Barcelona Supercomputing Center, Spain; Universitat Politècnica de Catalunya, Spain; Stanford University, USA; Carnegie Mellon University, USA; Arm, USA)
Decoupled Access-Execute (DAE) architectures separate memory accesses from computation in two specialized units. This design is becoming increasingly popular among hyperscalers to accelerate irregular embedding lookups in recommendation models. In this paper, we first broaden the scope by demonstrating the benefits of DAE architectures across a wider range of irregular embedding operations in several machine learning models. Then, we propose the Ember compiler to automatically compile all of these embedding operations to DAE architectures. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.
Published Artifact
Compiler-Assisted Instruction Fusion
Ravikiran Ravindranath Reddy,
Sawan Singh,
Arthur Perais,
Alberto Ros, and
Alexandra Jimborean
(University of Murcia, Spain; CNRS, France; Unaffiliated, Spain)
Hardware instruction fusion combines multiple architectural instructions into a single operation, improving performance by freeing up resources. While fusion typically involves consecutive instructions, there are proposals to fuse non-consecutive instructions to maximize potential. However, such approaches require complex and costly hardware to predict and either validate fusion or unfuse, which significantly increases the cost of fusion. In this work, we propose a compiler technique, CAIF - Compiler Assisted Instruction Fusion, for fusion-aware instruction scheduling. CAIF identifies fusible but non-consecutive memory operations and reorders eligible pairs of instructions such that they appear consecutively in the instruction stream.
Our experiments demonstrate that for neural network workloads, a hardware that only fuses consecutive instructions obtains 1.2% average performance improvements over a no-fusion baseline when applications are compiled with a standard compiler and 19.6% when compiled with CAIF. In addition, when non-consecutive hardware fusion (Helios) is enabled, CAIF boosts performance from 6.6% to 20.3%. Moreover, CAIF can effectively handle the statically challenging general-purpose application and boost performance on SPEC CPU 2017 from 2.4% to 6.4%, and from 14.4% to 17.7%, respectively, on the hardware configurations mentioned above.
Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References
Hongzheng Chen,
Bin Fan,
Alexander Collins,
Bastian Hagedorn,
Evghenii Gaburov,
Masahiro Masuda,
Matthew Brookhart,
Chris Sullivan,
Jason Knight,
Zhiru Zhang, and
Vinod Grover
(Cornell University, USA; NVIDIA, USA; NVIDIA, UK; NVIDIA, Germany)
Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines--a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which
expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1x speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2x speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.
LLM-VeriOpt: Verification-Guided Reinforcement Learning for LLM-Based Compiler Optimization
Xiangxin Fang,
Jiaqin Kang,
Rodrigo C. O. Rocha,
Sam Ainsworth, and
Lev Mukhanov
(Queen Mary University of London, UK; University of Edinburgh, UK)
Large Language Models (LLMs) for compiler optimization have recently emerged as a frontier research direction, with many studies demonstrating their potential to automate and improve low-level code transformations. While various techniques have been proposed to enhance LLMs’ ability to optimize LLVM IR or assembly code, ensuring the semantic equivalence of transformed instructions remains a fundamental prerequisite for safe and effective performance improvement. At the same time, code generated by LLMs is often so far away from being correct that it is very difficult to work out how to improve them to proceed in generating optimizations using the output.
In this work, we present LLM-VeriOpt, a novel reinforcement-learning methodology that incorporates feedback from a formal verifier, Alive2, to guide the training of a small-scale model, Qwen-3B. This facilitates the use of Guided Reinforcement via Group Relative Policy Optimization (GRPO), using semantic-equivalence signals from the Alive2 formal verification tool as part of the reward function. This allows the model to self-correct based on observing and subsequently learning to give correctness feedback during training, giving high code coverage by successfully transforming large amounts of code, while also optimizing it significantly.
We demonstrate our technique by designing an LLM-based peephole optimizer over LLVM-IR. Our method significantly improves the correctness of IR optimizations versus the base LLM Qwen-3B applied with just a prompt and no fine-tuning — achieving a 5.4× improvement in code successfully modified. The resulting model produces verifiably correct output 90% of the time, comfortably outperforming larger state-of-the-art LLMs, including Meta’s LLM Compiler. This yields speedups of 2.3× over O0-optimized code, comparable to the handwritten LLVM -instcombine pass, and producing emergent optimizations that outperform it in 20% of cases.
Published Artifact
Thinking Fast and Correct: Automated Rewriting of Numerical Code through Compiler Augmentation
Siyuan Brant Qian,
Vimarsh Sathia,
Ivan R. Ivanov,
Jan Hückelheim,
Paul Hovland, and
William S. Moses
(University of Illinois at Urbana-Champaign, USA; Institute of Science Tokyo, Japan; RIKEN RCCS, Japan; Argonne National Laboratory, USA)
Floating-point numbers are finite-precision approximations to real numbers and are ubiquitous in computer applications in nearly every field. Selecting the right floating-point representation that balances performance and numerical accuracy is a difficult task – one that has become even more critical as hardware trends toward high-performance, low-precision operations. Although the common wisdom around changing floating-point precision implies that accuracy and performance are inversely correlated, more advanced techniques can often circumvent this tradeoff. Applying complex numerical optimizations to real-world code, however, is an arduous engineering task that requires expertise in numerical analysis and performance engineering, and application-specific numerical context. While there is a plethora of existing tools that partially automate this process, they are limited in the scope of optimization techniques or still require substantial human intervention. We present Poseidon, a modular and extensible framework that fully automates floating-point optimizations for real-world applications within a production compiler. Our key insight is that a small surrogate profile often reveals sufficient numerical context to drive effective rewrites. Poseidon operates as a two-phase compiler: the first compilation instruments the program to capture numerical context; the second compilation consumes profiled data, generates and evaluates candidate rewrites, and solves for optimal performance/accuracy tradeoffs. Poseidon’s interoperability with standard compiler analyses and optimizations grants it analysis and optimization advantages unavailable to existing source- and binary-level approaches. On multiple large-scale applications, Poseidon leads to outsized benefits in performance without substantially changing accuracy, and outsized accuracy benefits without diminishing performance. On a quaternion differentiator, Poseidon enables a 1.46× speedup with a relative error of 10−7. On DOE’s LULESH hydrodynamics application, Poseidon improves program accuracy to exactly match a 512-bit simulation run without substantially reducing performance.
Published Artifact
Practical: SecSwift, a Compiler-Based Framework for Software Countermeasures in Cybersecurity
François de Ferrière,
Yves Janin, and
Sirine Mechmech
(STMICROELECTRONICS, France; Grenoble INP, France)
Fault injection attacks are a growing concern for the security of cyber systems. To mitigate these threats, hardware protections are often complemented by a series of software countermeasures,
the implementation of which can be complex, time-consuming, and error-prone even for cybersecurity experts. In this article, we introduce SecSwift, a compiler-based framework dedicated to the generation of these software countermeasures. SecSwift offers a carefully balanced architecture that augments the usual code generation flow of the LLVM compiler at selected points in the front-end, middle-end and back-end. In its current state, SecSwift leverages classical control flow integrity and data flow protections, generalized to their interprocedural variants.
A simulation-based instruction-set-level fault injection test campaign validates the proposed approach and demonstrates the robustness of the generated countermeasures. This campaign notably features a comparison between SecSwift and nZDC, which is widely considered a benchmark for countermeasure code generation. Experimental results indicate that SecSwift is on a par with, and often better than, nZDC in terms of fault detection and code size.
Finally, we show that SecSwift can easily generate countermeasures for 32-bit RISC-V RV32E processors, which have only 16 general-purpose registers. Clearly, SecSwift’s high-level approach is a distinct advantage for architectures with small register files—an area where low-level approaches based on partitioning between main and shadow registers struggle.
PolyUFC: Polyhedral Compilation Meets Roofline Analysis for Uncore Frequency Capping
Nilesh Rajendra Shah,
M V V S Manoj Kumar,
Dhairya Baxi, and
Ramakrishna Upadrasta
(IIT Hyderabad, India)
We present PolyUFC, an MLIR based compilation flow for uncore frequency capping that combines (performance and power) roofline analyses and polyhedral compilation-based static analysis for characterization of affine programs. We introduce a parametric mathematical model that links operational intensity and uncore frequency to derive frequency caps, validated through empirical evaluation on real hardware. By embedding these caps into Pluto optimized code generated by Polygeist, we achieve improvements in Energy Delay Product (EDP) up to 42% on compute-bound, and up to 54% on bandwidth-bound programs—carefully selected from ML-models from vision/NLP domains and PolyBench—over Intel UFS driver. Our framework is retargetable across multiple micro-architectures; and can handle multiple optimization goals like performance, energy and EDP, and is applicable across inter/intra dialects.
Info
Progressive Low-Precision Approximation of Tensor Operators on GPUs: Enabling Greater Trade-Offs between Performance and Accuracy
Fan Luo,
Guangli Li,
Zhaoyang Hao,
Xueying Wang,
Xiaobing Feng,
Huimin Cui, and
Jingling Xue
(Institute of Computing Technology at Chinese Academy of Sciences, China; Beijing University of Posts and Telecommunications, China; UNSW, Australia)
Recent GPUs integrate specialized hardware for low-precision arithmetic (e.g., FP16, INT8), offering substantial speedups for tensor operations. However, existing methods typically rely on coarse, operator-level trial-and-error tuning, which restricts the performance--accuracy trade-off space and limits achievable gains.
We present Platensor, a progressive low-precision approximation framework that expands this trade-off space through fine-grained, tile-level strategies. The key idea is to exploit the tiled computation patterns of GPUs to enable flexible precision control and richer optimization opportunities. Platensor performs a two-phase exploration: a fast rule-based pass that selects promising tile-level configurations, followed by an evolutionary search that refines them. It then automatically generates optimized kernels that combine tiles of different precisions.
Experiments on GEMM operators and representative applications---including kNN, LLMs, and HPL-MxP---show that Platensor significantly broadens the attainable performance-accuracy trade-offs and more fully leverages low-precision arithmetic on modern GPUs compared to operator-level tuning.
Pyls: Enabling Python Hardware Synthesis with Dynamic Polymorphism via LCRS Encoding
Bolei Tong,
Yongyan Fang,
Chaorui Wang,
Qingan Li,
Jingling Xue, and
Yuan Mengting
(Wuhan University, China; UNSW, Australia)
Python dominates AI development and is the most widely used dynamic programming language, but synthesizing its polymorphic functions into hardware remains challenging. Existing HLS solutions support only static subsets of Python, forcing CPU offload with costly communication overhead. We present Pyls, the first framework that synthesizes dynamically polymorphic Python into monolithic hardware via Left-Child Right-Sibling (LCRS) encoding. Key to our approach is representing all Python objects as LCRS trees, enabling uniform hardware handling of dynamic types. Pyls automatically converts objects to fixed-width formats, generates XLS IR designs, and implements a tree memory architecture for efficient runtime type resolution. On FPGA platforms, Pyls demonstrates speedups of 5.19× and 3.98× over two ASIC CPUs, 303.29× over a soft-core processor, and 282.66× over a heterogeneous SoC design.
TPDE: A Fast Adaptable Compiler Back-End Framework
Tobias Schwarz,
Tobias Kamm, and
Alexis Engelke
(TU Munich, Germany)
Fast machine code generation is especially important for fast start-up in just-in-time compilation, where the compilation time is part of the end-to-end latency. However, widely used compiler frameworks like LLVM do not prioritize fast compilation and require an extra IR translation step increasing latency even further; and rolling a custom code generator is a large engineering effort, especially when targeting multiple architectures.
Therefore, in this paper, we present TPDE, a compiler back-end framework that adapts to existing code representations in SSA form. Using an IR-specific adapter providing canonical access to IR data structures and a specification of the IR semantics, the framework performs one analysis pass and then performs the compilation in just a single pass, combining instruction selection, register allocation, and instruction encoding. The generated target instructions are primarily derived code written in a high-level language through LLVM's Machine IR, easing portability to different architectures while enabling optimizations during code generation.
To show the generality of our framework, we build a new back-end for LLVM from scratch targeting x86-64 and AArch64. Performance results on SPECint 2017 show that we can compile LLVM-IR 8–26x faster than LLVM -O0 while being on-par in terms of run-time performance. We also demonstrate the benefits of adapting to domain-specific IRs in JIT contexts, particularly WebAssembly and database query compilation, where avoiding the extra IR translation further reduces compilation latency.
Published Artifact
Info
Space-Time Optimisations for Early Fault-Tolerant Quantum Computation
Sanaa Sharma and
Prakash Murali
(University of Cambridge, UK)
Fault-tolerance is the future of quantum computing, ensuring error-corrected quantum computation that can be used for practical applications. Resource requirements for fault-tolerant quantum computing (FTQC) are daunting, and hence, compilation techniques must be designed to ensure resource efficiency. There is a growing need for compilation strategies tailored to the early FTQC regime, which refers to the first generation of fault-tolerant machines operating under stringent resource constraints of fewer physical qubits and limited distillation capacity. Present-day compilation techniques are largely focused on overprovisioning of routing paths and make liberal assumptions regarding the availability of distillation factories. Our work develops compilation techniques that are tailored to the needs of early FTQC systems, including distillation-adaptive qubit layouts and routing techniques. In particular, we show that simple greedy heuristics are extremely effective for this problem, offering significant reduction in the number of qubits compared to prior works. Our techniques offer results with an average overhead of 1.2X in execution time for a 53% reduction in qubits against the theoretical lower bounds. As the industry develops early FTQC systems with tens to hundreds of logical qubits over the coming years, our work has the potential to be widely useful for optimising program executions.
Tensor Program Superoptimization through Cost-Guided Symbolic Program Synthesis
Alexander Brauckmann,
Aarsh Chaube,
José Wesley de Souza Magalhães,
Elizabeth Polgreen, and
Michael F. P. O’Boyle
(University of Edinburgh, UK)
Modern tensor compiler frameworks like JAX and PyTorch accelerate numerical programs by compiling or mapping Domain-Specific Language (DSL) code to efficient executables. However, they rely on a fixed set of transformation rules and heuristics, which means they can miss profitable optimization opportunities. This leaves significant optimization potential unused for programs that fall outside these fixed patterns.
This paper presents STENSO, a tensor DSL program superoptimizer that discovers such missing rewrites. STENSO's core is a symbolic program synthesis based search algorithm that systematically explores the space of equivalent programs. By combining symbolic execution and sketch-based program synthesis, it generates equivalent candidate implementations. To make the search computationally tractable, STENSO further integrates a cost model with a branch-and-bound algorithm scheme. This effectively prunes the search space, arriving at optimal solutions in a reasonable time.
We evaluate STENSO on over 30 benchmarks. The discovered programs achieve geometric mean speedups of 3.8x over NumPy and 1.6x over state-of-the-art compilers like JAX and PyTorch-Inductor. These results underscore the limitations of heuristic-based compilation and demonstrate STENSO's effectiveness in finding such optimizations automatically.
Published Artifact
Archive submitted (45 kB)
Synthesizing Instruction Selection Back-Ends from ISA Specifications Made Practical
Florian Drescher and
Alexis Engelke
(TU Munich, Germany)
Instruction selection in compiler backends traditionally depends on huge handwritten rule libraries that map IR patterns to target instruction sequences. Porting to new architectures or extending them for new hardware features is a very error-prone process,results in a significant development effort of a dedicated expert and even requires long-term commitment to apply patches for previously introduced bugs.
Automatic synthesis of these rule libraries from formal ISA and IR specifications eliminates the initial development effort and reduces the long-term maintenance effort, as synthesized rules are correct by construction. Prior work on synthesizing instruction selectors either requires synthesis times of multiple days or significantly falls behind the code quality of an optimizing handwritten backend - in both cases not applicable in practice.
We introduce a term canonicalization and indexing approach that accelerates finding rules on syntactically-similar bitvector terms while returning to SMT solving to ensure completeness in all other cases. Combined with search bounds derived from LLVM's existing pattern base, this reduces synthesis times from multiple days to under two hours for AArch64 and RISC-V.
We integrated the synthesized instruction selection rules for AArch64 and RISC-V into LLVM's GlobalISel backend and achieved almost on-par performance with the existing, industry-standard code generation backends in LLVM on the SPEC 2017 Integer benchmark suite (within 4% of LLVM GlobalISel).
Accelerating App Recompilation across Android System Updates by Code Reusing
Hongtao Wu,
Yu Chen,
Mengfei Xie,
Futeng Yang,
Jun Yan,
Jiang Ma,
Jianming Fu,
Chun Jason Xue, and
Qingan Li
(Wuhan University, China; Guangdong OPPO Mobile Telecommunications, China; Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates)
Android utilizes Ahead-of-Time (AOT) compilation technology to precompile applications and stores the compiled code in OAT files, thereby improving app performance. When the Android system is updated, the old OAT files become invalidated. Applications will fall back to interpreted execution, resulting in degraded performance. To accommodate the frequent updates to the Android system that commonly occur on a monthly basis for most smartphone manufacturers, apps must be frequently recompiled into OAT files to promptly restore optimal app performance. However, recompiling applications is a resource-consuming process that cannot be completed quickly. Users have to endure issues such as device overheating and lag, which are caused by performance degradation after system updates.
This paper evaluated popular Android apps across different system updates and made an important observation: up to 99% of the compiled code can be reused across different system updates, rendering most existing recompilation efforts unnecessary. Based on this observation, this paper proposes a method to accelerate app recompilation across Android system updates by reusing the old OAT files.
We evaluated the proposed method with eight popular apps, on ten open-source Android system pairs and one closed-source Android system pair provided by a smartphone manufacturer. These Android system pairs have the same Android Runtime (ART) version and execute AOT compilation in both speed and speed-profile modes. Experimental results show that the proposed method reuses approximately 95% of compiled methods, achieving average speedups of 2.12× in CPU time and 1.39× in wall-clock time in speed-profile mode. In speed mode, the proposed method reuses about 99% of compiled methods, achieving average speedups of 5.15× in CPU time and 2.80× in wall-clock time, respectively. The proposed method not only accelerates app recompilation but also generates OAT files identical to those generated by native AOT compilation, without introducing security issues. Therefore, it holds significant promise for real-world deployment and has the potential to enhance user experience by speeding up the generation of new OAT files for applications.
Compilation of Generalized Matrix Chains with Symbolic Sizes
Francisco López,
Lars Karlsson, and
Paolo Bientinesi
(Umeå University, Sweden)
Generalized Matrix Chains (GMCs) are products of matrices where each matrix carries features (e.g., general, symmetric, triangular, positive-definite) and is optionally transposed and/or inverted. GMCs are commonly evaluated via sequences of calls to BLAS and LAPACK kernels. When matrix sizes are known, one can craft a sequence of kernel calls to evaluate a GMC that minimizes some cost, e.g., the number of floating-point operations (FLOPs). Even in these circumstances, high-level languages and libraries, upon which users usually rely, typically perform a suboptimal mapping of the input GMC onto a sequence of kernels. In this work, we go one step beyond and consider matrix sizes to be symbolic (unknown); this changes the nature of the problem since no single sequence of kernel calls is optimal for all possible combinations of matrix sizes. We design and evaluate a code generator for GMCs with symbolic sizes that relies on multi-versioning. At compile-time, when the GMC is known but the sizes are not, code is generated for a few carefully selected sequences of kernel calls. At run-time, when sizes become known, the best generated variant for the matrix sizes at hand is selected and executed. The code generator uses new theoretical results that guarantee that the cost is within a constant factor from optimal for all matrix sizes and an empirical tuning component that further tightens the gap to optimality in practice. In experiments, we found that the increase above optimal in both FLOPs and execution time of the generated code was less than 15% for 95% of the tested chains.
Published Artifact
SkeleShare: Algorithmic Skeletons and Equality Saturation for Hardware Resource Sharing
Jonathan Van der Cruysse,
Tzung-Han Juang,
Shakiba Bolbolian Khah, and
Christophe Dubach
(McGill University, Canada)
Compiling functional programs into efficient Field Programmable Gate Array (FPGA) designs is difficult.
Hardware resources must be explicitly allocated and shared to maximize resource efficiency.
This requires careful orchestration of several transformations to expose and exploit sharing opportunities.
This paper introduces SkeleShare, a novel approach that automates the problem of resource allocation and sharing.
It leverages equality saturation and algorithmic skeletons to expose sharing opportunities across abstraction levels.
A solver-based extractor then selects a design that consolidates computations, meeting resource constraints while maintaining performance.
This approach is evaluated on neural networks and image processing targeting a real FPGA.
The paper shows how SkeleShare is used to express the various algorithmic patterns and transformation rules inherent in neural network operators.
The experimental evaluation demonstrates that SkeleShare's fully automated resource allocation and sharing matches and exceeds the performance of prior work, which involves expert manual extraction of sharing opportunities.
FRUGAL: Pushing GPU Applications beyond Memory Limits
Lingqi Zhang,
Tengfei Wang,
Jiajun Huang,
Chen Zhuang,
Ivan R. Ivanov,
Peng Chen,
Toshio Endo, and
Mohamed Wahib
(RIKEN RCCS, Japan; Google Cloud, Japan; University of South Florida, USA; Institute of Science Tokyo, Japan)
GPUs power modern scientific and AI applications, but their limited memory capacity restricts scalability. Buying GPUs with larger HBM is prohibitively expensive and still bounded by market limits. Existing solutions either exploit application-specific knowledge through out-of-core techniques, which lack generality, or rely on system-level page faulting, which is transparent but inefficient. We propose FRUGAL, an application-agnostic framework and methodology that reduces GPU memory footprint while sustaining high performance. FRUGAL formulates memory management as an optimization over an application’s execution graph, encompassing prefetching, kernel execution, and offloading. Using static analysis and profiling, FRUGAL applies a two-phase scheduling and migration strategy, solving an otherwise intractable optimization efficiently. Evaluations on Tiled Cholesky Decomposition, Tiled LU Decomposition, Tiny-CUDA-NN, and QuEST show that FRUGAL significantly reduces maximum GPU memory usage by 80.21%, 80.20%, 64.75% and 60.86% with only a geometric mean of 28.31% slowdown. FRUGAL allows applications to exceed hardware-imposed limits, and maintains strong performance scalability beyond existing GPU memory constraints, without additional hardware cost.
SparseX: Synergizing GPU Libraries for Sparse Matrix Multiplication on Heterogeneous Processors
Ruifeng Zhang,
Xiangwei Wang,
Ang Li, and
Xipeng Shen
(North Carolina State University, USA; Pacific Northwest National Laboratory, USA; University of Washington, USA)
Sparse Matrix-Matrix Multiplication (SpMM) on GPU is critical to applications ranging from scientific simulations to Graph Neural Networks (GNNs) and Deep Neural Networks (DNNs). Modern GPUs offer diverse processing units, such as CUDA cores, Tensor Cores, and Sparse Tensor Cores. Many SpMM libraries have been built to harness those different types of processors. Although impressive performance has been reported by each, including cuSparse, Sputnik, CLASP, and Jigsaw, a systematic study in this work shows that no single library is a clear winner across all matrices and scenarios. Based on the empirical observations, this work proposes the first solution to synergize the various libraries to best harness the heterogeneous processors and the array of cutting-edge libraries. The solution is an extensible framework, namely SparseX, that can automatically select the best matrix multiplication library (and types of processors) on the fly for a given sparse matrix on a GPU through an agile accurate predictive model. Experiments show that SparseX can speed up sparse matrix multiplications on thousands of real-world matrices significantly over the SOTA GPU libraries, achieving significant speedups (e.g., as much as 95.34x over cuSparse). Its extensible design makes it easy to be extended to cover new libraries and hardware architectures.
Published Artifact
Fast Autoscheduling for Sparse ML Frameworks
Bobby Yan,
Alexander J Root,
Trevor Gale,
David Broman, and
Fredrik Kjolstad
(Stanford University, USA; KTH Royal Institute of Technology, Sweden)
The rapid growth in the size of deep learning models strains the capabilities of dense computation paradigms. Leveraging sparse computation has become increasingly popular for training and deploying large-scale models, but existing deep learning frameworks lack extensive support for sparse operations. Current approaches either require manual scheduling expertise or rely on exhaustive search taking hours to days, which are both incompatible with the interactive development essential to machine learning research. We present three algorithmic contributions that enable fast, automatic optimization of sparse tensor computations. First, we develop a heuristic-based loop ordering algorithm that avoids asymptotic performance cliffs while compiling in milliseconds rather than hours. Second, we introduce a tiling algorithm specialized for mixed sparse-dense computations that achieves performance comparable to hand-optimized kernels. Third, we present a format inference algorithm that automatically selects appropriate sparse tensor formats for intermediate and output tensors based on operation semantics. These algorithms are grounded in the computational properties of sparse tensor algebra, making them predictable and robust across diverse workloads. We implement these techniques in Scorch, a prototype sparse tensor compiler for PyTorch that demonstrates their practical effectiveness. With only minimal code changes, our approach achieves 1.05–5.80× speedups over PyTorch Sparse on end-to-end tasks including graph neural networks, sparse autoencoders, and sparse transformers, with compilation times fast enough for interactive ML development.
LEGO: A Layout Expression Language for Code Generation of Hierarchical Mapping
Amir Mohammad Tavakkoli,
Cosmin E. Oancea, and
Mary Hall
(University of Utah, USA; University of Copenhagen, Denmark)
We describe LEGO, a new approach to optimizing data movement whereby code is expressed as a layout-independent computation and composed with layouts for data and computation. This code generator organization derives complex indexing expressions associated with hierarchical parallel code and data movement for GPUs. LEGO maps from layout specification to indexing expressions, and can be integrated into existing compilers and code templates. It facilitates the exploration of data layouts in combination with other optimizations. We demonstrate LEGO’s integration with the Triton and MLIR compilers, and with CUDA templates. We show that LEGO is capable of deriving performance competitive with Triton, and shows broad applicability for data and thread layout mapping optimizations in its integration with CUDA and MLIR.
OpenQudit: Extensible and Accelerated Numerical Quantum Compilation via a JIT-Compiled DSL
Ed Younis
(Lawrence Berkeley National Laboratory, USA)
High-performance numerical quantum compilers rely on classical optimization, but are limited by slow numerical evaluations and a design that makes extending them with new instructions a difficult, error-prone task for domain experts. This paper introduces OpenQudit, a compilation framework that solves these problems by allowing users to define quantum operations symbolically in the Qudit Gate Language (QGL), a mathematically natural DSL. OpenQudit's ahead-of-time compiler uses a tensor network representation and an e-graph-based pass for symbolic simplification before a runtime tensor network virtual machine (TNVM) JIT-compiles the expressions into high-performance native code. The evaluation shows that this symbolic approach is highly effective, accelerating the core instantiation task by up to ~20x on common quantum circuit synthesis problems compared to state-of-the-art tools.
Published Artifact
Pushing Tensor Accelerators beyond MatMul in a User-Schedulable Language
Yihong Zhang,
Derek Gerstmann,
Andrew Adams, and
Maaz Ahmad
(University of Washington, USA; Adobe, USA)
Tensor accelerators now represent a growing share of compute resources in modern CPUs and GPUs. However, they are hard to program, leading developers to use vendor-provided kernel libraries that support tensor accelerators. As a result, the usage of tensor accelerators is limited to the provided interface, mainly designed for traditional ML and scientific computing workloads.
In this paper, we show that tensor accelerators can improve the performance of applications beyond simple variants of MatMul. For example, many image processing pipelines are linear transformations over matrices in disguise and can therefore utilize such specialized hardware. This is nonetheless hindered by the difficulties in programming tensor accelerators. We tackle this problem with compiler-based techniques. We use the Halide user-schedulable language and express operations as Halide algorithms succinctly. To this end, we implement a flexible tensor instruction selector based on equality saturation. The tensor instruction selector supports both CPU- and GPU-attached tensor accelerators and works with existing scheduling operations (e.g., producer-consumer fusion). Together, this enables developers to write diverse accelerator-leveraging applications in a few dozen lines.
Using our system, we demonstrate the potential of tensor accelerators beyond their traditional domains. We implement several image processing pipelines (e.g., filtering, resampling, and denoising) in our system and evaluate them against non-accelerator-leveraging baselines. We show that these pipelines can achieve significant speedups. For example, a downsampling routine is sped up by 6.1× by utilizing Tensor Cores on an Nvidia RTX 4070 GPU.
Published Artifact
proc time: 1.66