Powered by
23rd ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2025),
March 01–05, 2025,
Las Vegas, NV, USA
Frontmatter
Welcome from the Program Chairs
On behalf of the Program Committee of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization (CGO), we welcome you to this edition of the conference, to be held in Las Vegas. CGO’25 continued with the very successful dual-submission model introduced last year. For CGO’25, 58 submissions were received in the first round, and 106 submissions in the second round, including 8 papers that were recommended for revision from the first round. In total 48 papers were accepted for the program, including eight in the “Tools” category and three “Practical Experience” papers.
Papers First Round
Combining MLIR Dialects with Domain-Specific Architecture for Efficient Regular Expression Matching
Andrea Somaini,
Filippo Carloni,
Giovanni Agosta,
Marco D. Santambrogio, and
Davide Conficconi
(Politecnico di Milano, Italy)
Pattern matching based on Regular Expressions (REs) is a pervasive and challenging computational kernel used in several applications to identify critical information in a data stream. Due to the sequential data dependency of REs and the increasing data volume growth, hardware acceleration is gaining attention to address the limitation of general-purpose architectures. RE-oriented Domain-Specific Architectures (DSAs) combine the flexibility of translating REs into binary code with the efficiency of a specialized architecture, filling the gap between frozen hardware accelerators and the versatility of CPUs/GPUs. However, existing DSAs focus mainly on the efficiency execution challenge while missing the optimization opportunities that a structured compilation infrastructure can provide. This paper proposes a RE-tailored multi-level intermediate representation strategy embodied by the MLIR framework at the compiler level to exploit different abstraction optimizations via two domain-specific dialects, one targeting the abstract representation of REs and the other targeting the underlying domain-specific ISA. Moreover, this paper proposes a novel architectural organization of an open-source state-of-the-art DSA to maximize the parallelization capabilities. Overall, the proposed approach significantly improves execution time by up to 2.26×, energy efficiency by up to 2.30×, and resource usage.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
Qiwu: Exploiting Ciphertext-Level SIMD Parallelism in Homomorphic Encryption Programs
Zhongcheng Zhang,
Ying Liu,
Yuyang Zhang,
Zhenchuan Chen,
Jiacheng Zhao,
Xiaobing Feng,
Huimin Cui, and
Jingling Xue
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China; UNSW, Australia)
Fully Homomorphic Encryption (FHE), particularly the CKKS scheme, enables computation on encrypted data, facilitating secure task offloading to untrusted servers. CKKS allows packing multiple complex values into a single ciphertext, crucial for fixed-point arithmetic in machine learning, while leveraging SIMD parallelism at the plaintext level. However, operations such as reductions can degrade performance by creating a large number of bubbles (or gaps) in intermediate ciphertexts, leading to wasted computational resources. We introduce Qiwu, a ciphertext-level vectorization approach that enhances performance by fusing multiple ciphertexts containing bubbles. Qiwu uses a DSL to specify zero bubbles in input ciphertexts and nonzero bubbles in output ciphertexts, employs data-flow analysis to track them, and formulates a fusion plan guided by a cost-benefit assessment. Implemented in an existing FHE compiler, Qiwu was evaluated on four applications (including three machine learning tasks) and three kernels. It achieves speedups of up to 18.0× on CPUs, averaging 3.4× (geometric mean), compared to the state-of-the-art compiler that exploit only plaintext-level parallelism.
Article Search
Artifacts Available
Artifacts Functional
Results Reproduced
Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture
Olivia Hsu,
Alexander Rucker,
Tian Zhao,
Varun Desai,
Kunle Olukotun, and
Fredrik Kjolstad
(Stanford University, USA)
We introduce Stardust, a compiler from sparse tensor algebra languages to a sparse reconfigurable dataflow architecture via a parallel-patterns programming model. Stardust lets performance engineers specify the placement of data into memories separately from the placement of computation onto compute units. Users first schedule data placement onto an abstract memory model, and then Stardust binds that data to complex, on-chip physical memories. With guidance from user schedules, Stardust binds computation using these on-chip data structures to the appropriate parallel patterns. Through cycle-accurate simulation, we show that Stardust generates nine more tensor algebra kernels than the original Capstan sparse RDA work. The generated kernels perform, on average, 138× better than generated CPU kernels and 41× better than generated GPU kernels.
Article Search
SySTeC: A Symmetric Sparse Tensor Compiler
Radha Patel,
Willow Ahrens, and
Saman Amarasinghe
(Massachusetts Institute of Technology, USA)
Symmetric and sparse tensors arise naturally in many domains including linear algebra, statistics, physics, chemistry, and graph theory. Symmetric tensors are equal to their transposes, so in the n-dimensional case we can save up to a factor of n! by avoiding redundant operations. Sparse tensors, on the other hand, are mostly zero, and we can save asymptotically by processing only nonzeros. Unfortunately, specializing for both symmetry and sparsity at the same time is uniquely challenging. Optimizing for symmetry requires consideration of n! transpositions of a triangular kernel, which can be complex and error prone. Considering multiple transposed iteration orders and triangular loop bounds also complicates iteration through intricate sparse tensor formats. Additionally, since each combination of symmetry and sparse tensor formats requires a specialized implementation, this leads to a combinatorial number of cases. A compiler is needed, but existing compilers cannot take advantage of both symmetry and sparsity within the same kernel. In this paper, we describe the first compiler which can automatically generate symmetry-aware code for sparse or structured tensor kernels. We introduce a taxonomy for symmetry in tensor kernels, and show how to target each kind of symmetry. Our implementation demonstrates significant speedups ranging from 1.36x for SSYMV to 30.4x for a 5-dimensional MTTKRP over the non-symmetric state of the art.
Article Search
Artifacts Available
Artifacts Reusable
Cage: Hardware-Accelerated Safe WebAssembly
Martin Fink,
Dimitrios Stavrakakis,
Dennis Sprokholt,
Soham Chakraborty,
Jan-Erik Ekberg, and
Pramod Bhatotia
(TU Munich, Germany; TU Delft, Netherlands; Huawei, Finland)
WebAssembly (WASM) is an immensely versatile and increasingly popular compilation target. It executes applications written in several languages (e.g., C/C++) with near-native performance in various domains (e.g., mobile, edge, cloud). Despite WASM's sandboxing feature, which isolates applications from other instances and the host platform, WASM does not inherently provide any memory safety guarantees for applications written in low-level, unsafe languages.
To this end, we propose Cage, a hardware-accelerated toolchain for WASM that supports unmodified applications compiled to WASM and utilizes diverse Arm hardware features aiming to enrich the memory safety properties of WASM. Precisely, Cage leverages Arm's Memory Tagging Extension (MTE) to (i) provide spatial and temporal memory safety for heap and stack allocations and (ii) improve the performance of WASM's sandboxing mechanism. Cage further employs Arm's Pointer Authentication (PAC) to prevent leaked pointers from being reused by other WASM instances, thus enhancing WASM's security properties.
We implement our system based on 64-bit WASM. We provide a WASM compiler and runtime with support for Arm's MTE and PAC. On top of that, Cage's LLVM-based compiler toolchain transforms unmodified applications to provide spatial and temporal memory safety for stack and heap allocations and prevent function pointer reuse. Our evaluation on real hardware shows that Cage incurs minimal runtime (<5.8%) and memory (<3.7%) overheads and can improve the performance of WASM's sandboxing mechanism, achieving a speedup of over 5.1%, while offering efficient memory safety guarantees.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
Stack Filtering: Elevating Precision and Efficiency in Rust Pointer Analysis
Wei Li,
Dongjie He,
Wenguang Chen, and
Jingling Xue
(UNSW, Australia; Chongqing University, China; Tsinghua University, China)
Context-sensitive pointer analysis tends to generate excessive spurious points-to relations, causing inefficiency and imprecision. We introduce stack filtering, a novel approach using Rust's stack object lifetime information to address this issue. It identifies and eliminates context-sensitive points-to relations involving variables pointing to inactive stack objects beyond their lifetimes. Stack filtering is a lightweight two-phase process. It first assesses stack object liveness based on function reachability, leveraging an efficient call graph generated by Rapid Type Analysis (RTA). Then, this filtering is applied during the main pointer analysis.
We have integrated stack filtering into Rupta, a context-sensitive pointer analysis framework for Rust. Evaluations on real-world Rust applications show that stack filtering enhances analysis efficiency, reducing time and memory consumption while improving precision. This enhanced context-sensitive pointer analysis with stack filtering is poised to become a vital tool for Rust's program analysis and verification, given Rust's reliance on automatic memory management and a preference for static stack allocation.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
The MLIR Transform Dialect: Your Compiler Is More Powerful Than You Think
Martin Paul Lücke,
Oleksandr Zinenko,
William S. Moses,
Michel Steuwer, and
Albert Cohen
(University of Edinburgh, UK; Google DeepMind, France; University of Illinois at Urbana-Champaign, USA; Google DeepMind, USA; Technische Universität Berlin, Germany)
To take full advantage of a specific hardware target, performance engineers need to gain control on compilers in order to leverage their domain knowledge about the program and hardware. Yet, modern compilers are poorly controlled, usually by configuring a sequence of coarse-grained monolithic black-box passes, or by means of predefined compiler annotations/pragmas. These can be effective, but often do not let users precisely optimize their varying compute loads. As a consequence, performance engineers have to resort to implementing custom passes for a specific optimization heuristic, requiring compiler engineering expert knowledge.
In this paper, we present a technique that provides fine-grained control of general-purpose compilers by introducing the Transform dialect, a controllable IR-based transformation system implemented in MLIR. The Transform dialect empowers performance engineers to optimize their various compute loads by composing and reusing existing---but currently hidden---compiler features without the need to implement new passes or even rebuilding the compiler.
We demonstrate in five case studies that the Transform dialect enables precise, safe composition of compiler transformations and allows for straightforward integration with state-of-the-art search methods.
Preprint
Info
Artifacts Available
Artifacts Reusable
Results Reproduced
Honey Potion: An eBPF Backend for Elixir
Kael Soares Augusto,
Vinícius Pacheco,
Marcos A. Vieira,
Rodrigo Geraldo Ribeiro, and
Fernando Magno Quintão Pereira
(Federal University of Minas Gerais, Brazil; Cadence Design Systems, Brazil; Federal University of Ouro Preto, Brazil)
The Extended Berkeley Packet Filter (eBPF) is a sandboxed virtual machine that runs on operating systems with kernel privileges. Currently, eBPF programs are either translated from a subset of C, called Restricted C, or from bindings available for languages such as Rust or Python. This paper describes Honey Potion, a compiler that compiles Elixir to eBPF binaries. Translation is challenging, for it must not only preserve semantics, but also satisfy the eBPF verifier, which requires proofs of in-bounds memory accesses and termination. The translator relies heavily on this last constraint---ensured termination---to implement different optimizations: constant propagation, type specialization and partial evaluation. Honey Potion is publicly available, and has been used in the development of many eBPF applications, such as packet routers, process monitors and event loggers. To the best of our knowledge, Honey Potion is the first translator of a functional programming language to eBPF.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
ANT-ACE: An FHE Compiler Framework for Automating Neural Network Inference
Long Li,
Jianxin Lai,
Peng Yuan,
Tianxiang Sui,
Yan Liu,
Qing Zhu,
Xiaojing Zhang,
Linjie Xiao,
Wenguang Chen, and
Jingling Xue
(Ant Group, China; Tsinghua University, China; UNSW, Australia; Ant Group, Australia)
Fully Homomorphic Encryption (FHE) facilitates computations on encrypted data without requiring access to the decryption key, offering substantial privacy benefits for deploying neural network applications in sensitive sectors such as healthcare and finance. Nonetheless, programming these applications within the FHE framework is complex and demands extensive cryptographic expertise to guarantee correctness, performance, and security.
In this paper, we present ANT-ACE, a production-quality, open-source FHE compiler designed to automate neural network inference on encrypted data. ANT-ACE accepts ONNX models and generates C/C++ programs, leveraging its custom open-source FHE library. We explore the design challenges encountered in the development of ANT-ACE, which is engineered to support a variety of input formats and architectures across diverse FHE schemes through a novel Intermediate Representation (IR) that facilitates multiple levels of abstraction. Comprising 44,000 lines of C/C++ code, ANT-ACE efficiently translates ONNX models into C/C++ programs for encrypted inference on CPUs, specifically utilizing the RNS-CKKS scheme. Preliminary evaluations on a single CPU indicate that ANT-ACE achieves significant speed enhancements in ResNet models, surpassing expert manual implementations and fulfilling our design goals.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
GoFree: Reducing Garbage Collection via Compiler-Inserted Freeing
Haoran Peng,
Yu Zhang,
Michael D. Ernst,
Jinbao Chen, and
Boyao Ding
(University of Science and Technology of China, China; University of Washington, USA)
In a memory-managed programming language, programmers allocate memory by creating new objects, but programmers never free memory. A garbage collector (GC) periodically reclaims memory used by unreachable objects. As an optimization based on escape analysis, some memory can be freed explicitly by instructions inserted by the compiler. This optimization reduces the cost of garbage collection, without changing the programming model.
We designed and implemented this explicit freeing optimization for the Go language. We devised a new escape analysis that is both powerful and fast (𝑂(𝑁^2) time). Our escape analysis identifies short-lived heap objects that can be safely explicitly deallocated. We also implemented a freeing
primitive that is safe for use in concurrent environments.
We evaluated our system, GoFree, on 6 open-source Go programs. GoFree did not observably slow down compilation. At run time, GoFree deallocated on average 14% of allocated heap memory. It reduced GC frequency by 7%, GC time by 13%, wall-clock time by 2%, and heap size by 4%. We open-source GoFree.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
Memory Safety Instrumentations in Practice: Usability, Performance, and Security Guarantees
Tina Jung,
Fabian Ritter, and
Sebastian Hack
(Saarland University, Germany)
Memory safety violations due to C's undefined behavior, although well researched, still cause security breaches year by year.
The most dangerous reported violations are spatial safety violations, where objects are accessed outside of their bounds.
A wide variety of spatial safety sanitizers promise easy usage, broad security guarantees, and a low execution time overhead.
However, only few of them are actually used.
Instead of proposing yet another sanitizer, we dig deep into Low-Fat Pointers and SoftBound, two approaches to generate fast-to-execute safe programs with strong safety guarantees, and identify pain points in their usage.
We found that seemingly small simplifying assumptions or limitations of the approaches often lead to spurious error reports.
On top of analyzing usability issues, we set up a framework that abstracts common tasks of memory safety instrumentations, such as finding locations for checks and eliminating redundant checks.
This abstraction allows us to draw a fair comparison between approaches when it comes to execution time and the number of safe accesses.
We use this framework to give novel insights into how many accesses are provably safe, and where to attribute execution time overhead.
Our findings help future research on memory safety instrumentations by identifying issues that current approaches face in their practical application.
We make our LLVM-based instrumentation framework available to reduce the effort required to implement new instrumentations and to ease comparisons to Low-Fat Pointers and SoftBound.
Article Search
Info
Artifacts Available
Improving Native-Image Startup Performance
Matteo Basso,
Aleksandar Prokopec,
Andrea Rosà, and
Walter Binder
(USI Lugano, Switzerland; Oracle Labs, Switzerland)
With the increasing popularity of Serverless computing and Function as a Service---where typical workloads have a short lifetime---the research community is increasingly focusing on startup performance optimization. To reduce the startup time of managed language runtime systems, related work proposes strategies to move runtime environment initialization ahead-of-time. For instance, GraalVM Native Image allows one to create a binary file from a Java application that embeds a snapshot of the pre-initialized heap memory and can run without instantiating a Java Virtual Machine. However, the program startup needs to be further optimized, because the cloud runtime often starts the program while responding to the request. Thus, the program startup time impacts the service-level agreement.
In this paper, we improve the startup time of Native-Image binaries by changing their layout during compilation, reducing I/O traffic. We propose a profile-guided binary-reordering approach and a profiling methodology to obtain the execution-order profiles of methods and objects. Thanks to these profiles, we first reduce page faults related to the code section. Then, we propose three ordering strategies to reduce page faults related to accessing the objects in the heap snapshot. Since the object identities and the heap-snapshot contents are not persistent across Native-Image builds of the same program, we propose a method of matching objects from the profile against the objects in the profile-guided build. Experimental results show that our ordering strategies lead to an average page-fault reduction factor of 1.61× and an average execution-time speedup of 1.59×.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
Vectron: A Dynamic Programming Auto-vectorization Framework
Sourena Naser Moghaddasi,
Haris Smajlović,
Ariya Shajii, and
Ibrahim Numanagić
(University of Victoria, Canada; Exaloop, USA)
Dynamic programming (DP) is a fundamental algorithmic strategy that decomposes large problems into manageable subproblems. It is a cornerstone of many important computational methods in diverse fields, especially in the field of computational genomics, where it is used for sequence comparison. However, as the scale of the data keeps increasing, these algorithms are becoming a major computational bottleneck, and there is a need for strategies that can improve their performance. Here, we present Vectron, a novel auto-vectorization suite that targets array-based DP implementations written in Python and converts them to efficient vectorized counterparts that can efficiently process multiple problem instances in parallel. Leveraging Single Instruction Multiple Data (SIMD) capabilities in modern CPUs, along with Graphics Processing Units (GPUs), Vectron delivers significant speedups, ranging from 10% to more than 20x, over the conventional C++ implementations and manually vectorized and domain-specific state-of-the-art implementations, without necessitating large algorithm or code changes. Vectron's generality enables automatic vectorization of any array-based DP algorithm and, as a result, presents an attractive solution to optimization challenges inherent to DP algorithms.
Article Search
Artifacts Available
Artifacts Functional
LLM-Vectorizer: LLM-Based Verified Loop Vectorizer
Jubi Taneja,
Avery Laird,
Cong Yan,
Madan Musuvathi, and
Shuvendu K. Lahiri
(Microsoft Research, USA; University of Toronto, Canada)
Vectorization is a powerful optimization technique that significantly boosts the performance of high performance computing applications operating on large data arrays. Despite decades of research on auto-vectorization, compilers frequently miss opportunities to vectorize code. On the other hand, writing vectorized code manually using compiler intrinsics is still a complex, error-prone task that demands deep knowledge of specific architecture and compilers. In this paper, we evaluate the potential of large-language models (LLMs) to generate vectorized (Single Instruction Multiple Data) code from scalar programs that process individual array elements. We propose a novel finite-state-machine multi-agents based approach that harnesses LLMs and test-based feedback to generate vectorized code. Our findings indicate that LLMs are capable of producing high-performance vectorized code with run-time speedup ranging from 1.1x to 9.4x as compared to the state-of-the-art compilers such as Intel Compiler, GCC, and Clang. To verify the correctness of vectorized code, we use Alive2, a leading bounded translation validation tool for LLVM IR. We describe a few domain-specific techniques to improve the scalability of Alive2 on our benchmark dataset. Overall, our approach is able to verify 38.2% of vectorizations as correct on the TSVC benchmark dataset.
Article Search
Papers Second Round
Janitizer: Rethinking Binary Tools for Practical and Comprehensive Security
Mahwish Arif,
Sam Ainsworth, and
Timothy M. Jones
(University of Cambridge, UK; University of Edinburgh, UK)
Comprehensive application security can only be ensured if all code that it is going to execute is protected: any unprotected code, either from libraries or the application, becomes a potential attack surface. Compilers contain extensive suites of tools to aid in this, but require source availability that is often infeasible. Existing static and dynamic binary rewriting techniques that retrofit for security either lack in code coverage or soundness, or incur very high performance overhead.
We present a case for adopting hybrid static-dynamic mechanisms to ensure comprehensive security for binaries, providing sound and practical solutions. We highlight the limitations of existing hybrid tools in their use for security purposes, and provide insights to re-architect them. We provide a framework, Janitizer, that enables sound and comprehensive code coverage for entire applications, presenting hybrid binary implementations for two important classes of security schemes; a memory sanitizer and a control flow integrity scheme. These achieve the coverage and correctness of high-overhead dynamic techniques, while maintaining performance levels of low-coverage static techniques.
Article Search
VEGA: Automatically Generating Compiler Backends using a Pre-trained Transformer Model
Ming Zhong,
Fang Lv,
Lulin Wang,
Lei Qiu,
Yingying Wang,
Ying Liu,
Huimin Cui,
Xiaobing Feng, and
Jingling Xue
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia)
We introduce VEGA, an AI-driven system aimed at easing the development of compiler backends for new targets. Our approach involves categorizing functions from existing backends into function groups, each comprising various target-specific implementations of a standard compiler interface function, abstracted as a single function template. Therefore, generating a new backend involves customizing these function templates to specific target requirements. To capitalize on AI's capabilities in code generation, VEGA maps statements in a target-specific version of a function template into feature vectors, distinguishing between target-independent and target-specific properties. Leveraging a pre-trained model, VEGA can efficiently auto-generate a version of each function template tailored to a specific target, thereby enabling the construction of a complete compiler backend for a new target based solely on its target description files.
We evaluated VEGA on three distinct targets: a CPU processor (RISC-V), a customized processor with instruction extensions (RI5CY), and an IoT processor (xCORE). VEGA demonstrated high efficiency, generating compiler backends under an hour, which can substantially enhance developer productivity. Across the three targets, VEGA achieved accuracy rates of 71.5%, 73.2%, and 62.2% for all generated functions, significantly outperforming the traditional fork-flow method, which yielded less than 8% accuracy. Moreover, VEGA provides explicit confidence scores for generated functions and statements, allowing developers to easily identify areas requiring minimal manual intervention. This research has the potential to improve the effectiveness of traditional compiler backend development.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
SkipFlow: Improving the Precision of Points-to Analysis using Primitive Values and Predicate Edges
David Kozak,
Codrut Stancu,
Tomáš Vojnar, and
Christian Wimmer
(Oracle Labs, Czechia; Brno University of Technology, Czechia; Oracle Labs, Switzerland; Masaryk University, Czechia; Oracle Labs, USA)
A typical points-to analysis such as Andersen’s or Steensgaard’s may lose precision because it ignores the branching structure of the analyzed program. Moreover, points-to analysis typically focuses on objects only, not considering instructions manipulating primitive values. We argue that such an approach leads to an unnecessary precision loss, for example, when primitive constants true and false flow out of method calls. We propose a novel lightweight points-to analysis called SkipFlow that interprocedurally tracks the flow of both primitives and objects, and explicitly captures the branching structure of the code using predicate edges. At the same time, however, SkipFlow is as lightweight and scalable as possible, unlike a traditional flow-sensitive analysis. We apply SkipFlow to GraalVM Native Image, a closed-world solution to building standalone binaries for Java applications. We evaluate the implementation using a set of microservice applications as well as well-known benchmark suites. We show that SkipFlow reduces the size of the application in terms of reachable methods by 9% on average without significantly increasing the analysis time.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
MTE4JNI: A Memory Tagging Method to Protect Java Heap Memory from Illicit Native Code Access
Huinan Chen,
Jiang Ma,
Chun Jason Xue, and
Qingan Li
(Wuhan University, China; OPPO Electronics, China; MBZUAI, United Arab Emirates)
With the proliferation of mobile devices in daily life, ensuring the security and performance of these devices has become crucial. On Android, the Java Native Interface (JNI) acts as a bridge, allowing native libraries to directly access Java heap memory via raw pointers, bypassing Java's built-in safety checks. While this offers powerful functionality and performance, it also threatens the memory safety of the Java heap. Recently, Memory Tagging Extension (MTE) is introduced into the ARM architectures to enhance memory safety, reducing software vulnerabilities caused by illegal memory operations. This paper proposes MTE4JNI, an MTE-based JNI checking method, to protect Java heap memory from illicit native code access. Experimental results on real Android devices demonstrate that, compared to the currently employed guarded copy method, the proposed MTE4JNI method provides superior memory safety protection, while significantly reducing the runtime overhead on average by 11x and 27x for single-threaded and multi-threaded environments, respectively.
Article Search
Pattern Matching in AI Compilers and Its Formalization
Joseph W. Cutler,
Alex Collins,
Bin Fan,
Mahesh Ravishankar, and
Vinod Grover
(University of Pennsylvania, USA; NVIDIA, USA; NVIDIA, UK)
PyPM is a Python-based domain specific language (DSL) for building rewrite-based optimization passes on machine learning computation graphs. Users define individual optimizations by writing (a) patterns that match subgraphs of a computation graph and (b) corresponding rules which replace a matched subgraph with an optimized kernel. PyPM is distinguished from the many other DSLs for defining rewriting passes by its complex and novel pattern language which borrows concepts from logic programming. PyPM patterns can be recursive, nondeterminstic, and can require checking domain-specific constraints such as the shapes of tensors. The PyPM implementation is thus similarly complicated, consisting of thousands of lines of C++ code. In this paper, we present our work on building PyPM, as well as formalizing and distilling and this complexity to an understandable mathematical core. We have developed a formal core calculus expressing the main operations of the PyPM pattern language. We define both a declarative semantics – describing which patterns match which terms – and an algorithmic semantics – an idealized version of the PyPM pattern interpreter – and prove their equivalence. The development is fully mechanized in the Coq proof assistant.
Article Search
Postiz: Extending Post-increment Addressing for Loop Optimization and Code Size Reduction
Enming Fan,
Xiaofeng Guan,
Fan Hu,
Heng Shi,
Hao Zhou, and
Jianguo Yao
(Shanghai Enflame Technology, China; Shanghai Jiao Tong University, China)
Memory access instructions with auto-addressing modes are prevalent in various Instruction Set Architectures (ISAs), yet their use in compilers remains limited. Existing methods address code optimization in one of two ways: they either focus on reducing code size, but are constrained to basic block-level optimizations and may not fully exploit architectural benefits, or they optimize loop performance, often neglecting the advantages of post-increment instructions and focusing primarily on innermost loops while leaving outer loops unoptimized. To address these shortcomings and meet the needs of real-world Machine Learning (ML) applications, we introduce Postiz, a novel post-increment loop optimization technique. Postiz extends post-increment optimizations beyond traditional limits, incorporating enhancements for inner loops, cross-loop regions, and nested loop structures. Through a profitability analysis, Postiz optimizes code judiciously, leveraging architectural advantages and reducing code size without compromising improvement made by other optimizations. Our experiments show that Postiz is effective, achieving an optimization coverage of 98.04% on MobileNet and BERT benchmarks. In comparison to default LLVM optimization, Postiz generates approximately four times more post-increment instructions. Moreover, it reduces code size by an average of 9.45% across various platforms. These improvements represent significant advancements over current methods, showcasing Postiz’s potential to enhance compiler optimizations in a meaningful way.
Preprint
Teapot: Efficiently Uncovering Spectre Gadgets in COTS Binaries
Fangzheng Lin,
Zhongfa Wang, and
Hiroshi Sasaki
(Institute of Science Tokyo, Japan)
Speculative execution is crucial in enhancing modern processor performance but can introduce Spectre-type vulnerabilities that may leak sensitive information. Detecting Spectre gadgets from programs has been a research focus to enhance the analysis and understanding of Spectre attacks. However, one of the problems of existing approaches is that they rely on the presence of source code (or are impractical in terms of run-time performance and gadget detection ability).
This paper presents Teapot, the first Spectre gadget scanner that works on COTS binaries with comparable performance to compiler-based alternatives. As its core principle, we introduce Speculation Shadows, a novel approach that separates the binary code for normal execution and speculation simulation in order to improve run-time efficiency.
Teapot is based on static binary rewriting. It instruments the program to simulate the effects of speculative execution and also adds integrity checks to detect Spectre gadgets at run time. By leveraging fuzzing, Teapot succeeds in efficiently detecting Spectre gadgets. Evaluations show that Teapot outperforms both performance (more than 20x performant) and gadget detection ability than a previously proposed binary-based approach.
Article Search
Artifacts Available
Artifacts Reusable
Qubit Movement-Optimized Program Generation on Zoned Neutral Atom Processors
Enhyeok Jang,
Youngmin Kim,
Hyungseok Kim,
Seungwoo Choi,
Yipeng Huang, and
Won Woo Ro
(Yonsei University, South Korea; Rutgers University, USA)
A zoned neutral atom architecture achieves exceptional fidelity by segregating the execution spaces of 1- and 2-qubit gates, being a promising candidate for high-accuracy quantum systems. Unfortunately, na'ively applying programs designed for static qubit topologies to zoned architectures may result in most execution time being consumed by intra-zone travels of atoms. To address this, we introduce Mantra (Minimizing trAp movemeNts for aTom aRray Architectures), which rewrites quantum programs to reduce the interleaving of single- and two-qubit gates. Mantra incorporates three strategies: (i) a fountain-shaped controlled-Z (CZ) chain, (ii) ZZ-interaction protocol without a 1-qubit gate, and (iii) preemptive gate scheduling. Mantra reduces inter-zone movements by 68%, physical gate counts by 35%, and improves circuit fidelities by 17% compared to the standard executions.
Article Search
FastFlip: Compositional SDC Resiliency Analysis
Keyur Joshi,
Rahul Singh,
Tommaso Bassetto,
Sarita Adve,
Darko Marinov, and
Sasa Misailovic
(University of Illinois Urbana-Champaign, USA)
To efficiently harden programs susceptible to Silent Data Corruptions (SDCs), developers need to invoke error injection analyses to find particularly vulnerable instructions and then selectively protect them using appropriate compiler-level SDC detection mechanisms. However, these error injection analyses are both expensive and monolithic: they must be run from scratch after even small changes to the code, such as optimizations or bug fixes. This high recurring cost keeps such software-directed resiliency analyses out of standard software engineering practices such as regression testing. We present FastFlip, the first approach tailored to seamlessly incorporate resiliency analysis within the iterative software development workflow. FastFlip combines empirical error injection and symbolic SDC propagation analyses to enable fast and compositional error injection analysis of evolving programs. When developers modify a program, FastFlip often has to re-analyze only the modified program sections, which can save a significant amount of analysis time. We evaluated FastFlip with five benchmark programs. In our experiments, for each benchmark, we analyzed the original version plus two modified versions. The compositional nature of FastFlip speeds up the analysis of the incrementally modified versions by 3.2× (geomean) and up to 17.2×. The results demonstrate that FastFlip can effectively select a set of instructions to protect against SDCs that minimizes the runtime protection cost while protecting against a developer-specified target fraction of all tested SDC-causing errors.
Article Search
Proteus: Portable Runtime Optimization of GPU Kernel Execution with Just-in-Time Compilation
Giorgis Georgakoudis,
Konstantinos Parasyris, and
David Beckingsale
(Lawrence Livermore National Laboratory, USA)
In High-performance computing (HPC) fast application execution is the primary objective. HPC software is written in high-performance languages (C/C++, Fortran) and is statically compiled Ahead-of-Time (AOT) using optimizing compilers to generate fast code. AOT compilation optimizes source code with only limited information available at compile time, which precludes possible optimization leveraging runtime information. We propose Proteus, an easy-to-use, portable, and lightweight Just-In-Time (JIT) compilation approach to optimize GPU kernels at runtime. Proteus dynamically extracts, compiles, and optimizes language-agnostic LLVM IR to reduce compilation overhead while enhancing portability and versatility compared to language-specific solutions. Using a minimally intrusive annotation-based interface, Proteus specializes GPU kernels for input arguments and launch parameters. Evaluation on a diverse set of programs on AMD and NVIDIA GPUs shows that Proteus achieves significant end-to-end speedup, up to 2.8× for AMD and 1.78× on NVIDIA, over AOT optimization, while outperforming CUDA-specific Jitify with an average 1.23× speedp, thanks to reduced overhead and faster binary code in certain cases.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
Automatic Synthesis of Specialized Hash Functions
Renato B. Hoffmann,
Leonardo G. Faé,
Dalvan Griebler,
Xinliang David Li, and
Fernando Magno Quintão Pereira
(PUC-RS, Brazil; Google, USA; Federal University of Minas Gerais, Brazil)
This paper introduces a technique for synthesizing hash functions specialized to particular byte formats. This code generation method leverages three prevalent patterns: (i) fixed-length keys, (ii) keys with common subsequences, and (iii) keys ranging on predetermined sequences of bytes. Code generation involves two algorithms: one identifies relevant regular expressions within key examples, and the other generates specialized hash functions based on these expressions. Comparative analysis demonstrates that the synthetic functions outperform the general-purpose hashes in the C++ Standard Template Library and the Google Abseil Library when keys are given in ascending, normal or uniform distribution. In applications where low-mixing hashes are acceptable, the synthetic functions achieve speedups ranging from 2% to 11% on full benchmarks, and speedups of almost 50x once only hashing speed is considered.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
Weaver: A Retargetable Compiler Framework for FPQA Quantum Architectures
Oğuzcan Kırmemiş,
Francisco Romão,
Emmanouil Giortamis, and
Pramod Bhatotia
(TU Munich, Germany)
While the prominent quantum computing architectures are based on superconducting technology, new quantum hardware technologies are emerging, such as Trapped Ions, Neutral Atoms (or FPQAs), Silicon Spin Qubits, etc. This diverse set of technologies presents fundamental trade-offs in terms of scalability, performance, manufacturing, and operating expenses. To manage these diverse quantum technologies, there is a growing need for a retargetable compiler that can efficiently adapt existing code to these emerging hardware platforms. Such a retargetable compiler must be extensible to support new and rapidly evolving technologies, performant with fast compilation times and high-fidelity execution, and verifiable through rigorous equivalence checking to ensure the functional equivalence of the retargeted code. To this end, we present Weaver, the first extensible, performant, and verifiable retargetable quantum compiler framework with a focus on FPQAs due to their unique, promising features. Weaver introduces wQasm, the first formal extension of the standard OpenQASM quantum assembly with FPQA-specific instructions to support their distinct capabilities. Next, Weaver implements the wOptimizer, an extensible set of FPQA-specific optimization passes to improve execution quality. Last, the wChecker automatically checks for equivalence between the original and the retargeted code. Our evaluation shows that Weaver improves compilation times by 103×, execution times by 4.4×, and execution fidelity by 10%, on average, compared to superconducting and state-of-the-art (non-retargetable) FPQA compilers.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
Speeding up the Local C++ Development Cycle with Header Substitution
Nader Al Awar,
Zijian Yi,
George Biros, and
Milos Gligoric
(University of Texas at Austin, USA)
C++ remains one of the most widely used languages in various computing
fields, from embedded programming to high-performance computing. While
new features are constantly being added to C++, an important aspect of
the language that is often overlooked is its compilation time. Merely
including a few header files can cause compilation time to increase
significantly. An alternative to including header files is using
forward declarations; however, the rules for forward declaring classes
and functions are non obvious and confusing to most developers.
Additionally, forward declaring methods, as well as functions that
accept lambdas as arguments, is not possible. In this paper, we
present a novel technique, termed Header Substitution, to
automatically detect opportunities for forward declarations with the
goal of replacing includes of header files and improving compilation
time. Header Substitution also introduces function wrappers as an
alternative to forward declaring methods and functions with lambda
arguments. We implemented Header Substitution in a tool, dubbed Yalla,
and applied it to various C++ projects in order to speed up the
development cycle, i.e., the debugging, editing, compiling, and
rerunning loop, achieving up to a 24.5x speedup when compiling C++
files and a 4.68x speedup of the development cycle.
Article Search
Artifacts Functional
Results Reproduced
ASDF: A Compiler for Qwerty, a Basis-Oriented Quantum Programming Language
Austin J. Adams,
Sharjeel Khan,
Arjun S. Bhamra,
Ryan R. Abusaada,
Anthony M. Cabrera,
Cameron C. Hoechst,
Travis S. Humble,
Jeffrey S. Young, and
Thomas M. Conte
(Georgia Institute of Technology, USA; Oak Ridge National Laboratory, USA)
Qwerty is a high-level quantum programming language built on bases and functions rather than circuits. This new paradigm introduces new challenges in compilation, namely synthesizing circuits from basis translations and automatically specializing adjoint or predicated forms of functions. This paper presents ASDF, an open-source compiler for Qwerty that answers these challenges in compiling basis-oriented languages. Enabled with a novel high-level quantum IR implemented in the MLIR framework, our compiler produces OpenQASM 3 or QIR for either simulation or execution on hardware. Our compiler is evaluated by comparing the fault-tolerant resource requirements of generated circuits with other compilers, finding that ASDF produces circuits with comparable cost to prior circuit-oriented compilers.
Article Search
Info
Artifacts Available
Artifacts Reusable
Results Reproduced
CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning
Guoliang He and
Eiko Yoneki
(University of Cambridge, UK)
Large language models (LLMs) are remarked by their substantial computational requirements. To mitigate the cost, researchers develop specialized CUDA kernels, which often fuse several tensor operations to maximize the utilization of GPUs as much as possible. However, those specialized kernels may still leave performance on the table as CUDA assembly experts show that manual optimization of GPU SASS schedules can lead to better performance, and trial-and-error is largely employed to manually find the best GPU SASS schedules.
In this work, we employ an automatic approach to optimize GPU SASS schedules, which thus can be integrated into existing compiler frameworks. The key to automatic optimization is training an RL agent to mimic how human experts perform manual scheduling. To this end, we formulate an assembly game, where RL agents can play to find the best GPU SASS schedules. The assembly game starts from a -O3 optimized SASS schedule, and the RL agents can iteratively apply actions to mutate the current schedules. Positive rewards are generated if the mutated schedules get higher throughput by executing on GPUs. Experiments show that CuAsmRL can further improve the performance of existing specialized CUDA kernels transparently by up to 26%, and on average 9%. Moreover, it is used as a tool to reveal potential optimization moves learned automatically
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
IntelliGen: Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization
Zixuan Ma,
Haojie Wang,
Jingze Xing,
Shuhong Huang,
Liyan Zheng,
Chen Zhang,
Huanqi Cao,
Kezhao Huang,
Mingshu Zhai,
Shizhi Tang,
Penghan Wang, and
Jidong Zhai
(Tsinghua University, China; Qingcheng.AI, China)
Tensor compilers play a critical role in optimizing deep neural networks (DNNs), with memory performance emerging as a key bottleneck in code generation for DNN models. Existing tensor compilers are constrained by inefficient auto-tuning algorithms. They either must deploy coarse-grained descriptions, thus miss potential optimization, or struggle with vast search spaces, rendering auto-tuning inapplicable. Tensor compilers require a more holistic optimization of memory performance to overcome these constraints. To address this issue, we focus our optimization objective on memory performance, which allows us to design monotonic optimization methods, significantly enhancing the efficiency of auto-tuning and thus enabling auto-tuning on a fine-granularity description. Based on these observations, we propose IntelliGen, a tensor compiler with instruction-level auto-tuning and monotonic memory optimization. We design an instruction-level graph description, , and a monotonic optimization method for optimization on . Benefiting from auto-tuning techniques with fine-grained description, IntelliGen demonstrates significant speedup of up to 3.13×, 3.55×, and 16.9× (averaging 1.46×, 1.85×, and 2.30×, respectively) on NVIDIA GPUs, AMD GPUs, and Cambricon MLUs over the most efficient existing frameworks.
Article Search
CUrator: An Efficient LLM Execution Engine with Optimized Integration of CUDA Libraries
Yoon Noh Lee,
Yongseung Yu, and
Yongjun Park
(Yonsei University, South Korea)
Large Language Models (LLMs) have recently emerged as a state-of-the-art learning model with a wide range of applications in diverse computing environments. Among the various computational operations that comprise the LLM, the GEneral Matrix Multiplication (GEMM) operation is the most frequently utilized operation within the LLM. GEMM libraries such as cuBLAS and CUTLASS provide a variety of optimization techniques to achieve optimal GEMM performance in GPU-enabled computing environments. In particular, the CUTLASS open-source library for GPUs within the CUDA programming environment provides users with the capability to optimize templates for high performance. Previous research has demonstrated the effectiveness of CUTLASS-based GEMMs in improving the performance of real-world deep neural networks on various deep learning platforms. However, these studies have not considered different model parameters for modern LLMs nor have they explored the impact of diverse GPU computing environments. This paper presents CUrator, an efficient LLM execution engine that can achieve optimal end-to-end LLM performance using both cuBLAS and CUTLASS libraries on different GPUs for modern LLMs such as BERT, GPT, and Llama. CUrator first generates CUTLASS-/cuBLAS-friendly graph IRs of various LLMs on the TVM framework to maximize mapping coverage. On the CUTLASS mapping path, it performs a comprehensive search for programmable tuning parameters in the CUTLASS library with the objective of deriving optimal kernels for all GEMMs within each LLM. CUrator further introduces two optimization techniques: 1) build-time reduction key initialization support for CUTLASS Split-K GEMMs, and 2) Split-K support for CUTLASS Batch GEMMs. Finally, CUrator selects the best performing mapping path between cuBLAS and CUTLASS paths. The experimental results show that CUrator achieves inference speedups of 1.50× and 4.99×, respectively, for representative LLMs on the A100 GPU in the single and half precision, compared to the baseline. We strongly believe that the CUrator framework can provide the best direction for next-generation tuning frameworks by showing the maximum end-to-end performance of various LLMs on various GPUs.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
xDSL: Sidekick Compilation for SSA-Based Compilers
Mathieu Fehr,
Michel Weber,
Christian Ulmann,
Alexandre Lopoukhine,
Martin Paul Lücke,
Théo Degioanni,
Christos Vasiladiotis,
Michel Steuwer, and
Tobias Grosser
(University of Edinburgh, UK; ETH Zurich, Switzerland; University of Cambridge, UK; ENS Rennes, France; Technische Universität Berlin, Germany)
Traditionally, compiler researchers either conduct experiments within an existing production compiler or develop their own prototype compiler; both options come with trade-offs. On one hand, prototyping in a production compiler can be cumbersome, as they are often optimized for program compilation speed at the expense of software simplicity and development speed. On the other hand, the transition from a prototype compiler to production requires significant engineering work. To bridge this gap, we introduce the concept of sidekick compiler frameworks, an approach that uses multiple frameworks that interoperate with each other by leveraging textual interchange formats and declarative descriptions of abstractions. Each such compiler framework is specialized for specific use cases, such as performance or prototyping. Abstractions are by design shared across frameworks, simplifying the transition from prototyping to production. We demonstrate this idea with xDSL, a sidekick for MLIR focused on prototyping and teaching. xDSL interoperates with MLIR through a shared textual IR and the exchange of IRs through an IR Definition Language. The benefits of sidekick compiler frameworks are evaluated by showing on three use cases how xDSL impacts their development: teaching, DSL compilation, and rewrite system prototyping. We also investigate the trade-offs that xDSL offers, and demonstrate how we simplify the transition between frameworks using the IRDL dialect. With sidekick compilation, we envision a future in which engineers minimize the cost of development by choosing a framework built for their immediate needs, and later transitioning to production with minimal overhead.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
Parallaft: Runtime-Based CPU Fault Tolerance via Heterogeneous Parallelism
Boyue Zhang,
Sam Ainsworth,
Lev Mukhanov, and
Timothy M. Jones
(University of Cambridge, UK; University of Edinburgh, UK; Queen Mary University of London, UK)
The increasing vulnerability of microprocessors due to frequent silicon faults greatly exacerbates the risks of silent data corruption. Existing software-based schemes to detect these suffer from high power and performance overhead. State-of-the-art hardware fault-tolerance techniques exploit processor heterogeneity to minimize power, performance, and area overhead, but have not seen deployment in production due to their complexity.
This paper shows for the first time that the same insights of heterogeneous parallelism can be repurposed without any hardware support. We present Parallaft, a parallel software-based error detection technique taking the insights of state-of-the-art hardware techniques and repurposing them with tools more suited to the hardware of today, such as copy-on-write checkpointing, dirty-page tracking and performance-counter synchronization. This allows error checking to be offloaded to little cores of an Apple M2 heterogeneous processor, achieving half energy cost while maintaining performance comparable to the homogeneous duplication mechanism RAFT.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
An Efficient Polynomial Multiplication Derived Implementation of Convolution in Neural Networks
Haoke Xu,
Yulin Zhang,
Zitong Cheng, and
Xiaoming Li
(University of Delaware, USA; Minzu University of China, China)
Convolution is the most time consuming computation kernel in Convolutional Neural Network (CNN) applications and the majority of Graph Neural Network (GNN) applications. To achieve good convolution performance, current NN libraries such as cuDNN usually transform the naive convolution problem into either the matrix vector multiplication form, called the im2col+MM approach, or into the Fourier domain representation, namely the FFT approach. Both approaches either introduce significant amount of data redundancy to reduce the number of data passes and to leverage highly tuned linear algebra libraries, or avoid data redundancy but require multiple passes on input. Therefore, no implementation of convolution can outperform others in all cases. In this paper, we introduce a novel transformation that reinterprete the convolution in NN as a polynomial multiplication problem. By carefully constructing a number of conceptual polynomials, NN’s convolution essentially becomes a well-known problem of calculating the coefficients in the product of two polynomials. We evaluate our approach in multiple ways: at the API level comparing it with one of the most widely used NN libraries cuDNN, as well as implementing it in PyTorch and comparing the performance with benchmark networks. On three NVIDIA GPUs 3090Ti, V100 and A10G, our approach outperforms cuDNN over a broad range of parameters with the max speedups of 34.6%, 43.1% and 33.6% on these GPUs, respectively.
Article Search
Synthesis of Quantum Simulators by Compilation
Meisam Tarabkhah,
Mahshid Delavar,
Mina Doosti, and
Amir Shaikhha
(University of Edinburgh, UK; University of Sheffield, UK)
Quantum simulation plays a critical role in advancing our understanding and development of quantum algorithms, quantum computing hardware, and quantum information science. Despite the availability of various quantum circuit simulators, they often face challenges in terms of maintainability and extensibility. In this paper, we introduce Lightweight Functional Quantum Simulator (LFQS), a compilation-based framework that addresses these issues by leveraging an intermediate language to synthesize efficient quantum simulators in C and CUDA. The intermediate language exploits the underlying structured sparsity behind the matrix representation of quantum gates. Our framework generates efficient code for multi-core CPUs and GPUs and outperforms state-of-the-art simulators, such as Qiskit, Catalyst, and QuEST, as well as a representative of dense tensor frameworks, NumPy, in both micro-benchmark and macro-benchmark circuits.
Article Search
Scalar Interpolation: A Better Balance between Vector and Scalar Execution for SuperScalar Architectures
Reza Ghanbari,
Henry Kao,
João P. L. De Carvalho,
Ehsan Amiri, and
J. Nelson Amaral
(University of Alberta, Canada; Huawei Technologies, Canada)
Most compilers convert all iterations of a vectorizable loop into vector operations to decrease processing time. This paper proposes Scalar Interpolation, a technique that inserts scalar operations into vectorized loops to increase the utilization of execution units in processors with distinct pipelines for scalar and vector processing. Scalar interpolation inserts scalar operations for an entire iteration of the sequential loop to avoid data movements between vector and scalar registers. A challenge to introducing scalar interpolation is creating a static cost model to guide the compiler’s decision to interpolate scalar operations in a loop. An alternative to a static cost model is to perform auto-tuning in a loop to dynamically discover a sweet spot for the scalar interpolation factor. A performance study on an LLVM-based prototype reveals speedups of up to 30% on Intel Xeon (x86) with a static analysis of the cost model, and 43% on Kunpeng-920 (AArch64) with auto-tuning.
Article Search
Artifacts Available
Artifacts Functional
Results Reproduced
A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications
Lukas Trümper,
Philipp Schaad,
Berke Ates,
Alexandru Calotoiu,
Marcin Copik, and
Torsten Hoefler
(Daisytuner, Germany; ETH Zurich, Switzerland)
The same computations are often expressed differently across software projects and programming languages. In particular, how computations involving loops are expressed varies due to the many possibilities to permute and compose loops. Since each variant may have unique performance properties, automatic approaches to loop scheduling must support many different optimization recipes. In this paper, we propose a priori loop nest normalization to align loop nests and reduce the variation before the optimization. Specifically, we define and apply normalization criteria, mapping loop nests with different memory access patterns to the same canonical form. Since the memory access pattern is susceptible to loop variations and critical for performance, this normalization allows many loop nests to be optimized by the same optimization recipe. To evaluate our approach, we apply the normalization with optimizations designed for only the canonical form, improving the performance of many different loop nest variants. Across multiple implementations of 15 benchmarks using different languages, we outperform a baseline compiler in C on average by a factor of 21.13, state-of-the-art auto-schedulers such as Polly and the Tiramisu auto-scheduler by 2.31 and 2.89, as well as performance-oriented Python-based frameworks such as NumPy, Numba, and DaCe by 9.04, 3.92, and 1.47. Furthermore, we apply the concept to the CLOUDSC cloud microphysics scheme, an actively used component of the Integrated Forecasting System, achieving a 10% speedup over the highly-tuned Fortran code.
Article Search
A Multi-level Compiler Backend for Accelerated Micro-kernels Targeting RISC-V ISA Extensions
Alexandre Lopoukhine,
Federico Ficarelli,
Christos Vasiladiotis,
Anton Lydike,
Josse Van Delm,
Alban Dutilleul,
Luca Benini,
Marian Verhelst, and
Tobias Grosser
(University of Cambridge, UK; University of Bologna, Italy; Cineca, Italy; University of Edinburgh, UK; KU Leuven, Belgium; ENS Rennes, France; ETH Zurich, Switzerland)
High-performance micro-kernels must fully exploit today’s diverse and specialized hardware to deliver peak performance to deep neural networks (DNNs). While higher-level optimizations for DNNs are offered by numerous compilers (e.g., MLIR, TVM, OpenXLA), performance-critical micro-kernels are left to specialized code generators or handwritten assembly. Even though widely-adopted compilers (e.g., LLVM, GCC) offer tuned backends, their CPU-focused input abstraction, unstructured intermediate representation (IR) and general-purpose best-effort design inhibit tailored code generation for innovative hardware. We think it is time to widen the classical hourglass backend and embrace progressive lowering across a diverse set of structured abstractions to bring domain-specific code generation to compiler backends. We demonstrate this concept by implementing a custom backend for a RISC-V-based accelerator with hardware loops and streaming registers, leveraging knowledge about the hardware at levels of abstraction that match its custom instruction set architecture (ISA). We use incremental register allocation over structured IRs, while dropping classical spilling heuristics, and show up to 90% floating-point unit (FPU) utilization across key DNN kernels. By breaking the backend hourglass model, we reopen the path from domain-specific abstractions to specialized hardware.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
Accelerating LLMs using an Efficient GEMM Library and Target-Aware Optimizations on Real-World PIM Devices
Hyeoncheol Kim,
Taehoon Kim,
Taehyeong Park,
Donghyeon Kim,
Yongseung Yu,
Hanjun Kim, and
Yongjun Park
(Yonsei University, South Korea; Rebellions, South Korea; Hanyang University, South Korea)
Real-time processing of deep learning models on conventional systems, such as CPUs and GPUs, is highly challenging due to memory bottlenecks. This is exacerbated in Large Language Models (LLMs), where the majority of executions are dominated by General Matrix Multiplication (GEMM) operations, which are relatively more memory-intensive than convolution operations. Processing-in-Memory (PIM), which provides high internal bandwidth, can be a promising alternative for LLM serving. However, since current PIM systems do not fully replace traditional memory, data transfer between the host and PIM-side memory is essential. Therefore, minimizing the transfer cost between the host and PIM is crucial for serving LLMs efficiently on the PIM.
In this paper, we propose PIM-LLM, an end-to-end framework that accelerates LLMs using an efficient tiled GEMM library and several key target-aware optimizations on real-world PIM systems. We first propose PGEMMlib, which provides optimized tiling techniques for PIM, considering architecture specific characteristics to minimize unnecessary data transfer overhead and maximize parallelism. In addition, Tile-Selector explores optimized parameters and techniques for different GEMM shapes and available resources of PIM systems using an analytical model. To accelerate LLMs using PGEMMlib, we integrate it into the TVM deep learning compiler framework. We further optimize the LLM execution by applying several key optimizations: Build-time memory layout adjustment, PIM resource pooling, CPU/PIM cooperation support, and QKV generation fusion. Evaluation shows that PIM-LLM achieves significant performance gains of up to 45.75x over the TVM baseline for several well-known LLMs. We strongly believe that this work provides key insights for efficient LLM serving on real PIM devices.
Article Search
Calibro: Compilation-Assisted Linking-Time Binary Code Outlining for Code Size Reduction in Android Applications
Zhanhao Liang,
Hanming Sun,
Wenhan Shang,
Mengting Yuan,
Jingqin Fu,
Jiang Ma,
Chun Jason Xue, and
Qingan Li
(Wuhan University, China; Wuhan Broadcasting and Television Station, China; Guangdong OPPO Mobile Telecommunications, China; MBZUAI, United Arab Emirates)
Recent Android systems have employed pre-compilation technology to boost app launch speed and runtime performance. However, this generates large OAT files that over-consume scarce memory and storage resources in mobile devices. This paper conducts an evaluation of code redundancy in popular production android applications and observes that the code redundancy is up to 25%. To reduce the code size via redundancy elimination, this paper proposes Calibro, a compilation-assisted linking-time binary code outlining method. Calibro consists of two parts, the Compilation-Time code Outlining (CTO) and the Linking-Time Binary code Outlining (LTBO) with information collected at compilation- time. Additionally, a paralleled suffix tree method is proposed to reduce the building time overhead, and a hot function filtering method is proposed to effectively mitigate run-time performance degradation caused by code outlining. Experimental results show that compared to the baseline (the original AOSP version with all available code size optimization enabled), the proposed approach reduces code size in Android applications by more than 15.19% on average, with negligible runtime performance degradation and tolerable building time overhead. Hence the proposed code outlining approach is promising for production deployment.
Article Search
Tensorize: Fast Synthesis of Tensor Programs from Legacy Code using Symbolic Tracing, Sketching and Solving
Alexander Brauckmann,
Luc Jaulmes,
José W. de Souza Magalhães,
Elizabeth Polgreen, and
Michael F. P. O’Boyle
(University of Edinburgh, UK)
Tensor domain specific languages (DSLs) achieve substantial performance due to high-level compiler optimization and hardware acceleration. However, to achieve such performance for existing applications requires the programmer to manual rewrite their legacy code in evolving Tensor DSLs. Prior efforts to automate this translation face significant scalability issues which greatly reduces their applicability to real-world code. This paper presents Tensorize, a novel MLIR-based compiler approach to automatically lift legacy code to high level Tensor DSLs using program synthesis. Tensorizeuses a symbolic trace of the legacy program as a specification and automatically selects sketches from the target Tensor DSLs to drive the program synthesis. It uses an algebraic solver to rapidly simplify the specification, resulting in a fast, automatic approach that is correct by design. We evaluate Tensorizeon several legacy code benchmarks and compare against state-of-the-art techniques. Tensorizeis able to lift more code than prior schemes, is an order of magnitude faster in synthesis time, and guarantees correctness by construction.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
DialEgg: Dialect-Agnostic MLIR Optimizer using Equality Saturation with Egglog
Abd-El-Aziz Zayed and
Christophe Dubach
(McGill University, Canada; MILA, Canada)
MLIR’s ability to optimize programs at multiple levels of abstraction is key to enabling domain-specific optimizing compilers. However, expressing optimizations remains tedious. Optimizations can interact in unexpected ways, making it hard to unleash full performance.
Equality saturation promises to solve these challenges. First, it simplifies the expression of optimizations using rewrite rules. Secondly, it considers all possible optimization interactions, through saturation, selecting the best program variant. Despite these advantages, equality saturation remains absent from production compilers such as MLIR.
This paper proposes to integrate Egglog, a recent equality saturation engine, with MLIR, in a dialect-agnostic manner. This paper shows how the main MLIR constructs such as operations, types or attributes can be modeled in Egglog. It also presents DialEgg, a tool that pre-defines a large set of common MLIR constructs in Egglog and automatically translates between the MLIR and Egglog program representations. Using a few use-cases, this paper demonstrates the potential for combining equality saturation and MLIR.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
GraalNN: Context-Sensitive Static Profiling with Graph Neural Networks
Lazar Milikic,
Milan Cugurovic, and
Vojin Jovanovic
(Oracle Labs, Switzerland; Oracle Labs, Serbia)
Accurate static profile prediction is crucial for achieving optimal program performance in the absence of dynamic profiles. However, existing static profiling methods struggle to fully exploit the complex structure of the compiler’s intermediate representation and fail to effectively utilize the calling context information needed for accurate profile inference. To address these limitations, we introduce GraalNN, a Graph Neural Network-based static profiling framework that directly learns structural information from control-flow graphs. This reduces the reliance on handcrafted features and minimizes the effort required for feature engineering while improving the model’s ability to predict profiles. GraalNN adopts a two-stage approach: it predicts context-insensitive profiles during parsing and uses contextual information from the call graph to refine profiles during inlining. This methodology achieves a 10.13% runtime speedup across a diverse set of industry-standard benchmarks, surpassing state-of-the-art static profiling techniques by more than 2.5%. Furthermore, GraalNN improves throughput by 3.7% over other static profiling methods on real-world microservices.
Article Search
PreFix: Optimizing the Performance of Heap-Intensive Applications
Chaitanya Mamatha Ananda,
Rajiv Gupta,
Sriraman Tallam,
Han Shen, and
Xinliang David Li
(University of California at Riverside, USA; Google, USA)
Analyses of heap-intensive applications show that a small fraction of heap objects account for the majority of heap accesses and data cache misses. Prior works like HDS and HALO have shown that allocating hot objects in separate memory regions can improve spatial locality leading to better application performance. However, these techniques are constrained in two primary ways, limiting their gains. First, these techniques have Imperfect Separation, polluting the hot memory region with several cold objects. Second, reordering of objects across allocations is not possible as the original object allocation order is preserved. This paper presents a novel technique that achieves near perfect separation of hot objects via a new context mechanism that efficiently identifies hot objects with high precision. This technique, named PreFix, is based upon Preallocating memory for a Fixed small number of hot objects. The program, guided by profiles, is instrumented to compute context information derived from dynamic object identifiers, that precisely identifies hot object allocations that are then placed at predetermined locations in the preallocated memory. The preallocated memory region for hot objects provides the flexibility to reorder objects across allocations and allows colocation of objects that are part of a hot data stream (HDS), improving spatial locality. The runtime overhead of identifying hot objects is not significant as this optimization is only focused on a small number of static hot allocation sites and dynamic hot objects. While there is an increase in the program's memory foot-print, it is manageable and can be controlled by limiting the size of the preallocated memory. In addition, PreFix incorporates an object recycling optimization that reuses the same preallocated space to store different objects whose lifetimes are not expected to overlap. Our experiments with 13 heap-intensive applications yield reductions in execution times ranging from 2.77% to 74%. On average PreFix reduces execution time by 21.7% compared to 7.3% by HDS and 14% by HALO. This is due to PreFix's precision in hot object identification, hot object colocation, and low runtime overhead.
Article Search
Towards Efficient Compiler Auto-tuning: Leveraging Synergistic Search Spaces
Haolin Pan,
Yuanyu Wei,
Mingjie Xing,
Yanjun Wu, and
Chen Zhao
(Institute of Software at Chinese Academy of Sciences, China; Hangzhou Institute for Advanced Study at University of Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Determining the optimal sequence of compiler optimization passes is challenging due to the extensive and intricate search space. Traditional auto-tuning techniques, such as iterative compilation and machine learning methods, are often limited by high computational costs and difficulties in generalizing to new programs. These approaches can be inefficient and may not fully address the varying optimization needs across different programs. This paper introduces a novel approach that leverages the synergistic relationships between optimization passes to effectively reduce the search space. By focusing on chained synergy pass pairs that jointly optimize a specific target, our method uses K-means clustering to capture common optimization patterns across programs and forms these pairs into coresets. Leveraging a supervised learning model trained on these coresets, we effectively predict the most beneficial coreset for new programs, streamlining the search for optimal sequences. By integrating various search strategies, our method quickly converges to near-optimal solutions. Our approach achieves state-of-the-art performance on ten benchmark datasets, including MiBench, CBench, NPB, and CHStone, demonstrating an average reduction of 7.5% in Intermediate Representation (IR) instruction count compared to Oz. Furthermore, this set of chained synergy pass pairs is also well-suited for iterative search studies by other researchers, as it enables achieving an average codesize reduction of 13.9% compared to Oz with a simple search strategy that takes only about 5 seconds, outperforming existing search-based techniques in the initial pass search space across five datasets.
Article Search
Artifacts Available
Artifacts Reusable
Results Reproduced
proc time: 1.88