CC 2025 – Author Index |
Contents -
Abstracts -
Authors
|
A B C D F G H I J K L M N O P Q R S T U V W X Y Z
Abu-Yosef, Raneem |
![]() Raneem Abu-Yosef and Martin Kong (Ohio State University, USA) Distributed-memory programs that use the Message Passing Interface (MPI) often introduce various kinds of correctness anomalies. This work focuses on the type of anomalies detectable through data-flow modeling. We present a new tool and Domain-Specific Language to describe the data-flow of computations based on collective operations, such as the broadcast or all-gather in MPI. Our tool, CollectCall, models key aspects of distributed-memory algorithms, namely the processor space, symbolic communicators, data, its partitioning and mapping, and a set of communication primitives. Using these concepts, we build constraint systems that model the initial data placement and communication steps of the algorithm. Systems are built and solved with the Z3 SMT and the Integer Set Library (ISL) to decide the correctness of sequences of collective operations. We formalize the correctness requirements for a class of collective communication operations, and demonstrate the effectiveness of our approach on several micro-benchmarks and on well-known distributed algorithms from the literature while comparing against ITAC, MPI-Checker and PSE, state-of-the-art tools. ![]() |
|
Ainsworth, Sam |
![]() Minli Liao, Sam Ainsworth, Lev Mukhanov, and Timothy M. Jones (University of Cambridge, UK; University of Edinburgh, UK; Queen Mary University of London, UK) Faults within CPU circuits, which generate incorrect results and thus silent data corruption, have become endemic at scale. The only generic techniques to detect one-time or intermittent soft errors, such as particle strikes or voltage spikes, require redundant execution, where copies of each instruction in a program are executed twice and compared. The only software solution for this task that is open source and available for use today is nZDC, which aims to achieve “near-zero silent data corruption” through control- and data-flow redundancy. However, when we tried to apply this to large-scale workloads, we found it suffered a wide set of false positives, negatives, compiler bugs and run-time crashes, which meant it was impossible to benchmark against. This document details the wide set of fixes and workarounds we had to put in place to make nZDC work across full suites. We provide many new insights as to the edge cases that make such instruction duplication tricky under complex ISAs such as AArch64 and their similarly complex ABIs. Evaluation across SPECint 2006 and PARSEC with our extensions takes us from no workloads executing to all bar four, with 2× and 1.6× geomean overhead respectively relative to execution with no fault tolerance. ![]() |
|
Amorim, Guilherme |
![]() Mirlaine Crepalde, Augusto Mafra, Lucas Cavalini, Lucas Martins, Guilherme Amorim, Pedro Henrique Santos, and Fabiano Peixoto (Cadence Design Systems, Brazil) Random test case generation is a challenging subject in compiler testing. Due to the structured and strict nature of the languages required for compiler inputs, using randomization techniques for hunting bugs in compiler implementation represents a big challenge that requires trading off correctness and generation biases against fuzzing techniques for broader exploratory randomization. This paper shares the technology and the practical industry experience on two random testing frameworks developed for the Hardware Description Language (HDL) compiler of Jasper® App, a production formal verification software applied in Electronic Design Automation (EDA) industry. The two frameworks impact distinct parts of the compiler stack and provide different features and strengths for randomization: SystemVerilog Generator script, which creates random and formally provable HDL code, and Fuzz HDL Testing, a fuzzing solution applying LLVM’s libFuzzer to explore random textual inputs. ![]() |
|
Baghdadi, Riyadh |
![]() Chunting Liu and Riyadh Baghdadi (NYU Abu Dhabi, United Arab Emirates) Performance models are essential for automatic code optimization, enabling compilers to predict the effects of code transformations on performance and guide search for optimal transformations. Building state-of-the-art performance models with deep learning, however, requires vast labeled datasets of random programs – an expensive and time-consuming process, stretching over months. This paper introduces a self-supervised pre-training scheme with autoencoders to reduce the need for labeled data. By pre-training on a large dataset of random programs, the autoencoder learns representations of code and transformations, which are then used to embed programs for the performance model. Implemented in the Tiramisu autoscheduler, our approach improves model accuracy with less data. For example, to achieve a MAPE of 20.72%, the original model requires 18 million data points, whereas our method achieves a similar MAPE of 22.44% with only 3.6 million data points, reducing data requirements by 5×. ![]() |
|
Borin, Edson |
![]() Michael Canesche, Vanderson Martins do Rosario, Edson Borin, and Fernando Magno Quintão Pereira (Cadence Design Systems, Brazil; Cadence Design Systems, USA; State University of Campinas, Brazil; Federal University of Minas Gerais, Brazil) Tensor compilers like XLA, TVM, and TensorRT operate on computational graphs, where vertices represent operations and edges represent data flow between these operations. Operator fusion is an optimization that merges operators to improve their efficiency. This paper presents the operator fusion algorithm recently deployed in the Xtensa Neural Network Compiler (XNNC) - Cadence Tensilica's tensor compiler. The algorithm clusters nodes within the computational graph and iteratively grows these clusters until reaching a fixed point. A priority queue, sorted by the estimated profitability of merging cluster candidates, guides this iterative process. It balances precision and practicality, producing models 39% faster than XNNC's previous fusion approach, which was based on a depth-first traversal of the computational graph. Moreover, unlike recently proposed exhaustive or evolutionary search methods, this algorithm terminates quickly while often yielding equally efficient models. ![]() |
|
Brauckmann, Alexander |
![]() Alexander Brauckmann, Anderson Faustino da Silva, Gabriel Synnaeve, Michael F. P. O’Boyle, Jeronimo Castrillon, and Hugh Leather (University of Edinburgh, UK; State University of Maringá, Brazil; Meta, France; TU Dresden, Germany; Meta, USA) Data flow analysis is fundamental to modern program optimization and verification, serving as a critical foundation for compiler transformations. As machine learning increasingly drives compiler tasks, the need for models that can implicitly understand and correctly reason about data flow properties becomes crucial for maintaining soundness. State-of-the-art machine learning methods, especially graph neural networks (GNNs), face challenges in generalizing beyond training scenarios due to their limited ability to perform large propagations. We present DFA-Net, a neural network architecture tailored for compilers that systematically generalizes. It emulates the reasoning process of compilers, facilitating the generalization of data flow analyses from simple to complex programs. The architecture decomposes data flow analyses into specialized neural networks for initialization, transfer, and meet operations, explicitly incorporating compiler-specific knowledge into the model design. We evaluate DFA-Net on a data flow analysis benchmark from related work and demonstrate that our compiler-specific neural architecture can learn and systematically generalize on this task. DFA-Net demonstrates superior performance over traditional GNNs in data flow analysis, achieving F1 scores of 0.761 versus 0.009 for data dependencies and 0.989 versus 0.196 for dominators at high complexity levels, while maintaining perfect scores for liveness and reachability analyses where GNNs struggle significantly. ![]() ![]() |
|
Canesche, Michael |
![]() Michael Canesche, Vanderson Martins do Rosario, Edson Borin, and Fernando Magno Quintão Pereira (Cadence Design Systems, Brazil; Cadence Design Systems, USA; State University of Campinas, Brazil; Federal University of Minas Gerais, Brazil) Tensor compilers like XLA, TVM, and TensorRT operate on computational graphs, where vertices represent operations and edges represent data flow between these operations. Operator fusion is an optimization that merges operators to improve their efficiency. This paper presents the operator fusion algorithm recently deployed in the Xtensa Neural Network Compiler (XNNC) - Cadence Tensilica's tensor compiler. The algorithm clusters nodes within the computational graph and iteratively grows these clusters until reaching a fixed point. A priority queue, sorted by the estimated profitability of merging cluster candidates, guides this iterative process. It balances precision and practicality, producing models 39% faster than XNNC's previous fusion approach, which was based on a depth-first traversal of the computational graph. Moreover, unlike recently proposed exhaustive or evolutionary search methods, this algorithm terminates quickly while often yielding equally efficient models. ![]() |
|
Castrillon, Jeronimo |
![]() Anderson Faustino da Silva, Jeronimo Castrillon, and Fernando Magno Quintão Pereira (State University of Maringá, Brazil; TU Dresden, Germany; Federal University of Minas Gerais, Brazil) Classifying programs based on their tasks is essential in fields such as plagiarism detection, malware analysis, and software auditing. Traditionally, two classification approaches exist: static classifiers analyze program syntax, while dynamic classifiers observe their execution. Although dynamic analysis is regarded as more precise, it is often considered impractical due to high overhead, leading the research community to largely dismiss it. In this paper, we revisit this perception by comparing static and dynamic analyses using the same classification representation: opcode histograms. We show that dynamic histograms---generated from instructions actually executed---are only marginally (4-5%) more accurate than static histograms in non-adversarial settings. However, if an adversary is allowed to obfuscate programs, the accuracy of the dynamic classifier is twice higher than the static one, due to its ability to avoid observing dead-code. Obtaining dynamic histograms with a state-of-the-art Valgrind-based tool incurs an 85x slowdown; however, once we account for the time to produce the representations for static analysis of executables, the overall slowdown reduces to 4x: a result significantly lower than previously reported in the literature. ![]() ![]() Alexander Brauckmann, Anderson Faustino da Silva, Gabriel Synnaeve, Michael F. P. O’Boyle, Jeronimo Castrillon, and Hugh Leather (University of Edinburgh, UK; State University of Maringá, Brazil; Meta, France; TU Dresden, Germany; Meta, USA) Data flow analysis is fundamental to modern program optimization and verification, serving as a critical foundation for compiler transformations. As machine learning increasingly drives compiler tasks, the need for models that can implicitly understand and correctly reason about data flow properties becomes crucial for maintaining soundness. State-of-the-art machine learning methods, especially graph neural networks (GNNs), face challenges in generalizing beyond training scenarios due to their limited ability to perform large propagations. We present DFA-Net, a neural network architecture tailored for compilers that systematically generalizes. It emulates the reasoning process of compilers, facilitating the generalization of data flow analyses from simple to complex programs. The architecture decomposes data flow analyses into specialized neural networks for initialization, transfer, and meet operations, explicitly incorporating compiler-specific knowledge into the model design. We evaluate DFA-Net on a data flow analysis benchmark from related work and demonstrate that our compiler-specific neural architecture can learn and systematically generalize on this task. DFA-Net demonstrates superior performance over traditional GNNs in data flow analysis, achieving F1 scores of 0.761 versus 0.009 for data dependencies and 0.989 versus 0.196 for dominators at high complexity levels, while maintaining perfect scores for liveness and reachability analyses where GNNs struggle significantly. ![]() ![]() |
|
Cavalini, Lucas |
![]() Mirlaine Crepalde, Augusto Mafra, Lucas Cavalini, Lucas Martins, Guilherme Amorim, Pedro Henrique Santos, and Fabiano Peixoto (Cadence Design Systems, Brazil) Random test case generation is a challenging subject in compiler testing. Due to the structured and strict nature of the languages required for compiler inputs, using randomization techniques for hunting bugs in compiler implementation represents a big challenge that requires trading off correctness and generation biases against fuzzing techniques for broader exploratory randomization. This paper shares the technology and the practical industry experience on two random testing frameworks developed for the Hardware Description Language (HDL) compiler of Jasper® App, a production formal verification software applied in Electronic Design Automation (EDA) industry. The two frameworks impact distinct parts of the compiler stack and provide different features and strengths for randomization: SystemVerilog Generator script, which creates random and formally provable HDL code, and Fuzz HDL Testing, a fuzzing solution applying LLVM’s libFuzzer to explore random textual inputs. ![]() |
|
Chen, Changbin |
![]() Changbin Chen, Shu Sugita, Yotaro Nada, Hidetsugu Irie, Shuichi Sakai, and Ryota Shioya (University of Tokyo, Japan) Research on novel Instruction Set Architectures (ISAs) is actively pursued; however, it requires extensive efforts to develop and maintain comprehensive compilation toolchains for each new ISA. Binary translation can provide a practical solution for ISA researchers to port target programs to novel ISAs when such a level of toolchain support is not available. However, to ensure the correct handling of indirect jumps, existing binary translators rely on complex runtime systems, whose implementation on primitive research ISAs demands significant efforts. ISA researchers generally have access to additional source-level information, including the symbol table and source code, when using binary translators. The symbol table can provide potential jump targets for optimizing indirect jumps, and ISA-independent functions in source code can be directly compiled without translation. Leveraging source-level information as additional input, in this paper, we propose Biotite, a high-performance static binary translator that correctly handles arbitrary indirect jumps. Currently, Biotite supports the translation of RV64GC Linux binaries to self-contained LLVM IR. Our evaluation shows that Biotite successfully translates all benchmarks in SPEC CPU 2017 and achieves a 2.346× performance improvement over QEMU for the integer benchmark suite. ![]() |
|
Crepalde, Mirlaine |
![]() Mirlaine Crepalde, Augusto Mafra, Lucas Cavalini, Lucas Martins, Guilherme Amorim, Pedro Henrique Santos, and Fabiano Peixoto (Cadence Design Systems, Brazil) Random test case generation is a challenging subject in compiler testing. Due to the structured and strict nature of the languages required for compiler inputs, using randomization techniques for hunting bugs in compiler implementation represents a big challenge that requires trading off correctness and generation biases against fuzzing techniques for broader exploratory randomization. This paper shares the technology and the practical industry experience on two random testing frameworks developed for the Hardware Description Language (HDL) compiler of Jasper® App, a production formal verification software applied in Electronic Design Automation (EDA) industry. The two frameworks impact distinct parts of the compiler stack and provide different features and strengths for randomization: SystemVerilog Generator script, which creates random and formally provable HDL code, and Fuzz HDL Testing, a fuzzing solution applying LLVM’s libFuzzer to explore random textual inputs. ![]() |
|
Cummins, Chris |
![]() Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather (Meta, USA; Meta, France) Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training LLMs is resource-intensive, requiring substantial GPU hours and extensive data collection, which can be prohibitive. To address this gap, we introduce LLM Compiler, a suite of robust, openly available, pre-trained models specifically designed for compiler tasks. Built on the foundation of Code Llama, LLM Compiler enhances the understanding of compiler intermediate representations (IRs), assembly language, and optimization techniques. The models have been trained on a vast corpus of 546 billion tokens of LLVM-IR and assembly code and have undergone instruction fine-tuning to interpret compiler behavior. To demonstrate the utility of these research tools, we also present fine-tuned versions of the models with enhanced capabilities in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR. These achieve 77% of the optimising potential of an autotuning search, and 45% disassembly round trip (14% exact match). LLM Compiler is released under a bespoke commercial license to allow wide reuse and is available in two sizes: 7 billion and 13 billion parameters. Our aim is to provide scalable, cost-effective foundational models for further research and development in compiler optimization by both academic researchers and industry practitioners. Since we released LLM Compiler the community has quantized, repackaged, and downloaded the models over 250k times. ![]() ![]() Davide Italiano and Chris Cummins (Meta, USA) Compilers are complex, and significant effort has been expended on testing them. Techniques such as random program generation and differential testing have proved highly effective and have uncovered thousands of bugs in production compilers. The majority of effort has been expended on validating that a compiler produces correct code for a given input, while less attention has been paid to ensuring that the compiler produces performant code. In this work we adapt differential testing to the task of identifying missed optimization opportunities in compilers. We develop a novel testing approach which combines large language models (LLMs) with a series of differential testing strategies and use them to find missing code size optimizations in C / C++ compilers. The advantage of our approach is its simplicity. We offload the complex task of generating random code to an off-the-shelf LLM, and use heuristics and analyses to identify anomalous compiler behavior. Our approach requires fewer than 150 lines of code to implement. This simplicity makes it extensible. By simply changing the target compiler and initial LLM prompt we port the approach from C / C++ to Rust and Swift, finding bugs in both. To date we have reported 24 confirmed bugs in production compilers, and conclude that LLM-assisted testing is a promising avenue for detecting optimization bugs in real world compilers. ![]() |
|
Do Rosario, Vanderson Martins |
![]() Michael Canesche, Vanderson Martins do Rosario, Edson Borin, and Fernando Magno Quintão Pereira (Cadence Design Systems, Brazil; Cadence Design Systems, USA; State University of Campinas, Brazil; Federal University of Minas Gerais, Brazil) Tensor compilers like XLA, TVM, and TensorRT operate on computational graphs, where vertices represent operations and edges represent data flow between these operations. Operator fusion is an optimization that merges operators to improve their efficiency. This paper presents the operator fusion algorithm recently deployed in the Xtensa Neural Network Compiler (XNNC) - Cadence Tensilica's tensor compiler. The algorithm clusters nodes within the computational graph and iteratively grows these clusters until reaching a fixed point. A priority queue, sorted by the estimated profitability of merging cluster candidates, guides this iterative process. It balances precision and practicality, producing models 39% faster than XNNC's previous fusion approach, which was based on a depth-first traversal of the computational graph. Moreover, unlike recently proposed exhaustive or evolutionary search methods, this algorithm terminates quickly while often yielding equally efficient models. ![]() |
|
Faustino da Silva, Anderson |
![]() Anderson Faustino da Silva, Jeronimo Castrillon, and Fernando Magno Quintão Pereira (State University of Maringá, Brazil; TU Dresden, Germany; Federal University of Minas Gerais, Brazil) Classifying programs based on their tasks is essential in fields such as plagiarism detection, malware analysis, and software auditing. Traditionally, two classification approaches exist: static classifiers analyze program syntax, while dynamic classifiers observe their execution. Although dynamic analysis is regarded as more precise, it is often considered impractical due to high overhead, leading the research community to largely dismiss it. In this paper, we revisit this perception by comparing static and dynamic analyses using the same classification representation: opcode histograms. We show that dynamic histograms---generated from instructions actually executed---are only marginally (4-5%) more accurate than static histograms in non-adversarial settings. However, if an adversary is allowed to obfuscate programs, the accuracy of the dynamic classifier is twice higher than the static one, due to its ability to avoid observing dead-code. Obtaining dynamic histograms with a state-of-the-art Valgrind-based tool incurs an 85x slowdown; however, once we account for the time to produce the representations for static analysis of executables, the overall slowdown reduces to 4x: a result significantly lower than previously reported in the literature. ![]() ![]() Alexander Brauckmann, Anderson Faustino da Silva, Gabriel Synnaeve, Michael F. P. O’Boyle, Jeronimo Castrillon, and Hugh Leather (University of Edinburgh, UK; State University of Maringá, Brazil; Meta, France; TU Dresden, Germany; Meta, USA) Data flow analysis is fundamental to modern program optimization and verification, serving as a critical foundation for compiler transformations. As machine learning increasingly drives compiler tasks, the need for models that can implicitly understand and correctly reason about data flow properties becomes crucial for maintaining soundness. State-of-the-art machine learning methods, especially graph neural networks (GNNs), face challenges in generalizing beyond training scenarios due to their limited ability to perform large propagations. We present DFA-Net, a neural network architecture tailored for compilers that systematically generalizes. It emulates the reasoning process of compilers, facilitating the generalization of data flow analyses from simple to complex programs. The architecture decomposes data flow analyses into specialized neural networks for initialization, transfer, and meet operations, explicitly incorporating compiler-specific knowledge into the model design. We evaluate DFA-Net on a data flow analysis benchmark from related work and demonstrate that our compiler-specific neural architecture can learn and systematically generalize on this task. DFA-Net demonstrates superior performance over traditional GNNs in data flow analysis, achieving F1 scores of 0.761 versus 0.009 for data dependencies and 0.989 versus 0.196 for dominators at high complexity levels, while maintaining perfect scores for liveness and reachability analyses where GNNs struggle significantly. ![]() ![]() |
|
Gehring, Jonas |
![]() Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather (Meta, USA; Meta, France) Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training LLMs is resource-intensive, requiring substantial GPU hours and extensive data collection, which can be prohibitive. To address this gap, we introduce LLM Compiler, a suite of robust, openly available, pre-trained models specifically designed for compiler tasks. Built on the foundation of Code Llama, LLM Compiler enhances the understanding of compiler intermediate representations (IRs), assembly language, and optimization techniques. The models have been trained on a vast corpus of 546 billion tokens of LLVM-IR and assembly code and have undergone instruction fine-tuning to interpret compiler behavior. To demonstrate the utility of these research tools, we also present fine-tuned versions of the models with enhanced capabilities in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR. These achieve 77% of the optimising potential of an autotuning search, and 45% disassembly round trip (14% exact match). LLM Compiler is released under a bespoke commercial license to allow wide reuse and is available in two sizes: 7 billion and 13 billion parameters. Our aim is to provide scalable, cost-effective foundational models for further research and development in compiler optimization by both academic researchers and industry practitioners. Since we released LLM Compiler the community has quantized, repackaged, and downloaded the models over 250k times. ![]() |
|
Gorlatch, Sergei |
![]() Richard Schulze, Sergei Gorlatch, and Ari Rasch (University of Münster, Germany) We introduce pyATF -- a new, language-independent, open-source auto-tuning tool that fully automatically determines optimized values of performance-critical program parameters. A major feature of pyATF is its support for constrained parameters, e.g., the value of one parameter has to divide the value of another parameter. A further major feature of pyATF is its user interface which is designed with a particular focus on expressivity and usability for real-world demands, and which is offered in the increasingly popular Python programming language. We experimentally confirm the practicality of pyATF using real-world studies from the areas of quantum chemistry, image processing, data mining, and deep learning: we show that pyATF auto-tunes the complex parallel implementations of our studies to higher performance than achieved by state-of-practice approaches, including hand-optimized vendor libraries. ![]() ![]() ![]() |
|
Grubisic, Dejan |
![]() Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather (Meta, USA; Meta, France) Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training LLMs is resource-intensive, requiring substantial GPU hours and extensive data collection, which can be prohibitive. To address this gap, we introduce LLM Compiler, a suite of robust, openly available, pre-trained models specifically designed for compiler tasks. Built on the foundation of Code Llama, LLM Compiler enhances the understanding of compiler intermediate representations (IRs), assembly language, and optimization techniques. The models have been trained on a vast corpus of 546 billion tokens of LLVM-IR and assembly code and have undergone instruction fine-tuning to interpret compiler behavior. To demonstrate the utility of these research tools, we also present fine-tuned versions of the models with enhanced capabilities in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR. These achieve 77% of the optimising potential of an autotuning search, and 45% disassembly round trip (14% exact match). LLM Compiler is released under a bespoke commercial license to allow wide reuse and is available in two sizes: 7 billion and 13 billion parameters. Our aim is to provide scalable, cost-effective foundational models for further research and development in compiler optimization by both academic researchers and industry practitioners. Since we released LLM Compiler the community has quantized, repackaged, and downloaded the models over 250k times. ![]() |
|
Hack, Sebastian |
![]() Marcel Ullrich, Sebastian Hack, and Roland Leißa (Saarland University, Germany; University of Mannheim, Germany) Automatic Differentiation (AD) is at the core of all machine learning frameworks and has applications in scientific computing as well. Theoretical research on reverse-mode AD focuses on functional, higher-order languages, enabling AD to be formulated as a series of local, concise program rewrites. These theoretical approaches focus on correctness but disregard efficiency. Practical implementations, however, employ mutation and taping techniques to enhance efficiency. This approach, however, necessitates intricate, low-level, and non-local program transformations. In this work, we introduce MimIrADe, a functionally inspired AD technique implemented within a higher-order, graph-based ("sea of nodes") intermediate representation (IR). Our method consists of a streamlined implementation and incorporates standard optimizations, resulting in an efficient AD system. The higher-order nature of the IR enables us to utilize concise functional AD methods, expressing AD through local rewrites. This locality facilitates modular high-level extensions, such as matrix operations, in a straightforward manner. Additionally, the graph-based structure of the IR ensures that critical implementation aspects---particularly the handling of shared pullback invocations---are managed naturally and efficiently. Our AD pass supports a comprehensive set of features, including non-scalar types, pointers, and higher-order recursive functions. We demonstrate through standard benchmarks that a suite of common optimizations effectively eliminates the overhead typically associated with functional AD approaches, producing differentiated code that performs on par with leading mutation and taping techniques. At the same time, MimIrADe's implementation is an order of magnitude less complex compared to its contenders. ![]() ![]() |
|
He, Jihong |
![]() Shaobai Yuan, Jihong He, Yihui Xie, Feng Wang, and Jie Zhao (Hunan University, China) This paper introduces PLOS, a novel Post-Link Outlining approach designed to enhance code Size reduction for resource-constrained environments. Built on top of a post-link optimizer BOLT, PLOS maintains a holistic view of the whole-program structure and behavior, utilizing runtime information while preserving standard build system flows. The approach includes a granular outlining algorithm that matches and replaces repeated instruction sequences within/across modules and outlined functions, along with careful stack frame management to ensure correct function call handling. By integrating profiling information, PLOS balances the trade-off between code size and execution efficiency. The evaluation using eight MiBench benchmarks on an ARM-based Phytium FCT662 core demonstrates that PLOS achieves a mean code size reduction of 10.88% (up to 43.53%) and 6.61% (up to 14.78%) compared to LLVM’s and GCC's standard optimization, respectively, 1.76% (up to 4.75%) over LLVM’s aggressive code size reduction optimizations, and 2.88% (up to 8.56%) over a link-time outliner. The experimental results also show that PLOS can achieve a favorable balance between code size reduction and performance regression. ![]() |
|
Irie, Hidetsugu |
![]() Changbin Chen, Shu Sugita, Yotaro Nada, Hidetsugu Irie, Shuichi Sakai, and Ryota Shioya (University of Tokyo, Japan) Research on novel Instruction Set Architectures (ISAs) is actively pursued; however, it requires extensive efforts to develop and maintain comprehensive compilation toolchains for each new ISA. Binary translation can provide a practical solution for ISA researchers to port target programs to novel ISAs when such a level of toolchain support is not available. However, to ensure the correct handling of indirect jumps, existing binary translators rely on complex runtime systems, whose implementation on primitive research ISAs demands significant efforts. ISA researchers generally have access to additional source-level information, including the symbol table and source code, when using binary translators. The symbol table can provide potential jump targets for optimizing indirect jumps, and ISA-independent functions in source code can be directly compiled without translation. Leveraging source-level information as additional input, in this paper, we propose Biotite, a high-performance static binary translator that correctly handles arbitrary indirect jumps. Currently, Biotite supports the translation of RV64GC Linux binaries to self-contained LLVM IR. Our evaluation shows that Biotite successfully translates all benchmarks in SPEC CPU 2017 and achieves a 2.346× performance improvement over QEMU for the integer benchmark suite. ![]() |
|
Italiano, Davide |
![]() Davide Italiano and Chris Cummins (Meta, USA) Compilers are complex, and significant effort has been expended on testing them. Techniques such as random program generation and differential testing have proved highly effective and have uncovered thousands of bugs in production compilers. The majority of effort has been expended on validating that a compiler produces correct code for a given input, while less attention has been paid to ensuring that the compiler produces performant code. In this work we adapt differential testing to the task of identifying missed optimization opportunities in compilers. We develop a novel testing approach which combines large language models (LLMs) with a series of differential testing strategies and use them to find missing code size optimizations in C / C++ compilers. The advantage of our approach is its simplicity. We offload the complex task of generating random code to an off-the-shelf LLM, and use heuristics and analyses to identify anomalous compiler behavior. Our approach requires fewer than 150 lines of code to implement. This simplicity makes it extensible. By simply changing the target compiler and initial LLM prompt we port the approach from C / C++ to Rust and Swift, finding bugs in both. To date we have reported 24 confirmed bugs in production compilers, and conclude that LLM-assisted testing is a promising avenue for detecting optimization bugs in real world compilers. ![]() |
|
Jacob, Dejice |
![]() Duncan Lowther, Dejice Jacob, Jacob Trevor, and Jeremy Singer (University of Glasgow, UK) The lean MicroPython runtime is a widely adopted high level programming framework for embedded microcontroller systems. However, the existing MicroPython codebase has limited security features, rendering it a fundamentally insecure runtime environment. This is a critical problem, given the growing deployment of highly interconnected IoT systems on which society depends. Malicious actors seek to compromise such embedded infrastructure, using sophisticated attack vectors. We have implemented a novel variant of MicroPython, adding support for runtime security features provided in the CHERI RISC-V architecture as instantiated by the CHERIoT-RTOS system. Our new MicroPython port supports hardware-enabled spatial memory safety, mitigating a large set of common runtime memory attacks. We have also compartmentalized the MicroPython runtime, to prevent untrusted code from elevating its permissions and taking control of the entire system. We perform a multi-faceted evaluation of our work, involving a qualitative security-focused case study and a quantitative performance analysis. The case study explores the full set of five publicly reported MicroPython vulnerabilities (CVEs). We demonstrate that the enhanced security provided by CHERIoT MicroPython mitigate two heap buffer overflow CVEs. Our performance analysis shows a geometric mean runtime overhead of 48% for secure execution across a set of ten standard Python benchmarks, although we argue this is indicative of worst-case overhead on our prototype platform and a realistic deployment overhead would be significantly lower. This work opens up a new, secure-by-design approach to IoT application development. ![]() ![]() |
|
Jones, Timothy M. |
![]() Minli Liao, Sam Ainsworth, Lev Mukhanov, and Timothy M. Jones (University of Cambridge, UK; University of Edinburgh, UK; Queen Mary University of London, UK) Faults within CPU circuits, which generate incorrect results and thus silent data corruption, have become endemic at scale. The only generic techniques to detect one-time or intermittent soft errors, such as particle strikes or voltage spikes, require redundant execution, where copies of each instruction in a program are executed twice and compared. The only software solution for this task that is open source and available for use today is nZDC, which aims to achieve “near-zero silent data corruption” through control- and data-flow redundancy. However, when we tried to apply this to large-scale workloads, we found it suffered a wide set of false positives, negatives, compiler bugs and run-time crashes, which meant it was impossible to benchmark against. This document details the wide set of fixes and workarounds we had to put in place to make nZDC work across full suites. We provide many new insights as to the edge cases that make such instruction duplication tricky under complex ISAs such as AArch64 and their similarly complex ABIs. Evaluation across SPECint 2006 and PARSEC with our extensions takes us from no workloads executing to all bar four, with 2× and 1.6× geomean overhead respectively relative to execution with no fault tolerance. ![]() |
|
Kim, Eun Jung |
![]() Sungkeun Kim, Khanh Nguyen, Chia-Che Tsai, Jaewoo Lee, Abdullah Muzahid, and Eun Jung Kim (Samsung Research, South Korea; Texas A&M University, USA) Calling context is crucial for improving the precision of program analyses in various use cases (clients), such as profiling, debugging, optimization, and security checking. Often the calling context is encoded using a numerical value. We have observed that many clients benefit not only from a deterministic but also globally distinguishable value across runs to simplify bookkeeping and guarantee complete uniqueness. However, existing work only guarantees determinism, not global distinguishability. Clients need to develop auxiliary helpers, which incurs considerable overhead to distinguish encoded values among all calling contexts. In this paper, we propose Deterministic Distinguishable Calling Context Encoding () that can enable both properties of calling context encoding natively. The key idea of is leveraging the static call graph and encoding each calling context as the running call path count. Thereby, a mapping is established statically and can be readily used by the clients. Our experiments with two client tools show that has a comparable overhead compared to two state-of-the-art encoding schemes, PCCE and PCC, and further avoids the expensive overheads of collision detection, up to 2.1× and 50%, for Splash-3 and SPEC CPU 2017, respectively. ![]() |
|
Kim, Sungkeun |
![]() Sungkeun Kim, Khanh Nguyen, Chia-Che Tsai, Jaewoo Lee, Abdullah Muzahid, and Eun Jung Kim (Samsung Research, South Korea; Texas A&M University, USA) Calling context is crucial for improving the precision of program analyses in various use cases (clients), such as profiling, debugging, optimization, and security checking. Often the calling context is encoded using a numerical value. We have observed that many clients benefit not only from a deterministic but also globally distinguishable value across runs to simplify bookkeeping and guarantee complete uniqueness. However, existing work only guarantees determinism, not global distinguishability. Clients need to develop auxiliary helpers, which incurs considerable overhead to distinguish encoded values among all calling contexts. In this paper, we propose Deterministic Distinguishable Calling Context Encoding () that can enable both properties of calling context encoding natively. The key idea of is leveraging the static call graph and encoding each calling context as the running call path count. Thereby, a mapping is established statically and can be readily used by the clients. Our experiments with two client tools show that has a comparable overhead compared to two state-of-the-art encoding schemes, PCCE and PCC, and further avoids the expensive overheads of collision detection, up to 2.1× and 50%, for Splash-3 and SPEC CPU 2017, respectively. ![]() |
|
Kong, Martin |
![]() Raneem Abu-Yosef and Martin Kong (Ohio State University, USA) Distributed-memory programs that use the Message Passing Interface (MPI) often introduce various kinds of correctness anomalies. This work focuses on the type of anomalies detectable through data-flow modeling. We present a new tool and Domain-Specific Language to describe the data-flow of computations based on collective operations, such as the broadcast or all-gather in MPI. Our tool, CollectCall, models key aspects of distributed-memory algorithms, namely the processor space, symbolic communicators, data, its partitioning and mapping, and a set of communication primitives. Using these concepts, we build constraint systems that model the initial data placement and communication steps of the algorithm. Systems are built and solved with the Z3 SMT and the Integer Set Library (ISL) to decide the correctness of sequences of collective operations. We formalize the correctness requirements for a class of collective communication operations, and demonstrate the effectiveness of our approach on several micro-benchmarks and on well-known distributed algorithms from the literature while comparing against ITAC, MPI-Checker and PSE, state-of-the-art tools. ![]() |
|
Leather, Hugh |
![]() Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather (Meta, USA; Meta, France) Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training LLMs is resource-intensive, requiring substantial GPU hours and extensive data collection, which can be prohibitive. To address this gap, we introduce LLM Compiler, a suite of robust, openly available, pre-trained models specifically designed for compiler tasks. Built on the foundation of Code Llama, LLM Compiler enhances the understanding of compiler intermediate representations (IRs), assembly language, and optimization techniques. The models have been trained on a vast corpus of 546 billion tokens of LLVM-IR and assembly code and have undergone instruction fine-tuning to interpret compiler behavior. To demonstrate the utility of these research tools, we also present fine-tuned versions of the models with enhanced capabilities in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR. These achieve 77% of the optimising potential of an autotuning search, and 45% disassembly round trip (14% exact match). LLM Compiler is released under a bespoke commercial license to allow wide reuse and is available in two sizes: 7 billion and 13 billion parameters. Our aim is to provide scalable, cost-effective foundational models for further research and development in compiler optimization by both academic researchers and industry practitioners. Since we released LLM Compiler the community has quantized, repackaged, and downloaded the models over 250k times. ![]() ![]() Alexander Brauckmann, Anderson Faustino da Silva, Gabriel Synnaeve, Michael F. P. O’Boyle, Jeronimo Castrillon, and Hugh Leather (University of Edinburgh, UK; State University of Maringá, Brazil; Meta, France; TU Dresden, Germany; Meta, USA) Data flow analysis is fundamental to modern program optimization and verification, serving as a critical foundation for compiler transformations. As machine learning increasingly drives compiler tasks, the need for models that can implicitly understand and correctly reason about data flow properties becomes crucial for maintaining soundness. State-of-the-art machine learning methods, especially graph neural networks (GNNs), face challenges in generalizing beyond training scenarios due to their limited ability to perform large propagations. We present DFA-Net, a neural network architecture tailored for compilers that systematically generalizes. It emulates the reasoning process of compilers, facilitating the generalization of data flow analyses from simple to complex programs. The architecture decomposes data flow analyses into specialized neural networks for initialization, transfer, and meet operations, explicitly incorporating compiler-specific knowledge into the model design. We evaluate DFA-Net on a data flow analysis benchmark from related work and demonstrate that our compiler-specific neural architecture can learn and systematically generalize on this task. DFA-Net demonstrates superior performance over traditional GNNs in data flow analysis, achieving F1 scores of 0.761 versus 0.009 for data dependencies and 0.989 versus 0.196 for dominators at high complexity levels, while maintaining perfect scores for liveness and reachability analyses where GNNs struggle significantly. ![]() ![]() |
|
Lee, Jaewoo |
![]() Sungkeun Kim, Khanh Nguyen, Chia-Che Tsai, Jaewoo Lee, Abdullah Muzahid, and Eun Jung Kim (Samsung Research, South Korea; Texas A&M University, USA) Calling context is crucial for improving the precision of program analyses in various use cases (clients), such as profiling, debugging, optimization, and security checking. Often the calling context is encoded using a numerical value. We have observed that many clients benefit not only from a deterministic but also globally distinguishable value across runs to simplify bookkeeping and guarantee complete uniqueness. However, existing work only guarantees determinism, not global distinguishability. Clients need to develop auxiliary helpers, which incurs considerable overhead to distinguish encoded values among all calling contexts. In this paper, we propose Deterministic Distinguishable Calling Context Encoding () that can enable both properties of calling context encoding natively. The key idea of is leveraging the static call graph and encoding each calling context as the running call path count. Thereby, a mapping is established statically and can be readily used by the clients. Our experiments with two client tools show that has a comparable overhead compared to two state-of-the-art encoding schemes, PCCE and PCC, and further avoids the expensive overheads of collision detection, up to 2.1× and 50%, for Splash-3 and SPEC CPU 2017, respectively. ![]() |
|
Leißa, Roland |
![]() Marcel Ullrich, Sebastian Hack, and Roland Leißa (Saarland University, Germany; University of Mannheim, Germany) Automatic Differentiation (AD) is at the core of all machine learning frameworks and has applications in scientific computing as well. Theoretical research on reverse-mode AD focuses on functional, higher-order languages, enabling AD to be formulated as a series of local, concise program rewrites. These theoretical approaches focus on correctness but disregard efficiency. Practical implementations, however, employ mutation and taping techniques to enhance efficiency. This approach, however, necessitates intricate, low-level, and non-local program transformations. In this work, we introduce MimIrADe, a functionally inspired AD technique implemented within a higher-order, graph-based ("sea of nodes") intermediate representation (IR). Our method consists of a streamlined implementation and incorporates standard optimizations, resulting in an efficient AD system. The higher-order nature of the IR enables us to utilize concise functional AD methods, expressing AD through local rewrites. This locality facilitates modular high-level extensions, such as matrix operations, in a straightforward manner. Additionally, the graph-based structure of the IR ensures that critical implementation aspects---particularly the handling of shared pullback invocations---are managed naturally and efficiently. Our AD pass supports a comprehensive set of features, including non-scalar types, pointers, and higher-order recursive functions. We demonstrate through standard benchmarks that a suite of common optimizations effectively eliminates the overhead typically associated with functional AD approaches, producing differentiated code that performs on par with leading mutation and taping techniques. At the same time, MimIrADe's implementation is an order of magnitude less complex compared to its contenders. ![]() ![]() |
|
Liao, Minli |
![]() Minli Liao, Sam Ainsworth, Lev Mukhanov, and Timothy M. Jones (University of Cambridge, UK; University of Edinburgh, UK; Queen Mary University of London, UK) Faults within CPU circuits, which generate incorrect results and thus silent data corruption, have become endemic at scale. The only generic techniques to detect one-time or intermittent soft errors, such as particle strikes or voltage spikes, require redundant execution, where copies of each instruction in a program are executed twice and compared. The only software solution for this task that is open source and available for use today is nZDC, which aims to achieve “near-zero silent data corruption” through control- and data-flow redundancy. However, when we tried to apply this to large-scale workloads, we found it suffered a wide set of false positives, negatives, compiler bugs and run-time crashes, which meant it was impossible to benchmark against. This document details the wide set of fixes and workarounds we had to put in place to make nZDC work across full suites. We provide many new insights as to the edge cases that make such instruction duplication tricky under complex ISAs such as AArch64 and their similarly complex ABIs. Evaluation across SPECint 2006 and PARSEC with our extensions takes us from no workloads executing to all bar four, with 2× and 1.6× geomean overhead respectively relative to execution with no fault tolerance. ![]() |
|
Liu, Chunting |
![]() Chunting Liu and Riyadh Baghdadi (NYU Abu Dhabi, United Arab Emirates) Performance models are essential for automatic code optimization, enabling compilers to predict the effects of code transformations on performance and guide search for optimal transformations. Building state-of-the-art performance models with deep learning, however, requires vast labeled datasets of random programs – an expensive and time-consuming process, stretching over months. This paper introduces a self-supervised pre-training scheme with autoencoders to reduce the need for labeled data. By pre-training on a large dataset of random programs, the autoencoder learns representations of code and transformations, which are then used to embed programs for the performance model. Implemented in the Tiramisu autoscheduler, our approach improves model accuracy with less data. For example, to achieve a MAPE of 20.72%, the original model requires 18 million data points, whereas our method achieves a similar MAPE of 22.44% with only 3.6 million data points, reducing data requirements by 5×. ![]() |
|
Lowther, Duncan |
![]() Duncan Lowther, Dejice Jacob, Jacob Trevor, and Jeremy Singer (University of Glasgow, UK) The lean MicroPython runtime is a widely adopted high level programming framework for embedded microcontroller systems. However, the existing MicroPython codebase has limited security features, rendering it a fundamentally insecure runtime environment. This is a critical problem, given the growing deployment of highly interconnected IoT systems on which society depends. Malicious actors seek to compromise such embedded infrastructure, using sophisticated attack vectors. We have implemented a novel variant of MicroPython, adding support for runtime security features provided in the CHERI RISC-V architecture as instantiated by the CHERIoT-RTOS system. Our new MicroPython port supports hardware-enabled spatial memory safety, mitigating a large set of common runtime memory attacks. We have also compartmentalized the MicroPython runtime, to prevent untrusted code from elevating its permissions and taking control of the entire system. We perform a multi-faceted evaluation of our work, involving a qualitative security-focused case study and a quantitative performance analysis. The case study explores the full set of five publicly reported MicroPython vulnerabilities (CVEs). We demonstrate that the enhanced security provided by CHERIoT MicroPython mitigate two heap buffer overflow CVEs. Our performance analysis shows a geometric mean runtime overhead of 48% for secure execution across a set of ten standard Python benchmarks, although we argue this is indicative of worst-case overhead on our prototype platform and a realistic deployment overhead would be significantly lower. This work opens up a new, secure-by-design approach to IoT application development. ![]() ![]() |
|
Madsen, Magnus |
![]() Joseph Tan and Magnus Madsen (Stanford University, USA; Aarhus University, Denmark) The dot is a highly valued piece of syntactic real estate. Programming languages use the dot for a range of programming constructs: In object-oriented languages, the dot is used for field and method access on classes and objects. In imperative languages, the dot is often used for field access on structs. In functional languages, the dot is often used for field access on records. And finally in some other languages, the dot is used to access the length of arrays and to index into tuples. The dot is minimal yet aesthetically pleasing. More importantly, the dot is the great enabler of auto-completion. Programmers with a specific entity in mind (e.g., an object, struct, or record) can use the dot to trigger auto-completion to be presented with a list of operations available on the entity. This simple feature improves the programming experience and makes it easier to learn new APIs. In this paper, we explore how to overload the dot to support various programming constructs within one language while supporting type inference. We present a principled approach to overloading based on type classes extended with associated types, associated effects, and compiler-generated type classes and instances. We implement the design in the Flix programming language and evaluate its impact on compiler performance and on the quality of error messages. ![]() ![]() |
|
Mafra, Augusto |
![]() Mirlaine Crepalde, Augusto Mafra, Lucas Cavalini, Lucas Martins, Guilherme Amorim, Pedro Henrique Santos, and Fabiano Peixoto (Cadence Design Systems, Brazil) Random test case generation is a challenging subject in compiler testing. Due to the structured and strict nature of the languages required for compiler inputs, using randomization techniques for hunting bugs in compiler implementation represents a big challenge that requires trading off correctness and generation biases against fuzzing techniques for broader exploratory randomization. This paper shares the technology and the practical industry experience on two random testing frameworks developed for the Hardware Description Language (HDL) compiler of Jasper® App, a production formal verification software applied in Electronic Design Automation (EDA) industry. The two frameworks impact distinct parts of the compiler stack and provide different features and strengths for randomization: SystemVerilog Generator script, which creates random and formally provable HDL code, and Fuzz HDL Testing, a fuzzing solution applying LLVM’s libFuzzer to explore random textual inputs. ![]() |
|
Martins, Lucas |
![]() Mirlaine Crepalde, Augusto Mafra, Lucas Cavalini, Lucas Martins, Guilherme Amorim, Pedro Henrique Santos, and Fabiano Peixoto (Cadence Design Systems, Brazil) Random test case generation is a challenging subject in compiler testing. Due to the structured and strict nature of the languages required for compiler inputs, using randomization techniques for hunting bugs in compiler implementation represents a big challenge that requires trading off correctness and generation biases against fuzzing techniques for broader exploratory randomization. This paper shares the technology and the practical industry experience on two random testing frameworks developed for the Hardware Description Language (HDL) compiler of Jasper® App, a production formal verification software applied in Electronic Design Automation (EDA) industry. The two frameworks impact distinct parts of the compiler stack and provide different features and strengths for randomization: SystemVerilog Generator script, which creates random and formally provable HDL code, and Fuzz HDL Testing, a fuzzing solution applying LLVM’s libFuzzer to explore random textual inputs. ![]() |
|
Mukhanov, Lev |
![]() Minli Liao, Sam Ainsworth, Lev Mukhanov, and Timothy M. Jones (University of Cambridge, UK; University of Edinburgh, UK; Queen Mary University of London, UK) Faults within CPU circuits, which generate incorrect results and thus silent data corruption, have become endemic at scale. The only generic techniques to detect one-time or intermittent soft errors, such as particle strikes or voltage spikes, require redundant execution, where copies of each instruction in a program are executed twice and compared. The only software solution for this task that is open source and available for use today is nZDC, which aims to achieve “near-zero silent data corruption” through control- and data-flow redundancy. However, when we tried to apply this to large-scale workloads, we found it suffered a wide set of false positives, negatives, compiler bugs and run-time crashes, which meant it was impossible to benchmark against. This document details the wide set of fixes and workarounds we had to put in place to make nZDC work across full suites. We provide many new insights as to the edge cases that make such instruction duplication tricky under complex ISAs such as AArch64 and their similarly complex ABIs. Evaluation across SPECint 2006 and PARSEC with our extensions takes us from no workloads executing to all bar four, with 2× and 1.6× geomean overhead respectively relative to execution with no fault tolerance. ![]() |
|
Muzahid, Abdullah |
![]() Sungkeun Kim, Khanh Nguyen, Chia-Che Tsai, Jaewoo Lee, Abdullah Muzahid, and Eun Jung Kim (Samsung Research, South Korea; Texas A&M University, USA) Calling context is crucial for improving the precision of program analyses in various use cases (clients), such as profiling, debugging, optimization, and security checking. Often the calling context is encoded using a numerical value. We have observed that many clients benefit not only from a deterministic but also globally distinguishable value across runs to simplify bookkeeping and guarantee complete uniqueness. However, existing work only guarantees determinism, not global distinguishability. Clients need to develop auxiliary helpers, which incurs considerable overhead to distinguish encoded values among all calling contexts. In this paper, we propose Deterministic Distinguishable Calling Context Encoding () that can enable both properties of calling context encoding natively. The key idea of is leveraging the static call graph and encoding each calling context as the running call path count. Thereby, a mapping is established statically and can be readily used by the clients. Our experiments with two client tools show that has a comparable overhead compared to two state-of-the-art encoding schemes, PCCE and PCC, and further avoids the expensive overheads of collision detection, up to 2.1× and 50%, for Splash-3 and SPEC CPU 2017, respectively. ![]() |
|
Nabi, Syed Waqar |
![]() Robert Szafarczyk, Syed Waqar Nabi, and Wim Vanderbauwhede (University of Glasgow, UK) Irregular codes are bottlenecked by memory and communication latency. Decoupled access/execute (DAE) is a common technique to tackle this problem. It relies on the compiler to separate memory address generation from the rest of the program, however, such a separation is not always possible due to control and data dependencies between the access and execute slices, resulting in a loss of decoupling. In this paper, we present compiler support for speculation in DAE architectures that preserves decoupling in the face of control dependencies. We speculate memory requests in the access slice and poison mis-speculations in the execute slice without the need for replays or synchronization. Our transformation works on arbitrary, reducible control flow and is proven to preserve sequential consistency. We show that our approach applies to a wide range of architectural work on CPU/GPU prefetchers, CGRAs, and accelerators, enabling DAE on a wider range of codes than before. ![]() ![]() |
|
Nada, Yotaro |
![]() Changbin Chen, Shu Sugita, Yotaro Nada, Hidetsugu Irie, Shuichi Sakai, and Ryota Shioya (University of Tokyo, Japan) Research on novel Instruction Set Architectures (ISAs) is actively pursued; however, it requires extensive efforts to develop and maintain comprehensive compilation toolchains for each new ISA. Binary translation can provide a practical solution for ISA researchers to port target programs to novel ISAs when such a level of toolchain support is not available. However, to ensure the correct handling of indirect jumps, existing binary translators rely on complex runtime systems, whose implementation on primitive research ISAs demands significant efforts. ISA researchers generally have access to additional source-level information, including the symbol table and source code, when using binary translators. The symbol table can provide potential jump targets for optimizing indirect jumps, and ISA-independent functions in source code can be directly compiled without translation. Leveraging source-level information as additional input, in this paper, we propose Biotite, a high-performance static binary translator that correctly handles arbitrary indirect jumps. Currently, Biotite supports the translation of RV64GC Linux binaries to self-contained LLVM IR. Our evaluation shows that Biotite successfully translates all benchmarks in SPEC CPU 2017 and achieves a 2.346× performance improvement over QEMU for the integer benchmark suite. ![]() |
|
Nguyen, Khanh |
![]() Sungkeun Kim, Khanh Nguyen, Chia-Che Tsai, Jaewoo Lee, Abdullah Muzahid, and Eun Jung Kim (Samsung Research, South Korea; Texas A&M University, USA) Calling context is crucial for improving the precision of program analyses in various use cases (clients), such as profiling, debugging, optimization, and security checking. Often the calling context is encoded using a numerical value. We have observed that many clients benefit not only from a deterministic but also globally distinguishable value across runs to simplify bookkeeping and guarantee complete uniqueness. However, existing work only guarantees determinism, not global distinguishability. Clients need to develop auxiliary helpers, which incurs considerable overhead to distinguish encoded values among all calling contexts. In this paper, we propose Deterministic Distinguishable Calling Context Encoding () that can enable both properties of calling context encoding natively. The key idea of is leveraging the static call graph and encoding each calling context as the running call path count. Thereby, a mapping is established statically and can be readily used by the clients. Our experiments with two client tools show that has a comparable overhead compared to two state-of-the-art encoding schemes, PCCE and PCC, and further avoids the expensive overheads of collision detection, up to 2.1× and 50%, for Splash-3 and SPEC CPU 2017, respectively. ![]() |
|
O’Boyle, Michael F. P. |
![]() Alexander Brauckmann, Anderson Faustino da Silva, Gabriel Synnaeve, Michael F. P. O’Boyle, Jeronimo Castrillon, and Hugh Leather (University of Edinburgh, UK; State University of Maringá, Brazil; Meta, France; TU Dresden, Germany; Meta, USA) Data flow analysis is fundamental to modern program optimization and verification, serving as a critical foundation for compiler transformations. As machine learning increasingly drives compiler tasks, the need for models that can implicitly understand and correctly reason about data flow properties becomes crucial for maintaining soundness. State-of-the-art machine learning methods, especially graph neural networks (GNNs), face challenges in generalizing beyond training scenarios due to their limited ability to perform large propagations. We present DFA-Net, a neural network architecture tailored for compilers that systematically generalizes. It emulates the reasoning process of compilers, facilitating the generalization of data flow analyses from simple to complex programs. The architecture decomposes data flow analyses into specialized neural networks for initialization, transfer, and meet operations, explicitly incorporating compiler-specific knowledge into the model design. We evaluate DFA-Net on a data flow analysis benchmark from related work and demonstrate that our compiler-specific neural architecture can learn and systematically generalize on this task. DFA-Net demonstrates superior performance over traditional GNNs in data flow analysis, achieving F1 scores of 0.761 versus 0.009 for data dependencies and 0.989 versus 0.196 for dominators at high complexity levels, while maintaining perfect scores for liveness and reachability analyses where GNNs struggle significantly. ![]() ![]() |
|
Peixoto, Fabiano |
![]() Mirlaine Crepalde, Augusto Mafra, Lucas Cavalini, Lucas Martins, Guilherme Amorim, Pedro Henrique Santos, and Fabiano Peixoto (Cadence Design Systems, Brazil) Random test case generation is a challenging subject in compiler testing. Due to the structured and strict nature of the languages required for compiler inputs, using randomization techniques for hunting bugs in compiler implementation represents a big challenge that requires trading off correctness and generation biases against fuzzing techniques for broader exploratory randomization. This paper shares the technology and the practical industry experience on two random testing frameworks developed for the Hardware Description Language (HDL) compiler of Jasper® App, a production formal verification software applied in Electronic Design Automation (EDA) industry. The two frameworks impact distinct parts of the compiler stack and provide different features and strengths for randomization: SystemVerilog Generator script, which creates random and formally provable HDL code, and Fuzz HDL Testing, a fuzzing solution applying LLVM’s libFuzzer to explore random textual inputs. ![]() |
|
Pereira, Fernando Magno Quintão |
![]() Anderson Faustino da Silva, Jeronimo Castrillon, and Fernando Magno Quintão Pereira (State University of Maringá, Brazil; TU Dresden, Germany; Federal University of Minas Gerais, Brazil) Classifying programs based on their tasks is essential in fields such as plagiarism detection, malware analysis, and software auditing. Traditionally, two classification approaches exist: static classifiers analyze program syntax, while dynamic classifiers observe their execution. Although dynamic analysis is regarded as more precise, it is often considered impractical due to high overhead, leading the research community to largely dismiss it. In this paper, we revisit this perception by comparing static and dynamic analyses using the same classification representation: opcode histograms. We show that dynamic histograms---generated from instructions actually executed---are only marginally (4-5%) more accurate than static histograms in non-adversarial settings. However, if an adversary is allowed to obfuscate programs, the accuracy of the dynamic classifier is twice higher than the static one, due to its ability to avoid observing dead-code. Obtaining dynamic histograms with a state-of-the-art Valgrind-based tool incurs an 85x slowdown; however, once we account for the time to produce the representations for static analysis of executables, the overall slowdown reduces to 4x: a result significantly lower than previously reported in the literature. ![]() |
|
Quintão Pereira, Fernando Magno |
![]() Michael Canesche, Vanderson Martins do Rosario, Edson Borin, and Fernando Magno Quintão Pereira (Cadence Design Systems, Brazil; Cadence Design Systems, USA; State University of Campinas, Brazil; Federal University of Minas Gerais, Brazil) Tensor compilers like XLA, TVM, and TensorRT operate on computational graphs, where vertices represent operations and edges represent data flow between these operations. Operator fusion is an optimization that merges operators to improve their efficiency. This paper presents the operator fusion algorithm recently deployed in the Xtensa Neural Network Compiler (XNNC) - Cadence Tensilica's tensor compiler. The algorithm clusters nodes within the computational graph and iteratively grows these clusters until reaching a fixed point. A priority queue, sorted by the estimated profitability of merging cluster candidates, guides this iterative process. It balances precision and practicality, producing models 39% faster than XNNC's previous fusion approach, which was based on a depth-first traversal of the computational graph. Moreover, unlike recently proposed exhaustive or evolutionary search methods, this algorithm terminates quickly while often yielding equally efficient models. ![]() |
|
Rasch, Ari |
![]() Richard Schulze, Sergei Gorlatch, and Ari Rasch (University of Münster, Germany) We introduce pyATF -- a new, language-independent, open-source auto-tuning tool that fully automatically determines optimized values of performance-critical program parameters. A major feature of pyATF is its support for constrained parameters, e.g., the value of one parameter has to divide the value of another parameter. A further major feature of pyATF is its user interface which is designed with a particular focus on expressivity and usability for real-world demands, and which is offered in the increasingly popular Python programming language. We experimentally confirm the practicality of pyATF using real-world studies from the areas of quantum chemistry, image processing, data mining, and deep learning: we show that pyATF auto-tunes the complex parallel implementations of our studies to higher performance than achieved by state-of-practice approaches, including hand-optimized vendor libraries. ![]() ![]() ![]() |
|
Roziere, Baptiste |
![]() Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather (Meta, USA; Meta, France) Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training LLMs is resource-intensive, requiring substantial GPU hours and extensive data collection, which can be prohibitive. To address this gap, we introduce LLM Compiler, a suite of robust, openly available, pre-trained models specifically designed for compiler tasks. Built on the foundation of Code Llama, LLM Compiler enhances the understanding of compiler intermediate representations (IRs), assembly language, and optimization techniques. The models have been trained on a vast corpus of 546 billion tokens of LLVM-IR and assembly code and have undergone instruction fine-tuning to interpret compiler behavior. To demonstrate the utility of these research tools, we also present fine-tuned versions of the models with enhanced capabilities in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR. These achieve 77% of the optimising potential of an autotuning search, and 45% disassembly round trip (14% exact match). LLM Compiler is released under a bespoke commercial license to allow wide reuse and is available in two sizes: 7 billion and 13 billion parameters. Our aim is to provide scalable, cost-effective foundational models for further research and development in compiler optimization by both academic researchers and industry practitioners. Since we released LLM Compiler the community has quantized, repackaged, and downloaded the models over 250k times. ![]() |
|
Sakai, Shuichi |
![]() Changbin Chen, Shu Sugita, Yotaro Nada, Hidetsugu Irie, Shuichi Sakai, and Ryota Shioya (University of Tokyo, Japan) Research on novel Instruction Set Architectures (ISAs) is actively pursued; however, it requires extensive efforts to develop and maintain comprehensive compilation toolchains for each new ISA. Binary translation can provide a practical solution for ISA researchers to port target programs to novel ISAs when such a level of toolchain support is not available. However, to ensure the correct handling of indirect jumps, existing binary translators rely on complex runtime systems, whose implementation on primitive research ISAs demands significant efforts. ISA researchers generally have access to additional source-level information, including the symbol table and source code, when using binary translators. The symbol table can provide potential jump targets for optimizing indirect jumps, and ISA-independent functions in source code can be directly compiled without translation. Leveraging source-level information as additional input, in this paper, we propose Biotite, a high-performance static binary translator that correctly handles arbitrary indirect jumps. Currently, Biotite supports the translation of RV64GC Linux binaries to self-contained LLVM IR. Our evaluation shows that Biotite successfully translates all benchmarks in SPEC CPU 2017 and achieves a 2.346× performance improvement over QEMU for the integer benchmark suite. ![]() |
|
Santos, Pedro Henrique |
![]() Mirlaine Crepalde, Augusto Mafra, Lucas Cavalini, Lucas Martins, Guilherme Amorim, Pedro Henrique Santos, and Fabiano Peixoto (Cadence Design Systems, Brazil) Random test case generation is a challenging subject in compiler testing. Due to the structured and strict nature of the languages required for compiler inputs, using randomization techniques for hunting bugs in compiler implementation represents a big challenge that requires trading off correctness and generation biases against fuzzing techniques for broader exploratory randomization. This paper shares the technology and the practical industry experience on two random testing frameworks developed for the Hardware Description Language (HDL) compiler of Jasper® App, a production formal verification software applied in Electronic Design Automation (EDA) industry. The two frameworks impact distinct parts of the compiler stack and provide different features and strengths for randomization: SystemVerilog Generator script, which creates random and formally provable HDL code, and Fuzz HDL Testing, a fuzzing solution applying LLVM’s libFuzzer to explore random textual inputs. ![]() |
|
Schulze, Richard |
![]() Richard Schulze, Sergei Gorlatch, and Ari Rasch (University of Münster, Germany) We introduce pyATF -- a new, language-independent, open-source auto-tuning tool that fully automatically determines optimized values of performance-critical program parameters. A major feature of pyATF is its support for constrained parameters, e.g., the value of one parameter has to divide the value of another parameter. A further major feature of pyATF is its user interface which is designed with a particular focus on expressivity and usability for real-world demands, and which is offered in the increasingly popular Python programming language. We experimentally confirm the practicality of pyATF using real-world studies from the areas of quantum chemistry, image processing, data mining, and deep learning: we show that pyATF auto-tunes the complex parallel implementations of our studies to higher performance than achieved by state-of-practice approaches, including hand-optimized vendor libraries. ![]() ![]() ![]() |
|
Seeker, Volker |
![]() Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather (Meta, USA; Meta, France) Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training LLMs is resource-intensive, requiring substantial GPU hours and extensive data collection, which can be prohibitive. To address this gap, we introduce LLM Compiler, a suite of robust, openly available, pre-trained models specifically designed for compiler tasks. Built on the foundation of Code Llama, LLM Compiler enhances the understanding of compiler intermediate representations (IRs), assembly language, and optimization techniques. The models have been trained on a vast corpus of 546 billion tokens of LLVM-IR and assembly code and have undergone instruction fine-tuning to interpret compiler behavior. To demonstrate the utility of these research tools, we also present fine-tuned versions of the models with enhanced capabilities in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR. These achieve 77% of the optimising potential of an autotuning search, and 45% disassembly round trip (14% exact match). LLM Compiler is released under a bespoke commercial license to allow wide reuse and is available in two sizes: 7 billion and 13 billion parameters. Our aim is to provide scalable, cost-effective foundational models for further research and development in compiler optimization by both academic researchers and industry practitioners. Since we released LLM Compiler the community has quantized, repackaged, and downloaded the models over 250k times. ![]() |
|
Shioya, Ryota |
![]() Changbin Chen, Shu Sugita, Yotaro Nada, Hidetsugu Irie, Shuichi Sakai, and Ryota Shioya (University of Tokyo, Japan) Research on novel Instruction Set Architectures (ISAs) is actively pursued; however, it requires extensive efforts to develop and maintain comprehensive compilation toolchains for each new ISA. Binary translation can provide a practical solution for ISA researchers to port target programs to novel ISAs when such a level of toolchain support is not available. However, to ensure the correct handling of indirect jumps, existing binary translators rely on complex runtime systems, whose implementation on primitive research ISAs demands significant efforts. ISA researchers generally have access to additional source-level information, including the symbol table and source code, when using binary translators. The symbol table can provide potential jump targets for optimizing indirect jumps, and ISA-independent functions in source code can be directly compiled without translation. Leveraging source-level information as additional input, in this paper, we propose Biotite, a high-performance static binary translator that correctly handles arbitrary indirect jumps. Currently, Biotite supports the translation of RV64GC Linux binaries to self-contained LLVM IR. Our evaluation shows that Biotite successfully translates all benchmarks in SPEC CPU 2017 and achieves a 2.346× performance improvement over QEMU for the integer benchmark suite. ![]() |
|
Singer, Jeremy |
![]() Duncan Lowther, Dejice Jacob, Jacob Trevor, and Jeremy Singer (University of Glasgow, UK) The lean MicroPython runtime is a widely adopted high level programming framework for embedded microcontroller systems. However, the existing MicroPython codebase has limited security features, rendering it a fundamentally insecure runtime environment. This is a critical problem, given the growing deployment of highly interconnected IoT systems on which society depends. Malicious actors seek to compromise such embedded infrastructure, using sophisticated attack vectors. We have implemented a novel variant of MicroPython, adding support for runtime security features provided in the CHERI RISC-V architecture as instantiated by the CHERIoT-RTOS system. Our new MicroPython port supports hardware-enabled spatial memory safety, mitigating a large set of common runtime memory attacks. We have also compartmentalized the MicroPython runtime, to prevent untrusted code from elevating its permissions and taking control of the entire system. We perform a multi-faceted evaluation of our work, involving a qualitative security-focused case study and a quantitative performance analysis. The case study explores the full set of five publicly reported MicroPython vulnerabilities (CVEs). We demonstrate that the enhanced security provided by CHERIoT MicroPython mitigate two heap buffer overflow CVEs. Our performance analysis shows a geometric mean runtime overhead of 48% for secure execution across a set of ten standard Python benchmarks, although we argue this is indicative of worst-case overhead on our prototype platform and a realistic deployment overhead would be significantly lower. This work opens up a new, secure-by-design approach to IoT application development. ![]() ![]() |
|
Sugita, Shu |
![]() Changbin Chen, Shu Sugita, Yotaro Nada, Hidetsugu Irie, Shuichi Sakai, and Ryota Shioya (University of Tokyo, Japan) Research on novel Instruction Set Architectures (ISAs) is actively pursued; however, it requires extensive efforts to develop and maintain comprehensive compilation toolchains for each new ISA. Binary translation can provide a practical solution for ISA researchers to port target programs to novel ISAs when such a level of toolchain support is not available. However, to ensure the correct handling of indirect jumps, existing binary translators rely on complex runtime systems, whose implementation on primitive research ISAs demands significant efforts. ISA researchers generally have access to additional source-level information, including the symbol table and source code, when using binary translators. The symbol table can provide potential jump targets for optimizing indirect jumps, and ISA-independent functions in source code can be directly compiled without translation. Leveraging source-level information as additional input, in this paper, we propose Biotite, a high-performance static binary translator that correctly handles arbitrary indirect jumps. Currently, Biotite supports the translation of RV64GC Linux binaries to self-contained LLVM IR. Our evaluation shows that Biotite successfully translates all benchmarks in SPEC CPU 2017 and achieves a 2.346× performance improvement over QEMU for the integer benchmark suite. ![]() |
|
Synnaeve, Gabriel |
![]() Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather (Meta, USA; Meta, France) Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training LLMs is resource-intensive, requiring substantial GPU hours and extensive data collection, which can be prohibitive. To address this gap, we introduce LLM Compiler, a suite of robust, openly available, pre-trained models specifically designed for compiler tasks. Built on the foundation of Code Llama, LLM Compiler enhances the understanding of compiler intermediate representations (IRs), assembly language, and optimization techniques. The models have been trained on a vast corpus of 546 billion tokens of LLVM-IR and assembly code and have undergone instruction fine-tuning to interpret compiler behavior. To demonstrate the utility of these research tools, we also present fine-tuned versions of the models with enhanced capabilities in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR. These achieve 77% of the optimising potential of an autotuning search, and 45% disassembly round trip (14% exact match). LLM Compiler is released under a bespoke commercial license to allow wide reuse and is available in two sizes: 7 billion and 13 billion parameters. Our aim is to provide scalable, cost-effective foundational models for further research and development in compiler optimization by both academic researchers and industry practitioners. Since we released LLM Compiler the community has quantized, repackaged, and downloaded the models over 250k times. ![]() ![]() Alexander Brauckmann, Anderson Faustino da Silva, Gabriel Synnaeve, Michael F. P. O’Boyle, Jeronimo Castrillon, and Hugh Leather (University of Edinburgh, UK; State University of Maringá, Brazil; Meta, France; TU Dresden, Germany; Meta, USA) Data flow analysis is fundamental to modern program optimization and verification, serving as a critical foundation for compiler transformations. As machine learning increasingly drives compiler tasks, the need for models that can implicitly understand and correctly reason about data flow properties becomes crucial for maintaining soundness. State-of-the-art machine learning methods, especially graph neural networks (GNNs), face challenges in generalizing beyond training scenarios due to their limited ability to perform large propagations. We present DFA-Net, a neural network architecture tailored for compilers that systematically generalizes. It emulates the reasoning process of compilers, facilitating the generalization of data flow analyses from simple to complex programs. The architecture decomposes data flow analyses into specialized neural networks for initialization, transfer, and meet operations, explicitly incorporating compiler-specific knowledge into the model design. We evaluate DFA-Net on a data flow analysis benchmark from related work and demonstrate that our compiler-specific neural architecture can learn and systematically generalize on this task. DFA-Net demonstrates superior performance over traditional GNNs in data flow analysis, achieving F1 scores of 0.761 versus 0.009 for data dependencies and 0.989 versus 0.196 for dominators at high complexity levels, while maintaining perfect scores for liveness and reachability analyses where GNNs struggle significantly. ![]() ![]() |
|
Szafarczyk, Robert |
![]() Robert Szafarczyk, Syed Waqar Nabi, and Wim Vanderbauwhede (University of Glasgow, UK) Irregular codes are bottlenecked by memory and communication latency. Decoupled access/execute (DAE) is a common technique to tackle this problem. It relies on the compiler to separate memory address generation from the rest of the program, however, such a separation is not always possible due to control and data dependencies between the access and execute slices, resulting in a loss of decoupling. In this paper, we present compiler support for speculation in DAE architectures that preserves decoupling in the face of control dependencies. We speculate memory requests in the access slice and poison mis-speculations in the execute slice without the need for replays or synchronization. Our transformation works on arbitrary, reducible control flow and is proven to preserve sequential consistency. We show that our approach applies to a wide range of architectural work on CPU/GPU prefetchers, CGRAs, and accelerators, enabling DAE on a wider range of codes than before. ![]() ![]() |
|
Tan, Joseph |
![]() Joseph Tan and Magnus Madsen (Stanford University, USA; Aarhus University, Denmark) The dot is a highly valued piece of syntactic real estate. Programming languages use the dot for a range of programming constructs: In object-oriented languages, the dot is used for field and method access on classes and objects. In imperative languages, the dot is often used for field access on structs. In functional languages, the dot is often used for field access on records. And finally in some other languages, the dot is used to access the length of arrays and to index into tuples. The dot is minimal yet aesthetically pleasing. More importantly, the dot is the great enabler of auto-completion. Programmers with a specific entity in mind (e.g., an object, struct, or record) can use the dot to trigger auto-completion to be presented with a list of operations available on the entity. This simple feature improves the programming experience and makes it easier to learn new APIs. In this paper, we explore how to overload the dot to support various programming constructs within one language while supporting type inference. We present a principled approach to overloading based on type classes extended with associated types, associated effects, and compiler-generated type classes and instances. We implement the design in the Flix programming language and evaluate its impact on compiler performance and on the quality of error messages. ![]() ![]() |
|
Trevor, Jacob |
![]() Duncan Lowther, Dejice Jacob, Jacob Trevor, and Jeremy Singer (University of Glasgow, UK) The lean MicroPython runtime is a widely adopted high level programming framework for embedded microcontroller systems. However, the existing MicroPython codebase has limited security features, rendering it a fundamentally insecure runtime environment. This is a critical problem, given the growing deployment of highly interconnected IoT systems on which society depends. Malicious actors seek to compromise such embedded infrastructure, using sophisticated attack vectors. We have implemented a novel variant of MicroPython, adding support for runtime security features provided in the CHERI RISC-V architecture as instantiated by the CHERIoT-RTOS system. Our new MicroPython port supports hardware-enabled spatial memory safety, mitigating a large set of common runtime memory attacks. We have also compartmentalized the MicroPython runtime, to prevent untrusted code from elevating its permissions and taking control of the entire system. We perform a multi-faceted evaluation of our work, involving a qualitative security-focused case study and a quantitative performance analysis. The case study explores the full set of five publicly reported MicroPython vulnerabilities (CVEs). We demonstrate that the enhanced security provided by CHERIoT MicroPython mitigate two heap buffer overflow CVEs. Our performance analysis shows a geometric mean runtime overhead of 48% for secure execution across a set of ten standard Python benchmarks, although we argue this is indicative of worst-case overhead on our prototype platform and a realistic deployment overhead would be significantly lower. This work opens up a new, secure-by-design approach to IoT application development. ![]() ![]() |
|
Tsai, Chia-Che |
![]() Sungkeun Kim, Khanh Nguyen, Chia-Che Tsai, Jaewoo Lee, Abdullah Muzahid, and Eun Jung Kim (Samsung Research, South Korea; Texas A&M University, USA) Calling context is crucial for improving the precision of program analyses in various use cases (clients), such as profiling, debugging, optimization, and security checking. Often the calling context is encoded using a numerical value. We have observed that many clients benefit not only from a deterministic but also globally distinguishable value across runs to simplify bookkeeping and guarantee complete uniqueness. However, existing work only guarantees determinism, not global distinguishability. Clients need to develop auxiliary helpers, which incurs considerable overhead to distinguish encoded values among all calling contexts. In this paper, we propose Deterministic Distinguishable Calling Context Encoding () that can enable both properties of calling context encoding natively. The key idea of is leveraging the static call graph and encoding each calling context as the running call path count. Thereby, a mapping is established statically and can be readily used by the clients. Our experiments with two client tools show that has a comparable overhead compared to two state-of-the-art encoding schemes, PCCE and PCC, and further avoids the expensive overheads of collision detection, up to 2.1× and 50%, for Splash-3 and SPEC CPU 2017, respectively. ![]() |
|
Ullrich, Marcel |
![]() Marcel Ullrich, Sebastian Hack, and Roland Leißa (Saarland University, Germany; University of Mannheim, Germany) Automatic Differentiation (AD) is at the core of all machine learning frameworks and has applications in scientific computing as well. Theoretical research on reverse-mode AD focuses on functional, higher-order languages, enabling AD to be formulated as a series of local, concise program rewrites. These theoretical approaches focus on correctness but disregard efficiency. Practical implementations, however, employ mutation and taping techniques to enhance efficiency. This approach, however, necessitates intricate, low-level, and non-local program transformations. In this work, we introduce MimIrADe, a functionally inspired AD technique implemented within a higher-order, graph-based ("sea of nodes") intermediate representation (IR). Our method consists of a streamlined implementation and incorporates standard optimizations, resulting in an efficient AD system. The higher-order nature of the IR enables us to utilize concise functional AD methods, expressing AD through local rewrites. This locality facilitates modular high-level extensions, such as matrix operations, in a straightforward manner. Additionally, the graph-based structure of the IR ensures that critical implementation aspects---particularly the handling of shared pullback invocations---are managed naturally and efficiently. Our AD pass supports a comprehensive set of features, including non-scalar types, pointers, and higher-order recursive functions. We demonstrate through standard benchmarks that a suite of common optimizations effectively eliminates the overhead typically associated with functional AD approaches, producing differentiated code that performs on par with leading mutation and taping techniques. At the same time, MimIrADe's implementation is an order of magnitude less complex compared to its contenders. ![]() ![]() |
|
Vanderbauwhede, Wim |
![]() Robert Szafarczyk, Syed Waqar Nabi, and Wim Vanderbauwhede (University of Glasgow, UK) Irregular codes are bottlenecked by memory and communication latency. Decoupled access/execute (DAE) is a common technique to tackle this problem. It relies on the compiler to separate memory address generation from the rest of the program, however, such a separation is not always possible due to control and data dependencies between the access and execute slices, resulting in a loss of decoupling. In this paper, we present compiler support for speculation in DAE architectures that preserves decoupling in the face of control dependencies. We speculate memory requests in the access slice and poison mis-speculations in the execute slice without the need for replays or synchronization. Our transformation works on arbitrary, reducible control flow and is proven to preserve sequential consistency. We show that our approach applies to a wide range of architectural work on CPU/GPU prefetchers, CGRAs, and accelerators, enabling DAE on a wider range of codes than before. ![]() ![]() |
|
Wang, Feng |
![]() Shaobai Yuan, Jihong He, Yihui Xie, Feng Wang, and Jie Zhao (Hunan University, China) This paper introduces PLOS, a novel Post-Link Outlining approach designed to enhance code Size reduction for resource-constrained environments. Built on top of a post-link optimizer BOLT, PLOS maintains a holistic view of the whole-program structure and behavior, utilizing runtime information while preserving standard build system flows. The approach includes a granular outlining algorithm that matches and replaces repeated instruction sequences within/across modules and outlined functions, along with careful stack frame management to ensure correct function call handling. By integrating profiling information, PLOS balances the trade-off between code size and execution efficiency. The evaluation using eight MiBench benchmarks on an ARM-based Phytium FCT662 core demonstrates that PLOS achieves a mean code size reduction of 10.88% (up to 43.53%) and 6.61% (up to 14.78%) compared to LLVM’s and GCC's standard optimization, respectively, 1.76% (up to 4.75%) over LLVM’s aggressive code size reduction optimizations, and 2.88% (up to 8.56%) over a link-time outliner. The experimental results also show that PLOS can achieve a favorable balance between code size reduction and performance regression. ![]() |
|
Xie, Yihui |
![]() Shaobai Yuan, Jihong He, Yihui Xie, Feng Wang, and Jie Zhao (Hunan University, China) This paper introduces PLOS, a novel Post-Link Outlining approach designed to enhance code Size reduction for resource-constrained environments. Built on top of a post-link optimizer BOLT, PLOS maintains a holistic view of the whole-program structure and behavior, utilizing runtime information while preserving standard build system flows. The approach includes a granular outlining algorithm that matches and replaces repeated instruction sequences within/across modules and outlined functions, along with careful stack frame management to ensure correct function call handling. By integrating profiling information, PLOS balances the trade-off between code size and execution efficiency. The evaluation using eight MiBench benchmarks on an ARM-based Phytium FCT662 core demonstrates that PLOS achieves a mean code size reduction of 10.88% (up to 43.53%) and 6.61% (up to 14.78%) compared to LLVM’s and GCC's standard optimization, respectively, 1.76% (up to 4.75%) over LLVM’s aggressive code size reduction optimizations, and 2.88% (up to 8.56%) over a link-time outliner. The experimental results also show that PLOS can achieve a favorable balance between code size reduction and performance regression. ![]() |
|
Yuan, Shaobai |
![]() Shaobai Yuan, Jihong He, Yihui Xie, Feng Wang, and Jie Zhao (Hunan University, China) This paper introduces PLOS, a novel Post-Link Outlining approach designed to enhance code Size reduction for resource-constrained environments. Built on top of a post-link optimizer BOLT, PLOS maintains a holistic view of the whole-program structure and behavior, utilizing runtime information while preserving standard build system flows. The approach includes a granular outlining algorithm that matches and replaces repeated instruction sequences within/across modules and outlined functions, along with careful stack frame management to ensure correct function call handling. By integrating profiling information, PLOS balances the trade-off between code size and execution efficiency. The evaluation using eight MiBench benchmarks on an ARM-based Phytium FCT662 core demonstrates that PLOS achieves a mean code size reduction of 10.88% (up to 43.53%) and 6.61% (up to 14.78%) compared to LLVM’s and GCC's standard optimization, respectively, 1.76% (up to 4.75%) over LLVM’s aggressive code size reduction optimizations, and 2.88% (up to 8.56%) over a link-time outliner. The experimental results also show that PLOS can achieve a favorable balance between code size reduction and performance regression. ![]() |
|
Zhao, Jie |
![]() Shaobai Yuan, Jihong He, Yihui Xie, Feng Wang, and Jie Zhao (Hunan University, China) This paper introduces PLOS, a novel Post-Link Outlining approach designed to enhance code Size reduction for resource-constrained environments. Built on top of a post-link optimizer BOLT, PLOS maintains a holistic view of the whole-program structure and behavior, utilizing runtime information while preserving standard build system flows. The approach includes a granular outlining algorithm that matches and replaces repeated instruction sequences within/across modules and outlined functions, along with careful stack frame management to ensure correct function call handling. By integrating profiling information, PLOS balances the trade-off between code size and execution efficiency. The evaluation using eight MiBench benchmarks on an ARM-based Phytium FCT662 core demonstrates that PLOS achieves a mean code size reduction of 10.88% (up to 43.53%) and 6.61% (up to 14.78%) compared to LLVM’s and GCC's standard optimization, respectively, 1.76% (up to 4.75%) over LLVM’s aggressive code size reduction optimizations, and 2.88% (up to 8.56%) over a link-time outliner. The experimental results also show that PLOS can achieve a favorable balance between code size reduction and performance regression. ![]() |
69 authors
proc time: 6.59