CGO 2024 – Author Index |
Contents -
Abstracts -
Authors
|
A B C D E F G H I J K L M O P Q R S T V W X Y Z
Abellán, José L. |
CGO '24: "AXI4MLIR: User-Driven Automatic ..."
AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Nicolas Bohm Agostini, Jude Haris, Perry Gibson, Malith Jayaweera, Norm Rubin, Antonino Tumeo, José L. Abellán, José Cano, and David Kaeli (Northeastern University, USA; Pacific Northwest National Laboratory, USA; University of Glasgow, United Kingdom; University of Murcia, Spain) This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR’s versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65× speedup compared to manually optimized driver code implementations. AXI4MLIR implementation is open-source and available at: https://github.com/AXI4MLIR/axi4mlir. @InProceedings{CGO24p143, author = {Nicolas Bohm Agostini and Jude Haris and Perry Gibson and Malith Jayaweera and Norm Rubin and Antonino Tumeo and José L. Abellán and José Cano and David Kaeli}, title = {AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {143--157}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Agostini, Nicolas Bohm |
CGO '24: "AXI4MLIR: User-Driven Automatic ..."
AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Nicolas Bohm Agostini, Jude Haris, Perry Gibson, Malith Jayaweera, Norm Rubin, Antonino Tumeo, José L. Abellán, José Cano, and David Kaeli (Northeastern University, USA; Pacific Northwest National Laboratory, USA; University of Glasgow, United Kingdom; University of Murcia, Spain) This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR’s versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65× speedup compared to manually optimized driver code implementations. AXI4MLIR implementation is open-source and available at: https://github.com/AXI4MLIR/axi4mlir. @InProceedings{CGO24p143, author = {Nicolas Bohm Agostini and Jude Haris and Perry Gibson and Malith Jayaweera and Norm Rubin and Antonino Tumeo and José L. Abellán and José Cano and David Kaeli}, title = {AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {143--157}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Alves, Artur Cesar Araujo |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Amarasinghe, Saman |
CGO '24: "AskIt: Unified Programming ..."
AskIt: Unified Programming Interface for Programming with Large Language Models
Katsumi Okuda and Saman Amarasinghe (Massachusetts Institute of Technology, USA; Mitsubishi Electric Corporation, Japan) Large Language Models (LLMs) exhibit a unique phenomenon known as emergent abilities, demonstrating adeptness across numerous tasks, from text summarization to code generation. While these abilities open up novel avenues in software design and crafting, their incorporation presents substantial challenges. Developers face decisions regarding the use of LLMs for directly performing tasks within applications as well as for generating and executing code to accomplish these tasks. Moreover, effective prompt design becomes a critical concern, given the necessity of extracting data from natural language outputs. To address these complexities, this paper introduces AskIt, a domain-specific language (DSL) specifically designed for LLMs. AskIt simplifies LLM integration by providing a unified interface that not only allows for direct task execution using LLMs but also supports the entire cycle of code generation and execution. This dual capability is achieved through (1) type-guided output control, (2) template-based function definitions, and (3) prompt generation for both usage modes. Our evaluations underscore AskIt's effectiveness. Across 50 tasks, AskIt generated concise prompts, achieving a 16.14 % reduction in prompt length compared to benchmarks. Additionally, by enabling a seamless transition between using LLMs directly in applications and for generating code, AskIt achieved significant efficiency improvements, as observed in our GSM8K benchmark experiments. The implementations of AskIt in TypeScript and Python are available at https://github.com/katsumiok/ts-askit and https://github.com/katsumiok/pyaskit, respectively. @InProceedings{CGO24p41, author = {Katsumi Okuda and Saman Amarasinghe}, title = {AskIt: Unified Programming Interface for Programming with Large Language Models}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {41--54}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Ancourt, Corinne |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Armengol-Estapé, Jordi |
CGO '24: "SLaDe: A Portable Small Language ..."
SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly
Jordi Armengol-Estapé, Jackson Woodruff, Chris Cummins, and Michael F. P. O'Boyle (University of Edinburgh, United Kingdom; Meta AI Research, USA) Decompilation is a well-studied area with numerous high- quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. Nevertheless, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect. This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence Transformer trained over real-world code and augmented with a type inference engine. We utilize a novel tokenizer, dropout-free regularization, and type inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code. We evaluate SLaDe on over 4,000 ExeBench functions on two ISAs and at two optimization levels. SLaDe is up to 6× more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4× more accurate than the large language model ChatGPT and generates significantly more readable code than both. @InProceedings{CGO24p67, author = {Jordi Armengol-Estapé and Jackson Woodruff and Chris Cummins and Michael F. P. O'Boyle}, title = {SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {67--80}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Bao, Guoqing |
CGO '24: "PresCount: Effective Register ..."
PresCount: Effective Register Allocation for Bank Conflict Reduction
Xiaofeng Guan, Hao Zhou, Guoqing Bao, Handong Li, Liang Zhu, and Jianguo Yao (Shanghai Jiao Tong University, China; Shanghai Enflame Technology, China) Modern processors with large multi-banked register files often rely on hardware solutions to resolve bank conflicts efficiently. However, these hardware-based methods, while flexible, can incur runtime penalties and restrict the exploration of optimized hardware designs. In contrast, compiler-based methods for register bank assignments avoid runtime overhead. However, incorporating bank assignment into the complex register allocation process presents significant challenges, leading existing methods to adopt conservative approaches to avoid potential side effects. This paper introduces the novel register allocation method PresCount, which enhances the coloring strategy for the Register Conflict Graph (RCG) and incorporates a bank pressure tracking mechanism to improve performance. The integrated register bank assigner in PresCount effectively reduces bank conflicts, achieving remarkable reductions of 43.28% and 27.76%, respectively, compared to existing methods on platforms with rich register banks and limited register budgets, as demonstrated by SPECfp and CNN-KERNEL benchmarks. Furthermore, a subgroup splitting technique is introduced to facilitate register allocation under the bank-subgroup register file design, specifically our Domain-Specific Architecture (DSA) for AI computing. This technique demonstrates an impressive 99.85% reduction in bank conflicts for domain-specific kernel functions. By addressing the challenges of bank conflicts in register allocation, the proposed PresCount method showcases significant improvements in performance and efficiency for platforms with different register configurations and domain-specific workloads, allowing for more flexible exploration of optimized hardware designs. @InProceedings{CGO24p170, author = {Xiaofeng Guan and Hao Zhou and Guoqing Bao and Handong Li and Liang Zhu and Jianguo Yao}, title = {PresCount: Effective Register Allocation for Bank Conflict Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {170--181}, doi = {}, year = {2024}, } |
|
Barollet, Théo |
CGO '24: "EasyTracker: A Python Library ..."
EasyTracker: A Python Library for Controlling and Inspecting Program Execution
Théo Barollet, Christophe Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, and Fabrice Rastello (University Grenoble Alpes - Inria - CNRS - Grenoble INP - LIG, France) Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program. @InProceedings{CGO24p359, author = {Théo Barollet and Christophe Guillon and Manuel Selva and François Broquedis and Florent Bouchez-Tichadou and Fabrice Rastello}, title = {EasyTracker: A Python Library for Controlling and Inspecting Program Execution}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {359--372}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Barthou, Denis |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Bastoul, Cédric |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Béchennec, Jean-Luc |
CGO '24: "SCHEMATIC: Compile-Time Checkpoint ..."
SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems
Hugo Reymond, Jean-Luc Béchennec, Mikaël Briday, Sébastien Faucou, Isabelle Puaut, and Erven Rohou (Université de Rennes - Inria - CNRS - IRISA, France; Nantes Université - École Centrale Nantes - CNRS - LS2N - UMR 6004, France) Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible. While many techniques rely on the developer to address these issues, we present SCHEMATIC, a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. SCHEMATIC ensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. SCHEMATIC takes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM. We tested SCHEMATIC for different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51 % compared to related techniques. @InProceedings{CGO24p258, author = {Hugo Reymond and Jean-Luc Béchennec and Mikaël Briday and Sébastien Faucou and Isabelle Puaut and Erven Rohou}, title = {SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {258--269}, doi = {}, year = {2024}, } Video |
|
Bellavita, Julian |
CGO '24: "Tackling the Matrix Multiplication ..."
Tackling the Matrix Multiplication Micro-kernel Generation with Exo
Adrián Castelló, Julian Bellavita, Grace Dinh, Yuka Ikarashi, and Héctor Martínez (Universitat Politècnica de València, Spain; Cornell University, USA; University of California at Berkeley, USA; Massachusetts Institute of Technology, USA; Universidad de Córdoba, Spain) The optimization of the matrix multiplication (or gemm) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The gemm is usually implemented following the GotoBLAS philosophy, which tiles the gemm operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating micro-kernels with the exo compiler that perform close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions. @InProceedings{CGO24p182, author = {Adrián Castelló and Julian Bellavita and Grace Dinh and Yuka Ikarashi and Héctor Martínez}, title = {Tackling the Matrix Multiplication Micro-kernel Generation with Exo}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {182--192}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable |
|
Bora, Utpal |
CGO '24: "OptiWISE: Combining Sampling ..."
OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis
Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, and Timothy M. Jones (University of Cambridge, United Kingdom; Arm, USA; Arm, United Kingdom) Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements. @InProceedings{CGO24p373, author = {Yuxin Guo and Alex W. Chadwick and Márton Erdős and Utpal Bora and Ilias Vougioukas and Giacomo Gabrielli and Timothy M. Jones}, title = {OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {373--385}, doi = {}, year = {2024}, } Published Artifact Video Artifacts Available Artifacts Reusable Results Reproduced |
|
Bouchez-Tichadou, Florent |
CGO '24: "EasyTracker: A Python Library ..."
EasyTracker: A Python Library for Controlling and Inspecting Program Execution
Théo Barollet, Christophe Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, and Fabrice Rastello (University Grenoble Alpes - Inria - CNRS - Grenoble INP - LIG, France) Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program. @InProceedings{CGO24p359, author = {Théo Barollet and Christophe Guillon and Manuel Selva and François Broquedis and Florent Bouchez-Tichadou and Fabrice Rastello}, title = {EasyTracker: A Python Library for Controlling and Inspecting Program Execution}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {359--372}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Briday, Mikaël |
CGO '24: "SCHEMATIC: Compile-Time Checkpoint ..."
SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems
Hugo Reymond, Jean-Luc Béchennec, Mikaël Briday, Sébastien Faucou, Isabelle Puaut, and Erven Rohou (Université de Rennes - Inria - CNRS - IRISA, France; Nantes Université - École Centrale Nantes - CNRS - LS2N - UMR 6004, France) Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible. While many techniques rely on the developer to address these issues, we present SCHEMATIC, a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. SCHEMATIC ensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. SCHEMATIC takes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM. We tested SCHEMATIC for different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51 % compared to related techniques. @InProceedings{CGO24p258, author = {Hugo Reymond and Jean-Luc Béchennec and Mikaël Briday and Sébastien Faucou and Isabelle Puaut and Erven Rohou}, title = {SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {258--269}, doi = {}, year = {2024}, } Video |
|
Brodman, James |
CGO '24: "Experiences Building an MLIR-Based ..."
Experiences Building an MLIR-Based SYCL Compiler
Ettore Tiotto, Víctor Pérez, Whitney Tsang, Lukas Sommer, Julian Oppermann, Victor Lomüller, Mehdi Goli, and James Brodman (Intel Corporation, Canada; Codeplay Software, United Kingdom; Intel Corporation, USA) Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers. The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges. This work therefore describes practical experience with the design and implementation of an MLIR-based SYCL compiler. By modeling key elements of the SYCL programming model in host and device code in the MLIR dialect framework, the presented approach enables the implementation of powerful device code optimizations as well as analyses across host and device code. Compared to two LLVM-based SYCL implementations, this yields speedups of up to 4.3x on a collection of SYCL benchmark applications. Finally, this work also discusses challenges encountered in the design and implementation and how these could be addressed in the future. @InProceedings{CGO24p399, author = {Ettore Tiotto and Víctor Pérez and Whitney Tsang and Lukas Sommer and Julian Oppermann and Victor Lomüller and Mehdi Goli and James Brodman}, title = {Experiences Building an MLIR-Based SYCL Compiler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {399--410}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Broquedis, François |
CGO '24: "EasyTracker: A Python Library ..."
EasyTracker: A Python Library for Controlling and Inspecting Program Execution
Théo Barollet, Christophe Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, and Fabrice Rastello (University Grenoble Alpes - Inria - CNRS - Grenoble INP - LIG, France) Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program. @InProceedings{CGO24p359, author = {Théo Barollet and Christophe Guillon and Manuel Selva and François Broquedis and Florent Bouchez-Tichadou and Fabrice Rastello}, title = {EasyTracker: A Python Library for Controlling and Inspecting Program Execution}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {359--372}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Burgstaller, Bernd |
CGO '24: "BEC: Bit-Level Static Analysis ..."
BEC: Bit-Level Static Analysis for Reliability against Soft Errors
Yousun Ko and Bernd Burgstaller (Yonsei University, South Korea) Soft errors are a type of transient digital signal corruption that occurs in digital hardware components such as the internal flip-flops of CPU pipelines, the register file, memory cells, and even internal communication buses. Soft errors are caused by environmental radioactivity, magnetic interference, lasers, and temperature fluctuations, either unintentionally, or as part of a deliberate attempt to compromise a system and expose confidential data. We propose a bit-level error coalescing (BEC) static program analysis and its two use cases to understand and improve program reliability against soft errors. The BEC analysis tracks each bit corruption in the register file and classifies the effect of the corruption by its semantics at compile time. The usefulness of the proposed analysis is demonstrated in two scenarios, fault injection campaign pruning, and reliability-aware program transformation. Experimental results show that bit-level analysis pruned up to 30.04 % of exhaustive fault injection campaigns (13.71 % on average), without loss of accuracy. Program vulnerability was reduced by up to 13.11 % (4.94 % on average) through bit-level vulnerability-aware instruction scheduling. The analysis has been implemented within LLVM and evaluated on the RISC-V architecture. To the best of our knowledge, the proposed BEC analysis is the first bit-level compiler analysis for program reliability against soft errors. The proposed method is generic and not limited to a specific computer architecture. @InProceedings{CGO24p283, author = {Yousun Ko and Bernd Burgstaller}, title = {BEC: Bit-Level Static Analysis for Reliability against Soft Errors}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {283--295}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Campanoni, Simone |
CGO '24: "Representing Data Collections ..."
Representing Data Collections in an SSA Form
Tommy McMichen, Nathan Greiner, Peter Zhong, Federico Sossai, Atmn Patel, and Simone Campanoni (Northwestern University, USA) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of MEMOIR is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, MEMOIR compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our MEMOIR compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. @InProceedings{CGO24p308, author = {Tommy McMichen and Nathan Greiner and Peter Zhong and Federico Sossai and Atmn Patel and Simone Campanoni}, title = {Representing Data Collections in an SSA Form}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {308--321}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Cano, José |
CGO '24: "AXI4MLIR: User-Driven Automatic ..."
AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Nicolas Bohm Agostini, Jude Haris, Perry Gibson, Malith Jayaweera, Norm Rubin, Antonino Tumeo, José L. Abellán, José Cano, and David Kaeli (Northeastern University, USA; Pacific Northwest National Laboratory, USA; University of Glasgow, United Kingdom; University of Murcia, Spain) This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR’s versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65× speedup compared to manually optimized driver code implementations. AXI4MLIR implementation is open-source and available at: https://github.com/AXI4MLIR/axi4mlir. @InProceedings{CGO24p143, author = {Nicolas Bohm Agostini and Jude Haris and Perry Gibson and Malith Jayaweera and Norm Rubin and Antonino Tumeo and José L. Abellán and José Cano and David Kaeli}, title = {AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {143--157}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Cao, Liqing |
CGO '24: "Boosting the Performance of ..."
Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China; TianqiSoft, China) The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness. We have developed a new taint analysis tool, SADroid, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADroid significantly outperforms FlowDroid (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X. @InProceedings{CGO24p296, author = {Haofeng Li and Jie Lu and Haining Meng and Liqing Cao and Lian Li and Lin Gao}, title = {Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {296--307}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Carloni, Filippo |
CGO '24: "One Automaton to Rule Them ..."
One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution
Luisa Cicolini, Filippo Carloni, Marco D. Santambrogio, and Davide Conficconi (Politecnico di Milano, Italy) Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single- and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean 5.99× throughput improvement and 4.05× speedup against single and multiple parallel FSAs. @InProceedings{CGO24p193, author = {Luisa Cicolini and Filippo Carloni and Marco D. Santambrogio and Davide Conficconi}, title = {One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {193--206}, doi = {}, year = {2024}, } Published Artifact Video Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Castelló, Adrián |
CGO '24: "Tackling the Matrix Multiplication ..."
Tackling the Matrix Multiplication Micro-kernel Generation with Exo
Adrián Castelló, Julian Bellavita, Grace Dinh, Yuka Ikarashi, and Héctor Martínez (Universitat Politècnica de València, Spain; Cornell University, USA; University of California at Berkeley, USA; Massachusetts Institute of Technology, USA; Universidad de Córdoba, Spain) The optimization of the matrix multiplication (or gemm) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The gemm is usually implemented following the GotoBLAS philosophy, which tiles the gemm operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating micro-kernels with the exo compiler that perform close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions. @InProceedings{CGO24p182, author = {Adrián Castelló and Julian Bellavita and Grace Dinh and Yuka Ikarashi and Héctor Martínez}, title = {Tackling the Matrix Multiplication Micro-kernel Generation with Exo}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {182--192}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable |
|
Chabbi, Milind |
CGO '24: "EasyView: Bringing Performance ..."
EasyView: Bringing Performance Profiles into Integrated Development Environments
Qidong Zhao, Milind Chabbi, and Xu Liu (North Carolina State University, USA; Scalable Machines Research, USA) Dynamic program analysis (also known as profiling) is well-known for its powerful capabilities of identifying performance inefficiencies in software packages. Although a large number of dynamic program analysis techniques are developed in academia and industry, very few of them are widely used by software developers in their regular software developing activities. There are three major reasons. First, the dynamic analysis tools (also known as profilers) are disjoint from the coding environments such as IDEs and editors; frequently switching focus between them significantly complicates the entire cycle of software development. Second, mastering various tools to interpret their analysis results requires substantial efforts; even worse, many tools have their own design of graphical user interfaces (GUI) for data presentation, which steepens the learning curves. Third, most existing tools expose few interfaces to support user-defined analysis, which makes the tools less customizable to fulfill diverse user demands. We develop EasyView, a general solution to integrate the interpretation and visualization of various profiling results in the coding environments, which bridges software developers with profilers to provide easy and intuitive dynamic analysis during the code development cycle. The novelty of EasyView is three-fold. First, we develop a generic data format, which enables EasyView to support mainstream profilers for different languages. Second, we develop a set of customizable schemes to analyze and visualize the profiles in intuitive ways. Third, we tightly integrate EasyView with popular coding environments, such as Microsoft Visual Studio Code, with easy code exploration and user interaction. Our evaluation shows that EasyView is able to support various profilers for different languages and provide unique insights into performance inefficiencies in different domains. Our user studies show that EasyView can effectively help software developers facilitate their coding and debugging efforts. @InProceedings{CGO24p386, author = {Qidong Zhao and Milind Chabbi and Xu Liu}, title = {EasyView: Bringing Performance Profiles into Integrated Development Environments}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {386--398}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional Results Reproduced CGO '24: "Unveiling and Vanquishing ..." Unveiling and Vanquishing Goroutine Leaks in Enterprise Microservices: A Dynamic Analysis Approach Georgian-Vlad Saioc, Dmitriy Shirchenko, and Milind Chabbi (Aarhus University, Denmark; Uber Technologies, Denmark; Uber Technologies, USA) Go is a modern programming language gaining popularity in enterprise microservice systems. Concurrency is a first-class citizen in Go with lightweight “goroutines” as the building blocks of concurrent execution. Go advocates message-passing to communicate and synchronize among goroutines. Improper use of message passing in Go can result in “partial deadlock” (interchangeably called a goroutine leak), a subtle concurrency bug where a blocked sender (receiver) never finds a corresponding receiver (sender), causing the blocked goroutine to leak memory, via its call stack and objects reachable from the stack. In this paper, we systematically study the prevalence of message passing and the resulting partial deadlocks in ≈75 Million lines of Uber’s Go monorepo hosting ≈2500 microservices. We develop two lightweight, dynamic analysis tools: GOLEAK and LEAKPROF, designed to identify partial deadlocks. GOLEAK detects partial deadlocks during unit testing and prevents the introduction of new bugs. Conversely, LEAKPROF uses goroutine profiles obtained from services deployed in production to pinpoint intricate bugs arising from complex control flow, unexplored interleavings, or the absence of test coverage. We share our experience and insights deploying these tools in developer workflows in a large industrial setting. Using GOLEAK we unearthed 857 pre-existing goroutine leaks in the legacy code and prevented the introduction of ≈260 new leaks over one year period. Using LEAKPROF we found 24 and fixed 21 goroutine leaks, which resulted in up to 34% speedup and 9.2× memory reduction in some of our production services. @InProceedings{CGO24p411, author = {Georgian-Vlad Saioc and Dmitriy Shirchenko and Milind Chabbi}, title = {Unveiling and Vanquishing Goroutine Leaks in Enterprise Microservices: A Dynamic Analysis Approach}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {411--422}, doi = {}, year = {2024}, } Video |
|
Chadwick, Alex W. |
CGO '24: "OptiWISE: Combining Sampling ..."
OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis
Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, and Timothy M. Jones (University of Cambridge, United Kingdom; Arm, USA; Arm, United Kingdom) Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements. @InProceedings{CGO24p373, author = {Yuxin Guo and Alex W. Chadwick and Márton Erdős and Utpal Bora and Ilias Vougioukas and Giacomo Gabrielli and Timothy M. Jones}, title = {OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {373--385}, doi = {}, year = {2024}, } Published Artifact Video Artifacts Available Artifacts Reusable Results Reproduced |
|
Chen, Ciyong |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Cheng, Xianhang |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Cicolini, Luisa |
CGO '24: "One Automaton to Rule Them ..."
One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution
Luisa Cicolini, Filippo Carloni, Marco D. Santambrogio, and Davide Conficconi (Politecnico di Milano, Italy) Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single- and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean 5.99× throughput improvement and 4.05× speedup against single and multiple parallel FSAs. @InProceedings{CGO24p193, author = {Luisa Cicolini and Filippo Carloni and Marco D. Santambrogio and Davide Conficconi}, title = {One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {193--206}, doi = {}, year = {2024}, } Published Artifact Video Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Cole, Murray |
CGO '24: "Revealing Compiler Heuristics ..."
Revealing Compiler Heuristics through Automated Discovery and Optimization
Volker Seeker, Chris Cummins, Murray Cole, Björn Franke, Kim Hazelwood, and Hugh Leather (Meta AI Research, USA; University of Edinburgh, United Kingdom) Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function. We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -Oz baseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application. @InProceedings{CGO24p55, author = {Volker Seeker and Chris Cummins and Murray Cole and Björn Franke and Kim Hazelwood and Hugh Leather}, title = {Revealing Compiler Heuristics through Automated Discovery and Optimization}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {55--66}, doi = {}, year = {2024}, } |
|
Conficconi, Davide |
CGO '24: "One Automaton to Rule Them ..."
One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution
Luisa Cicolini, Filippo Carloni, Marco D. Santambrogio, and Davide Conficconi (Politecnico di Milano, Italy) Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single- and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean 5.99× throughput improvement and 4.05× speedup against single and multiple parallel FSAs. @InProceedings{CGO24p193, author = {Luisa Cicolini and Filippo Carloni and Marco D. Santambrogio and Davide Conficconi}, title = {One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {193--206}, doi = {}, year = {2024}, } Published Artifact Video Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Consolaro, Gianpietro |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Cui, Jingze |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Cui, Jinku |
CGO '24: "DrPy: Pinpointing Inefficient ..."
DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications
Jinku Cui, Qidong Zhao, Yueming Hao, and Xu Liu (North Carolina State University, USA) Python has become an increasingly popular programming language, especially in the areas of data analytics and machine learning. Many modern Python packages employ a multi-layer design: the Python layer manages various packages and expresses high-level algorithms; the native layer is written in C/C++/Fortran/CUDA for efficient computation. Typically, each layer manages its own computation and memory and exposes APIs for cross-layer interactions. Without holistic optimization, performance inefficiencies can exist at the boundary between layers. In this paper, we develop DrPy, a novel profiler that pinpoints such memory inefficiencies across layers in Python applications. Unlike existing tools, DrPy takes a hybrid and fine-grained approach to track memory objects and their usage in both Python and native layers. DrPy correlates the behavior of memory objects across layers and builds an object flow graph to pinpoint memory inefficiencies. In addition, DrPy captures rich information associated with object flow graphs, such as call paths and source code attribution to guide intuitive code optimization. Guided by DrPy, we are able to optimize many Python applications with non-trivial performance improvement. Many optimization patches have been validated by the application developers and committed to the application repositories. @InProceedings{CGO24p245, author = {Jinku Cui and Qidong Zhao and Yueming Hao and Xu Liu}, title = {DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {245--257}, doi = {}, year = {2024}, } Artifacts Functional |
|
Cummins, Chris |
CGO '24: "Revealing Compiler Heuristics ..."
Revealing Compiler Heuristics through Automated Discovery and Optimization
Volker Seeker, Chris Cummins, Murray Cole, Björn Franke, Kim Hazelwood, and Hugh Leather (Meta AI Research, USA; University of Edinburgh, United Kingdom) Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function. We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -Oz baseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application. @InProceedings{CGO24p55, author = {Volker Seeker and Chris Cummins and Murray Cole and Björn Franke and Kim Hazelwood and Hugh Leather}, title = {Revealing Compiler Heuristics through Automated Discovery and Optimization}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {55--66}, doi = {}, year = {2024}, } CGO '24: "SLaDe: A Portable Small Language ..." SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly Jordi Armengol-Estapé, Jackson Woodruff, Chris Cummins, and Michael F. P. O'Boyle (University of Edinburgh, United Kingdom; Meta AI Research, USA) Decompilation is a well-studied area with numerous high- quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. Nevertheless, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect. This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence Transformer trained over real-world code and augmented with a type inference engine. We utilize a novel tokenizer, dropout-free regularization, and type inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code. We evaluate SLaDe on over 4,000 ExeBench functions on two ISAs and at two optimization levels. SLaDe is up to 6× more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4× more accurate than the large language model ChatGPT and generates significantly more readable code than both. @InProceedings{CGO24p67, author = {Jordi Armengol-Estapé and Jackson Woodruff and Chris Cummins and Michael F. P. O'Boyle}, title = {SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {67--80}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Dehnavi, Maryam Mehri |
CGO '24: "A Framework for Fine-Grained ..."
A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
Abhinav Jangda, Saeed Maleki, Maryam Mehri Dehnavi, Madan Musuvathi, and Olli Saarikivi (Microsoft Research, USA; University of Toronto, Canada) Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU. To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined fine-grained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes. @InProceedings{CGO24p93, author = {Abhinav Jangda and Saeed Maleki and Maryam Mehri Dehnavi and Madan Musuvathi and Olli Saarikivi}, title = {A Framework for Fine-Grained Synchronization of Dependent GPU Kernels}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {93--105}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Dinh, Grace |
CGO '24: "Tackling the Matrix Multiplication ..."
Tackling the Matrix Multiplication Micro-kernel Generation with Exo
Adrián Castelló, Julian Bellavita, Grace Dinh, Yuka Ikarashi, and Héctor Martínez (Universitat Politècnica de València, Spain; Cornell University, USA; University of California at Berkeley, USA; Massachusetts Institute of Technology, USA; Universidad de Córdoba, Spain) The optimization of the matrix multiplication (or gemm) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The gemm is usually implemented following the GotoBLAS philosophy, which tiles the gemm operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating micro-kernels with the exo compiler that perform close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions. @InProceedings{CGO24p182, author = {Adrián Castelló and Julian Bellavita and Grace Dinh and Yuka Ikarashi and Héctor Martínez}, title = {Tackling the Matrix Multiplication Micro-kernel Generation with Exo}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {182--192}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable |
|
Domke, Jens |
CGO '24: "Retargeting and Respecializing ..."
Retargeting and Respecializing GPU Workloads for Performance Portability
Ivan R. Ivanov, Oleksandr Zinenko, Jens Domke, Toshio Endo, and William S. Moses (Tokyo Institute of Technology, Japan; RIKEN R-CCS, Japan; Google DeepMind, France; University of Illinois at Urbana-Champaign, USA; Google DeepMind, USA) In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher-performance and lower costs have led to a significant diversification of architecture designs across, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using more newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs performing automatic translation from CUDA and simultaneously adjusting the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 16% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program. @InProceedings{CGO24p119, author = {Ivan R. Ivanov and Oleksandr Zinenko and Jens Domke and Toshio Endo and William S. Moses}, title = {Retargeting and Respecializing GPU Workloads for Performance Portability}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {119--132}, doi = {}, year = {2024}, } Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Dong, Rongchao |
CGO '24: "A System-Level Dynamic Binary ..."
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, and Weihua Zhang (Fudan University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode. @InProceedings{CGO24p423, author = {Jinhu Jiang and Chaoyi Liang and Rongchao Dong and Zhaohui Yang and Zhongjun Zhou and Wenwen Wang and Pen-Chung Yew and Weihua Zhang}, title = {A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {423--434}, doi = {}, year = {2024}, } |
|
Du, Longsheng |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Dubach, Christophe |
CGO '24: "Latent Idiom Recognition for ..."
Latent Idiom Recognition for a Minimalist Functional Array Language using Equality Saturation
Jonathan Van der Cruysse and Christophe Dubach (McGill University, Canada) Accelerating programs is typically done by recognizing code idioms matching high-performance libraries or hardware interfaces. However, recognizing such idioms automatically is challenging. The idiom recognition machinery is difficult to write and requires expert knowledge. In addition, slight variations in the input program might hide the idiom and defeat the recognizer. This paper advocates for the use of a minimalist functional array language supporting a small, but expressive, set of operators. The minimalist design leads to a tiny sets of rewrite rules, which encode the language semantics. Crucially, the same minimalist language is also used to encode idioms. This removes the need for hand-crafted analysis passes, or for having to learn a complex domain-specific language to define the idioms. Coupled with equality saturation, this approach is able to match the core functions from the BLAS and PyTorch libraries on a set of computational kernels. Compared to reference C kernel implementations, the approach produces a geometric mean speedup of 1.46$× for C programs using BLAS, when generating such programs from the high-level minimalist language. @InProceedings{CGO24p270, author = {Jonathan Van der Cruysse and Christophe Dubach}, title = {Latent Idiom Recognition for a Minimalist Functional Array Language using Equality Saturation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {270--282}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Endo, Toshio |
CGO '24: "Retargeting and Respecializing ..."
Retargeting and Respecializing GPU Workloads for Performance Portability
Ivan R. Ivanov, Oleksandr Zinenko, Jens Domke, Toshio Endo, and William S. Moses (Tokyo Institute of Technology, Japan; RIKEN R-CCS, Japan; Google DeepMind, France; University of Illinois at Urbana-Champaign, USA; Google DeepMind, USA) In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher-performance and lower costs have led to a significant diversification of architecture designs across, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using more newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs performing automatic translation from CUDA and simultaneously adjusting the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 16% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program. @InProceedings{CGO24p119, author = {Ivan R. Ivanov and Oleksandr Zinenko and Jens Domke and Toshio Endo and William S. Moses}, title = {Retargeting and Respecializing GPU Workloads for Performance Portability}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {119--132}, doi = {}, year = {2024}, } Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Engelke, Alexis |
CGO '24: "Compile-Time Analysis of Compiler ..."
Compile-Time Analysis of Compiler Frameworks for Query Compilation
Alexis Engelke and Tobias Schwarz (TU Munich, Germany) Low compilation times are highly important in contexts of Just-in-time compilation. This not only applies to language runtimes for Java, WebAssembly, or JavaScript, but is also crucial for database systems that employ query compilation as the primary measure for achieving high throughput in combination with low query execution time. We present a performance comparison and detailed analysis of the compile times of the JIT compilation back-ends provided by GCC, LLVM, Cranelift, and a single-pass compiler in the context of database queries. Our results show that LLVM achieves the highest execution performance, but can compile substantially faster when tuning for low compilation time. Cranelift achieves a similar run-time performance to unoptimized LLVM, but compiles just 20–35% faster and is outperformed by the single-pass compiler, which compiles code 16x faster than Cranelift at similar execution performance. @InProceedings{CGO24p233, author = {Alexis Engelke and Tobias Schwarz}, title = {Compile-Time Analysis of Compiler Frameworks for Query Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {233--244}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Erdős, Márton |
CGO '24: "OptiWISE: Combining Sampling ..."
OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis
Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, and Timothy M. Jones (University of Cambridge, United Kingdom; Arm, USA; Arm, United Kingdom) Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements. @InProceedings{CGO24p373, author = {Yuxin Guo and Alex W. Chadwick and Márton Erdős and Utpal Bora and Ilias Vougioukas and Giacomo Gabrielli and Timothy M. Jones}, title = {OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {373--385}, doi = {}, year = {2024}, } Published Artifact Video Artifacts Available Artifacts Reusable Results Reproduced |
|
Fan, Yuyou |
CGO '24: "High-Throughput, Formal-Methods-Assisted ..."
High-Throughput, Formal-Methods-Assisted Fuzzing for LLVM
Yuyou Fan and John Regehr (University of Utah, USA) It is very difficult to thoroughly test a compiler, and as a consequence it is common for released versions of production compilers to contain bugs that cause them to crash and to emit incorrect object code. We created alive-mutate, a mutation-based fuzzing tool that takes test cases written by humans and randomly modifies them, based on the hypothesis that while compiler developers are fundamentally good at writing tests, they also tend to miss corner cases. Alive-mutate is integrated with the Alive2 translation validation tool for LLVM, which is useful because it checks the behavior of optimizations for all possible values of input variables. Alive-mutate is also integrated with the LLVM middle-end, allowing it to perform mutations, optimizations, and formal verification of the optimizations all within a single program---avoiding numerous sources of overhead. Alive-mutate's fuzzing throughput is 12x higher, on average, than a fuzzing workflow that runs mutation, optimization, and formal verification in separate processes. So far we have used alive-mutate to find and report 33 previously unknown bugs in LLVM. @InProceedings{CGO24p349, author = {Yuyou Fan and John Regehr}, title = {High-Throughput, Formal-Methods-Assisted Fuzzing for LLVM}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {349--358}, doi = {}, year = {2024}, } Published Artifact Info Artifacts Available |
|
Faucou, Sébastien |
CGO '24: "SCHEMATIC: Compile-Time Checkpoint ..."
SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems
Hugo Reymond, Jean-Luc Béchennec, Mikaël Briday, Sébastien Faucou, Isabelle Puaut, and Erven Rohou (Université de Rennes - Inria - CNRS - IRISA, France; Nantes Université - École Centrale Nantes - CNRS - LS2N - UMR 6004, France) Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible. While many techniques rely on the developer to address these issues, we present SCHEMATIC, a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. SCHEMATIC ensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. SCHEMATIC takes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM. We tested SCHEMATIC for different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51 % compared to related techniques. @InProceedings{CGO24p258, author = {Hugo Reymond and Jean-Luc Béchennec and Mikaël Briday and Sébastien Faucou and Isabelle Puaut and Erven Rohou}, title = {SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {258--269}, doi = {}, year = {2024}, } Video |
|
Franke, Björn |
CGO '24: "Revealing Compiler Heuristics ..."
Revealing Compiler Heuristics through Automated Discovery and Optimization
Volker Seeker, Chris Cummins, Murray Cole, Björn Franke, Kim Hazelwood, and Hugh Leather (Meta AI Research, USA; University of Edinburgh, United Kingdom) Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function. We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -Oz baseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application. @InProceedings{CGO24p55, author = {Volker Seeker and Chris Cummins and Murray Cole and Björn Franke and Kim Hazelwood and Hugh Leather}, title = {Revealing Compiler Heuristics through Automated Discovery and Optimization}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {55--66}, doi = {}, year = {2024}, } |
|
Fu, Hao |
CGO '24: "Ecmas: Efficient Circuit Mapping ..."
Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code
Mingzheng Zhu, Hao Fu, Jun Wu, Chi Zhang, Wei Xie, and Xiang-Yang Li (University of Science and Technology of China, China) As the leading candidate of quantum error correction codes, surface code suffers from significant overhead, such as execution time. Reducing the circuit's execution time not only enhances its execution efficiency but also improves fidelity. However, finding the shortest execution time is NP-hard. In this work, we study the surface code mapping and scheduling problem. To reduce the execution time of a quantum circuit, we first introduce two novel metrics: Circuit Parallelism Degree and Chip Communication Capacity to quantitatively characterize quantum circuits and chips. Then, we propose a resource-adaptive mapping and scheduling method, named Ecmas, with customized initialization of chip resources for each circuit. Ecmas can dramatically reduce the execution time in both double defect and lattice surgery models. Furthermore, we provide an additional version Ecmas-ReSu for sufficient qubits, which is performance-guaranteed and more efficient. Extensive numerical tests on practical datasets show that Ecmas outperforms the state-of-the-art methods by reducing the execution time by 51.5% on average for double defect model. Ecmas can reach the optimal result in most benchmarks, reducing the execution time by up to 13.9% for lattice surgery model. @InProceedings{CGO24p158, author = {Mingzheng Zhu and Hao Fu and Jun Wu and Chi Zhang and Wei Xie and Xiang-Yang Li}, title = {Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {158--169}, doi = {}, year = {2024}, } |
|
Fu, Qiang |
CGO '24: "JITSPMM: Just-in-Time Instruction ..."
JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication
Qiang Fu, Thomas B. Rolinger, and H. Howie Huang (Advanced Micro Devices, USA; NVIDIA, USA; George Washington University, USA) Achieving high performance for Sparse MatrixMatrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the aheadof-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSPMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSPMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSPMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSPMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSPMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSPMM provides an average improvement of 3.8× and 1.4×, respectively. @InProceedings{CGO24p448, author = {Qiang Fu and Thomas B. Rolinger and H. Howie Huang}, title = {JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {448--459}, doi = {}, year = {2024}, } |
|
Gabrielli, Giacomo |
CGO '24: "OptiWISE: Combining Sampling ..."
OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis
Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, and Timothy M. Jones (University of Cambridge, United Kingdom; Arm, USA; Arm, United Kingdom) Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements. @InProceedings{CGO24p373, author = {Yuxin Guo and Alex W. Chadwick and Márton Erdős and Utpal Bora and Ilias Vougioukas and Giacomo Gabrielli and Timothy M. Jones}, title = {OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {373--385}, doi = {}, year = {2024}, } Published Artifact Video Artifacts Available Artifacts Reusable Results Reproduced |
|
Gao, Lin |
CGO '24: "Boosting the Performance of ..."
Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China; TianqiSoft, China) The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness. We have developed a new taint analysis tool, SADroid, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADroid significantly outperforms FlowDroid (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X. @InProceedings{CGO24p296, author = {Haofeng Li and Jie Lu and Haining Meng and Liqing Cao and Lian Li and Lin Gao}, title = {Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {296--307}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Geeson, Luke |
CGO '24: "Compiler Testing with Relaxed ..."
Compiler Testing with Relaxed Memory Models
Luke Geeson and Lee Smith (University College London, United Kingdom; Arm, United Kingdom) Finding bugs is key to the correctness of compilers in wide use today. If the behaviour of a compiled program, as allowed by its architecture memory model, is not a behaviour of the source program under its source model, then there is a bug. This holds for all programs, but we focus on concurrency bugs that occur only with two or more threads of execution. We focus on testing techniques that detect such bugs in C/C++ compilers. We seek a testing technique that automatically covers concurrency bugs up to fixed bounds on program sizes and that scales to find bugs in compiled programs with many lines of code. Otherwise, a testing technique can miss bugs. Unfortunately, the state-of-the-art techniques are yet to satisfy all of these properties. We present the Téléchat compiler testing tool for concurrent programs. Téléchat compiles a concurrent C/C++ program and compares source and compiled program behaviours using source and architecture memory models. We make three claims: Téléchat improves the state-of-the-art at finding bugs in code generation for multi-threaded execution, it is the first public description of a compiler testing tool for concurrency that is deployed in industry, and it is the first tool that takes a significant step towards the desired properties. We provide experimental evidence suggesting Téléchat finds bugs missed by other state-of-the-art techniques, case studies indicating that Téléchat satisfies the properties, and reports of our experience deploying Téléchat in industry regression testing. @InProceedings{CGO24p334, author = {Luke Geeson and Lee Smith}, title = {Compiler Testing with Relaxed Memory Models}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {334--348}, doi = {}, year = {2024}, } Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Georgakoudis, Giorgis |
CGO '24: "Enhancing Performance through ..."
Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs
Alnis Murtovi, Giorgis Georgakoudis, Konstantinos Parasyris, Chunhua Liao, Ignacio Laguna, and Bernhard Steffen (TU Dortmund, Germany; Lawrence Livermore National Laboratory, USA) Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%. @InProceedings{CGO24p106, author = {Alnis Murtovi and Giorgis Georgakoudis and Konstantinos Parasyris and Chunhua Liao and Ignacio Laguna and Bernhard Steffen}, title = {Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {106--118}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Gibson, Perry |
CGO '24: "AXI4MLIR: User-Driven Automatic ..."
AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Nicolas Bohm Agostini, Jude Haris, Perry Gibson, Malith Jayaweera, Norm Rubin, Antonino Tumeo, José L. Abellán, José Cano, and David Kaeli (Northeastern University, USA; Pacific Northwest National Laboratory, USA; University of Glasgow, United Kingdom; University of Murcia, Spain) This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR’s versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65× speedup compared to manually optimized driver code implementations. AXI4MLIR implementation is open-source and available at: https://github.com/AXI4MLIR/axi4mlir. @InProceedings{CGO24p143, author = {Nicolas Bohm Agostini and Jude Haris and Perry Gibson and Malith Jayaweera and Norm Rubin and Antonino Tumeo and José L. Abellán and José Cano and David Kaeli}, title = {AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {143--157}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Goli, Mehdi |
CGO '24: "Experiences Building an MLIR-Based ..."
Experiences Building an MLIR-Based SYCL Compiler
Ettore Tiotto, Víctor Pérez, Whitney Tsang, Lukas Sommer, Julian Oppermann, Victor Lomüller, Mehdi Goli, and James Brodman (Intel Corporation, Canada; Codeplay Software, United Kingdom; Intel Corporation, USA) Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers. The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges. This work therefore describes practical experience with the design and implementation of an MLIR-based SYCL compiler. By modeling key elements of the SYCL programming model in host and device code in the MLIR dialect framework, the presented approach enables the implementation of powerful device code optimizations as well as analyses across host and device code. Compared to two LLVM-based SYCL implementations, this yields speedups of up to 4.3x on a collection of SYCL benchmark applications. Finally, this work also discusses challenges encountered in the design and implementation and how these could be addressed in the future. @InProceedings{CGO24p399, author = {Ettore Tiotto and Víctor Pérez and Whitney Tsang and Lukas Sommer and Julian Oppermann and Victor Lomüller and Mehdi Goli and James Brodman}, title = {Experiences Building an MLIR-Based SYCL Compiler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {399--410}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Gordon, Vahl Scott |
CGO '24: "Instruction Scheduling for ..."
Instruction Scheduling for the GPU on the GPU
Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, and Vahl Scott Gordon (California State University, Sacramento, USA; Advanced Micro Devices, USA) In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU. @InProceedings{CGO24p435, author = {Ghassan Shobaki and Pınar Muyan-Özçelik and Josh Hutton and Bruce Linck and Vladislav Malyshenko and Austin Kerbow and Ronaldo Ramirez-Ortega and Vahl Scott Gordon}, title = {Instruction Scheduling for the GPU on the GPU}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {435--447}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Greiner, Nathan |
CGO '24: "Representing Data Collections ..."
Representing Data Collections in an SSA Form
Tommy McMichen, Nathan Greiner, Peter Zhong, Federico Sossai, Atmn Patel, and Simone Campanoni (Northwestern University, USA) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of MEMOIR is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, MEMOIR compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our MEMOIR compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. @InProceedings{CGO24p308, author = {Tommy McMichen and Nathan Greiner and Peter Zhong and Federico Sossai and Atmn Patel and Simone Campanoni}, title = {Representing Data Collections in an SSA Form}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {308--321}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Guan, Xiaofeng |
CGO '24: "PresCount: Effective Register ..."
PresCount: Effective Register Allocation for Bank Conflict Reduction
Xiaofeng Guan, Hao Zhou, Guoqing Bao, Handong Li, Liang Zhu, and Jianguo Yao (Shanghai Jiao Tong University, China; Shanghai Enflame Technology, China) Modern processors with large multi-banked register files often rely on hardware solutions to resolve bank conflicts efficiently. However, these hardware-based methods, while flexible, can incur runtime penalties and restrict the exploration of optimized hardware designs. In contrast, compiler-based methods for register bank assignments avoid runtime overhead. However, incorporating bank assignment into the complex register allocation process presents significant challenges, leading existing methods to adopt conservative approaches to avoid potential side effects. This paper introduces the novel register allocation method PresCount, which enhances the coloring strategy for the Register Conflict Graph (RCG) and incorporates a bank pressure tracking mechanism to improve performance. The integrated register bank assigner in PresCount effectively reduces bank conflicts, achieving remarkable reductions of 43.28% and 27.76%, respectively, compared to existing methods on platforms with rich register banks and limited register budgets, as demonstrated by SPECfp and CNN-KERNEL benchmarks. Furthermore, a subgroup splitting technique is introduced to facilitate register allocation under the bank-subgroup register file design, specifically our Domain-Specific Architecture (DSA) for AI computing. This technique demonstrates an impressive 99.85% reduction in bank conflicts for domain-specific kernel functions. By addressing the challenges of bank conflicts in register allocation, the proposed PresCount method showcases significant improvements in performance and efficiency for platforms with different register configurations and domain-specific workloads, allowing for more flexible exploration of optimized hardware designs. @InProceedings{CGO24p170, author = {Xiaofeng Guan and Hao Zhou and Guoqing Bao and Handong Li and Liang Zhu and Jianguo Yao}, title = {PresCount: Effective Register Allocation for Bank Conflict Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {170--181}, doi = {}, year = {2024}, } |
|
Guillon, Christophe |
CGO '24: "EasyTracker: A Python Library ..."
EasyTracker: A Python Library for Controlling and Inspecting Program Execution
Théo Barollet, Christophe Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, and Fabrice Rastello (University Grenoble Alpes - Inria - CNRS - Grenoble INP - LIG, France) Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program. @InProceedings{CGO24p359, author = {Théo Barollet and Christophe Guillon and Manuel Selva and François Broquedis and Florent Bouchez-Tichadou and Fabrice Rastello}, title = {EasyTracker: A Python Library for Controlling and Inspecting Program Execution}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {359--372}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Guo, Yuxin |
CGO '24: "OptiWISE: Combining Sampling ..."
OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis
Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, and Timothy M. Jones (University of Cambridge, United Kingdom; Arm, USA; Arm, United Kingdom) Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements. @InProceedings{CGO24p373, author = {Yuxin Guo and Alex W. Chadwick and Márton Erdős and Utpal Bora and Ilias Vougioukas and Giacomo Gabrielli and Timothy M. Jones}, title = {OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {373--385}, doi = {}, year = {2024}, } Published Artifact Video Artifacts Available Artifacts Reusable Results Reproduced |
|
Hakimi, Milad |
CGO '24: "TapeFlow: Streaming Gradient ..."
TapeFlow: Streaming Gradient Tapes in Automatic Differentiation
Milad Hakimi and Arrvindh Shriraman (Simon Fraser University, Canada) Computing gradients is a crucial task in many domains, including machine learning, physics simulations, and scientific computing. Automatic differentiation (AD) computes gradients for arbitrary imperative code. In reverse mode AD, an auxiliary structure, the tape, is used to transfer intermediary values required for gradient computation. The challenge is how to organize the tape in the memory hierarchy since it has a high reuse distance, lacks temporal locality, and inflates working set by 2 — 4×. We introduce Tapeflow, a compiler framework to orchestrate and manage the gradient tape. We make three key contributions. i) We introduce the concept of regions, which transforms the tape layout into an array-of-structs format to improve spatial reuse. ii) We schedule the execution into layers and explicitly orchestrate the tape operands using a scratchpad. This reduces the required cache size and on-chip energy. iii) Finally, we stream the tape from the DRAM by organizing it into a FIFO of tiles. The tape operands arrive just-in-time for each layer. Tapeflow, running on the same hardware, outperforms Enzyme, the state-of-the-art compiler, by 1.3—2.5×, reduces on-chip SRAM usage by 5— 40×, and saves 8× on-chip energy. We demonstrate Tapeflow on a wide range of algorithms written in general-purpose language. Index Terms—Automatic differentiation, Gradients, Streaming Algorithms, Back propagation @InProceedings{CGO24p81, author = {Milad Hakimi and Arrvindh Shriraman}, title = {TapeFlow: Streaming Gradient Tapes in Automatic Differentiation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {81--92}, doi = {}, year = {2024}, } Video |
|
Han, Ruobing |
CGO '24: "Enabling Fine-Grained Incremental ..."
Enabling Fine-Grained Incremental Builds by Making Compiler Stateful
Ruobing Han, Jisheng Zhao, and Hyesoon Kim (Georgia Institute of Technology, USA) Incremental builds are commonly employed in software development, involving minor changes to existing source code that is then frequently recompiled. Speeding up incremental builds not only enhances the software development workflow but also improves CI/CD systems by enabling faster verification steps. Current solutions for incremental builds primarily rely on build systems that analyze file dependencies to avoid unnecessary recompilation of unchanged files. However, for the files that do undergo changes, these build systems simply invoke compilers to recompile them from scratch. This approach reveals a fundamental asymmetry in the system: while build systems operate in a stateful manner, compilers are stateless. As a result, incremental builds are applied only at a coarse-grained level, focusing on entire source files, rather than at a more fine-grained level that considers individual code sections. In this paper, we propose an innovative approach for enabling the fine-grained incremental build by introducing statefulness into compilers. Under this paradigm, the compiler leverages its profiling history to expedite the compilation process of modified source files, thereby reducing overall build time. Specifically, the stateful compiler retains dormant information of compiler passes executed in previous builds and uses this data to bypass dormant passes during subsequent incremental compilations. We also outline the essential changes needed to transform conventional stateless compilers into stateful ones. For practical evaluation, we modify the Clang compiler to adopt a stateful architecture and evaluate its performance on real-world C++ projects. Our comparative study indicates that the stateful version outperforms the standard Clang compiler in incremental builds, accelerating the end-to-end build process by an average of 6.72%. @InProceedings{CGO24p221, author = {Ruobing Han and Jisheng Zhao and Hyesoon Kim}, title = {Enabling Fine-Grained Incremental Builds by Making Compiler Stateful}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {221--232}, doi = {}, year = {2024}, } |
|
Hao, Yueming |
CGO '24: "DrPy: Pinpointing Inefficient ..."
DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications
Jinku Cui, Qidong Zhao, Yueming Hao, and Xu Liu (North Carolina State University, USA) Python has become an increasingly popular programming language, especially in the areas of data analytics and machine learning. Many modern Python packages employ a multi-layer design: the Python layer manages various packages and expresses high-level algorithms; the native layer is written in C/C++/Fortran/CUDA for efficient computation. Typically, each layer manages its own computation and memory and exposes APIs for cross-layer interactions. Without holistic optimization, performance inefficiencies can exist at the boundary between layers. In this paper, we develop DrPy, a novel profiler that pinpoints such memory inefficiencies across layers in Python applications. Unlike existing tools, DrPy takes a hybrid and fine-grained approach to track memory objects and their usage in both Python and native layers. DrPy correlates the behavior of memory objects across layers and builds an object flow graph to pinpoint memory inefficiencies. In addition, DrPy captures rich information associated with object flow graphs, such as call paths and source code attribution to guide intuitive code optimization. Guided by DrPy, we are able to optimize many Python applications with non-trivial performance improvement. Many optimization patches have been validated by the application developers and committed to the application repositories. @InProceedings{CGO24p245, author = {Jinku Cui and Qidong Zhao and Yueming Hao and Xu Liu}, title = {DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {245--257}, doi = {}, year = {2024}, } Artifacts Functional |
|
Haris, Jude |
CGO '24: "AXI4MLIR: User-Driven Automatic ..."
AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Nicolas Bohm Agostini, Jude Haris, Perry Gibson, Malith Jayaweera, Norm Rubin, Antonino Tumeo, José L. Abellán, José Cano, and David Kaeli (Northeastern University, USA; Pacific Northwest National Laboratory, USA; University of Glasgow, United Kingdom; University of Murcia, Spain) This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR’s versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65× speedup compared to manually optimized driver code implementations. AXI4MLIR implementation is open-source and available at: https://github.com/AXI4MLIR/axi4mlir. @InProceedings{CGO24p143, author = {Nicolas Bohm Agostini and Jude Haris and Perry Gibson and Malith Jayaweera and Norm Rubin and Antonino Tumeo and José L. Abellán and José Cano and David Kaeli}, title = {AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {143--157}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Hashemian, Shideh |
CGO '24: "A Tensor Algebra Compiler ..."
A Tensor Algebra Compiler for Sparse Differentiation
Amir Shaikhha, Mathieu Huot, and Shideh Hashemian (University of Edinburgh, United Kingdom; University of Oxford, United Kingdom) Sparse tensors are prevalent in many data-intensive applications. However, existing automatic differentiation (AD) frameworks are tailored towards dense tensors, which makes it a challenge to efficiently compute gradients through sparse tensor operations. This is due to irregular sparsity patterns that can result in substantial memory and computational overheads. We propose a novel framework that enables the efficient AD of sparse tensors. The key aspects of our work include a compilation pipeline leveraging two intermediate DSLs with AD-agnostic domain-specific optimizations followed by efficient C++ code generation. We showcase the effectiveness of our framework in terms of performance and scalability through extensive experimentation, outperforming state-of-the-art alternatives across a variety of synthetic and real-world datasets. @InProceedings{CGO24p1, author = {Amir Shaikhha and Mathieu Huot and Shideh Hashemian}, title = {A Tensor Algebra Compiler for Sparse Differentiation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {1--12}, doi = {}, year = {2024}, } |
|
Hazelwood, Kim |
CGO '24: "Revealing Compiler Heuristics ..."
Revealing Compiler Heuristics through Automated Discovery and Optimization
Volker Seeker, Chris Cummins, Murray Cole, Björn Franke, Kim Hazelwood, and Hugh Leather (Meta AI Research, USA; University of Edinburgh, United Kingdom) Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function. We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -Oz baseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application. @InProceedings{CGO24p55, author = {Volker Seeker and Chris Cummins and Murray Cole and Björn Franke and Kim Hazelwood and Hugh Leather}, title = {Revealing Compiler Heuristics through Automated Discovery and Optimization}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {55--66}, doi = {}, year = {2024}, } |
|
He, Wenlei |
CGO '24: "Revamping Sampling-Based PGO ..."
Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation
Wenlei He, Hongtao Yu, Lei Wang, and Taewook Oh (Meta, USA) The ever increasing scale of modern data center demands more effective optimizations, as even a small percentage of performance improvement can result in a significant reduction in data-center cost and its environmental footprint. However, the diverse set of workloads running in data centers also challenges the scalability of optimization solutions. Profile-guided optimization (PGO) is a promising technique to improve application performance. Sampling-based PGO is widely used in data-center applications due to its low operational overhead, but the performance gains are not as substantial as the instrumentation-based counterpart. The high operational overhead of instrumentation-based PGO, on the other hand, hinders its large-scale adoption, despite its superior performance gains. @InProceedings{CGO24p322, author = {Wenlei He and Hongtao Yu and Lei Wang and Taewook Oh}, title = {Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {322--333}, doi = {}, year = {2024}, } |
|
Huang, H. Howie |
CGO '24: "JITSPMM: Just-in-Time Instruction ..."
JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication
Qiang Fu, Thomas B. Rolinger, and H. Howie Huang (Advanced Micro Devices, USA; NVIDIA, USA; George Washington University, USA) Achieving high performance for Sparse MatrixMatrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the aheadof-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSPMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSPMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSPMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSPMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSPMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSPMM provides an average improvement of 3.8× and 1.4×, respectively. @InProceedings{CGO24p448, author = {Qiang Fu and Thomas B. Rolinger and H. Howie Huang}, title = {JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {448--459}, doi = {}, year = {2024}, } |
|
Huot, Mathieu |
CGO '24: "A Tensor Algebra Compiler ..."
A Tensor Algebra Compiler for Sparse Differentiation
Amir Shaikhha, Mathieu Huot, and Shideh Hashemian (University of Edinburgh, United Kingdom; University of Oxford, United Kingdom) Sparse tensors are prevalent in many data-intensive applications. However, existing automatic differentiation (AD) frameworks are tailored towards dense tensors, which makes it a challenge to efficiently compute gradients through sparse tensor operations. This is due to irregular sparsity patterns that can result in substantial memory and computational overheads. We propose a novel framework that enables the efficient AD of sparse tensors. The key aspects of our work include a compilation pipeline leveraging two intermediate DSLs with AD-agnostic domain-specific optimizations followed by efficient C++ code generation. We showcase the effectiveness of our framework in terms of performance and scalability through extensive experimentation, outperforming state-of-the-art alternatives across a variety of synthetic and real-world datasets. @InProceedings{CGO24p1, author = {Amir Shaikhha and Mathieu Huot and Shideh Hashemian}, title = {A Tensor Algebra Compiler for Sparse Differentiation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {1--12}, doi = {}, year = {2024}, } |
|
Hutton, Josh |
CGO '24: "Instruction Scheduling for ..."
Instruction Scheduling for the GPU on the GPU
Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, and Vahl Scott Gordon (California State University, Sacramento, USA; Advanced Micro Devices, USA) In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU. @InProceedings{CGO24p435, author = {Ghassan Shobaki and Pınar Muyan-Özçelik and Josh Hutton and Bruce Linck and Vladislav Malyshenko and Austin Kerbow and Ronaldo Ramirez-Ortega and Vahl Scott Gordon}, title = {Instruction Scheduling for the GPU on the GPU}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {435--447}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Ikarashi, Yuka |
CGO '24: "Tackling the Matrix Multiplication ..."
Tackling the Matrix Multiplication Micro-kernel Generation with Exo
Adrián Castelló, Julian Bellavita, Grace Dinh, Yuka Ikarashi, and Héctor Martínez (Universitat Politècnica de València, Spain; Cornell University, USA; University of California at Berkeley, USA; Massachusetts Institute of Technology, USA; Universidad de Córdoba, Spain) The optimization of the matrix multiplication (or gemm) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The gemm is usually implemented following the GotoBLAS philosophy, which tiles the gemm operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating micro-kernels with the exo compiler that perform close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions. @InProceedings{CGO24p182, author = {Adrián Castelló and Julian Bellavita and Grace Dinh and Yuka Ikarashi and Héctor Martínez}, title = {Tackling the Matrix Multiplication Micro-kernel Generation with Exo}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {182--192}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable |
|
Ivanov, Ivan R. |
CGO '24: "Retargeting and Respecializing ..."
Retargeting and Respecializing GPU Workloads for Performance Portability
Ivan R. Ivanov, Oleksandr Zinenko, Jens Domke, Toshio Endo, and William S. Moses (Tokyo Institute of Technology, Japan; RIKEN R-CCS, Japan; Google DeepMind, France; University of Illinois at Urbana-Champaign, USA; Google DeepMind, USA) In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher-performance and lower costs have led to a significant diversification of architecture designs across, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using more newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs performing automatic translation from CUDA and simultaneously adjusting the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 16% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program. @InProceedings{CGO24p119, author = {Ivan R. Ivanov and Oleksandr Zinenko and Jens Domke and Toshio Endo and William S. Moses}, title = {Retargeting and Respecializing GPU Workloads for Performance Portability}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {119--132}, doi = {}, year = {2024}, } Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Jangda, Abhinav |
CGO '24: "A Framework for Fine-Grained ..."
A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
Abhinav Jangda, Saeed Maleki, Maryam Mehri Dehnavi, Madan Musuvathi, and Olli Saarikivi (Microsoft Research, USA; University of Toronto, Canada) Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU. To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined fine-grained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes. @InProceedings{CGO24p93, author = {Abhinav Jangda and Saeed Maleki and Maryam Mehri Dehnavi and Madan Musuvathi and Olli Saarikivi}, title = {A Framework for Fine-Grained Synchronization of Dependent GPU Kernels}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {93--105}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Jayaweera, Malith |
CGO '24: "AXI4MLIR: User-Driven Automatic ..."
AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Nicolas Bohm Agostini, Jude Haris, Perry Gibson, Malith Jayaweera, Norm Rubin, Antonino Tumeo, José L. Abellán, José Cano, and David Kaeli (Northeastern University, USA; Pacific Northwest National Laboratory, USA; University of Glasgow, United Kingdom; University of Murcia, Spain) This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR’s versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65× speedup compared to manually optimized driver code implementations. AXI4MLIR implementation is open-source and available at: https://github.com/AXI4MLIR/axi4mlir. @InProceedings{CGO24p143, author = {Nicolas Bohm Agostini and Jude Haris and Perry Gibson and Malith Jayaweera and Norm Rubin and Antonino Tumeo and José L. Abellán and José Cano and David Kaeli}, title = {AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {143--157}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced CGO '24: "Energy-Aware Tile Size Selection ..." Energy-Aware Tile Size Selection for Affine Programs on GPUs Malith Jayaweera, Martin Kong, Yanzhi Wang, and David Kaeli (Northeastern University, USA; Ohio State University, USA) Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5× and 1.2× improvement and obtain up to 6.3× improvement on non-Polybench high-dimensional affine kernels. @InProceedings{CGO24p13, author = {Malith Jayaweera and Martin Kong and Yanzhi Wang and David Kaeli}, title = {Energy-Aware Tile Size Selection for Affine Programs on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {13--27}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Jiang, Jinhu |
CGO '24: "A System-Level Dynamic Binary ..."
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, and Weihua Zhang (Fudan University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode. @InProceedings{CGO24p423, author = {Jinhu Jiang and Chaoyi Liang and Rongchao Dong and Zhaohui Yang and Zhongjun Zhou and Wenwen Wang and Pen-Chung Yew and Weihua Zhang}, title = {A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {423--434}, doi = {}, year = {2024}, } |
|
Jin, Baihui |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Jones, Timothy M. |
CGO '24: "OptiWISE: Combining Sampling ..."
OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis
Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, and Timothy M. Jones (University of Cambridge, United Kingdom; Arm, USA; Arm, United Kingdom) Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements. @InProceedings{CGO24p373, author = {Yuxin Guo and Alex W. Chadwick and Márton Erdős and Utpal Bora and Ilias Vougioukas and Giacomo Gabrielli and Timothy M. Jones}, title = {OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {373--385}, doi = {}, year = {2024}, } Published Artifact Video Artifacts Available Artifacts Reusable Results Reproduced |
|
Kaeli, David |
CGO '24: "AXI4MLIR: User-Driven Automatic ..."
AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Nicolas Bohm Agostini, Jude Haris, Perry Gibson, Malith Jayaweera, Norm Rubin, Antonino Tumeo, José L. Abellán, José Cano, and David Kaeli (Northeastern University, USA; Pacific Northwest National Laboratory, USA; University of Glasgow, United Kingdom; University of Murcia, Spain) This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR’s versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65× speedup compared to manually optimized driver code implementations. AXI4MLIR implementation is open-source and available at: https://github.com/AXI4MLIR/axi4mlir. @InProceedings{CGO24p143, author = {Nicolas Bohm Agostini and Jude Haris and Perry Gibson and Malith Jayaweera and Norm Rubin and Antonino Tumeo and José L. Abellán and José Cano and David Kaeli}, title = {AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {143--157}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced CGO '24: "Energy-Aware Tile Size Selection ..." Energy-Aware Tile Size Selection for Affine Programs on GPUs Malith Jayaweera, Martin Kong, Yanzhi Wang, and David Kaeli (Northeastern University, USA; Ohio State University, USA) Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5× and 1.2× improvement and obtain up to 6.3× improvement on non-Polybench high-dimensional affine kernels. @InProceedings{CGO24p13, author = {Malith Jayaweera and Martin Kong and Yanzhi Wang and David Kaeli}, title = {Energy-Aware Tile Size Selection for Affine Programs on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {13--27}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Kerbow, Austin |
CGO '24: "Instruction Scheduling for ..."
Instruction Scheduling for the GPU on the GPU
Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, and Vahl Scott Gordon (California State University, Sacramento, USA; Advanced Micro Devices, USA) In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU. @InProceedings{CGO24p435, author = {Ghassan Shobaki and Pınar Muyan-Özçelik and Josh Hutton and Bruce Linck and Vladislav Malyshenko and Austin Kerbow and Ronaldo Ramirez-Ortega and Vahl Scott Gordon}, title = {Instruction Scheduling for the GPU on the GPU}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {435--447}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Kim, Hyesoon |
CGO '24: "Enabling Fine-Grained Incremental ..."
Enabling Fine-Grained Incremental Builds by Making Compiler Stateful
Ruobing Han, Jisheng Zhao, and Hyesoon Kim (Georgia Institute of Technology, USA) Incremental builds are commonly employed in software development, involving minor changes to existing source code that is then frequently recompiled. Speeding up incremental builds not only enhances the software development workflow but also improves CI/CD systems by enabling faster verification steps. Current solutions for incremental builds primarily rely on build systems that analyze file dependencies to avoid unnecessary recompilation of unchanged files. However, for the files that do undergo changes, these build systems simply invoke compilers to recompile them from scratch. This approach reveals a fundamental asymmetry in the system: while build systems operate in a stateful manner, compilers are stateless. As a result, incremental builds are applied only at a coarse-grained level, focusing on entire source files, rather than at a more fine-grained level that considers individual code sections. In this paper, we propose an innovative approach for enabling the fine-grained incremental build by introducing statefulness into compilers. Under this paradigm, the compiler leverages its profiling history to expedite the compilation process of modified source files, thereby reducing overall build time. Specifically, the stateful compiler retains dormant information of compiler passes executed in previous builds and uses this data to bypass dormant passes during subsequent incremental compilations. We also outline the essential changes needed to transform conventional stateless compilers into stateful ones. For practical evaluation, we modify the Clang compiler to adopt a stateful architecture and evaluate its performance on real-world C++ projects. Our comparative study indicates that the stateful version outperforms the standard Clang compiler in incremental builds, accelerating the end-to-end build process by an average of 6.72%. @InProceedings{CGO24p221, author = {Ruobing Han and Jisheng Zhao and Hyesoon Kim}, title = {Enabling Fine-Grained Incremental Builds by Making Compiler Stateful}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {221--232}, doi = {}, year = {2024}, } |
|
Ko, Yousun |
CGO '24: "BEC: Bit-Level Static Analysis ..."
BEC: Bit-Level Static Analysis for Reliability against Soft Errors
Yousun Ko and Bernd Burgstaller (Yonsei University, South Korea) Soft errors are a type of transient digital signal corruption that occurs in digital hardware components such as the internal flip-flops of CPU pipelines, the register file, memory cells, and even internal communication buses. Soft errors are caused by environmental radioactivity, magnetic interference, lasers, and temperature fluctuations, either unintentionally, or as part of a deliberate attempt to compromise a system and expose confidential data. We propose a bit-level error coalescing (BEC) static program analysis and its two use cases to understand and improve program reliability against soft errors. The BEC analysis tracks each bit corruption in the register file and classifies the effect of the corruption by its semantics at compile time. The usefulness of the proposed analysis is demonstrated in two scenarios, fault injection campaign pruning, and reliability-aware program transformation. Experimental results show that bit-level analysis pruned up to 30.04 % of exhaustive fault injection campaigns (13.71 % on average), without loss of accuracy. Program vulnerability was reduced by up to 13.11 % (4.94 % on average) through bit-level vulnerability-aware instruction scheduling. The analysis has been implemented within LLVM and evaluated on the RISC-V architecture. To the best of our knowledge, the proposed BEC analysis is the first bit-level compiler analysis for program reliability against soft errors. The proposed method is generic and not limited to a specific computer architecture. @InProceedings{CGO24p283, author = {Yousun Ko and Bernd Burgstaller}, title = {BEC: Bit-Level Static Analysis for Reliability against Soft Errors}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {283--295}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Kong, Martin |
CGO '24: "Energy-Aware Tile Size Selection ..."
Energy-Aware Tile Size Selection for Affine Programs on GPUs
Malith Jayaweera, Martin Kong, Yanzhi Wang, and David Kaeli (Northeastern University, USA; Ohio State University, USA) Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5× and 1.2× improvement and obtain up to 6.3× improvement on non-Polybench high-dimensional affine kernels. @InProceedings{CGO24p13, author = {Malith Jayaweera and Martin Kong and Yanzhi Wang and David Kaeli}, title = {Energy-Aware Tile Size Selection for Affine Programs on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {13--27}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Laguna, Ignacio |
CGO '24: "Enhancing Performance through ..."
Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs
Alnis Murtovi, Giorgis Georgakoudis, Konstantinos Parasyris, Chunhua Liao, Ignacio Laguna, and Bernhard Steffen (TU Dortmund, Germany; Lawrence Livermore National Laboratory, USA) Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%. @InProceedings{CGO24p106, author = {Alnis Murtovi and Giorgis Georgakoudis and Konstantinos Parasyris and Chunhua Liao and Ignacio Laguna and Bernhard Steffen}, title = {Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {106--118}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Lavery, Dan |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Leather, Hugh |
CGO '24: "Revealing Compiler Heuristics ..."
Revealing Compiler Heuristics through Automated Discovery and Optimization
Volker Seeker, Chris Cummins, Murray Cole, Björn Franke, Kim Hazelwood, and Hugh Leather (Meta AI Research, USA; University of Edinburgh, United Kingdom) Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function. We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -Oz baseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application. @InProceedings{CGO24p55, author = {Volker Seeker and Chris Cummins and Murray Cole and Björn Franke and Kim Hazelwood and Hugh Leather}, title = {Revealing Compiler Heuristics through Automated Discovery and Optimization}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {55--66}, doi = {}, year = {2024}, } |
|
Li, Handong |
CGO '24: "PresCount: Effective Register ..."
PresCount: Effective Register Allocation for Bank Conflict Reduction
Xiaofeng Guan, Hao Zhou, Guoqing Bao, Handong Li, Liang Zhu, and Jianguo Yao (Shanghai Jiao Tong University, China; Shanghai Enflame Technology, China) Modern processors with large multi-banked register files often rely on hardware solutions to resolve bank conflicts efficiently. However, these hardware-based methods, while flexible, can incur runtime penalties and restrict the exploration of optimized hardware designs. In contrast, compiler-based methods for register bank assignments avoid runtime overhead. However, incorporating bank assignment into the complex register allocation process presents significant challenges, leading existing methods to adopt conservative approaches to avoid potential side effects. This paper introduces the novel register allocation method PresCount, which enhances the coloring strategy for the Register Conflict Graph (RCG) and incorporates a bank pressure tracking mechanism to improve performance. The integrated register bank assigner in PresCount effectively reduces bank conflicts, achieving remarkable reductions of 43.28% and 27.76%, respectively, compared to existing methods on platforms with rich register banks and limited register budgets, as demonstrated by SPECfp and CNN-KERNEL benchmarks. Furthermore, a subgroup splitting technique is introduced to facilitate register allocation under the bank-subgroup register file design, specifically our Domain-Specific Architecture (DSA) for AI computing. This technique demonstrates an impressive 99.85% reduction in bank conflicts for domain-specific kernel functions. By addressing the challenges of bank conflicts in register allocation, the proposed PresCount method showcases significant improvements in performance and efficiency for platforms with different register configurations and domain-specific workloads, allowing for more flexible exploration of optimized hardware designs. @InProceedings{CGO24p170, author = {Xiaofeng Guan and Hao Zhou and Guoqing Bao and Handong Li and Liang Zhu and Jianguo Yao}, title = {PresCount: Effective Register Allocation for Bank Conflict Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {170--181}, doi = {}, year = {2024}, } |
|
Li, Haofeng |
CGO '24: "Boosting the Performance of ..."
Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China; TianqiSoft, China) The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness. We have developed a new taint analysis tool, SADroid, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADroid significantly outperforms FlowDroid (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X. @InProceedings{CGO24p296, author = {Haofeng Li and Jie Lu and Haining Meng and Liqing Cao and Lian Li and Lin Gao}, title = {Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {296--307}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Li, Jianhui |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Li, Lian |
CGO '24: "Boosting the Performance of ..."
Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China; TianqiSoft, China) The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness. We have developed a new taint analysis tool, SADroid, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADroid significantly outperforms FlowDroid (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X. @InProceedings{CGO24p296, author = {Haofeng Li and Jie Lu and Haining Meng and Liqing Cao and Lian Li and Lin Gao}, title = {Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {296--307}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Li, Xiang-Yang |
CGO '24: "Ecmas: Efficient Circuit Mapping ..."
Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code
Mingzheng Zhu, Hao Fu, Jun Wu, Chi Zhang, Wei Xie, and Xiang-Yang Li (University of Science and Technology of China, China) As the leading candidate of quantum error correction codes, surface code suffers from significant overhead, such as execution time. Reducing the circuit's execution time not only enhances its execution efficiency but also improves fidelity. However, finding the shortest execution time is NP-hard. In this work, we study the surface code mapping and scheduling problem. To reduce the execution time of a quantum circuit, we first introduce two novel metrics: Circuit Parallelism Degree and Chip Communication Capacity to quantitatively characterize quantum circuits and chips. Then, we propose a resource-adaptive mapping and scheduling method, named Ecmas, with customized initialization of chip resources for each circuit. Ecmas can dramatically reduce the execution time in both double defect and lattice surgery models. Furthermore, we provide an additional version Ecmas-ReSu for sufficient qubits, which is performance-guaranteed and more efficient. Extensive numerical tests on practical datasets show that Ecmas outperforms the state-of-the-art methods by reducing the execution time by 51.5% on average for double defect model. Ecmas can reach the optimal result in most benchmarks, reducing the execution time by up to 13.9% for lattice surgery model. @InProceedings{CGO24p158, author = {Mingzheng Zhu and Hao Fu and Jun Wu and Chi Zhang and Wei Xie and Xiang-Yang Li}, title = {Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {158--169}, doi = {}, year = {2024}, } |
|
Liang, Chaoyi |
CGO '24: "A System-Level Dynamic Binary ..."
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, and Weihua Zhang (Fudan University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode. @InProceedings{CGO24p423, author = {Jinhu Jiang and Chaoyi Liang and Rongchao Dong and Zhaohui Yang and Zhongjun Zhou and Wenwen Wang and Pen-Chung Yew and Weihua Zhang}, title = {A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {423--434}, doi = {}, year = {2024}, } |
|
Liao, Chunhua |
CGO '24: "Enhancing Performance through ..."
Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs
Alnis Murtovi, Giorgis Georgakoudis, Konstantinos Parasyris, Chunhua Liao, Ignacio Laguna, and Bernhard Steffen (TU Dortmund, Germany; Lawrence Livermore National Laboratory, USA) Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%. @InProceedings{CGO24p106, author = {Alnis Murtovi and Giorgis Georgakoudis and Konstantinos Parasyris and Chunhua Liao and Ignacio Laguna and Bernhard Steffen}, title = {Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {106--118}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Lin, Eric |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Linck, Bruce |
CGO '24: "Instruction Scheduling for ..."
Instruction Scheduling for the GPU on the GPU
Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, and Vahl Scott Gordon (California State University, Sacramento, USA; Advanced Micro Devices, USA) In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU. @InProceedings{CGO24p435, author = {Ghassan Shobaki and Pınar Muyan-Özçelik and Josh Hutton and Bruce Linck and Vladislav Malyshenko and Austin Kerbow and Ronaldo Ramirez-Ortega and Vahl Scott Gordon}, title = {Instruction Scheduling for the GPU on the GPU}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {435--447}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Liu, Xu |
CGO '24: "EasyView: Bringing Performance ..."
EasyView: Bringing Performance Profiles into Integrated Development Environments
Qidong Zhao, Milind Chabbi, and Xu Liu (North Carolina State University, USA; Scalable Machines Research, USA) Dynamic program analysis (also known as profiling) is well-known for its powerful capabilities of identifying performance inefficiencies in software packages. Although a large number of dynamic program analysis techniques are developed in academia and industry, very few of them are widely used by software developers in their regular software developing activities. There are three major reasons. First, the dynamic analysis tools (also known as profilers) are disjoint from the coding environments such as IDEs and editors; frequently switching focus between them significantly complicates the entire cycle of software development. Second, mastering various tools to interpret their analysis results requires substantial efforts; even worse, many tools have their own design of graphical user interfaces (GUI) for data presentation, which steepens the learning curves. Third, most existing tools expose few interfaces to support user-defined analysis, which makes the tools less customizable to fulfill diverse user demands. We develop EasyView, a general solution to integrate the interpretation and visualization of various profiling results in the coding environments, which bridges software developers with profilers to provide easy and intuitive dynamic analysis during the code development cycle. The novelty of EasyView is three-fold. First, we develop a generic data format, which enables EasyView to support mainstream profilers for different languages. Second, we develop a set of customizable schemes to analyze and visualize the profiles in intuitive ways. Third, we tightly integrate EasyView with popular coding environments, such as Microsoft Visual Studio Code, with easy code exploration and user interaction. Our evaluation shows that EasyView is able to support various profilers for different languages and provide unique insights into performance inefficiencies in different domains. Our user studies show that EasyView can effectively help software developers facilitate their coding and debugging efforts. @InProceedings{CGO24p386, author = {Qidong Zhao and Milind Chabbi and Xu Liu}, title = {EasyView: Bringing Performance Profiles into Integrated Development Environments}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {386--398}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional Results Reproduced CGO '24: "DrPy: Pinpointing Inefficient ..." DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications Jinku Cui, Qidong Zhao, Yueming Hao, and Xu Liu (North Carolina State University, USA) Python has become an increasingly popular programming language, especially in the areas of data analytics and machine learning. Many modern Python packages employ a multi-layer design: the Python layer manages various packages and expresses high-level algorithms; the native layer is written in C/C++/Fortran/CUDA for efficient computation. Typically, each layer manages its own computation and memory and exposes APIs for cross-layer interactions. Without holistic optimization, performance inefficiencies can exist at the boundary between layers. In this paper, we develop DrPy, a novel profiler that pinpoints such memory inefficiencies across layers in Python applications. Unlike existing tools, DrPy takes a hybrid and fine-grained approach to track memory objects and their usage in both Python and native layers. DrPy correlates the behavior of memory objects across layers and builds an object flow graph to pinpoint memory inefficiencies. In addition, DrPy captures rich information associated with object flow graphs, such as call paths and source code attribution to guide intuitive code optimization. Guided by DrPy, we are able to optimize many Python applications with non-trivial performance improvement. Many optimization patches have been validated by the application developers and committed to the application repositories. @InProceedings{CGO24p245, author = {Jinku Cui and Qidong Zhao and Yueming Hao and Xu Liu}, title = {DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {245--257}, doi = {}, year = {2024}, } Artifacts Functional |
|
Lomüller, Victor |
CGO '24: "Experiences Building an MLIR-Based ..."
Experiences Building an MLIR-Based SYCL Compiler
Ettore Tiotto, Víctor Pérez, Whitney Tsang, Lukas Sommer, Julian Oppermann, Victor Lomüller, Mehdi Goli, and James Brodman (Intel Corporation, Canada; Codeplay Software, United Kingdom; Intel Corporation, USA) Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers. The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges. This work therefore describes practical experience with the design and implementation of an MLIR-based SYCL compiler. By modeling key elements of the SYCL programming model in host and device code in the MLIR dialect framework, the presented approach enables the implementation of powerful device code optimizations as well as analyses across host and device code. Compared to two LLVM-based SYCL implementations, this yields speedups of up to 4.3x on a collection of SYCL benchmark applications. Finally, this work also discusses challenges encountered in the design and implementation and how these could be addressed in the future. @InProceedings{CGO24p399, author = {Ettore Tiotto and Víctor Pérez and Whitney Tsang and Lukas Sommer and Julian Oppermann and Victor Lomüller and Mehdi Goli and James Brodman}, title = {Experiences Building an MLIR-Based SYCL Compiler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {399--410}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Lossing, Nelson |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Lu, Jie |
CGO '24: "Boosting the Performance of ..."
Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China; TianqiSoft, China) The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness. We have developed a new taint analysis tool, SADroid, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADroid significantly outperforms FlowDroid (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X. @InProceedings{CGO24p296, author = {Haofeng Li and Jie Lu and Haining Meng and Liqing Cao and Lian Li and Lin Gao}, title = {Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {296--307}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Mahmud, Jalal |
CGO '24: "Seer: Predictive Runtime Kernel ..."
Seer: Predictive Runtime Kernel Selection for Irregular Problems
Ryan Swann, Muhammad Osama, Karthik Sangaiah, and Jalal Mahmud (AMD, USA) Modern GPUs are designed for regular problems and suffer from load imbalance when processing irregular data. Prior to our work, a domain expert selects the best kernel to map fine-grained irregular parallelism to a GPU. We instead propose Seer, an abstraction for producing a simple, reproduceable, and understandable decision tree selector model which performs runtime kernel selection for irregular workloads. To showcase our framework, we conduct a case study in Sparse Matrix Vector Multiplication (SpMV), in which Seer predicts the best strategy for a given dataset with an improvement of 2× over the best single iteration kernel across the entire SuiteSparse Matrix Collection dataset. @InProceedings{CGO24p133, author = {Ryan Swann and Muhammad Osama and Karthik Sangaiah and Jalal Mahmud}, title = {Seer: Predictive Runtime Kernel Selection for Irregular Problems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {133--142}, doi = {}, year = {2024}, } Artifacts Reusable Results Reproduced |
|
Maleki, Saeed |
CGO '24: "A Framework for Fine-Grained ..."
A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
Abhinav Jangda, Saeed Maleki, Maryam Mehri Dehnavi, Madan Musuvathi, and Olli Saarikivi (Microsoft Research, USA; University of Toronto, Canada) Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU. To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined fine-grained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes. @InProceedings{CGO24p93, author = {Abhinav Jangda and Saeed Maleki and Maryam Mehri Dehnavi and Madan Musuvathi and Olli Saarikivi}, title = {A Framework for Fine-Grained Synchronization of Dependent GPU Kernels}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {93--105}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Malyshenko, Vladislav |
CGO '24: "Instruction Scheduling for ..."
Instruction Scheduling for the GPU on the GPU
Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, and Vahl Scott Gordon (California State University, Sacramento, USA; Advanced Micro Devices, USA) In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU. @InProceedings{CGO24p435, author = {Ghassan Shobaki and Pınar Muyan-Özçelik and Josh Hutton and Bruce Linck and Vladislav Malyshenko and Austin Kerbow and Ronaldo Ramirez-Ortega and Vahl Scott Gordon}, title = {Instruction Scheduling for the GPU on the GPU}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {435--447}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Martínez, Héctor |
CGO '24: "Tackling the Matrix Multiplication ..."
Tackling the Matrix Multiplication Micro-kernel Generation with Exo
Adrián Castelló, Julian Bellavita, Grace Dinh, Yuka Ikarashi, and Héctor Martínez (Universitat Politècnica de València, Spain; Cornell University, USA; University of California at Berkeley, USA; Massachusetts Institute of Technology, USA; Universidad de Córdoba, Spain) The optimization of the matrix multiplication (or gemm) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The gemm is usually implemented following the GotoBLAS philosophy, which tiles the gemm operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating micro-kernels with the exo compiler that perform close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions. @InProceedings{CGO24p182, author = {Adrián Castelló and Julian Bellavita and Grace Dinh and Yuka Ikarashi and Héctor Martínez}, title = {Tackling the Matrix Multiplication Micro-kernel Generation with Exo}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {182--192}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable |
|
McMichen, Tommy |
CGO '24: "Representing Data Collections ..."
Representing Data Collections in an SSA Form
Tommy McMichen, Nathan Greiner, Peter Zhong, Federico Sossai, Atmn Patel, and Simone Campanoni (Northwestern University, USA) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of MEMOIR is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, MEMOIR compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our MEMOIR compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. @InProceedings{CGO24p308, author = {Tommy McMichen and Nathan Greiner and Peter Zhong and Federico Sossai and Atmn Patel and Simone Campanoni}, title = {Representing Data Collections in an SSA Form}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {308--321}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Mei, Yijie |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Meng, Haining |
CGO '24: "Boosting the Performance of ..."
Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations
Haofeng Li, Jie Lu, Haining Meng, Liqing Cao, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China; TianqiSoft, China) The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness. We have developed a new taint analysis tool, SADroid, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADroid significantly outperforms FlowDroid (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X. @InProceedings{CGO24p296, author = {Haofeng Li and Jie Lu and Haining Meng and Liqing Cao and Lian Li and Lin Gao}, title = {Boosting the Performance of Multi-solver IFDS Algorithms with Flow-Sensitivity Optimizations}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {296--307}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Moses, William S. |
CGO '24: "Retargeting and Respecializing ..."
Retargeting and Respecializing GPU Workloads for Performance Portability
Ivan R. Ivanov, Oleksandr Zinenko, Jens Domke, Toshio Endo, and William S. Moses (Tokyo Institute of Technology, Japan; RIKEN R-CCS, Japan; Google DeepMind, France; University of Illinois at Urbana-Champaign, USA; Google DeepMind, USA) In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher-performance and lower costs have led to a significant diversification of architecture designs across, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using more newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs performing automatic translation from CUDA and simultaneously adjusting the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 16% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program. @InProceedings{CGO24p119, author = {Ivan R. Ivanov and Oleksandr Zinenko and Jens Domke and Toshio Endo and William S. Moses}, title = {Retargeting and Respecializing GPU Workloads for Performance Portability}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {119--132}, doi = {}, year = {2024}, } Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Murtovi, Alnis |
CGO '24: "Enhancing Performance through ..."
Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs
Alnis Murtovi, Giorgis Georgakoudis, Konstantinos Parasyris, Chunhua Liao, Ignacio Laguna, and Bernhard Steffen (TU Dortmund, Germany; Lawrence Livermore National Laboratory, USA) Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%. @InProceedings{CGO24p106, author = {Alnis Murtovi and Giorgis Georgakoudis and Konstantinos Parasyris and Chunhua Liao and Ignacio Laguna and Bernhard Steffen}, title = {Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {106--118}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Musuvathi, Madan |
CGO '24: "A Framework for Fine-Grained ..."
A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
Abhinav Jangda, Saeed Maleki, Maryam Mehri Dehnavi, Madan Musuvathi, and Olli Saarikivi (Microsoft Research, USA; University of Toronto, Canada) Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU. To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined fine-grained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes. @InProceedings{CGO24p93, author = {Abhinav Jangda and Saeed Maleki and Maryam Mehri Dehnavi and Madan Musuvathi and Olli Saarikivi}, title = {A Framework for Fine-Grained Synchronization of Dependent GPU Kernels}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {93--105}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Muyan-Özçelik, Pınar |
CGO '24: "Instruction Scheduling for ..."
Instruction Scheduling for the GPU on the GPU
Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, and Vahl Scott Gordon (California State University, Sacramento, USA; Advanced Micro Devices, USA) In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU. @InProceedings{CGO24p435, author = {Ghassan Shobaki and Pınar Muyan-Özçelik and Josh Hutton and Bruce Linck and Vladislav Malyshenko and Austin Kerbow and Ronaldo Ramirez-Ortega and Vahl Scott Gordon}, title = {Instruction Scheduling for the GPU on the GPU}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {435--447}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
O'Boyle, Michael F. P. |
CGO '24: "SLaDe: A Portable Small Language ..."
SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly
Jordi Armengol-Estapé, Jackson Woodruff, Chris Cummins, and Michael F. P. O'Boyle (University of Edinburgh, United Kingdom; Meta AI Research, USA) Decompilation is a well-studied area with numerous high- quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. Nevertheless, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect. This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence Transformer trained over real-world code and augmented with a type inference engine. We utilize a novel tokenizer, dropout-free regularization, and type inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code. We evaluate SLaDe on over 4,000 ExeBench functions on two ISAs and at two optimization levels. SLaDe is up to 6× more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4× more accurate than the large language model ChatGPT and generates significantly more readable code than both. @InProceedings{CGO24p67, author = {Jordi Armengol-Estapé and Jackson Woodruff and Chris Cummins and Michael F. P. O'Boyle}, title = {SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {67--80}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Oh, Taewook |
CGO '24: "Revamping Sampling-Based PGO ..."
Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation
Wenlei He, Hongtao Yu, Lei Wang, and Taewook Oh (Meta, USA) The ever increasing scale of modern data center demands more effective optimizations, as even a small percentage of performance improvement can result in a significant reduction in data-center cost and its environmental footprint. However, the diverse set of workloads running in data centers also challenges the scalability of optimization solutions. Profile-guided optimization (PGO) is a promising technique to improve application performance. Sampling-based PGO is widely used in data-center applications due to its low operational overhead, but the performance gains are not as substantial as the instrumentation-based counterpart. The high operational overhead of instrumentation-based PGO, on the other hand, hinders its large-scale adoption, despite its superior performance gains. @InProceedings{CGO24p322, author = {Wenlei He and Hongtao Yu and Lei Wang and Taewook Oh}, title = {Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {322--333}, doi = {}, year = {2024}, } |
|
Okuda, Katsumi |
CGO '24: "AskIt: Unified Programming ..."
AskIt: Unified Programming Interface for Programming with Large Language Models
Katsumi Okuda and Saman Amarasinghe (Massachusetts Institute of Technology, USA; Mitsubishi Electric Corporation, Japan) Large Language Models (LLMs) exhibit a unique phenomenon known as emergent abilities, demonstrating adeptness across numerous tasks, from text summarization to code generation. While these abilities open up novel avenues in software design and crafting, their incorporation presents substantial challenges. Developers face decisions regarding the use of LLMs for directly performing tasks within applications as well as for generating and executing code to accomplish these tasks. Moreover, effective prompt design becomes a critical concern, given the necessity of extracting data from natural language outputs. To address these complexities, this paper introduces AskIt, a domain-specific language (DSL) specifically designed for LLMs. AskIt simplifies LLM integration by providing a unified interface that not only allows for direct task execution using LLMs but also supports the entire cycle of code generation and execution. This dual capability is achieved through (1) type-guided output control, (2) template-based function definitions, and (3) prompt generation for both usage modes. Our evaluations underscore AskIt's effectiveness. Across 50 tasks, AskIt generated concise prompts, achieving a 16.14 % reduction in prompt length compared to benchmarks. Additionally, by enabling a seamless transition between using LLMs directly in applications and for generating code, AskIt achieved significant efficiency improvements, as observed in our GSM8K benchmark experiments. The implementations of AskIt in TypeScript and Python are available at https://github.com/katsumiok/ts-askit and https://github.com/katsumiok/pyaskit, respectively. @InProceedings{CGO24p41, author = {Katsumi Okuda and Saman Amarasinghe}, title = {AskIt: Unified Programming Interface for Programming with Large Language Models}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {41--54}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Oppermann, Julian |
CGO '24: "Experiences Building an MLIR-Based ..."
Experiences Building an MLIR-Based SYCL Compiler
Ettore Tiotto, Víctor Pérez, Whitney Tsang, Lukas Sommer, Julian Oppermann, Victor Lomüller, Mehdi Goli, and James Brodman (Intel Corporation, Canada; Codeplay Software, United Kingdom; Intel Corporation, USA) Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers. The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges. This work therefore describes practical experience with the design and implementation of an MLIR-based SYCL compiler. By modeling key elements of the SYCL programming model in host and device code in the MLIR dialect framework, the presented approach enables the implementation of powerful device code optimizations as well as analyses across host and device code. Compared to two LLVM-based SYCL implementations, this yields speedups of up to 4.3x on a collection of SYCL benchmark applications. Finally, this work also discusses challenges encountered in the design and implementation and how these could be addressed in the future. @InProceedings{CGO24p399, author = {Ettore Tiotto and Víctor Pérez and Whitney Tsang and Lukas Sommer and Julian Oppermann and Victor Lomüller and Mehdi Goli and James Brodman}, title = {Experiences Building an MLIR-Based SYCL Compiler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {399--410}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Osama, Muhammad |
CGO '24: "Seer: Predictive Runtime Kernel ..."
Seer: Predictive Runtime Kernel Selection for Irregular Problems
Ryan Swann, Muhammad Osama, Karthik Sangaiah, and Jalal Mahmud (AMD, USA) Modern GPUs are designed for regular problems and suffer from load imbalance when processing irregular data. Prior to our work, a domain expert selects the best kernel to map fine-grained irregular parallelism to a GPU. We instead propose Seer, an abstraction for producing a simple, reproduceable, and understandable decision tree selector model which performs runtime kernel selection for irregular workloads. To showcase our framework, we conduct a case study in Sparse Matrix Vector Multiplication (SpMV), in which Seer predicts the best strategy for a given dataset with an improvement of 2× over the best single iteration kernel across the entire SuiteSparse Matrix Collection dataset. @InProceedings{CGO24p133, author = {Ryan Swann and Muhammad Osama and Karthik Sangaiah and Jalal Mahmud}, title = {Seer: Predictive Runtime Kernel Selection for Irregular Problems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {133--142}, doi = {}, year = {2024}, } Artifacts Reusable Results Reproduced |
|
Parasyris, Konstantinos |
CGO '24: "Enhancing Performance through ..."
Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs
Alnis Murtovi, Giorgis Georgakoudis, Konstantinos Parasyris, Chunhua Liao, Ignacio Laguna, and Bernhard Steffen (TU Dortmund, Germany; Lawrence Livermore National Laboratory, USA) Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%. @InProceedings{CGO24p106, author = {Alnis Murtovi and Giorgis Georgakoudis and Konstantinos Parasyris and Chunhua Liao and Ignacio Laguna and Bernhard Steffen}, title = {Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {106--118}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Patel, Atmn |
CGO '24: "Representing Data Collections ..."
Representing Data Collections in an SSA Form
Tommy McMichen, Nathan Greiner, Peter Zhong, Federico Sossai, Atmn Patel, and Simone Campanoni (Northwestern University, USA) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of MEMOIR is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, MEMOIR compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our MEMOIR compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. @InProceedings{CGO24p308, author = {Tommy McMichen and Nathan Greiner and Peter Zhong and Federico Sossai and Atmn Patel and Simone Campanoni}, title = {Representing Data Collections in an SSA Form}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {308--321}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Pérez, Víctor |
CGO '24: "Experiences Building an MLIR-Based ..."
Experiences Building an MLIR-Based SYCL Compiler
Ettore Tiotto, Víctor Pérez, Whitney Tsang, Lukas Sommer, Julian Oppermann, Victor Lomüller, Mehdi Goli, and James Brodman (Intel Corporation, Canada; Codeplay Software, United Kingdom; Intel Corporation, USA) Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers. The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges. This work therefore describes practical experience with the design and implementation of an MLIR-based SYCL compiler. By modeling key elements of the SYCL programming model in host and device code in the MLIR dialect framework, the presented approach enables the implementation of powerful device code optimizations as well as analyses across host and device code. Compared to two LLVM-based SYCL implementations, this yields speedups of up to 4.3x on a collection of SYCL benchmark applications. Finally, this work also discusses challenges encountered in the design and implementation and how these could be addressed in the future. @InProceedings{CGO24p399, author = {Ettore Tiotto and Víctor Pérez and Whitney Tsang and Lukas Sommer and Julian Oppermann and Victor Lomüller and Mehdi Goli and James Brodman}, title = {Experiences Building an MLIR-Based SYCL Compiler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {399--410}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Puaut, Isabelle |
CGO '24: "SCHEMATIC: Compile-Time Checkpoint ..."
SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems
Hugo Reymond, Jean-Luc Béchennec, Mikaël Briday, Sébastien Faucou, Isabelle Puaut, and Erven Rohou (Université de Rennes - Inria - CNRS - IRISA, France; Nantes Université - École Centrale Nantes - CNRS - LS2N - UMR 6004, France) Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible. While many techniques rely on the developer to address these issues, we present SCHEMATIC, a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. SCHEMATIC ensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. SCHEMATIC takes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM. We tested SCHEMATIC for different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51 % compared to related techniques. @InProceedings{CGO24p258, author = {Hugo Reymond and Jean-Luc Béchennec and Mikaël Briday and Sébastien Faucou and Isabelle Puaut and Erven Rohou}, title = {SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {258--269}, doi = {}, year = {2024}, } Video |
|
Qin, Zhennan |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Ramirez-Ortega, Ronaldo |
CGO '24: "Instruction Scheduling for ..."
Instruction Scheduling for the GPU on the GPU
Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, and Vahl Scott Gordon (California State University, Sacramento, USA; Advanced Micro Devices, USA) In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU. @InProceedings{CGO24p435, author = {Ghassan Shobaki and Pınar Muyan-Özçelik and Josh Hutton and Bruce Linck and Vladislav Malyshenko and Austin Kerbow and Ronaldo Ramirez-Ortega and Vahl Scott Gordon}, title = {Instruction Scheduling for the GPU on the GPU}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {435--447}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Rastello, Fabrice |
CGO '24: "EasyTracker: A Python Library ..."
EasyTracker: A Python Library for Controlling and Inspecting Program Execution
Théo Barollet, Christophe Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, and Fabrice Rastello (University Grenoble Alpes - Inria - CNRS - Grenoble INP - LIG, France) Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program. @InProceedings{CGO24p359, author = {Théo Barollet and Christophe Guillon and Manuel Selva and François Broquedis and Florent Bouchez-Tichadou and Fabrice Rastello}, title = {EasyTracker: A Python Library for Controlling and Inspecting Program Execution}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {359--372}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Razanajato, Harenome |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Regehr, John |
CGO '24: "High-Throughput, Formal-Methods-Assisted ..."
High-Throughput, Formal-Methods-Assisted Fuzzing for LLVM
Yuyou Fan and John Regehr (University of Utah, USA) It is very difficult to thoroughly test a compiler, and as a consequence it is common for released versions of production compilers to contain bugs that cause them to crash and to emit incorrect object code. We created alive-mutate, a mutation-based fuzzing tool that takes test cases written by humans and randomly modifies them, based on the hypothesis that while compiler developers are fundamentally good at writing tests, they also tend to miss corner cases. Alive-mutate is integrated with the Alive2 translation validation tool for LLVM, which is useful because it checks the behavior of optimizations for all possible values of input variables. Alive-mutate is also integrated with the LLVM middle-end, allowing it to perform mutations, optimizations, and formal verification of the optimizations all within a single program---avoiding numerous sources of overhead. Alive-mutate's fuzzing throughput is 12x higher, on average, than a fuzzing workflow that runs mutation, optimization, and formal verification in separate processes. So far we have used alive-mutate to find and report 33 previously unknown bugs in LLVM. @InProceedings{CGO24p349, author = {Yuyou Fan and John Regehr}, title = {High-Throughput, Formal-Methods-Assisted Fuzzing for LLVM}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {349--358}, doi = {}, year = {2024}, } Published Artifact Info Artifacts Available |
|
Reymond, Hugo |
CGO '24: "SCHEMATIC: Compile-Time Checkpoint ..."
SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems
Hugo Reymond, Jean-Luc Béchennec, Mikaël Briday, Sébastien Faucou, Isabelle Puaut, and Erven Rohou (Université de Rennes - Inria - CNRS - IRISA, France; Nantes Université - École Centrale Nantes - CNRS - LS2N - UMR 6004, France) Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible. While many techniques rely on the developer to address these issues, we present SCHEMATIC, a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. SCHEMATIC ensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. SCHEMATIC takes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM. We tested SCHEMATIC for different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51 % compared to related techniques. @InProceedings{CGO24p258, author = {Hugo Reymond and Jean-Luc Béchennec and Mikaël Briday and Sébastien Faucou and Isabelle Puaut and Erven Rohou}, title = {SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {258--269}, doi = {}, year = {2024}, } Video |
|
Rohou, Erven |
CGO '24: "SCHEMATIC: Compile-Time Checkpoint ..."
SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems
Hugo Reymond, Jean-Luc Béchennec, Mikaël Briday, Sébastien Faucou, Isabelle Puaut, and Erven Rohou (Université de Rennes - Inria - CNRS - IRISA, France; Nantes Université - École Centrale Nantes - CNRS - LS2N - UMR 6004, France) Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible. While many techniques rely on the developer to address these issues, we present SCHEMATIC, a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. SCHEMATIC ensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. SCHEMATIC takes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM. We tested SCHEMATIC for different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51 % compared to related techniques. @InProceedings{CGO24p258, author = {Hugo Reymond and Jean-Luc Béchennec and Mikaël Briday and Sébastien Faucou and Isabelle Puaut and Erven Rohou}, title = {SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {258--269}, doi = {}, year = {2024}, } Video |
|
Rolinger, Thomas B. |
CGO '24: "JITSPMM: Just-in-Time Instruction ..."
JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication
Qiang Fu, Thomas B. Rolinger, and H. Howie Huang (Advanced Micro Devices, USA; NVIDIA, USA; George Washington University, USA) Achieving high performance for Sparse MatrixMatrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the aheadof-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSPMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSPMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSPMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSPMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSPMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSPMM provides an average improvement of 3.8× and 1.4×, respectively. @InProceedings{CGO24p448, author = {Qiang Fu and Thomas B. Rolinger and H. Howie Huang}, title = {JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {448--459}, doi = {}, year = {2024}, } |
|
Rubin, Norm |
CGO '24: "AXI4MLIR: User-Driven Automatic ..."
AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Nicolas Bohm Agostini, Jude Haris, Perry Gibson, Malith Jayaweera, Norm Rubin, Antonino Tumeo, José L. Abellán, José Cano, and David Kaeli (Northeastern University, USA; Pacific Northwest National Laboratory, USA; University of Glasgow, United Kingdom; University of Murcia, Spain) This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR’s versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65× speedup compared to manually optimized driver code implementations. AXI4MLIR implementation is open-source and available at: https://github.com/AXI4MLIR/axi4mlir. @InProceedings{CGO24p143, author = {Nicolas Bohm Agostini and Jude Haris and Perry Gibson and Malith Jayaweera and Norm Rubin and Antonino Tumeo and José L. Abellán and José Cano and David Kaeli}, title = {AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {143--157}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Saarikivi, Olli |
CGO '24: "A Framework for Fine-Grained ..."
A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
Abhinav Jangda, Saeed Maleki, Maryam Mehri Dehnavi, Madan Musuvathi, and Olli Saarikivi (Microsoft Research, USA; University of Toronto, Canada) Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU. To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined fine-grained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes. @InProceedings{CGO24p93, author = {Abhinav Jangda and Saeed Maleki and Maryam Mehri Dehnavi and Madan Musuvathi and Olli Saarikivi}, title = {A Framework for Fine-Grained Synchronization of Dependent GPU Kernels}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {93--105}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Saioc, Georgian-Vlad |
CGO '24: "Unveiling and Vanquishing ..."
Unveiling and Vanquishing Goroutine Leaks in Enterprise Microservices: A Dynamic Analysis Approach
Georgian-Vlad Saioc, Dmitriy Shirchenko, and Milind Chabbi (Aarhus University, Denmark; Uber Technologies, Denmark; Uber Technologies, USA) Go is a modern programming language gaining popularity in enterprise microservice systems. Concurrency is a first-class citizen in Go with lightweight “goroutines” as the building blocks of concurrent execution. Go advocates message-passing to communicate and synchronize among goroutines. Improper use of message passing in Go can result in “partial deadlock” (interchangeably called a goroutine leak), a subtle concurrency bug where a blocked sender (receiver) never finds a corresponding receiver (sender), causing the blocked goroutine to leak memory, via its call stack and objects reachable from the stack. In this paper, we systematically study the prevalence of message passing and the resulting partial deadlocks in ≈75 Million lines of Uber’s Go monorepo hosting ≈2500 microservices. We develop two lightweight, dynamic analysis tools: GOLEAK and LEAKPROF, designed to identify partial deadlocks. GOLEAK detects partial deadlocks during unit testing and prevents the introduction of new bugs. Conversely, LEAKPROF uses goroutine profiles obtained from services deployed in production to pinpoint intricate bugs arising from complex control flow, unexplored interleavings, or the absence of test coverage. We share our experience and insights deploying these tools in developer workflows in a large industrial setting. Using GOLEAK we unearthed 857 pre-existing goroutine leaks in the legacy code and prevented the introduction of ≈260 new leaks over one year period. Using LEAKPROF we found 24 and fixed 21 goroutine leaks, which resulted in up to 34% speedup and 9.2× memory reduction in some of our production services. @InProceedings{CGO24p411, author = {Georgian-Vlad Saioc and Dmitriy Shirchenko and Milind Chabbi}, title = {Unveiling and Vanquishing Goroutine Leaks in Enterprise Microservices: A Dynamic Analysis Approach}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {411--422}, doi = {}, year = {2024}, } Video |
|
Sangaiah, Karthik |
CGO '24: "Seer: Predictive Runtime Kernel ..."
Seer: Predictive Runtime Kernel Selection for Irregular Problems
Ryan Swann, Muhammad Osama, Karthik Sangaiah, and Jalal Mahmud (AMD, USA) Modern GPUs are designed for regular problems and suffer from load imbalance when processing irregular data. Prior to our work, a domain expert selects the best kernel to map fine-grained irregular parallelism to a GPU. We instead propose Seer, an abstraction for producing a simple, reproduceable, and understandable decision tree selector model which performs runtime kernel selection for irregular workloads. To showcase our framework, we conduct a case study in Sparse Matrix Vector Multiplication (SpMV), in which Seer predicts the best strategy for a given dataset with an improvement of 2× over the best single iteration kernel across the entire SuiteSparse Matrix Collection dataset. @InProceedings{CGO24p133, author = {Ryan Swann and Muhammad Osama and Karthik Sangaiah and Jalal Mahmud}, title = {Seer: Predictive Runtime Kernel Selection for Irregular Problems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {133--142}, doi = {}, year = {2024}, } Artifacts Reusable Results Reproduced |
|
Santambrogio, Marco D. |
CGO '24: "One Automaton to Rule Them ..."
One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution
Luisa Cicolini, Filippo Carloni, Marco D. Santambrogio, and Davide Conficconi (Politecnico di Milano, Italy) Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single- and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean 5.99× throughput improvement and 4.05× speedup against single and multiple parallel FSAs. @InProceedings{CGO24p193, author = {Luisa Cicolini and Filippo Carloni and Marco D. Santambrogio and Davide Conficconi}, title = {One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {193--206}, doi = {}, year = {2024}, } Published Artifact Video Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Schwarz, Tobias |
CGO '24: "Compile-Time Analysis of Compiler ..."
Compile-Time Analysis of Compiler Frameworks for Query Compilation
Alexis Engelke and Tobias Schwarz (TU Munich, Germany) Low compilation times are highly important in contexts of Just-in-time compilation. This not only applies to language runtimes for Java, WebAssembly, or JavaScript, but is also crucial for database systems that employ query compilation as the primary measure for achieving high throughput in combination with low query execution time. We present a performance comparison and detailed analysis of the compile times of the JIT compilation back-ends provided by GCC, LLVM, Cranelift, and a single-pass compiler in the context of database queries. Our results show that LLVM achieves the highest execution performance, but can compile substantially faster when tuning for low compilation time. Cranelift achieves a similar run-time performance to unoptimized LLVM, but compiles just 20–35% faster and is outperformed by the single-pass compiler, which compiles code 16x faster than Cranelift at similar execution performance. @InProceedings{CGO24p233, author = {Alexis Engelke and Tobias Schwarz}, title = {Compile-Time Analysis of Compiler Frameworks for Query Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {233--244}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Seeker, Volker |
CGO '24: "Revealing Compiler Heuristics ..."
Revealing Compiler Heuristics through Automated Discovery and Optimization
Volker Seeker, Chris Cummins, Murray Cole, Björn Franke, Kim Hazelwood, and Hugh Leather (Meta AI Research, USA; University of Edinburgh, United Kingdom) Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function. We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -Oz baseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application. @InProceedings{CGO24p55, author = {Volker Seeker and Chris Cummins and Murray Cole and Björn Franke and Kim Hazelwood and Hugh Leather}, title = {Revealing Compiler Heuristics through Automated Discovery and Optimization}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {55--66}, doi = {}, year = {2024}, } |
|
Selva, Manuel |
CGO '24: "EasyTracker: A Python Library ..."
EasyTracker: A Python Library for Controlling and Inspecting Program Execution
Théo Barollet, Christophe Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, and Fabrice Rastello (University Grenoble Alpes - Inria - CNRS - Grenoble INP - LIG, France) Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program. @InProceedings{CGO24p359, author = {Théo Barollet and Christophe Guillon and Manuel Selva and François Broquedis and Florent Bouchez-Tichadou and Fabrice Rastello}, title = {EasyTracker: A Python Library for Controlling and Inspecting Program Execution}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {359--372}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Shaikhha, Amir |
CGO '24: "A Tensor Algebra Compiler ..."
A Tensor Algebra Compiler for Sparse Differentiation
Amir Shaikhha, Mathieu Huot, and Shideh Hashemian (University of Edinburgh, United Kingdom; University of Oxford, United Kingdom) Sparse tensors are prevalent in many data-intensive applications. However, existing automatic differentiation (AD) frameworks are tailored towards dense tensors, which makes it a challenge to efficiently compute gradients through sparse tensor operations. This is due to irregular sparsity patterns that can result in substantial memory and computational overheads. We propose a novel framework that enables the efficient AD of sparse tensors. The key aspects of our work include a compilation pipeline leveraging two intermediate DSLs with AD-agnostic domain-specific optimizations followed by efficient C++ code generation. We showcase the effectiveness of our framework in terms of performance and scalability through extensive experimentation, outperforming state-of-the-art alternatives across a variety of synthetic and real-world datasets. @InProceedings{CGO24p1, author = {Amir Shaikhha and Mathieu Huot and Shideh Hashemian}, title = {A Tensor Algebra Compiler for Sparse Differentiation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {1--12}, doi = {}, year = {2024}, } |
|
Shirchenko, Dmitriy |
CGO '24: "Unveiling and Vanquishing ..."
Unveiling and Vanquishing Goroutine Leaks in Enterprise Microservices: A Dynamic Analysis Approach
Georgian-Vlad Saioc, Dmitriy Shirchenko, and Milind Chabbi (Aarhus University, Denmark; Uber Technologies, Denmark; Uber Technologies, USA) Go is a modern programming language gaining popularity in enterprise microservice systems. Concurrency is a first-class citizen in Go with lightweight “goroutines” as the building blocks of concurrent execution. Go advocates message-passing to communicate and synchronize among goroutines. Improper use of message passing in Go can result in “partial deadlock” (interchangeably called a goroutine leak), a subtle concurrency bug where a blocked sender (receiver) never finds a corresponding receiver (sender), causing the blocked goroutine to leak memory, via its call stack and objects reachable from the stack. In this paper, we systematically study the prevalence of message passing and the resulting partial deadlocks in ≈75 Million lines of Uber’s Go monorepo hosting ≈2500 microservices. We develop two lightweight, dynamic analysis tools: GOLEAK and LEAKPROF, designed to identify partial deadlocks. GOLEAK detects partial deadlocks during unit testing and prevents the introduction of new bugs. Conversely, LEAKPROF uses goroutine profiles obtained from services deployed in production to pinpoint intricate bugs arising from complex control flow, unexplored interleavings, or the absence of test coverage. We share our experience and insights deploying these tools in developer workflows in a large industrial setting. Using GOLEAK we unearthed 857 pre-existing goroutine leaks in the legacy code and prevented the introduction of ≈260 new leaks over one year period. Using LEAKPROF we found 24 and fixed 21 goroutine leaks, which resulted in up to 34% speedup and 9.2× memory reduction in some of our production services. @InProceedings{CGO24p411, author = {Georgian-Vlad Saioc and Dmitriy Shirchenko and Milind Chabbi}, title = {Unveiling and Vanquishing Goroutine Leaks in Enterprise Microservices: A Dynamic Analysis Approach}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {411--422}, doi = {}, year = {2024}, } Video |
|
Shobaki, Ghassan |
CGO '24: "Instruction Scheduling for ..."
Instruction Scheduling for the GPU on the GPU
Ghassan Shobaki, Pınar Muyan-Özçelik, Josh Hutton, Bruce Linck, Vladislav Malyshenko, Austin Kerbow, Ronaldo Ramirez-Ortega, and Vahl Scott Gordon (California State University, Sacramento, USA; Advanced Micro Devices, USA) In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD's production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU. @InProceedings{CGO24p435, author = {Ghassan Shobaki and Pınar Muyan-Özçelik and Josh Hutton and Bruce Linck and Vladislav Malyshenko and Austin Kerbow and Ronaldo Ramirez-Ortega and Vahl Scott Gordon}, title = {Instruction Scheduling for the GPU on the GPU}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {435--447}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Shriraman, Arrvindh |
CGO '24: "TapeFlow: Streaming Gradient ..."
TapeFlow: Streaming Gradient Tapes in Automatic Differentiation
Milad Hakimi and Arrvindh Shriraman (Simon Fraser University, Canada) Computing gradients is a crucial task in many domains, including machine learning, physics simulations, and scientific computing. Automatic differentiation (AD) computes gradients for arbitrary imperative code. In reverse mode AD, an auxiliary structure, the tape, is used to transfer intermediary values required for gradient computation. The challenge is how to organize the tape in the memory hierarchy since it has a high reuse distance, lacks temporal locality, and inflates working set by 2 — 4×. We introduce Tapeflow, a compiler framework to orchestrate and manage the gradient tape. We make three key contributions. i) We introduce the concept of regions, which transforms the tape layout into an array-of-structs format to improve spatial reuse. ii) We schedule the execution into layers and explicitly orchestrate the tape operands using a scratchpad. This reduces the required cache size and on-chip energy. iii) Finally, we stream the tape from the DRAM by organizing it into a FIFO of tiles. The tape operands arrive just-in-time for each layer. Tapeflow, running on the same hardware, outperforms Enzyme, the state-of-the-art compiler, by 1.3—2.5×, reduces on-chip SRAM usage by 5— 40×, and saves 8× on-chip energy. We demonstrate Tapeflow on a wide range of algorithms written in general-purpose language. Index Terms—Automatic differentiation, Gradients, Streaming Algorithms, Back propagation @InProceedings{CGO24p81, author = {Milad Hakimi and Arrvindh Shriraman}, title = {TapeFlow: Streaming Gradient Tapes in Automatic Differentiation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {81--92}, doi = {}, year = {2024}, } Video |
|
Smith, Lee |
CGO '24: "Compiler Testing with Relaxed ..."
Compiler Testing with Relaxed Memory Models
Luke Geeson and Lee Smith (University College London, United Kingdom; Arm, United Kingdom) Finding bugs is key to the correctness of compilers in wide use today. If the behaviour of a compiled program, as allowed by its architecture memory model, is not a behaviour of the source program under its source model, then there is a bug. This holds for all programs, but we focus on concurrency bugs that occur only with two or more threads of execution. We focus on testing techniques that detect such bugs in C/C++ compilers. We seek a testing technique that automatically covers concurrency bugs up to fixed bounds on program sizes and that scales to find bugs in compiled programs with many lines of code. Otherwise, a testing technique can miss bugs. Unfortunately, the state-of-the-art techniques are yet to satisfy all of these properties. We present the Téléchat compiler testing tool for concurrent programs. Téléchat compiles a concurrent C/C++ program and compares source and compiled program behaviours using source and architecture memory models. We make three claims: Téléchat improves the state-of-the-art at finding bugs in code generation for multi-threaded execution, it is the first public description of a compiler testing tool for concurrency that is deployed in industry, and it is the first tool that takes a significant step towards the desired properties. We provide experimental evidence suggesting Téléchat finds bugs missed by other state-of-the-art techniques, case studies indicating that Téléchat satisfies the properties, and reports of our experience deploying Téléchat in industry regression testing. @InProceedings{CGO24p334, author = {Luke Geeson and Lee Smith}, title = {Compiler Testing with Relaxed Memory Models}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {334--348}, doi = {}, year = {2024}, } Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Sommer, Lukas |
CGO '24: "Experiences Building an MLIR-Based ..."
Experiences Building an MLIR-Based SYCL Compiler
Ettore Tiotto, Víctor Pérez, Whitney Tsang, Lukas Sommer, Julian Oppermann, Victor Lomüller, Mehdi Goli, and James Brodman (Intel Corporation, Canada; Codeplay Software, United Kingdom; Intel Corporation, USA) Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers. The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges. This work therefore describes practical experience with the design and implementation of an MLIR-based SYCL compiler. By modeling key elements of the SYCL programming model in host and device code in the MLIR dialect framework, the presented approach enables the implementation of powerful device code optimizations as well as analyses across host and device code. Compared to two LLVM-based SYCL implementations, this yields speedups of up to 4.3x on a collection of SYCL benchmark applications. Finally, this work also discusses challenges encountered in the design and implementation and how these could be addressed in the future. @InProceedings{CGO24p399, author = {Ettore Tiotto and Víctor Pérez and Whitney Tsang and Lukas Sommer and Julian Oppermann and Victor Lomüller and Mehdi Goli and James Brodman}, title = {Experiences Building an MLIR-Based SYCL Compiler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {399--410}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Song, Yunfei |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Sossai, Federico |
CGO '24: "Representing Data Collections ..."
Representing Data Collections in an SSA Form
Tommy McMichen, Nathan Greiner, Peter Zhong, Federico Sossai, Atmn Patel, and Simone Campanoni (Northwestern University, USA) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of MEMOIR is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, MEMOIR compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our MEMOIR compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. @InProceedings{CGO24p308, author = {Tommy McMichen and Nathan Greiner and Peter Zhong and Federico Sossai and Atmn Patel and Simone Campanoni}, title = {Representing Data Collections in an SSA Form}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {308--321}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Steffen, Bernhard |
CGO '24: "Enhancing Performance through ..."
Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs
Alnis Murtovi, Giorgis Georgakoudis, Konstantinos Parasyris, Chunhua Liao, Ignacio Laguna, and Bernhard Steffen (TU Dortmund, Germany; Lawrence Livermore National Laboratory, USA) Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%. @InProceedings{CGO24p106, author = {Alnis Murtovi and Giorgis Georgakoudis and Konstantinos Parasyris and Chunhua Liao and Ignacio Laguna and Bernhard Steffen}, title = {Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {106--118}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Susungi, Adilla |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Swann, Ryan |
CGO '24: "Seer: Predictive Runtime Kernel ..."
Seer: Predictive Runtime Kernel Selection for Irregular Problems
Ryan Swann, Muhammad Osama, Karthik Sangaiah, and Jalal Mahmud (AMD, USA) Modern GPUs are designed for regular problems and suffer from load imbalance when processing irregular data. Prior to our work, a domain expert selects the best kernel to map fine-grained irregular parallelism to a GPU. We instead propose Seer, an abstraction for producing a simple, reproduceable, and understandable decision tree selector model which performs runtime kernel selection for irregular workloads. To showcase our framework, we conduct a case study in Sparse Matrix Vector Multiplication (SpMV), in which Seer predicts the best strategy for a given dataset with an improvement of 2× over the best single iteration kernel across the entire SuiteSparse Matrix Collection dataset. @InProceedings{CGO24p133, author = {Ryan Swann and Muhammad Osama and Karthik Sangaiah and Jalal Mahmud}, title = {Seer: Predictive Runtime Kernel Selection for Irregular Problems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {133--142}, doi = {}, year = {2024}, } Artifacts Reusable Results Reproduced |
|
Tchoulak, Nassim |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Tiotto, Ettore |
CGO '24: "Experiences Building an MLIR-Based ..."
Experiences Building an MLIR-Based SYCL Compiler
Ettore Tiotto, Víctor Pérez, Whitney Tsang, Lukas Sommer, Julian Oppermann, Victor Lomüller, Mehdi Goli, and James Brodman (Intel Corporation, Canada; Codeplay Software, United Kingdom; Intel Corporation, USA) Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers. The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges. This work therefore describes practical experience with the design and implementation of an MLIR-based SYCL compiler. By modeling key elements of the SYCL programming model in host and device code in the MLIR dialect framework, the presented approach enables the implementation of powerful device code optimizations as well as analyses across host and device code. Compared to two LLVM-based SYCL implementations, this yields speedups of up to 4.3x on a collection of SYCL benchmark applications. Finally, this work also discusses challenges encountered in the design and implementation and how these could be addressed in the future. @InProceedings{CGO24p399, author = {Ettore Tiotto and Víctor Pérez and Whitney Tsang and Lukas Sommer and Julian Oppermann and Victor Lomüller and Mehdi Goli and James Brodman}, title = {Experiences Building an MLIR-Based SYCL Compiler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {399--410}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Titzer, Ben L. |
CGO '24: "Whose Baseline Compiler Is ..."
Whose Baseline Compiler Is It Anyway?
Ben L. Titzer (Carnegie Mellon University, USA) Compilers face an intrinsic tradeoff between compilation speed and code quality. The tradeoff is particularly stark in a dynamic setting where JIT compilation time contributes to application runtime. Many systems now employ multiple compilation tiers, where one tier offers fast compile speed while another has much slower compile speed but produces higher quality code. With proper heuristics on when to use each, the overall performance is better than using either compiler in isolation. At the introduction of WebAssembly into the Web platform in 2017, most engines employed optimizing compilers and pre-compiled entire modules before execution. Yet since that time, all Web engines have introduced new ”baseline” compiler tiers for Wasm to improve startup time. Further, many new non-web engines have appeared, some of which also employ simple compilers. In this paper, we demystify single-pass compilers for Wasm, explaining their internal algorithms and tradeoffs, as well as providing a detailed empirical study of those employed in production. We show the design of a new single-pass compiler for a research Wasm engine that integrates with an in-place interpreter and host garbage collector using value tags, while also supporting flexible instrumentation. In experiments, we measure the effectiveness of optimizations targeting value tags and find, somewhat surprisingly, that the runtime overhead can be reduced to near zero. We also assess the relative compile speed and execution time of six baseline compilers and place these baseline compilers in a two-dimensional tradeoff space with other execution tiers for Wasm. @InProceedings{CGO24p207, author = {Ben L. Titzer}, title = {Whose Baseline Compiler Is It Anyway?}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {207--220}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Tsang, Whitney |
CGO '24: "Experiences Building an MLIR-Based ..."
Experiences Building an MLIR-Based SYCL Compiler
Ettore Tiotto, Víctor Pérez, Whitney Tsang, Lukas Sommer, Julian Oppermann, Victor Lomüller, Mehdi Goli, and James Brodman (Intel Corporation, Canada; Codeplay Software, United Kingdom; Intel Corporation, USA) Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers. The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges. This work therefore describes practical experience with the design and implementation of an MLIR-based SYCL compiler. By modeling key elements of the SYCL programming model in host and device code in the MLIR dialect framework, the presented approach enables the implementation of powerful device code optimizations as well as analyses across host and device code. Compared to two LLVM-based SYCL implementations, this yields speedups of up to 4.3x on a collection of SYCL benchmark applications. Finally, this work also discusses challenges encountered in the design and implementation and how these could be addressed in the future. @InProceedings{CGO24p399, author = {Ettore Tiotto and Víctor Pérez and Whitney Tsang and Lukas Sommer and Julian Oppermann and Victor Lomüller and Mehdi Goli and James Brodman}, title = {Experiences Building an MLIR-Based SYCL Compiler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {399--410}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Tumeo, Antonino |
CGO '24: "AXI4MLIR: User-Driven Automatic ..."
AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Nicolas Bohm Agostini, Jude Haris, Perry Gibson, Malith Jayaweera, Norm Rubin, Antonino Tumeo, José L. Abellán, José Cano, and David Kaeli (Northeastern University, USA; Pacific Northwest National Laboratory, USA; University of Glasgow, United Kingdom; University of Murcia, Spain) This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR’s versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65× speedup compared to manually optimized driver code implementations. AXI4MLIR implementation is open-source and available at: https://github.com/AXI4MLIR/axi4mlir. @InProceedings{CGO24p143, author = {Nicolas Bohm Agostini and Jude Haris and Perry Gibson and Malith Jayaweera and Norm Rubin and Antonino Tumeo and José L. Abellán and José Cano and David Kaeli}, title = {AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {143--157}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Van der Cruysse, Jonathan |
CGO '24: "Latent Idiom Recognition for ..."
Latent Idiom Recognition for a Minimalist Functional Array Language using Equality Saturation
Jonathan Van der Cruysse and Christophe Dubach (McGill University, Canada) Accelerating programs is typically done by recognizing code idioms matching high-performance libraries or hardware interfaces. However, recognizing such idioms automatically is challenging. The idiom recognition machinery is difficult to write and requires expert knowledge. In addition, slight variations in the input program might hide the idiom and defeat the recognizer. This paper advocates for the use of a minimalist functional array language supporting a small, but expressive, set of operators. The minimalist design leads to a tiny sets of rewrite rules, which encode the language semantics. Crucially, the same minimalist language is also used to encode idioms. This removes the need for hand-crafted analysis passes, or for having to learn a complex domain-specific language to define the idioms. Coupled with equality saturation, this approach is able to match the core functions from the BLAS and PyTorch libraries on a set of computational kernels. Compared to reference C kernel implementations, the approach produces a geometric mean speedup of 1.46$× for C programs using BLAS, when generating such programs from the high-level minimalist language. @InProceedings{CGO24p270, author = {Jonathan Van der Cruysse and Christophe Dubach}, title = {Latent Idiom Recognition for a Minimalist Functional Array Language using Equality Saturation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {270--282}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Vougioukas, Ilias |
CGO '24: "OptiWISE: Combining Sampling ..."
OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis
Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, and Timothy M. Jones (University of Cambridge, United Kingdom; Arm, USA; Arm, United Kingdom) Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements. @InProceedings{CGO24p373, author = {Yuxin Guo and Alex W. Chadwick and Márton Erdős and Utpal Bora and Ilias Vougioukas and Giacomo Gabrielli and Timothy M. Jones}, title = {OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {373--385}, doi = {}, year = {2024}, } Published Artifact Video Artifacts Available Artifacts Reusable Results Reproduced |
|
Wang, Lei |
CGO '24: "Revamping Sampling-Based PGO ..."
Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation
Wenlei He, Hongtao Yu, Lei Wang, and Taewook Oh (Meta, USA) The ever increasing scale of modern data center demands more effective optimizations, as even a small percentage of performance improvement can result in a significant reduction in data-center cost and its environmental footprint. However, the diverse set of workloads running in data centers also challenges the scalability of optimization solutions. Profile-guided optimization (PGO) is a promising technique to improve application performance. Sampling-based PGO is widely used in data-center applications due to its low operational overhead, but the performance gains are not as substantial as the instrumentation-based counterpart. The high operational overhead of instrumentation-based PGO, on the other hand, hinders its large-scale adoption, despite its superior performance gains. @InProceedings{CGO24p322, author = {Wenlei He and Hongtao Yu and Lei Wang and Taewook Oh}, title = {Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {322--333}, doi = {}, year = {2024}, } |
|
Wang, Wenwen |
CGO '24: "A System-Level Dynamic Binary ..."
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, and Weihua Zhang (Fudan University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode. @InProceedings{CGO24p423, author = {Jinhu Jiang and Chaoyi Liang and Rongchao Dong and Zhaohui Yang and Zhongjun Zhou and Wenwen Wang and Pen-Chung Yew and Weihua Zhang}, title = {A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {423--434}, doi = {}, year = {2024}, } |
|
Wang, Yanzhi |
CGO '24: "Energy-Aware Tile Size Selection ..."
Energy-Aware Tile Size Selection for Affine Programs on GPUs
Malith Jayaweera, Martin Kong, Yanzhi Wang, and David Kaeli (Northeastern University, USA; Ohio State University, USA) Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5× and 1.2× improvement and obtain up to 6.3× improvement on non-Polybench high-dimensional affine kernels. @InProceedings{CGO24p13, author = {Malith Jayaweera and Martin Kong and Yanzhi Wang and David Kaeli}, title = {Energy-Aware Tile Size Selection for Affine Programs on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {13--27}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional |
|
Woodruff, Jackson |
CGO '24: "SLaDe: A Portable Small Language ..."
SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly
Jordi Armengol-Estapé, Jackson Woodruff, Chris Cummins, and Michael F. P. O'Boyle (University of Edinburgh, United Kingdom; Meta AI Research, USA) Decompilation is a well-studied area with numerous high- quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. Nevertheless, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect. This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence Transformer trained over real-world code and augmented with a type inference engine. We utilize a novel tokenizer, dropout-free regularization, and type inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code. We evaluate SLaDe on over 4,000 ExeBench functions on two ISAs and at two optimization levels. SLaDe is up to 6× more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4× more accurate than the large language model ChatGPT and generates significantly more readable code than both. @InProceedings{CGO24p67, author = {Jordi Armengol-Estapé and Jackson Woodruff and Chris Cummins and Michael F. P. O'Boyle}, title = {SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {67--80}, doi = {}, year = {2024}, } Published Artifact Artifacts Available |
|
Wu, Jun |
CGO '24: "Ecmas: Efficient Circuit Mapping ..."
Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code
Mingzheng Zhu, Hao Fu, Jun Wu, Chi Zhang, Wei Xie, and Xiang-Yang Li (University of Science and Technology of China, China) As the leading candidate of quantum error correction codes, surface code suffers from significant overhead, such as execution time. Reducing the circuit's execution time not only enhances its execution efficiency but also improves fidelity. However, finding the shortest execution time is NP-hard. In this work, we study the surface code mapping and scheduling problem. To reduce the execution time of a quantum circuit, we first introduce two novel metrics: Circuit Parallelism Degree and Chip Communication Capacity to quantitatively characterize quantum circuits and chips. Then, we propose a resource-adaptive mapping and scheduling method, named Ecmas, with customized initialization of chip resources for each circuit. Ecmas can dramatically reduce the execution time in both double defect and lattice surgery models. Furthermore, we provide an additional version Ecmas-ReSu for sufficient qubits, which is performance-guaranteed and more efficient. Extensive numerical tests on practical datasets show that Ecmas outperforms the state-of-the-art methods by reducing the execution time by 51.5% on average for double defect model. Ecmas can reach the optimal result in most benchmarks, reducing the execution time by up to 13.9% for lattice surgery model. @InProceedings{CGO24p158, author = {Mingzheng Zhu and Hao Fu and Jun Wu and Chi Zhang and Wei Xie and Xiang-Yang Li}, title = {Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {158--169}, doi = {}, year = {2024}, } |
|
Xie, Wei |
CGO '24: "Ecmas: Efficient Circuit Mapping ..."
Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code
Mingzheng Zhu, Hao Fu, Jun Wu, Chi Zhang, Wei Xie, and Xiang-Yang Li (University of Science and Technology of China, China) As the leading candidate of quantum error correction codes, surface code suffers from significant overhead, such as execution time. Reducing the circuit's execution time not only enhances its execution efficiency but also improves fidelity. However, finding the shortest execution time is NP-hard. In this work, we study the surface code mapping and scheduling problem. To reduce the execution time of a quantum circuit, we first introduce two novel metrics: Circuit Parallelism Degree and Chip Communication Capacity to quantitatively characterize quantum circuits and chips. Then, we propose a resource-adaptive mapping and scheduling method, named Ecmas, with customized initialization of chip resources for each circuit. Ecmas can dramatically reduce the execution time in both double defect and lattice surgery models. Furthermore, we provide an additional version Ecmas-ReSu for sufficient qubits, which is performance-guaranteed and more efficient. Extensive numerical tests on practical datasets show that Ecmas outperforms the state-of-the-art methods by reducing the execution time by 51.5% on average for double defect model. Ecmas can reach the optimal result in most benchmarks, reducing the execution time by up to 13.9% for lattice surgery model. @InProceedings{CGO24p158, author = {Mingzheng Zhu and Hao Fu and Jun Wu and Chi Zhang and Wei Xie and Xiang-Yang Li}, title = {Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {158--169}, doi = {}, year = {2024}, } |
|
Yang, Zhaohui |
CGO '24: "A System-Level Dynamic Binary ..."
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, and Weihua Zhang (Fudan University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode. @InProceedings{CGO24p423, author = {Jinhu Jiang and Chaoyi Liang and Rongchao Dong and Zhaohui Yang and Zhongjun Zhou and Wenwen Wang and Pen-Chung Yew and Weihua Zhang}, title = {A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {423--434}, doi = {}, year = {2024}, } |
|
Yao, Jianguo |
CGO '24: "PresCount: Effective Register ..."
PresCount: Effective Register Allocation for Bank Conflict Reduction
Xiaofeng Guan, Hao Zhou, Guoqing Bao, Handong Li, Liang Zhu, and Jianguo Yao (Shanghai Jiao Tong University, China; Shanghai Enflame Technology, China) Modern processors with large multi-banked register files often rely on hardware solutions to resolve bank conflicts efficiently. However, these hardware-based methods, while flexible, can incur runtime penalties and restrict the exploration of optimized hardware designs. In contrast, compiler-based methods for register bank assignments avoid runtime overhead. However, incorporating bank assignment into the complex register allocation process presents significant challenges, leading existing methods to adopt conservative approaches to avoid potential side effects. This paper introduces the novel register allocation method PresCount, which enhances the coloring strategy for the Register Conflict Graph (RCG) and incorporates a bank pressure tracking mechanism to improve performance. The integrated register bank assigner in PresCount effectively reduces bank conflicts, achieving remarkable reductions of 43.28% and 27.76%, respectively, compared to existing methods on platforms with rich register banks and limited register budgets, as demonstrated by SPECfp and CNN-KERNEL benchmarks. Furthermore, a subgroup splitting technique is introduced to facilitate register allocation under the bank-subgroup register file design, specifically our Domain-Specific Architecture (DSA) for AI computing. This technique demonstrates an impressive 99.85% reduction in bank conflicts for domain-specific kernel functions. By addressing the challenges of bank conflicts in register allocation, the proposed PresCount method showcases significant improvements in performance and efficiency for platforms with different register configurations and domain-specific workloads, allowing for more flexible exploration of optimized hardware designs. @InProceedings{CGO24p170, author = {Xiaofeng Guan and Hao Zhou and Guoqing Bao and Handong Li and Liang Zhu and Jianguo Yao}, title = {PresCount: Effective Register Allocation for Bank Conflict Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {170--181}, doi = {}, year = {2024}, } |
|
Ye, Jason |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Yew, Pen-Chung |
CGO '24: "A System-Level Dynamic Binary ..."
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, and Weihua Zhang (Fudan University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode. @InProceedings{CGO24p423, author = {Jinhu Jiang and Chaoyi Liang and Rongchao Dong and Zhaohui Yang and Zhongjun Zhou and Wenwen Wang and Pen-Chung Yew and Weihua Zhang}, title = {A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {423--434}, doi = {}, year = {2024}, } |
|
Yu, Hongtao |
CGO '24: "Revamping Sampling-Based PGO ..."
Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation
Wenlei He, Hongtao Yu, Lei Wang, and Taewook Oh (Meta, USA) The ever increasing scale of modern data center demands more effective optimizations, as even a small percentage of performance improvement can result in a significant reduction in data-center cost and its environmental footprint. However, the diverse set of workloads running in data centers also challenges the scalability of optimization solutions. Profile-guided optimization (PGO) is a promising technique to improve application performance. Sampling-based PGO is widely used in data-center applications due to its low operational overhead, but the performance gains are not as substantial as the instrumentation-based counterpart. The high operational overhead of instrumentation-based PGO, on the other hand, hinders its large-scale adoption, despite its superior performance gains. @InProceedings{CGO24p322, author = {Wenlei He and Hongtao Yu and Lei Wang and Taewook Oh}, title = {Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {322--333}, doi = {}, year = {2024}, } |
|
Zhang, Chi |
CGO '24: "Ecmas: Efficient Circuit Mapping ..."
Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code
Mingzheng Zhu, Hao Fu, Jun Wu, Chi Zhang, Wei Xie, and Xiang-Yang Li (University of Science and Technology of China, China) As the leading candidate of quantum error correction codes, surface code suffers from significant overhead, such as execution time. Reducing the circuit's execution time not only enhances its execution efficiency but also improves fidelity. However, finding the shortest execution time is NP-hard. In this work, we study the surface code mapping and scheduling problem. To reduce the execution time of a quantum circuit, we first introduce two novel metrics: Circuit Parallelism Degree and Chip Communication Capacity to quantitatively characterize quantum circuits and chips. Then, we propose a resource-adaptive mapping and scheduling method, named Ecmas, with customized initialization of chip resources for each circuit. Ecmas can dramatically reduce the execution time in both double defect and lattice surgery models. Furthermore, we provide an additional version Ecmas-ReSu for sufficient qubits, which is performance-guaranteed and more efficient. Extensive numerical tests on practical datasets show that Ecmas outperforms the state-of-the-art methods by reducing the execution time by 51.5% on average for double defect model. Ecmas can reach the optimal result in most benchmarks, reducing the execution time by up to 13.9% for lattice surgery model. @InProceedings{CGO24p158, author = {Mingzheng Zhu and Hao Fu and Jun Wu and Chi Zhang and Wei Xie and Xiang-Yang Li}, title = {Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {158--169}, doi = {}, year = {2024}, } |
|
Zhang, Renwei |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Zhang, Weihua |
CGO '24: "A System-Level Dynamic Binary ..."
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, and Weihua Zhang (Fudan University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode. @InProceedings{CGO24p423, author = {Jinhu Jiang and Chaoyi Liang and Rongchao Dong and Zhaohui Yang and Zhongjun Zhou and Wenwen Wang and Pen-Chung Yew and Weihua Zhang}, title = {A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {423--434}, doi = {}, year = {2024}, } |
|
Zhang, Yan |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Zhang, Yifei |
CGO '24: "oneDNN Graph Compiler: A Hybrid ..."
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, Yan Zhang, Jason Ye, Eric Lin, and Dan Lavery (Intel, USA; Intel, China) With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors. @InProceedings{CGO24p460, author = {Jianhui Li and Zhennan Qin and Yijie Mei and Jingze Cui and Yunfei Song and Ciyong Chen and Yifei Zhang and Longsheng Du and Xianhang Cheng and Baihui Jin and Yan Zhang and Jason Ye and Eric Lin and Dan Lavery}, title = {oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {460--470}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Zhang, Zhen |
CGO '24: "PolyTOPS: Reconfigurable and ..."
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cédric Bastoul (Huawei Technologies, France; Mines Paris-PSL, France; Huawei Technologies, China) Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario. @InProceedings{CGO24p28, author = {Gianpietro Consolaro and Zhen Zhang and Harenome Razanajato and Nelson Lossing and Nassim Tchoulak and Adilla Susungi and Artur Cesar Araujo Alves and Renwei Zhang and Denis Barthou and Corinne Ancourt and Cédric Bastoul}, title = {PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {28--40}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Zhao, Jisheng |
CGO '24: "Enabling Fine-Grained Incremental ..."
Enabling Fine-Grained Incremental Builds by Making Compiler Stateful
Ruobing Han, Jisheng Zhao, and Hyesoon Kim (Georgia Institute of Technology, USA) Incremental builds are commonly employed in software development, involving minor changes to existing source code that is then frequently recompiled. Speeding up incremental builds not only enhances the software development workflow but also improves CI/CD systems by enabling faster verification steps. Current solutions for incremental builds primarily rely on build systems that analyze file dependencies to avoid unnecessary recompilation of unchanged files. However, for the files that do undergo changes, these build systems simply invoke compilers to recompile them from scratch. This approach reveals a fundamental asymmetry in the system: while build systems operate in a stateful manner, compilers are stateless. As a result, incremental builds are applied only at a coarse-grained level, focusing on entire source files, rather than at a more fine-grained level that considers individual code sections. In this paper, we propose an innovative approach for enabling the fine-grained incremental build by introducing statefulness into compilers. Under this paradigm, the compiler leverages its profiling history to expedite the compilation process of modified source files, thereby reducing overall build time. Specifically, the stateful compiler retains dormant information of compiler passes executed in previous builds and uses this data to bypass dormant passes during subsequent incremental compilations. We also outline the essential changes needed to transform conventional stateless compilers into stateful ones. For practical evaluation, we modify the Clang compiler to adopt a stateful architecture and evaluate its performance on real-world C++ projects. Our comparative study indicates that the stateful version outperforms the standard Clang compiler in incremental builds, accelerating the end-to-end build process by an average of 6.72%. @InProceedings{CGO24p221, author = {Ruobing Han and Jisheng Zhao and Hyesoon Kim}, title = {Enabling Fine-Grained Incremental Builds by Making Compiler Stateful}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {221--232}, doi = {}, year = {2024}, } |
|
Zhao, Qidong |
CGO '24: "EasyView: Bringing Performance ..."
EasyView: Bringing Performance Profiles into Integrated Development Environments
Qidong Zhao, Milind Chabbi, and Xu Liu (North Carolina State University, USA; Scalable Machines Research, USA) Dynamic program analysis (also known as profiling) is well-known for its powerful capabilities of identifying performance inefficiencies in software packages. Although a large number of dynamic program analysis techniques are developed in academia and industry, very few of them are widely used by software developers in their regular software developing activities. There are three major reasons. First, the dynamic analysis tools (also known as profilers) are disjoint from the coding environments such as IDEs and editors; frequently switching focus between them significantly complicates the entire cycle of software development. Second, mastering various tools to interpret their analysis results requires substantial efforts; even worse, many tools have their own design of graphical user interfaces (GUI) for data presentation, which steepens the learning curves. Third, most existing tools expose few interfaces to support user-defined analysis, which makes the tools less customizable to fulfill diverse user demands. We develop EasyView, a general solution to integrate the interpretation and visualization of various profiling results in the coding environments, which bridges software developers with profilers to provide easy and intuitive dynamic analysis during the code development cycle. The novelty of EasyView is three-fold. First, we develop a generic data format, which enables EasyView to support mainstream profilers for different languages. Second, we develop a set of customizable schemes to analyze and visualize the profiles in intuitive ways. Third, we tightly integrate EasyView with popular coding environments, such as Microsoft Visual Studio Code, with easy code exploration and user interaction. Our evaluation shows that EasyView is able to support various profilers for different languages and provide unique insights into performance inefficiencies in different domains. Our user studies show that EasyView can effectively help software developers facilitate their coding and debugging efforts. @InProceedings{CGO24p386, author = {Qidong Zhao and Milind Chabbi and Xu Liu}, title = {EasyView: Bringing Performance Profiles into Integrated Development Environments}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {386--398}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Functional Results Reproduced CGO '24: "DrPy: Pinpointing Inefficient ..." DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications Jinku Cui, Qidong Zhao, Yueming Hao, and Xu Liu (North Carolina State University, USA) Python has become an increasingly popular programming language, especially in the areas of data analytics and machine learning. Many modern Python packages employ a multi-layer design: the Python layer manages various packages and expresses high-level algorithms; the native layer is written in C/C++/Fortran/CUDA for efficient computation. Typically, each layer manages its own computation and memory and exposes APIs for cross-layer interactions. Without holistic optimization, performance inefficiencies can exist at the boundary between layers. In this paper, we develop DrPy, a novel profiler that pinpoints such memory inefficiencies across layers in Python applications. Unlike existing tools, DrPy takes a hybrid and fine-grained approach to track memory objects and their usage in both Python and native layers. DrPy correlates the behavior of memory objects across layers and builds an object flow graph to pinpoint memory inefficiencies. In addition, DrPy captures rich information associated with object flow graphs, such as call paths and source code attribution to guide intuitive code optimization. Guided by DrPy, we are able to optimize many Python applications with non-trivial performance improvement. Many optimization patches have been validated by the application developers and committed to the application repositories. @InProceedings{CGO24p245, author = {Jinku Cui and Qidong Zhao and Yueming Hao and Xu Liu}, title = {DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {245--257}, doi = {}, year = {2024}, } Artifacts Functional |
|
Zhong, Peter |
CGO '24: "Representing Data Collections ..."
Representing Data Collections in an SSA Form
Tommy McMichen, Nathan Greiner, Peter Zhong, Federico Sossai, Atmn Patel, and Simone Campanoni (Northwestern University, USA) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of MEMOIR is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, MEMOIR compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our MEMOIR compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. @InProceedings{CGO24p308, author = {Tommy McMichen and Nathan Greiner and Peter Zhong and Federico Sossai and Atmn Patel and Simone Campanoni}, title = {Representing Data Collections in an SSA Form}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {308--321}, doi = {}, year = {2024}, } Published Artifact Artifacts Available Artifacts Reusable Results Reproduced |
|
Zhou, Hao |
CGO '24: "PresCount: Effective Register ..."
PresCount: Effective Register Allocation for Bank Conflict Reduction
Xiaofeng Guan, Hao Zhou, Guoqing Bao, Handong Li, Liang Zhu, and Jianguo Yao (Shanghai Jiao Tong University, China; Shanghai Enflame Technology, China) Modern processors with large multi-banked register files often rely on hardware solutions to resolve bank conflicts efficiently. However, these hardware-based methods, while flexible, can incur runtime penalties and restrict the exploration of optimized hardware designs. In contrast, compiler-based methods for register bank assignments avoid runtime overhead. However, incorporating bank assignment into the complex register allocation process presents significant challenges, leading existing methods to adopt conservative approaches to avoid potential side effects. This paper introduces the novel register allocation method PresCount, which enhances the coloring strategy for the Register Conflict Graph (RCG) and incorporates a bank pressure tracking mechanism to improve performance. The integrated register bank assigner in PresCount effectively reduces bank conflicts, achieving remarkable reductions of 43.28% and 27.76%, respectively, compared to existing methods on platforms with rich register banks and limited register budgets, as demonstrated by SPECfp and CNN-KERNEL benchmarks. Furthermore, a subgroup splitting technique is introduced to facilitate register allocation under the bank-subgroup register file design, specifically our Domain-Specific Architecture (DSA) for AI computing. This technique demonstrates an impressive 99.85% reduction in bank conflicts for domain-specific kernel functions. By addressing the challenges of bank conflicts in register allocation, the proposed PresCount method showcases significant improvements in performance and efficiency for platforms with different register configurations and domain-specific workloads, allowing for more flexible exploration of optimized hardware designs. @InProceedings{CGO24p170, author = {Xiaofeng Guan and Hao Zhou and Guoqing Bao and Handong Li and Liang Zhu and Jianguo Yao}, title = {PresCount: Effective Register Allocation for Bank Conflict Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {170--181}, doi = {}, year = {2024}, } |
|
Zhou, Zhongjun |
CGO '24: "A System-Level Dynamic Binary ..."
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, and Weihua Zhang (Fudan University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode. @InProceedings{CGO24p423, author = {Jinhu Jiang and Chaoyi Liang and Rongchao Dong and Zhaohui Yang and Zhongjun Zhou and Wenwen Wang and Pen-Chung Yew and Weihua Zhang}, title = {A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {423--434}, doi = {}, year = {2024}, } |
|
Zhu, Liang |
CGO '24: "PresCount: Effective Register ..."
PresCount: Effective Register Allocation for Bank Conflict Reduction
Xiaofeng Guan, Hao Zhou, Guoqing Bao, Handong Li, Liang Zhu, and Jianguo Yao (Shanghai Jiao Tong University, China; Shanghai Enflame Technology, China) Modern processors with large multi-banked register files often rely on hardware solutions to resolve bank conflicts efficiently. However, these hardware-based methods, while flexible, can incur runtime penalties and restrict the exploration of optimized hardware designs. In contrast, compiler-based methods for register bank assignments avoid runtime overhead. However, incorporating bank assignment into the complex register allocation process presents significant challenges, leading existing methods to adopt conservative approaches to avoid potential side effects. This paper introduces the novel register allocation method PresCount, which enhances the coloring strategy for the Register Conflict Graph (RCG) and incorporates a bank pressure tracking mechanism to improve performance. The integrated register bank assigner in PresCount effectively reduces bank conflicts, achieving remarkable reductions of 43.28% and 27.76%, respectively, compared to existing methods on platforms with rich register banks and limited register budgets, as demonstrated by SPECfp and CNN-KERNEL benchmarks. Furthermore, a subgroup splitting technique is introduced to facilitate register allocation under the bank-subgroup register file design, specifically our Domain-Specific Architecture (DSA) for AI computing. This technique demonstrates an impressive 99.85% reduction in bank conflicts for domain-specific kernel functions. By addressing the challenges of bank conflicts in register allocation, the proposed PresCount method showcases significant improvements in performance and efficiency for platforms with different register configurations and domain-specific workloads, allowing for more flexible exploration of optimized hardware designs. @InProceedings{CGO24p170, author = {Xiaofeng Guan and Hao Zhou and Guoqing Bao and Handong Li and Liang Zhu and Jianguo Yao}, title = {PresCount: Effective Register Allocation for Bank Conflict Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {170--181}, doi = {}, year = {2024}, } |
|
Zhu, Mingzheng |
CGO '24: "Ecmas: Efficient Circuit Mapping ..."
Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code
Mingzheng Zhu, Hao Fu, Jun Wu, Chi Zhang, Wei Xie, and Xiang-Yang Li (University of Science and Technology of China, China) As the leading candidate of quantum error correction codes, surface code suffers from significant overhead, such as execution time. Reducing the circuit's execution time not only enhances its execution efficiency but also improves fidelity. However, finding the shortest execution time is NP-hard. In this work, we study the surface code mapping and scheduling problem. To reduce the execution time of a quantum circuit, we first introduce two novel metrics: Circuit Parallelism Degree and Chip Communication Capacity to quantitatively characterize quantum circuits and chips. Then, we propose a resource-adaptive mapping and scheduling method, named Ecmas, with customized initialization of chip resources for each circuit. Ecmas can dramatically reduce the execution time in both double defect and lattice surgery models. Furthermore, we provide an additional version Ecmas-ReSu for sufficient qubits, which is performance-guaranteed and more efficient. Extensive numerical tests on practical datasets show that Ecmas outperforms the state-of-the-art methods by reducing the execution time by 51.5% on average for double defect model. Ecmas can reach the optimal result in most benchmarks, reducing the execution time by up to 13.9% for lattice surgery model. @InProceedings{CGO24p158, author = {Mingzheng Zhu and Hao Fu and Jun Wu and Chi Zhang and Wei Xie and Xiang-Yang Li}, title = {Ecmas: Efficient Circuit Mapping and Scheduling for Surface Code}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {158--169}, doi = {}, year = {2024}, } |
|
Zinenko, Oleksandr |
CGO '24: "Retargeting and Respecializing ..."
Retargeting and Respecializing GPU Workloads for Performance Portability
Ivan R. Ivanov, Oleksandr Zinenko, Jens Domke, Toshio Endo, and William S. Moses (Tokyo Institute of Technology, Japan; RIKEN R-CCS, Japan; Google DeepMind, France; University of Illinois at Urbana-Champaign, USA; Google DeepMind, USA) In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher-performance and lower costs have led to a significant diversification of architecture designs across, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using more newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs performing automatic translation from CUDA and simultaneously adjusting the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 16% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program. @InProceedings{CGO24p119, author = {Ivan R. Ivanov and Oleksandr Zinenko and Jens Domke and Toshio Endo and William S. Moses}, title = {Retargeting and Respecializing GPU Workloads for Performance Portability}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {119--132}, doi = {}, year = {2024}, } Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced |
176 authors
proc time: 3.73