CGO 2022
2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Powered by
Conference Publishing Consulting

2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), April 02–06, 2022, Seoul, South Korea

CGO 2022 – Proceedings

Contents - Abstracts - Authors


Title Page

Welcome from the General Chair
Welcome to the 20th edition of the IEEE/ACM International Symposium on Code Generation and optimization (CGO) organized in Seoul, South Korea, but attended from all around the globe!

Welcome from the Program Chairs
We are pleased to welcome you to CGO 2022. On behalf of the Program Committee, we are pleased to present an exciting and stimulating program for the 2022 International Symposium on Code Generation and Optimization.

CGO 2022 Organization

Report from the Artifact Evaluation Committee
CGO 2022 included two separate categories of papers—main conference papers and tool & practical experience papers—as it did in the previous year. The acceptance criterion of the tool papers continued to adopt the minimum requirement to test the functionality of the tool from the Artifact Evaluation Committee.

CGO 2022 Sponsors


A Compiler Framework for Optimizing Dynamic Parallelism on GPUs
Mhd Ghaith Olabi, Juan Gómez Luna, Onur Mutlu, Wen-mei Hwu, and Izzat El Hajj
(American University of Beirut, Lebanon; ETH Zurich, Switzerland; University of Illinois at Urbana-Champaign, USA; NVIDIA, USA)
Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small grids are launched. The large number of launches results in high launch latency due to congestion, and the small grid sizes result in hardware underutilization.
To address this issue, we propose a compiler framework for optimizing the use of dynamic parallelism in applications with nested parallelism. The framework features three key optimizations: thresholding, coarsening, and aggregation. Thresholding involves launching a grid dynamically only if the number of child threads exceeds some threshold, and serializing the child threads in the parent thread otherwise. Coarsening involves executing the work of multiple thread blocks by a single coarsened block to amortize the common work across them. Aggregation involves combining multiple child grids into a single aggregated grid.
Thresholding is sometimes applied manually by programmers in the context of dynamic parallelism. We automate it in the compiler and discuss the challenges associated with doing so. Coarsening is sometimes applied as an optimization in other contexts. We propose to apply coarsening in the context of dynamic parallelism and automate it in the compiler as well. Aggregation has been automated in the compiler by prior work. We enhance aggregation by proposing a new aggregation technique that uses multi-block granularity. We also integrate these three optimizations into an open-source compiler framework to simplify the process of optimizing dynamic parallelism code.
Our evaluation shows that our compiler framework improves the performance of applications with nested parallelism by a geometric mean of 43.0× over applications that use dynamic parallelism, 8.7× over applications that do not use dynamic parallelism, and 3.6× over applications that use dynamic parallelism with aggregation alone as proposed in prior work.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
Automatic Horizontal Fusion for GPU Kernels
Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long
(Carnegie Mellon University, USA; University of Toronto, Canada)
We present automatic horizontal fusion, a novel optimization technique that complements the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose goal is to eliminate intermediate data round trips, our horizontal fusion technique aims to increase the thread-level parallelism to hide instruction latencies. We also present HFuse, a new source to source CUDA compiler that implements automatic horizontal fusion. Our experimental results show that the horizontal fusion can speed up the running time by 2.5%-60.8%. Our results reveal that the horizontal fusion is especially beneficial for fusing kernels with instructions that require different kinds of GPU resources (e.g., a memory-intensive kernel and a compute-intensive kernel).

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
DARM: Control-Flow Melding for SIMT Thread Divergence Reduction
Charitha Saumya, Kirshanthan Sundararajah, and Milind KulkarniORCID logo
(Purdue University, USA)
GPGPUs use the Single-Instruction-Multiple-Thread (SIMT) execution model where a group of threads— wavefront or warp—execute instructions in lockstep. When threads in a group encounter a branching instruction, not all threads in the group take the same path, a phenomenon known as control-flow divergence. The control-flow divergence causes performance degradation because both paths of the branch must be executed one after the other. Prior research has primarily addressed this issue through architectural modifications. We observe that certain GPGPU kernels with control-flow divergence have similar control-flow structures with similar instructions on both sides of a branch. This structure can be exploited to reduce control-flow divergence by melding the two sides of the branch allowing threads to reconverge early, reducing divergence. In this work, we present DARM, a compiler analysis and transformation framework that can meld divergent control-flow structures with similar instruction sequences. We show that DARM can reduce the performance degradation from control-flow divergence.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
Efficient Execution of OpenMP on GPUs
Joseph Huber, Melanie Cornelius, Giorgis Georgakoudis, Shilei Tian, Jose M Monslave Diaz, Kuter Dinel, Barbara Chapman, and Johannes Doerfert
(Oak Ridge National Laboratory, USA; Illinois Institute of Technology, USA; Lawrence Livermore National Laboratory, USA; Stony Brook University, USA; Argonne National Laboratory, USA; Düzce University, Turkey)
OpenMP is the preferred choice for CPU parallelism in High- Performance-Computing (HPC) applications written in C, C++, or Fortran. As HPC systems became heterogeneous, OpenMP introduced support for accelerator offloading via the target directive. This allowed porting existing (CPU) code onto GPUs, including well established CPU parallelism paradigms. However, there are architectural differences be- tween CPU and GPU execution which make common pat- terns, like forking and joining threads, single threaded execu- tion, or sharing of local (stack) variables, in general costly on the latter. So far it was left to the user to identify and avoid non-efficient code patterns, most commonly by writing their OpenMP offloading codes in a kernel-language style which resembles CUDA more than it does traditional OpenMP.
In this work we present OpenMP-aware program analy- ses and optimizations that allow efficient execution of the generic, CPU-centric parallelism model provided by OpenMP on GPUs. Our implementation in LLVM/Clang maps various common OpenMP patterns found in real world applications efficiently to the GPU. As static analysis is inherently limited we provide actionable and informative feedback to the user about the performed and missed optimizations, together with ways for the user to annotate the program for better results. Our extensive evaluation using several HPC proxy applica- tions shows significantly improved GPU kernel times and reduction in resources requirements, such as GPU registers.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Reproduced

Domain-Specific Compilation

GraphIt to CUDA Compiler in 2021 LOC: A Case for High-Performance DSL Implementation via Staging with BuilDSL
Ajay BrahmakshatriyaORCID logo and Saman AmarasingheORCID logo
(Massachusetts Institute of Technology, USA)
Domain-Specific Languages (DSLs) provide the optimum balance between generalization and specialization that is crucial to getting the best performance for a particular domain. DSLs like Halide and GraphIt and their rich scheduling languages allow users to generate an implementation best suited for the algorithm and input. DSLs also provide the right abstraction for generating code for diverse architectures like GPUs, CPUs, and hardware accelerators. DSL compilers are massive, typically spanning tens of thousands of lines of code and need a frontend, some analysis and transformation passes, and target-specific code generation. These implementations usually require a great deal of compiler knowledge and domain experts cannot prototype DSLs without getting compiler experts involved.
Using multi-stage programming in a high-level language like Scala, OCaml, or C++, is a great solution because it provides easy-to-use frontend and automatic code generation abilities. The DSL writers typically implement their abstraction as a library in the multi-stage programming language and use it to generate specialized code by providing partial inputs. This solves the problem only partially because DSLs like GraphIt have shown that several domain-specific analyses and transformations need t be performed to get the best performance. Special care has to be taken when targeting massively parallel architectures like GPUs where factors like load balancing, warp divergence, coalesced memory accesses play a critical role.
In this paper, we demonstrate how to build an end-to-end DSL compiler framework and a graph DSL using multi-stage programming in C++. We show how the staged types can be extended to perform domain-specific data flow and control flow analyses and transformations. We also show how our generated CUDA code matches the performance of the code generated from the state-of-the-art graph DSL, GraphIt. We achieve all this in a very small fraction (8.4%) of the code size required to implement the traditional DSL compiler.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
A Compiler for Sound Floating-Point Computations using Affine Arithmetic
Joao Rivera, Franz Franchetti, and Markus Püschel
(ETH Zurich, Switzerland; Carnegie Mellon University, USA)
Floating-point arithmetic is extensively used in scientific and engineering applications to approximate real arithmetic. Unfortunately, floating-point arithmetic is not a sound implementation of real arithmetic, i.e., it may produce different results, does not provide error guarantees, and the errors can become arbitrarily large. In this paper, we introduce SafeGen, a source-to-source compiler that rewrites a given C program using floating-point arithmetic to an efficient C program performing the same computation soundly, i.e., it returns an error bound that is guaranteed to contain the correct result of the program if it had been executed in real arithmetic. Equivalently, it gives a precision certificate on the number of correct bits in the result. SafeGen uses affine arithmetic (AA) that keeps accuracy high compared to interval arithmetic by preserving linear correlations between variables. To mitigate its high cost, SafeGen combines a novel form of static analysis to identify these correlations with a flexible policy-based approach for their selection. SafeGen supports SIMD intrinsics in the input and can output SIMD-optimized code. Our results show that SafeGen-generated code is 30--70 times faster than manually rewritten code using AA libraries. Equivalently, SafeGen can offer many more bits of certified accuracy within a reduced time budget.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
Aggregate Update Problem for Multi-clocked Dataflow Languages
Hannes Kallwies, Martin Leucker, Torben Scheffel, Malte Schmitz, and Daniel Thoma
(University of Lübeck, Germany)
Dataflow languages have, as well as functional languages, immutable semantics, which is often implemented by copying values. A common compiler optimization known from functional languages involves analyzing which data structures can be modified in-place instead of copying them. This paper presents a novel algorithm to this so called Aggregate Update Problem for multi-clocked dataflow languages, i.e. those that allow streams to have events at disjoint timestamps, like e.g. Lucid, Lustre and Signal. Unrestricted multi-clocked languages require a static triggering analysis on how events and hence data values are read, written and replicated. We use TeSSLa as a generic stream transformation language with a small set of operators to develop our ideas. We implemented the solution in a TeSSLa compiler targeting the Java VM via Scala code generation which combines persistent data structures and mutable data structures for those data values which allow in-place editing. Our empirical evaluation shows considerable speedup for use cases where queues, maps or sets are dominant data structures.

Publisher's Version Article Search Info Artifacts Available Artifacts Reusable Results Reproduced


CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research
Chris Cummins, Bram Wasti, Jiadong Guo, Brandon Cui, Jason Ansel, Sahir Gomez, Somya Jain, Jia Liu, Olivier Teytaud, Benoit Steiner, Yuandong Tian, and Hugh Leather
(Meta, USA)
Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What is needed is an easy, reusable experimental infrastructure for real world compiler optimization tasks that can serve as a common benchmark for comparing techniques, and as a platform to accelerate progress in the field.
We introduce CompilerGym, a set of environments for real world compiler optimization tasks, and a toolkit for exposing new optimization tasks to compiler researchers. CompilerGym enables anyone to experiment on production compiler optimization problems through an easy-to-use package, regardless of their experience with compilers. We build upon the popular OpenAI Gym interface enabling researchers to interact with compilers using Python and a familiar API.
We describe the CompilerGym architecture and implementation, characterize the optimization spaces and computational efficiencies of three included compiler environments, and provide extensive empirical evaluations. Compared to prior works, CompilerGym offers larger datasets and optimization spaces, is 27x more computationally efficient, is fault-tolerant, and capable of detecting reproducibility bugs in the underlying compilers.
In making it easy for anyone to experiment with compilers -- irrespective of their background -- we aim to accelerate progress in the AI and compiler research domains.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
PALMED: Throughput Characterization for Superscalar Architectures
Nicolas Derumigny, Théophile Bastian, Fabian Gruber, Guillaume Iooss, Christophe Guillon, Louis-Noël Pouchet, and Fabrice Rastello
(Grenoble Alps University, France; Inria, France; CNRS, France; Grenoble INP, France; Colorado State University, USA; STMicroelectronics, France)
In a super-scalar architecture, the scheduler dynamically assigns micro-operations (µOp) to execution ports. The port mapping of an architecture describes how an instruction decomposes into µOps and lists for each µOp the set of ports it can be mapped to. It is used by compilers and performance debugging tools to characterize the performance throughput of a sequence of instructions repeatedly executed as the core component of a loop.
This paper introduces a dual equivalent representation: The resource mapping of an architecture is an abstract model where, to be executed, an instruction must use a set of abstract resources, themselves representing combinations of execution ports. For a given architecture, finding a port mapping is an important but difficult problem. Building a resource mapping is a more tractable problem and provides a simpler and equivalent model. This paper describes Palmed, a tool that automatically builds a resource mapping for pipelined, super-scalar, out-of-order CPU architectures. Palmed does not require hardware performance counters, and relies solely on runtime measurements.
We evaluate the pertinence of our dual representation for throughput modeling by extracting a representative set of basic-blocks from the compiled binaries of the SPEC CPU 2017 benchmarks. We compared the throughput predicted by existing machine models to that produced by Palmed, and found comparable accuracy to state-of-the art tools, achieving sub-10 % mean square error rate on this workload on Intel’s Skylake microarchitecture.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
SRTuner: Effective Compiler Optimization Customization by Exposing Synergistic Relations
Sunghyun Park, Salar Latifi, Yongjun Park, Armand Behroozi, Byungsoo Jeon, and Scott Mahlke
(University of Michigan at Ann Arbor, USA; Hanyang University, South Korea; Carnegie Mellon University, USA)
Despite ceaseless efforts, extremely large and complex optimization space makes even the state-of-the-art compilers fail in delivering the most performant setting that can fully utilize the underlying hardware. Although this inefficiency suggests opportunity for tuning, it has been challenging for prior tuning methods to consider the complex interactions between optimizations and maximize the tuning quality while handling local optima efficiently.
To tackle this problem, we suggest an intelligent auto-tuning strategy, called SRTuner, which searches for the best optimization setting by exposing important optimization interactions and directly using them to focus on promising subspaces. To reveal high-impact inter-optimization relations, SRTuner proposes a multistage structure and a distribution-based estimation method that approximates the impact of an optimization effectively. Besides, to efficiently handle local optima, our technique defines optimization decisions as a series of multi-armed bandit problems to formulate the exploration-exploitation dilemma.
SRTuner is evaluated with three representative compilers from various domains on different target hardware: GCC (traditional C/C++ compiler) on CPU, TVM (domain-specific machine learning compiler) on GPU, and OpenCL compilers (kernel compiler for heterogeneous computing) on both CPU/GPU. Results show that SRTuner accelerates target executions by 1.24×, 2.03× and 34.4× compared to the highest level of optimization provided by each compiler and outperforms state-of-the-art works by 1.04×-1.14×.
As a byproduct of our unique tuning strategy, SRTuner can offer synergistic optimizations for each workload, which allows it to in part identify why it outperformed current compilers. With this information, we are able to find important optimizations that each compiler misused and demonstrate how this information can benefit future tuning strategies.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Reproduced

Binary Techniques

Recovering Container Class Types in C++ Binaries
Xudong Wang, Xuezheng Xu, Qingan Li, Mengting Yuan ORCID logo, and Jingling Xue
(UNSW, Australia; Wuhan University, China)
We present Tiara, a novel approach to recovering container classes in C++ binaries. Given a variable address in a C++ binary, Tiara first applies a new type-relevant slicing algorithm incorporated with a decay function, TSlice, to obtain an inter-procedural forward slice of instructions expressed as a CFG to summarize how the variable is used in the binary (as our primary contribution). Tiara then makes use of a GCN (Graph Convolutional Network) to learn and predict the container type for the variable (as our secondary contribution). According to our evaluation, Tiara can advance the state of the art in inferring commonly used container types in a set of eight large real-world COTS C++ binaries efficiently (in terms of the overall analysis time) and effectively (in terms of precision, recall and F1 score).

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
Automatic Generation of Debug Headers through BlackBox Equivalence Checking
Vaibhav Kiran Kurhe, Pratik Karia, Shubhani Gupta, Abhishek Rose, and Sorav Bansal
(IIT Delhi, India)
Modern compiler optimization pipelines are large and complex, and it is rather cumbersome and error-prone for compiler developers to preserve debugging information across optimization passes. An optimization can add, remove, or reorder code and variables, which makes it difficult to associate the generated code statements and values with the source code statements and values. Moreover recent proposals for automatic generation of optimizations (e.g., through search algorithms) have not previously considered the preservation of debugging information. We demonstrate the application of a blackbox equivalence checker to automatically populate the debugging information in the debug headers of the optimized executables compiled from C programs. A blackbox equivalence checker can automatically compute equivalence proofs between the original source code and the optimized executable code without the knowledge of the exact transformations performed by the compiler/optimizer. We present an algorithm that uses these formal equivalence proofs to improve the executable’s debugging headers. We evaluate this approach on benchmarks derived from the Testsuite of Vectoriz- ing Compilers (TSVC) compiled through three different compil- ers: GCC, Clang/LLVM, and ICC. We demonstrate significant improvements in the debuggability of the optimized executable code in these experiments. The benefits of these improvements can be transparently realized through any standard debugger, such as GDB, to debug the updated executable.

Publisher's Version Article Search Artifacts Available Artifacts Functional
Gadgets Splicing: Dynamic Binary Transformation for Precise Rewriting
Linan Tian, Yangyang Shi, Liwei Chen, Yanqi Yang, and Gang Shi
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Many systems and applications depend on binary rewriting technology to analyze and retrofit software binaries when source code is not available, including binary instrumentation, profiling and security policy reinforcement. However, the investigations have found that many static binary rewriters still fail to accurately transform all legal instructions in binaries. Dynamic binary rewriters allow for accuracy, but coverage and rewriting efficiency are limited. Therefore, the existing binary rewriting technology cannot meet all the needs of binary rewriting. In this paper, we present GRIN, a novel binary rewriting tool that allows for high-precision instruction identification. In GRIN, we propose a gadget-based entry address analysis technique. It identifies the entry addresses of the basic blocks in the binary by gathering and executing the basic blocks related to the computation of the entry addresses of the basic blocks. By traversing from these entries as the new entries of the program, we guarantee the correctness of the identified instructions. We have implemented the prototype of GRIN and evaluated on the SPEC2006 and the whole set of GNU Coreutils. We demonstrate that the precision of GRIN is improved to 99.92% compared to current state-of-the-art techniques.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Reproduced

IR, Encryption, and Compression

Lambda the Ultimate SSA: Optimizing Functional Programs in SSA
Siddharth Bhat and Tobias Grosser ORCID logo
(IIIT Hyderabad, India; University of Edinburgh, UK)
Single Static Assignment (SSA) is the workhorse of modern optimizing compilers for imperative programming languages. However, functional languages have been slow to adopt SSA, and prefer to use intermediate representations based on minimal lambda calculi due to SSA’s inability to express higher order constructs. We exploit a new SSA construct — regions — in order to express functional optimizations via classical SSA-based reasoning. Region optimization currently relies on ad-hoc analyses and transforms on imperative programs. These ad-hoc transformations are sufficient for imperative languages as regions are used in a limited fashion. In contrast, we use regions pervasively to model sub-expressions in our functional IR. This motivates us to systematize region optimizations. We extend classical SSA reasoning to regions for functional-style analyses and transformations. We implement a new SSA+region based backend for LEAN4, a theorem prover that implements a purely functional, dependently typed programming language. Our backend is feature complete, and handles all constructs of LEAN4’s functional intermediate representation λrc within the SSA framework. We evaluate our proposed region optimizations by optimizing λrc within an SSA+regions based framework implemented in MLIR and demonstrating performance parity with the current LEAN4 backend. We believe our work will pave the way for a unified optimization framework capable of representing, analyzing, and optimizing both functional and imperative languages.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Reproduced
NOELLE Offers Empowering LLVM Extensions
Angelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Ishita Chaturvedi, Brian Homerding, Tommy McMichen, David I. August, and Simone Campanoni ORCID logo
(Northwestern University, USA; Princeton University, USA)
Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Publisher's Version Article Search Info Artifacts Available Artifacts Functional Results Reproduced
HECATE: Performance-Aware Scale Optimization for Homomorphic Encryption Compiler
Yongwoo Lee, Seonyeong Heo, Seonyoung Cheon, Shinnung Jeong, Changsu Kim, Eunkyung Kim, Dongyoon Lee, and Hanjun Kim
(Yonsei University, South Korea; ETH Zurich, Switzerland; Seoul National University, South Korea; Samsung SDS, South Korea; Stony Brook University, USA)
Despite the benefit of Fully Homomorphic Encryption (FHE) that supports encrypted computation, writing an efficient FHE application is challenging due to magnitude scale management. Each FHE operation increases scales of ciphertext and leaving the scales high harms performance of the following FHE operations. Thus, rescaling ciphertext is inevitable to optimize an FHE application, but since FHE requires programmers to match the rescaling levels of operands of each FHE operation, programmers should rescale ciphertext reflecting the entire FHE application. Although recently proposed FHE compilers reduce the programming burden by automatically manipulating ciphertext scales, they fail to fully optimize the FHE application because they greedily rescale the ciphertext without considering their performance impacts throughout the entire application. This work proposes HECATE, a new FHE compiler framework that optimizes scales of ciphertext reflecting their rescaling levels and performance impact. With a new type system that embeds the scale and rescaling level, and a new rescaling operation called downscale, HECATE makes various scale management plans, analyzes their expected performance, and finds the optimal rescaling points throughout the entire FHE application. This work implements HECATE on top of the MLIR framework with a Python frontend and shows that HECATE achieves 27% speedup over the state-of-the-art approach for various FHE applications.

Publisher's Version Article Search
Unified Compilation for Lossless Compression and Sparse Computing
Daniel Donenfeld, Stephen Chou, and Saman AmarasingheORCID logo
(Massachusetts Institute of Technology, USA)
This paper shows how to extend sparse tensor algebra compilers to support lossless compression techniques, including variants of run-length encoding and Lempel-Ziv compression. We develop new abstractions to represent losslessly compressed data as a generalized form of sparse tensors, with repetitions of values (which are compressed out in storage) represented by non-scalar, dynamic fill values. We then show how a compiler can use these abstractions to emit efficient code that computes on losslessly compressed data. By unifying lossless compression with sparse tensor algebra, our technique is able to generate code that computes with both losslessly compressed data and sparse data, as well as generate code that computes directly on compressed data without needing to first decompress it.
Our evaluation shows our technique generates efficient image and video processing kernels that compute on losslessly compressed data. We find that the generated kernels are up to 16.3× faster than equivalent dense kernels generated by TACO, a tensor algebra compiler, and up to 16.1× faster than OpenCV, a widely used image processing library.

Publisher's Version Article Search Artifacts Reusable Results Reproduced

Program Analysis and Optimization

Loop Rolling for Code Size Reduction
Rodrigo C. O. Rocha, Pavlos Petoumenos ORCID logo, Björn Franke, Pramod Bhatotia, and Michael O'Boyle ORCID logo
(University of Edinburgh, UK; University of Manchester, UK; TU Munich, Germany)
Code size is critical for resource-constrained devices, where memory and storage are limited. Compilers, therefore, should offer optimizations aimed at code reduction. One such optimization is loop rerolling, which transforms a partially unrolled loop into a fully rolled one. However, existing techniques are limited and rarely applicable to real-world programs. They are incapable of handling partial rerolling or straight-line code.
In this paper, we propose RoLAG, a novel code-size optimization that creates loops out of straight-line code. It identifies isomorphic code by aligning SSA graphs in a bottom-up fashion. The aligned code is later rolled into a loop. In addition, we propose several optimizations that increase the amount of aligned code by identifying specific patterns of code. Finally, an analysis is used to estimate the profitability of the rolled loop before deciding which version should be kept in the code.
Our evaluation of RoLAG on full programs from MiBench and SPEC 2017 show absolute reductions of up to 88~KB while LLVM's technique is hardly distinguishable from the baseline with no rerolling. Finally, our results show that RoLAG is highly applicable to real-world code extracted from popular GitHub repositories. RoLAG is triggered several orders of magnitude more often than LLVM's rerolling, resulting in meaningful reductions on real-world functions.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
Solving PBQP-Based Register Allocation using Deep Reinforcement Learning
Minsu Kim, Jeong-Keun Park, and Soo-Mook Moon
(Seoul National University, South Korea)
Irregularly structured registers are hard to abstract and allocate. Partitioned Boolean quadratic programming (PBQP) is a useful abstraction to represent complex register constraints, even those in highly irregular processors of automated test equipment (ATE) of DRAM memory chips. The PBQP problem is NP-hard, requiring a heuristic solution. If no spill is allowed as in ATE, however, we have to enumerate more to find a solution rather than to approximate, since a spill means a total compilation failure. We propose solving the PBQP problem with deep reinforcement learning (Deep-RL), more specifically, a model-based approach using Monte Carlo tree search and deep neural network as used in Alphazero, a proven Deep-RL technology. Through elaborate training with random PBQP graphs, our Deep-RL solver could cut the search space sharply, making an enumeration-based solution more affordable. Furthermore, by employing backtracking with a proper coloring order, Deep-RL can find a solution with modestly-trained neural networks with even less search space. Our experiments show that Deep-RL can successfully find a solution for 10 product-level ATE programs while searching much fewer (e.g., 1/3,500) states than the previous PBQP enumeration solver. Also, when applied to C programs in llvm-test-suite for regular CPUs, it achieves a competitive performance to the existing PBQP register allocator in LLVM.

Publisher's Version Article Search
F3M: Fast Focused Function Merging
Sean Stirling, Rodrigo C. O. Rocha, Kim Hazelwood, Hugh Leather, Michael O'Boyle ORCID logo, and Pavlos Petoumenos ORCID logo
(Codeplay, UK; University of Edinburgh, UK; Facebook, USA; University of Manchester, UK)
From IoT devices to datacenters, code size is important, motivating ongoing research in binary reduction. A key technique is the merging of similar functions to reduce code redundancy. Success, however, depends on accurately identifying functions that can be profitably merged. Attempting to merge all function pairs is prohibitively expensive. Current approaches, therefore, employ summaries to estimate similarity. However these summaries often give little information about how well two programs will merge. To make things worse, they rely on exhaustive search across all summaries; impractical for real-world programs. In this work, we propose a new technique for matching similar functions. We use a hash-based approach that better captures code similarity and, at the same time, significantly reduces the search space by focusing on the most promising candidates. Experimental results show that our similarity metric has a better correlation with merging profitability. This improves the average code size reduction by 6 percentage points, while it reduces the overhead of function merging by 1.8x on average and by as much as 597x for large applications. Faster merging and reduced code size to compile at later stages mean that our approach introduces little to no compile time overhead, while in many cases it makes compilation faster by up to 30%.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
Sound, Precise, and Fast Abstract Interpretation with Tristate Numbers
Harishankar Vishwanathan, Matan Shachnai, Srinivas Narayana, and Santosh NagarakatteORCID logo
(Rutgers University, USA)
Extended Berkeley Packet Filter (BPF) is a language and run-time system that allows non-superusers to extend the Linux and Windows operating systems by downloading user code into the kernel. To ensure that user code is safe to run in kernel context, BPF relies on a static analyzer that proves properties about the code, such as bounded memory access and the absence of operations that crash. The BPF static analyzer checks safety using abstract interpretation with several abstract domains. Among these, the domain of tnums (tristate numbers) is a key domain used to reason about the bitwise uncertainty in program values. This paper formally specifies the tnum abstract domain and its arithmetic operators. We provide the first proofs of soundness and optimality of the abstract arithmetic operators for tnum addition and subtraction used in the BPF analyzer. Further, we describe a novel sound algorithm for multiplication of tnums that is more precise and efficient (runs 33% faster on average) than the Linux kernel's algorithm. Our tnum multiplication is now merged in the Linux kernel.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced

Natural-Language Techniques

M3V: Multi-modal Multi-view Context Embedding for Repair Operator Prediction
Xuezheng Xu, Xudong Wang, and Jingling Xue
(UNSW, Australia)
We address the problem of finding context embeddings for faulty locations to allow a learning-based APR tool to learn and predict the repair operators used at the faulty locations. We introduce M3V, a new multi-modal multi-view context embedding approach, which represents the context of a faulty location in two modalities: (1) texts that capture its signature in a natural language using the tree-LSTM model, and (2) graphs that capture its structure with two views, data and control dependences, using the GNN model. We then fuse these two modalities to learn a probabilistic classifier from correct code that, once given a faulty location, will produce a probabilistic distribution over a set of repair operators. We have evaluated M3V against the state-of-the-art context embedding approaches in repairing two common types of bugs in Java, null pointer exceptions (NPE) and index out of bounds (OOB). Trained and tested with 75673 code samples from 20 real-world projects, a learning-based APR tool can predict repair operators more effectively with our context embeddings in repairing NPE bugs, by achieving higher accuracies (11% – 41%) and higher F1 scores (16% – 143%). For OOB bugs, these improvements are 9% – 30% and 15% – 79%, respectively.

Publisher's Version Article Search
Enabling Near Real-Time NLU-Driven Natural Language Programming through Dynamic Grammar Graph-Based Translation
Zifan Nan, Xipeng ShenORCID logo, and Hui Guan
(North Carolina State University, USA; University of Massachusetts at Amherst, USA)
Recently, natural language (NL)-based program synthesis has drawn increasing interest. Conventional methods that depend on some predefined domain-specific rules suffer from the lack of robustness and generality. Recent efforts on adopting deep learning to map queries to code requires a large number of labeled examples, making them not applicable on domains with scarce labeled examples. Although a third alternative, natural language understanding (NLU)-driven approach addresses the problems, the long response time hinders its adoption in practice, especially in an interactive scenario. This paper presents a solution to enable near real-time NLU-driven NL programming. The solution features a new algorithm, dynamic grammar graph-based translation (DGGT), for identifying the best grammar tree for a query via dynamic programming. It also introduces two new optimizations, grammar-based pruning and orphan node relocation, to further reduce the search space and address the special complexities from queries. Evaluations on two domains, text editing and program source code analysis, show that the DGGT algorithm and the optimizations shortens the response time of a state-of-the-art NLU-driven synthesizer by up to 1887× (25-133× on average) while improving the accuracy by 2-12%.

Publisher's Version Article Search

AI Systems

SPNC: An Open-Source MLIR-Based Compiler for Fast Sum-Product Network Inference on CPUs and GPUs
Lukas SommerORCID logo, Cristian Axenie, and Andreas Koch ORCID logo
(TU Darmstadt, Germany; Huawei Research, Germany)
Sum-Product Networks (SPNs) are an alternative to the widely used Neural Networks (NNs) for machine learning. SPNs can not only reason about (un)certainty by qualifying their output with a probability, they also allow fast (tractable) inference by having run-times that are just linear w.r.t. the network size.
We present SPNC, the first tool flow for generating fast native code for SPN inference on both CPUs and GPUs, including the use of vectorized/SIMD execution. To this end, we add two SPN-specific dialects to the MLIR framework and discuss their lowering towards the execution targets.
We evaluate our approach on two applications, for which we consider performance, scaling to very large SPNs, and compile vs execution-time trade-offs. In this manner, we achieve multiple orders of magnitude in speed-ups over existing SPN support libraries.

Publisher's Version Article Search Info
Distill: Domain-Specific Compilation for Cognitive Models
Jan Vesely, Raghavendra Pradyumna Pothukuchi, Ketaki Joshi, Samyak Gupta, Jonathan D. Cohen, and Abhishek Bhattacharjee
(Yale University, USA; Princeton University, USA)
Computational models of cognition enable a better understanding of the human brain and behavior, psychiatric and neurological illnesses, clinical interventions to treat illnesses, and also offer a path towards human-like artificial intelligence. Cognitive models are also, however, laborious to develop, requiring composition of many types of computational tasks, and suffer from poor performance as they are generally designed using high-level languages like Python. In this work, we present Distill, a domain-specific compilation tool to accelerate cognitive models while continuing to offer cognitive scientists the ability to develop their models in flexible high-level languages. Distill uses domain-specific knowledge to compile Python-based cognitive models into LLVM IR, carefully stripping away features like dynamic typing and memory management that add performance overheads without being necessary for the underlying computation of the models. The net effect is an average of 27× performance improvement in model execution over state-of-the-art techniques using Pyston and PyPy. Distill also repurposes classical compiler data flow analyses to reveal properties about data flow in cognitive models that are useful to cognitive scientists. Distill is publicly available, integrated in the PsyNeuLink cognitive modeling environment, and is already being used by researchers in the brain sciences.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Reproduced
Optimizing GPU Deep Learning Operators with Polyhedral Scheduling Constraint Injection
Cédric BastoulORCID logo, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Adilla Susungi, Javier de Juan, Etienne Filhol, Baptiste Jarry, Gianpietro Consolaro, and Renwei Zhang
(Huawei Technologies, France; Huawei Technologies, China)
Automatic parallel code generation from high-level abstractions such as those manipulated by artificial intelligence and deep learning (AI/DL) frameworks heavily rely on compiler techniques for automatic parallelization and optimization. Many recent advances rely on the polyhedral framework for this task because of its ability to model and to apply a wide range of loop transformations. However, modeling the complexity of the target architecture and of efficient cost models to decide about the best transformation is in general out of reach for a framework based on linear/affine constraints. In this work, we propose to decouple the polyhedral framework into linear and non-linear components. We introduce the constraint tree abstraction which may be generated by a non-linear optimizer and injected to the polyhedral optimization process to build better solutions. We present how to benefit from such a mechanism to generate efficient codes for GPU in the context of AI/DL operators. Our constraint injection allows to drive the polyhedral scheduler towards efficient solutions for load/store vectorization relying both on memory coalescing and vector types. We implemented our scheduler supporting constraint injection and our constraint construction system within a production AI/DL framework. Experiments on well known neural networks show the efficiency of this approach with respect to state-of-the-art polyhedral scheduling for GPU.

Publisher's Version Article Search
Comprehensive Accelerator-Dataflow Co-design Optimization for Convolutional Neural Networks
Miheer Vaidya, Aravind Sukumaran-Rajam ORCID logo, Atanas Rountev, and P. Sadayappan ORCID logo
(University of Utah, USA; Washington State University, USA; Ohio State University, USA)
The design space of possible schedules for mapping a Convolutional Neural Network layer onto a spatial accelerator array, referred as the dataflow, is enormous. The co-design of key architectural parameters (such as number of processing elements, sizes of register files and scratchpad memories) along with the dataflow to optimize the implementation of one or more CNN stages makes the design space explosively larger. Several recent efforts have addressed the design-space exploration problem for CNN accelerators via heuristics or limited search strategies. In this paper we develop the first optimization approach that uses analytical modeling and the solution of constrained nonlinear optimization problems for comprehensive algorithm-architecture co-design optimization. Using the Timeloop accelerator modeling framework, we demonstrate that the new optimization methodology can enable significant improvements over prior accelerator designs for both energy minimization and performance maximization.

Publisher's Version Article Search

proc time: 12.19