CGO 2021
2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Powered by
Conference Publishing Consulting

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), February 27 – March 3, 2021, Virtual, Republic of Korea

CGO 2021 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page


Message from the General Chair
Welcome to the 19th edition (and the first “virtual” edition) of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO)! We originally planned to hold the conference in Seoul, South Korea. But, due to the COVID-19 pandemic, we have decided to go virtual this year. Instead, we will go back to Seoul again next year for CGO 2022.

Message from the Program Chairs
We are pleased to welcome you to CGO 2021, the first virtual CGO Conference. In addition, the Program Committee was virtual due to the worldwide infection rate of the coronavirus. On behalf of the Program Committee, we are pleased to present an exciting and stimulating program for the 2021 International Symposium on Code Generation and Optimization Conference.

CGO 2021 Organization
Committee Listings

Report from the Artifact Evaluation Committee
CGO 2021 included two separate categories of papers—main conference papers and tool & practical experience papers—as it did in the previous year. However, this year the acceptance criterion of tool papers required the validation of the tool from the Artifact Evaluation Committee.
All authors of accepted CGO 2021 standard papers and conditionally accepted Tools & Practical Experience papers were given an opportunity to participate in the artifact evaluation process by submitting a research artifact.

CGO 2021 Sponsors
https://docs.google.com/document/d/1xHue10NMj5OwEf5Ss-uD0h-tSG259RUr5_sE2-PqQzA/edit

Keynote

Data Layout and Data Representation Optimizations to Reduce Data Movement (Keynote)
Mary Hall ORCID logo
(University of Utah, USA)
Code generation and optimization for the diversity of current and future architectures must focus on reducing data movement to achieve high performance. How data is laid out in memory, and representations that compress data (e.g., reduced floating point precision) have a profound impact on data movement. Moreover, the cost of data movement in a program is architecture-specific, and consequently, optimizing data layout and data representation must be performed by a compiler once the target architecture is known. With this context in mind, this talk will provide examples of data layout and data representation optimizations, and call for integrating these data properties into code generation and optimization systems.

Compiler Infrastructure
(Chair: Michael Kruse, Argonne National Laboratory, USA)

MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini ORCID logo, Uday Bondhugula ORCID logo, Albert Cohen ORCID logo, Andy Davis ORCID logo, Jacques Pienaar ORCID logo, River Riddle, Tatiana Shpeisman, Nicolas Vasilache ORCID logo, and Oleksandr Zinenko ORCID logo
(Google, USA; Indian Institute of Science, India; Google, France)
This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together.
MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics.

Artifacts Available Artifacts Reusable
Progressive Raising in Multi-level IR
Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko ORCID logo, Albert Cohen ORCID logo, Nicolas Vasilache ORCID logo, Tobias Grosser, and Henk Corporaal
(Eindhoven University of Technology, Netherlands; Inria, France; ENS Paris, France; Google, France; Google, Switzerland; University of Edinburgh, UK)
Multi-level intermediate representations ‍(IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opini­onated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions.
We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level ‍IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multi-Level Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements.

Artifacts Available Artifacts Functional Results Reproduced
Towards a Domain-Extensible Compiler: Optimizing an Image Processing Pipeline on Mobile CPUs
Thomas KoehlerORCID logo and Michel SteuwerORCID logo
(University of Glasgow, UK; University of Edinburgh, UK)
Halide and many similar projects have demonstrated the great potential of domain specific optimizing compilers. They enable programs to be expressed at a convenient high-level, while generating high-performance code for parallel architectures. As domains of interest expand towards deep learning, probabilistic programming and beyond, it becomes increasingly clear that it is unsustainable to redesign domain specific compilers for each new domain. In addition, the rapid growth of hardware architectures to optimize for poses great challenges for designing these compilers.
In this paper, we show how to extend a unifying domain-extensible compiler with domain-specific as well as hardware-specific optimizations. The compiler operates on generic patterns that have proven flexible enough to express a wide range of computations. Optimizations are not hard-coded into the compiler but are expressed as user-defined rewrite rules that are composed into strategies controlling the optimization process. Crucially, both computational patterns and optimization strategies are extensible without modifying the core compiler implementation.
We demonstrate that this domain-extensible compiler design is capable of expressing image processing pipelines and well-known image processing optimizations. Our results on four mobile ARM multi-core CPUs, often used for image processing tasks, show that the code generated for the Harris operator outperforms the image processing library OpenCV by up to 16× and achieves performance close to - or even up to 1.4× better than - the state-of-the-art image processing compiler Halide.

Artifacts Available Artifacts Reusable Results Reproduced
BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++
Ajay BrahmakshatriyaORCID logo and Saman AmarasingheORCID logo
(Massachusetts Institute of Technology, USA)
The simplest implementation of a domain-specific language is to embed it in an existing language using operator overloading. This way, the DSL can inherit parsing, syntax and type checking, error handling, and the toolchain of debuggers and IDEs from the host language. A natural host language choice for most high-performance DSLs is the de-facto high-performance language, C++. However, DSL designers quickly run into the problem of not being able to extract control flows due to a lack of introspection in C++ and have to resort to special functions with lambdas to represent loops and conditionals. This approach introduces unnecessary syntax and does not capture the side effects of updates inside the lambdas in a safe way. We present BuildIt, a type-based multi-stage execution framework that solves this problem by extracting all control flow operators like if-then-else conditionals and for and while loops using a pure library approach. BuildIt achieves this by repeated execution of the program to explore all control flow paths and construct the AST piece by piece. We show that BuildIt can do this without exponential blow-up in terms of output size and execution time. We apply BuildIt’s staging capabilities to the state-of-the-art tensor compiler TACO to generate low-level IR for custom-level formats. Thus, BuildIt offers a way to get both generalization and programmability for the user while generating specialized and efficient code. We also demonstrate that BuildIt can generate rich control-flow from relatively simple code by using it to stage an interpreter for an esoteric language. BuildIt changes the way we think about multi-staging as a problem by reducing the PL complexity of a supposedly harder problem requiring features like introspection or specialized compiler support to a set of common features found in most languages.

Artifacts Available Artifacts Reusable Results Reproduced

Dealing with Precision
(Chair: Uma Srinivasan, Twitter)

An Interval Compiler for Sound Floating-Point Computations
Joao Rivera ORCID logo, Franz Franchetti ORCID logo, and Markus Püschel ORCID logo
(ETH Zurich, Switzerland; Carnegie Mellon University, USA)
Floating-point arithmetic is widely used by software developers but is unsound, i.e., there is no guarantee on the accuracy obtained, which can be imperative in safety-critical applications. We present IGen, a source-to-source compiler that translates a given C function using floating-point into an equivalent sound C function that uses interval arithmetic. IGen supports Intel SIMD intrinsics in the input function using a specially designed code generator and can produce SIMD-optimized output. To mitigate a possible loss of accuracy due to the increase of interval sizes, IGen can compile to double-double precision, again SIMD-optimized. Finally, IGen implements an accuracy optimization for the common reduction pattern. We benchmark our compiler on high-performance code in the domain of linear algebra and signal processing. The results show that the generated code delivers sound double precision results at high performance. In particular, we observe speed-ups of up to 9.8 when compared to commonly used interval libraries using double precision. When compiling to double-double, our compiler delivers intervals that keep error accumulation small enough to compute results with at most one bit of error in double precision, i.e., certified double precision results.

Artifacts Available Artifacts Reusable Results Reproduced
Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic
Tiago Trevisan Jost ORCID logo, Yves Durand ORCID logo, Christian Fabre ORCID logo, Albert Cohen ORCID logo, and Frédéric Pétrot ORCID logo
(Université Grenoble Alpes, France; CEA LIST, France; Google, France; CNRS, France; Grenoble INP, France; TIMA, France)
Floating-Point (FP) units in processors are generally limited to supporting a subset of formats defined by the IEEE ‍754 standard. As a result, high-efficiency languages and optimizing compilers for high-performance computing only support IEEE standard types and applications needing higher precision involve cumbersome memory management and calls to external libraries, resulting in code bloat and making the intent of the program unclear. We present an extension of the C type system that can represent generic FP operations and formats, supporting both static precision and dynamically variable precision. We design and implement a compilation flow bridging the abstraction gap between this type system and low-level FP instructions or software libraries. The effectiveness of our solution is demonstrated through an LLVM-based implementation, leveraging aggressive optimizations in LLVM including the Polly loop nest optimizer, which targets two backend code generators: one for the ISA of a variable precision FP arithmetic coprocessor, and one for the MPFR multi-precision floating-point library. Our optimizing compilation flow targeting MPFR outperforms the Boost programming interface for the MPFR library by a factor of 1.80× and 1.67× in sequential execution of the PolyBench and RAJAPerf suites, respectively, and by a factor of 7.62× on an 8-core (and 16-thread) machine for RAJAPerf in OpenMP.

Artifacts Functional Results Reproduced
UNIT: Unifying Tensorized Instruction Compilation
Jian WengORCID logo, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang ORCID logo, and Tony NowatzkiORCID logo
(University of California at Los Angeles, USA; Amazon, USA)
Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions.
In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3× speedup over Intel oneDNN on an x86 CPU, 1.75× speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13× speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU.

Artifacts Available Artifacts Functional
Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs
Guangli Li, Jingling Xue ORCID logo, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng ORCID logo
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia; Jilin University, China)
Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The main challenge faced when accelerating a tensor program on tensor-specialized hardware is how to achieve the best performance possible in reduced precision by fully utilizing its computational resources while keeping the precision loss in a controlled manner. In this paper, we address this challenge by proposing QUANTENSOR, a new approach for accelerating general-purpose tensor programs by replacing its tensor computations with low-precision quantized tensor computations on NVIDIA Tensor Cores. The key novelty is a new residual-based precision refinement technique for controlling the quantization errors, allowing tradeoffs between performance and precision to be made. Evaluation with GEMM, deep neural networks, and linear algebra applications shows that QUANTENSOR can achieve remarkable performance improvements while reducing the precision loss incurred significantly at acceptable overheads.

Binary Profiling, Tracing, Sampling
(Chair: Wei Wang, University of Texas at San Antonio, USA)

Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring
Mahwish Arif, Ruoyu Zhou, Hsi-Ming Ho, and Timothy M. JonesORCID logo
(University of Cambridge, UK; University of Sussex, UK)
Binary instrumentation and rewriting frameworks provide a powerful way of implementing custom analysis and transformation techniques for applications ranging from performance profiling to security monitoring. However, using these frameworks to write even simple analyses and transformations is non-trivial. Developers often need to write framework-specific boilerplate code and work with low-level and complex programming details. This not only results in hundreds (or thousands) of lines of code, but also leaves significant room for error.
To address this, we introduce Cinnamon, a domain-specific language designed to write programs for binary profiling and monitoring. Cinnamon's abstractions allow the programmer to focus on implementing their technique in a platform-independent way, without worrying about complex lower-level details. Programmers can use these abstractions to perform analysis and instrumentation at different locations and granularity levels in the binary. The flexibility of Cinnamon also enables its programs to be mapped to static, dynamic or hybrid analysis and instrumentation approaches. As a proof of concept, we target Cinnamon to three different binary frameworks by implementing a custom Cinnamon to C/C++ compiler and integrating the generated code within these frameworks. We further demonstrate the ability of Cinnamon to express a range of profiling and monitoring tools through different use-cases.

GPA: A GPU Performance Advisor Based on Instruction Sampling
Keren Zhou ORCID logo, Xiaozhu Meng, Ryuichi Sai ORCID logo, and John Mellor-Crummey
(Rice University, USA)
Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained tuning advice at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program’s structure and the GPU to match inefficiency patterns with optimization strategies. To quantify the potential benefits of each optimization strategy, we developed PC sampling-based performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides insightful reports to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01× to 3.58×, with a geometric mean of 1.22×.

Artifacts Available Artifacts Reusable Results Reproduced
ELFies: Executable Region Checkpoints for Performance Analysis and Simulation
Harish Patil ORCID logo, Alexander Isaev, Wim Heirman ORCID logo, Alen Sabu ORCID logo, Ali Hajiabadi ORCID logo, and Trevor E. Carlson ORCID logo
(Intel Corporation, USA; Intel Corporation, Belgium; National University of Singapore, Singapore)
We address the challenge faced in characterizing long-running workloads, namely how to reliably focus the detailed analysis on interesting execution regions. We present a set of tools that allows users to precisely capture any region of interest in program execution, and create a stand-alone executable, called an ELFie, from it. An ELFie starts with the same program state captured at the beginning of the region of interest and then executes natively. With ELFies, there is no fast-forwarding to the region of interest needed or the uncertainty of reaching the region. ELFies can be fed to dynamic program-analysis tools or simulators that work with regular program binaries. Our tool-chain is based on the PinPlay framework and requires no special hardware, operating system changes, re-compilation, or re-linking of test programs. This paper describes the design of our ELFie generation tool-chain and the application of ELFies in performance analysis and simulation of regions of interest in popular long-running single and multi-threaded benchmarks.

Artifacts Functional Results Reproduced
Vulkan Vision: Ray Tracing Workload Characterization using Automatic Graphics Instrumentation
David Pankratz, Tyler Nowicki, Ahmed Eltantawy, and José Nelson Amaral ORCID logo
(University of Alberta, Canada; Huawei Technologies, Canada)
While there are mature performance monitoring, profiling and instrumentation tools to help understanding the dynamic behaviour of general-purpose GPU applications, the abstract programming models of graphics applications have limited the development of such tools for graphics. This paper introduces Vulcan Vision(V-Vision), a framework for collecting detailed GPU execution data from Vulkan applications to guide hardware-informed improvements. A core contribution of V-Vision is providing out-of-the-box data collection for capturing complete dynamic warp and thread execution traces. V-Vision also provides analyses for the follow purposes: identifying and visualizing application hotspots to guide optimization, characterizing application behaviour and estimating the effect of architectural modifications. This paper demonstrates the potential for these analyses in applications that utilize the recent ray tracing extension in Vulkan and describes new insights about the applications and the underlying hardware.

Artifacts Functional Results Reproduced

Parallelism - Optimizing, Modeling, Testing
(Chair: Michael O'Boyle, University of Edinburgh, UK)

Loop Parallelization using Dynamic Commutativity Analysis
Christos Vasiladiotis ORCID logo, Roberto Castañeda Lozano, Murray Cole, and Björn Franke ORCID logo
(University of Edinburgh, UK)
Automatic parallelization has largely failed to keep its promise of extracting parallelism from sequential legacy code to maximize performance on multi-core systems outside the numerical domain. In this paper, we develop a novel dynamic commutativity analysis (DCA) for identifying parallelizable loops. Using commutativity instead of dependence tests, DCA avoids many of the overly strict data dependence constraints limiting existing parallelizing compilers. DCA extends the scope of automatic parallelization to uniformly include both regular array-based and irregular pointer-based codes. We have prototyped our novel parallelism detection analysis and evaluated it extensively against five state-of-the-art dependence-based techniques in two experimental settings. First, when applied to the NAS benchmarks which contain almost 1400 loops, DCA is able to identify as many parallel loops (over 1200) as the profile-guided dependence techniques and almost twice as many as all the static techniques combined. We then apply DCA to complex pointer-based loops, where it can successfully detect parallelism, while existing techniques fail to identify any. When combined with existing parallel code generation techniques, this results in an average speedup of 3.6× (and up to 55×) across the NAS benchmarks on a 72-core host, and up to 36.9× for the pointer-based loops, demonstrating the effectiveness of DCA in identifying profitable parallelism across a wide range of loops.

Fine-Grained Pipeline Parallelization for Network Function Programs
Seungbin Song, Heelim Choi, and Hanjun Kim
(Yonsei University, South Korea)
Network programming languages enable programmers to implement new network functions on various hardware and software stacks in the domain of Software Defined Networking (SDN). Although the languages extend the flexibility of network devices, existing compilers do not fully optimize the network programs due to their coarse-grained parallelization methods. The compilers consider each packet processing table that consists of match and action functions as a unit of tasks and parallelize the programs without decomposing match and action functions. This work proposes a new fine-grained pipeline parallelization compiler for network programming languages, named PSDN. First, the PSDN compiler decouples match and action functions from packet processing tables and analyzes dependencies among the matches and actions. While respecting the dependencies, the compiler efficiently schedules each match and action function into a pipeline with clock cycle estimation and fuses functions to reduce synchronization overheads. This work implements the PSDN compiler that translates a P4 network program to a Xilinx PX program, which is synthesizable to NetFPGA-SUME hardware. The proposed compiler reduces packet processing latency by 12.1% and utilization by 3.5% compared to previous work.

YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures
Christie L. Alappat, Johannes Seiferth, Georg Hager, Matthias Korch, Thomas Rauber, and Gerhard Wellein
(University of Erlangen-Nuremberg, Germany; University of Bayreuth, Germany)
The landscape of multi-core architectures is growing more complex and diverse. Optimal application performance tuning parameters can vary widely across CPUs, and finding them in a possibly multidimensional parameter search space can be time consuming, expensive and potentially infeasible. In this work, we introduce YaskSite, a tool capable of tackling these challenges for stencil computations. YaskSite is built upon Intel’s YASK framework. It combines YASK’s flexibility to deal with different target architectures with the Execution-Cache-Memory performance model, which enables identifying optimal performance parameters analytically without the need to run the code. Further we show that YaskSite’s features can be exploited by external tuning frameworks to reliably select the most efficient kernel(s) for the application at hand. To demonstrate this, we integrate YaskSite into Offsite, an offline tuner for explicit ordinary differential equation methods, and show that the generated performance predictions are reliable and accurate, leading to considerable performance gains at minimal code generation time and autotuning costs on the latest Intel Cascade Lake and AMD Rome CPUs.

Artifacts Available Artifacts Functional Results Reproduced
GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs
Ting Yuan, Guangwei Li, Jie Lu, Chen Liu, Lian Li, and Jingling Xue ORCID logo
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia)
Go, a fast growing programming language, is often considered as "the programming language of the cloud". The language provides a rich set of synchronization primitives, making it easy to write concurrent programs with great parallelism. However, the rich set of primitives also introduces many bugs. We build GoBench, the first benchmark suite for Go concurrency bugs. Currently, GoBench consists of 82 real bugs from 9 popular open source applications and 103 bug kernels. The bug kernels are carefully extracted and simplified from 67 out of these 82 bugs and 36 additional bugs reported in a recent study to preserve their bug-inducing complexities as much as possible. These bugs cover a variety of concurrency issues, both traditional and Go-specific. We believe GoBench will be instrumental in helping researchers understand concurrency bugs in Go and develop effective tools for their detection. We have therefore evaluated a range of representative concurrency error detection tools using GoBench. Our evaluation has revealed their limitations and provided insights for making further improvements.

Artifacts Available Artifacts Functional Results Reproduced

Memory Optimization and Safeness
(Chair: Eunjung Park, Los Alamos National Laboratory, USA)

Memory-Safe Elimination of Side Channels
Luigi Soares and Fernando Magno Quintão Pereira ORCID logo
(Federal University of Minas Gerais, Brazil)
A program is said to be isochronous if its running time does not depend on classified information. The programming languages literature contains much work that transforms programs to ensure isochronicity. The current state-of-the-art approach is a code transformation technique due to Wu et al., published in 2018. That technique has an important virtue: it ensures that the transformed program runs exactly the same set of operations, regardless of inputs. However, in this paper we demonstrate that it has also a shortcoming: it might add out-of-bounds memory accesses into programs that were originally memory sound. From this observation, we show how to deliver the same runtime guarantees that Wu et al. provide, in a memory-safe way. In addition to being safer, our LLVM-based implementation is more efficient than its original inspiration, achieving shorter repairing times, and producing code that is smaller and faster.

Info Artifacts Available Artifacts Reusable Results Reproduced
Variable-Sized Blocks for Locality-Aware SpMV
Naveen NamashivayamORCID logo, Sanyam Mehta, and Pen-Chung Yew
(HPE, USA; University of Minnesota at Twin Cities, USA)
Blocking is an important optimization option available to mitigate the data movement overhead and improve the temporal locality in SpMV, a sparse BLAS kernel with irregular memory reference pattern. In this work, we propose an analytical model to determine the effective block size for highly irregular sparse matrices by factoring the distribution of non-zeros in the sparse dataset. As a result, the blocks generated by our scheme are variable-sized as opposed to constant-sized in most existing SpMV algorithms.
We demonstrate our blocking scheme using Compressed Vector Blocks (CVB), a new column-based blocked data format, on Intel Xeon Skylake-X multicore processor. We evaluated the performance of CVB-based SpMV with variable-sized blocks using extensive set of matrices from Stanford Network Analysis Platform (SNAP). Our evaluation shows a speedup of up to 2.62X (with an average of 1.73X) and 2.02X (with an average of 1.18X) over the highly vendor tuned SpMV implementation in Intel’s Math Kernel Library (MKL) on single and multiple Intel Xeon cores respectively.

Object Versioning for Flow-Sensitive Pointer Analysis
Mohamad Barbar, Yulei SuiORCID logo, and Shiping Chen
(University of Technology Sydney, Australia; CSIRO's Data61, Australia)
Flow-sensitive points-to analysis provides better precision than its flow-insensitive counterpart. Traditionally performed on the control-flow graph, it incurs heavy analysis overhead. For performance, staged flow-sensitive analysis (SFS) is conducted on a pre-computed def-use (value-flow) graph where points-to sets of variables are propagated across def-use chains sparsely rather than across control-flow in the control-flow graph. SFS makes the propagation of different objects’ points-to sets sparse (multiple-object sparsity), however, it suffers from redundant propagation between instructions of the same object’s points-to sets (single-object sparsity). The points-to set of an object is often duplicated, resulting in redundant propagation and storage, especially in real-world heap-intensive programs. We notice that a simple graph prelabelling extension can identify much of this redundancy in a pre-analysis. With this pre-analysis, multiple nodes (instructions) in the value-flow graph can share an individual memory object’s points-to set rather than each node maintaining its own points-to set for that single object. We present object versioning for flow-sensitive points-to analysis, a finer single-object sparsity technique which maintains the same precision while allowing us to avoid much of the redundancy present in propagating and storing points-to sets. Our experiments conducted on 15 open-source programs, when compared with SFS, show that our approach runs up to 26.22× faster (5.31× on average), and reduces memory usage by up to 5.46× (2.11× on average).

Artifacts Available Artifacts Functional Results Reproduced
Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing
Haofeng Li, Haining Meng, Hengjie Zheng, Liqing Cao, Jie Lu, Lian Li, and Lin Gao
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China)
The IFDS algorithm can be memory-intensive, requiring a memory budget of more than 100 GB of RAM for some applications. The large memory requirements significantly restrict the deployment of IFDS-based tools in practise. To improve this, we propose a disk-assisted solution that drastically reduces the memory requirements of traditional IFDS solvers. Our solution saves memory by 1) recomputing instead of memorizing intermediate analysis data, and 2) swapping in-memory data to disk when memory usages reach a threshold. We implement sophisticated scheduling schemes to swap data between memory and disks efficiently.
We have developed a new taint analysis tool, DiskDroid, based on our disk-assisted IFDS solver. Compared to FlowDroid, a state-of-the-art IFDS-based taint analysis tool, for a set of 19 apps which take from 10 to 128 GB of RAM by FlowDroid, DiskDroid can analyze them with less than 10GB of RAM at a slight performance improvement of 8.6%. In addition, for 21 apps requiring more than 128GB of RAM by FlowDroid, DiskDroid can analyze each app in 3 hours, under the same memory budget of 10GB. This makes the tool deployable to normal desktop environments. We make the tool publicly available at https://github.com/HaofLi/DiskDroid.

Artifacts Available Artifacts Functional

Compiling Graph Algorithms, Compiling for GPU's
(Chair: Maria Garzaran, Intel Corporation and University of Illinois at Urbana-Champaign, USA)

Compiling Graph Applications for GPUs with GraphIt
Ajay BrahmakshatriyaORCID logo, Yunming Zhang, Changwan Hong, Shoaib Kamil, Julian Shun, and Saman AmarasingheORCID logo
(Massachusetts Institute of Technology, USA; Adobe, USA)
The performance of graph programs depends highly on the algorithm, size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations as well as the hardware platform to use. The GraphIt programming language makes it easy for the developers to write the algorithm once and optimize it for different inputs using a scheduling language. However GraphIt currently has no support for generating high performance CUDA code for GPUs Developers have to resort to re-implementing the entire algorithm with entirely different set of abstractions and optimizations in order to achieve high-performance. We propose G2, an extension to the GraphIt compiler frame- work, that achieves high performance on both CPUs and GPUs using the same algorithm specification. G2 significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that en- ables combining load balancing, edge traversal direction, active vertex set creation, active vertex set processing ordering, and kernel fusion optimizations. G2 also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balance by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within L2 cache. We evaluate G2 on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11× speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments.

Artifacts Available Artifacts Reusable Results Reproduced
Efficient Execution of Graph Algorithms on CPU with SIMD Extensions
Ruohuang Zheng and Sreepathi Pai ORCID logo
(University of Rochester, USA)
Existing state-of-the-art CPU graph frameworks take advantage of multiple cores, but not the SIMD capability within each core. In this work, we retarget an existing GPU graph algorithm compiler to obtain the first graph framework that uses SIMD extensions on CPUs to efficiently execute graph algorithms. We evaluate this compiler on 10 benchmarks and 3 graphs on 3 different CPUs and also compare to the GPU. Evaluation results show that on a 8-core machine, enabling SIMD on a naive multi-core implementation achieves an additional 7.48x speedup, averaged across 10 benchmarks and 3 inputs. Applying our SIMD-targeted optimizations improves the plain SIMD implementation by 1.67x, outperforming a serial implementation by 12.46x. On average, the optimized multi-core SIMD version also outperforms the state-of-the-art graph framework, GraphIt, by 1.53x, averaged across 5 (common) benchmarks. SIMD execution on CPUs closes the gap between the CPU and GPU to 1.76x, but the CPU virtual memory performs better when graphs are much bigger than available physical memory.

Artifacts Available Artifacts Reusable Results Reproduced
r3d3: Optimized Query Compilation on GPUs
Alexander Krolik, Clark Verbrugge ORCID logo, and Laurie Hendren ORCID logo
(McGill University, Canada)
Query compilation is an effective approach to improve the performance of repeated database queries. GPU-based approaches have significant promise, but face difficulties in managing compilation time, data transfer costs, and in addressing a reasonably comprehensive range of SQL operations. In this work we describe a hybrid AoT/JIT approach to GPU-based query compilation. We use multiple optimizations to reduce execution, compile, and data transfer times, improving performance over both other GPU-based approaches and CPU-based query compilers as well. Our design addresses a wide range of SQL queries, sufficient to demonstrate the practicality of using GPUs for query optimization.

Info Artifacts Available Artifacts Functional Results Reproduced
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, Joel Fuentes ORCID logo, Wei-Yu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, and Daniel Rhee
(Intel Corporation, USA)
The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU.

Artifacts Available Artifacts Functional Results Reproduced

Compiling for Spatial, Quantum, and Embedded Devices
(Chair: Wei-Fen Lin, National Cheng Kung University, Taiwan)

Relaxed Peephole Optimization: A Novel Compiler Optimization for Quantum Circuits
Ji Liu ORCID logo, Luciano Bello ORCID logo, and Huiyang Zhou ORCID logo
(North Carolina State University, USA; IBM Research, USA)
As in classical computing, compilers play an important role in quantum computing. Quantum processors typically support a limited set of primitive operations or quantum gates and have certain hardware-related limitations. A quantum compiler is responsible for adapting a quantum program to these constraint environments and decomposing quantum gates into a sequence of the primitive ones. During the compilation process, it is also critical for the compiler to optimize the quantum circuits in order to reduce the noise in the computation results. Since the noise is introduced by operations and decoherence, reducing the gate count is the key for improving performance.
In this paper, we propose a novel quantum compiler optimization, named relaxed peephole optimization (RPO) for quantum computers. RPO leverages the single-qubit state information that can be determined statically by the compiler. We define that a qubit is in a basis state when, at a given point in time, its state is either in the X-, Y-, or Z-basis (|+>/|−>, |L>/|R> and |0>/|1>). When basis qubits are used as inputs to quantum gates, there exist opportunities for strength reduction, which replaces quantum operations with equivalent but less expensive ones. Compared to the existing peephole optimization for quantum programs, the difference is that our proposed optimization does not require an identical unitary matrix, thereby named ‘relaxed’ peephole optimization. We also extend our approach to optimize the quantum gates when some input qubits are in known pure states. Both optimizations, namely the Quantum Basis-state Optimization (QBO) and the Quantum Pure-state Optimization (QPO), are implemented in the IBM’s Qiskit transpiler. Our experimental results show that our proposed optimization pass is fast and effective. The circuits optimized with our compiler optimizations obtain up to 18.0% (11.7% on average) fewer CNOT gates and up to 8.2% (7.1% on average) lower transpilation time than that of the most aggressive optimization level in the Qiskit compiler. When running on real quantum computers, the success rates of 3-qubit quantum phase estimation algorithm improve by 2.30X due to the reduced gate counts.

Artifacts Available Artifacts Reusable Results Reproduced
StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems
Johannes de Fine Licht ORCID logo, Andreas Kuster ORCID logo, Tiziano De MatteisORCID logo, Tal Ben-NunORCID logo, Dominic Hofer ORCID logo, and Torsten Hoefler ORCID logo
(ETH Zurich, Switzerland; MeteoSwiss, Switzerland)
Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice.

Artifacts Available Artifacts Reusable Results Reproduced
Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices
Changsu Kim, Shinnung Jeong, Sungjun Cho ORCID logo, Yongwoo Lee, William Song, Youngsok Kim ORCID logo, and Hanjun Kim
(POSTECH, South Korea; Yonsei University, South Korea)
In the embedded device market, custom hardware platforms such as an application specific integrated circuit (ASIC) and a field programmable gate array (FPGA) are attractive thanks to their high performance and power efficiency. However, its huge design costs make it challenging for manufacturers to timely launch new devices. High-level synthesis (HLS) helps significantly reduce the design costs by automating the translation of service algorithms into hardware logics; however, current HLS compilers do not fit well to embedded devices as they fail to produce area-efficient solutions while supporting concurrent events from diverse peripherals such as sensors, actuators and network modules. This paper proposes a new thread-aware HLS compiler named DURO that produces area-efficient embedded devices. DURO shares commonly-invoked functions and operators across different callers and threads with a new thread-aware area cost model, and thus effectively reduces the logic size. Moreover, DURO supports a variety of device peripherals by automatically integrating peripheral controllers and interfaces as peripheral drivers. The experiment results of six embedded devices with ten peripherals demonstrate that DURO reduces the area and energy dissipation of embedded devices by 28.5% and 25.3% compared with the designs generated by the state-of-the-art HLS compiler. This work also implements FPGA prototypes of the six devices using DURO, and the measurement results show 65.3% energy saving over Raspberry Pi Zero with slightly better computation performance.

JIT and Binary Translation; Optimizing for Code-Size
(Chair: Probir Roy, University of Michigan at Dearborn, USA)

HHVM Jump-Start: Boosting Both Warmup and Steady-State Performance at Scale
Guilherme Ottoni and Bin Liu
(Facebook, USA)
Just-In-Time ‍(JIT) compilation is often employed in Virtual Machines ‍(VMs) to translate their virtual-machine languages into real-machine code. This approach not only brings portability, but it also enables aggressive compiler optimizations based on runtime behavior observed via profiling. The downside of JIT compilation, compared to Ahead-Of-Time native compilation, is that the profiling and compilation overheads are incurred during execution. To mitigate these overheads, previous work have proposed sharing either profile data or final JIT compiled code across VM executions. Unfortunately, these techniques have drawbacks, including steady-state performance degradation and difficulty of use. To address these issues, this paper presents the Jump-Start mechanism implemented inside the HipHop Virtual Machine ‍(HHVM). Jump-Start is a practical approach to share VM profile data at a large scale, being used to power one of the largest websites in the world. In this paper, we argue for HHVM’s Jump-Start approach, describe it in detail, and present steady-state optimizations built on top of it. Running the Facebook website, we demonstrate that Jump-Start effectively solves the warmup problem in HHVM, reducing the server capacity loss during warmup by 54.9%, while also improving steady-state performance by 5.4%.

Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation
Ziyi ZhaoORCID logo, Zhang Jiang, Ying Chen ORCID logo, Xiaoli Gong ORCID logo, Wenwen Wang, and Pen-Chung Yew
(Nankai University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA)
Dynamic Binary Translation (DBT) is a key enabler for cross-ISA emulation, system virtualization, runtime instrumentation, and many other important applications. Among several critical requirements for DBT, it is important to provide equivalent semantics for atomic synchronization instructions such as Load-link/Store-conditional (LL/SC), which are mostly included in the reduced-instruction set architectures (RISC) and Compare-and-Swap(CAS), which is mostly in the complex instruction set architectures (CISC). However, the state-of-the-art DBT tools often do not provide a fully correct translation of these atomic instructions, in particular, from RISC atomic instructions (i.e. LL/SC) to CISC atomic instructions (i.e. CAS), due to performance concerns.
As a result, some may cause the well-known ABA problem, which could lead to wrong results or program crashes. In our experimental studies on QEMU, a state-of-the-art DBT, that runs multi-threaded lock-free stack operations implemented with ARM instruction set (i.e. using LL/SC) on Intel x86 platforms (i.e. using CAS), it often crashes within 2 seconds.
Although attempts have been made to provide correct emulation for such atomic instructions, they either result in heavy execution overheads or require additional hardware support. In this paper, we propose several schemes to address those issues and implement them on QEMU to evaluate their performance overheads. The results show that all of the proposed schemes can provide correct emulation and, for the best solution, can achieve a min, max, geomean speedup of 1.25x, 3.21x, 2.03x respectively, over the best existing software-based scheme.

Info Artifacts Available Artifacts Reusable Results Reproduced
An Experience with Code-Size Optimization for Production iOS Mobile Applications
Milind Chabbi ORCID logo, Jin Lin, and Raj Barik
(Uber Technologies, USA)
Modern mobile application binaries are bulky for many reasons: software and its dependencies, fast-paced addition of new features, high-level language constructs, and statically linked platform libraries. Reduced application size is critical not only for the end-user experience but also for vendor’s download size limitations. Moreover, download size restrictions may impact revenues for critical businesses.
In this paper, we highlight some of the key reasons of code-size bloat in iOS mobile applications, specifically apps written using a mix of Swift and Objective-C. Our observation reveals that machine code sequences systematically repeat throughout the app’s binary. We highlight source-code patterns and high-level language constructs that lead to an increase in the code size. We propose whole-program, fine-grained machine-code outlining as an effective optimization to constrain the code-size growth. We evaluate the effectiveness of our new optimization pipeline on the UberRider iOS app used by millions of customers daily. Our optimizations reduce the code size by 23%. The impact of our optimizations on the code size grows in magnitude over time as the code evolves. For a set of performance spans defined by the app developers, the optimizations do not statistically regress production performance. We applied the same optimizations to Uber’s UberDriver and UberEats apps and gained 17% and 19% size savings, respectively.

Info Artifacts Available Artifacts Reusable
AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction
Anderson Faustino da Silva ORCID logo, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimarães, and Fernando Magno Quintão Pereira ORCID logo
(State University of Maringá, Brazil; Federal University of Minas Gerais, Brazil)
A predictive compiler uses properties of a program to decide how to optimize it. The compiler is trained on a collection of programs to derive a model which determines its actions in face of unknown codes. One of the challenges of predictive compilation is how to find good training sets. Regardless of the programming language, the availability of human-made benchmarks is limited. Moreover, current synthesizers produce code that is very different from actual programs, and mining compilable code from open repositories is difficult, due to program dependencies. In this paper, we use a combination of web crawling and type inference to overcome these problems for the C programming language. We use a type reconstructor based on Hindley-Milner's algorithm to produce AnghaBench, a virtually unlimited collection of real-world compilable C programs. Although AnghaBench programs are not executable, they can be transformed into object files by any C compliant compiler. Therefore, they can be used to train compilers for code size reduction. We have used thousands of AnghaBench programs to train YaCoS, a predictive compiler based on LLVM. The version of YaCoS autotuned with AnghaBench generates binaries for the LLVM test suite over 10% smaller than clang -Oz. It compresses code impervious even to the state-of-the-art Function Sequence Alignment technique published in 2019, as it does not require large binaries to work well.

Info

proc time: 1.84