CGO 2019
2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Powered by
Conference Publishing Consulting

2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), February 16–20, 2019, Washington, DC, USA

CGO 2019 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page


Message from the General Chair
It is my great pleasure to welcome you to the 17th edition of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO) held in Washington DC, a wonderful city on the Potomac River, home to iconic museums and performing-arts centers as well as the Washington monument and the US federal government buildings.

Message from the Program Chairs
We would like to welcome you to CGO 2019 in the beautiful city of Washington DC. On behalf of the Program Committee, we are delighted to present the program for the 2019 International Symposium on Code Generation and Optimization. We hope that you find it informative and stimulating.

CGO 2019 Organization
This summarizes the organizational leadership of CGO 2019.

Report from the Artifact Evaluation Committee
Authors of accepted papers were given the opportunity to participate in the artifact evaluation process by submitting a research artifact. ACM defines an artifact as “a digital object that was either created by the authors to be used as part of the study or generated by the experiment itself”. The artifact evaluation process checks if the submitted artifact supports the claims made in the paper. Ultimately, artifact evaluation is intended to encourage researchers to take extra care in conducting experiments in a reproducible way and to package experimental workflows and all related materials for making them accessible for others.

Sponsors
This lists the sponsors for CGO 2019.

ACM Student Research Competition
As in previous years, the 2019 International Symposium on Code Generation and Optimization (CGO) hosts the ACM Student Research Competition (SRC). We have selected twelve abstracts to compete in the ACM SRC. All submissions were reviewed by five members of the selection committee. Each reviewer attributed up to 5 points to a submission and provided a short comment. The submissions were then ranked according to their total score. We would like to thank the General Co-Chairs and Program Co-Chairs for helping with our inquiries and for their initiative during the preparations of the SRC. Special thanks go to the members of the selection committee for reviewing all the abstracts on time and to all the authors who submitted a poster abstract, as well as all students who participated in the SRC.


Keynote

Rethinking Compilation in a Heterogeneous World (Keynote)
Michael O'Boyle ORCID logo
(University of Edinburgh, UK)
Moore’s Law has been the main driver behind the extraordinary success of computer systems. However, with the technology roadmap showing a decline in transistor scaling and hence the demise of Moore’s law, computer systems will be increasingly specialised and diverse. The consistent ISA contract is beginning to break down. As it stands, software will simply not fit. Current compiler technology, whose role is to map software to the underlying hardware is incapable of doing this. This looming crisis requires a fundamental rethink of how we design, program and use heterogeneous systems. This talk examines new ways of tackling heterogeneity so that, rather than deny and fear the end of Moore’s law, we embrace and exploit it.


Research Papers

Binary Optimization

BOLT: A Practical Binary Optimizer for Data Centers and Beyond
Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni
(Facebook, USA)
Abstract — Performance optimization for large-scale applications has recently become more important as computation continues to move towards data centers. Data-center applications are generally very large and complex, which makes code layout an important optimization to improve their performance. This has motivated recent investigation of practical techniques to improve code layout at both compile time and link time. Although post-link optimizers had some success in the past, no recent work has explored their benefits in the context of modern data-center applications. In this paper, we present BOLT, an open-source post-link optimizer built on top of the LLVM framework. Utilizing sample-based profiling, BOLT boosts the performance of real-world applications even for highly optimized binaries built with both feedback-driven optimizations (FDO) and link-time optimizations (LTO). We demonstrate that post-link performance improvements are complementary to conventional compiler optimizations, even when the latter are done at a whole-program level and in the presence of profile information. We evaluated BOLT on both Facebook data-center workloads and open-source compilers. For data-center applications, BOLT achieves up to 7.0% performance speedups on top of profile-guided function reordering and LTO. For the GCC and Clang compilers, our evaluation shows that BOLT speeds up their binaries by up to 20.4% on top of FDO and LTO, and up to 52.1% if the binaries are built without FDO and LTO.

Artifacts Available Artifacts Reusable Results Replicated
Janus: Statically-Driven and Profile-Guided Automatic Dynamic Binary Parallelisation
Ruoyu Zhou and Timothy M. JonesORCID logo
(University of Cambridge, UK)
We present Janus, a framework that addresses the challenge of automatic binary parallelisation. Janus uses same-ISA dynamic binary modification to optimise application binaries, controlled by static analysis with judicious use of software speculation and runtime checks that ensure the safety of the optimisations. A static binary analyser first examines a binary executable, to determine the loops that are amenable to parallelisation and the transformations required. These are encoded as a series of rewrite rules, the steps needed to convert a serial loop into parallel form. The Janus dynamic binary modifier reads both the original executable and rewrite rules and carries out the transformations on a per-basic-block level just-in-time before execution. Lifting static analysis out of the runtime enables the global and profile-guided views of the application; ambiguities from static binary analysis can in turn be addressed through a combination of dynamic runtime checks and speculation guard against data dependence violations. It allows us to parallelise even those loops containing dynamically discovered code. We demonstrate Janus by parallelising a range of optimised SPEC CPU 2006 benchmarks, achieving average speedups of 2.1× and 6.0× in the best case.

Artifacts Available Artifacts Functional Results Replicated

Bugs and Security

Smokestack: Thwarting DOP Attacks with Runtime Stack Layout Randomization
Misiker Tadesse Aga and Todd Austin
(University of Michigan, USA)
Memory corruption vulnerabilities in type-unsafe languages are often exploited to perform a control-flow hijacking attack, in which an attacker uses vulnerabilities to corrupt control data in the program to eventually gain control over the execution of the program. However, widespread adoption of control-flow attack defenses such as Control-flow Integrity(CFI) has led attackers to exploit memory errors to corrupt non-control data that can not be detected by these defenses. Non-control data attacks can be used to corrupt security critical data or leak sensitive information. Moreover, recent attacks such as data-oriented programming (DOP) have generalized non-control data attacks to achieve Turing-complete computation capabilities within the programmer-specified control-flow graph, leaving previously proposed control-flow protections unable to stop these attacks.
In this paper, we present a stack-layout randomization scheme that can effectively thwart DOP attacks. Our approach, called Smokestack, provides each function invocation with a randomly permuted ordering of the local stack organization. In addition, we utilize true-random value sources combined with disclosure-resistant pseudo-random number generation to ensure that an adversary cannot anticipate a function’s invocation permutation of automatic variables. Our evaluation on SPEC benchmarks and various real-world applications shows that Smokestack can stop DOP attacks with minimal overhead.

Automatic Equivalence Checking for Assembly Implementations of Cryptography Libraries
Jay P. Lim and Santosh Nagarakatte ORCID logo
(Rutgers University, USA)
This paper presents an approach and a tool, CASM-Verify, to automatically check the equivalence of highly optimized assembly implementations of cryptographic algorithms. The key idea of this paper is to decompose the equivalence checking problem into several small sub-problems using a combination of concrete and symbolic evaluation. Given a reference and an optimized implementation, CASM-Verify concretely executes the two implementations on randomly generated inputs and identifies likely equivalent variables. Subsequently, it uses symbolic verification using an SMT solver to determine whether the identified variables are indeed equivalent. Further, it decomposes the original query into small sub-queries using a collection of optimizations for memory accesses. These techniques enable CASM-Verify to verify the equivalence of assembly implementations (e.g., x86 and SSE) of various algorithms such as SHA-256, ChaCha20, and AES-128 for a message block.

Publisher's Version Info Artifacts Available Artifacts Functional Results Replicated
CSOD: Context-Sensitive Overflow Detection
Hongyu Liu, Sam Silvestro ORCID logo, Xiaoyin WangORCID logo, Lide Duan, and Tongping Liu
(University of Texas at San Antonio, USA)
Buffer overflow is possibly the most well-known memory issue. It can cause erratic program behavior, such as incorrect outputs and crashes, and can be exploited to issue security attacks. Detecting buffer overflows has drawn significant research attention for almost three decades. However, the prevalence of security attacks due to buffer overflows indicates that existing tools are still not widely utilized in production environments, possibly due to their high performance overhead or limited effectiveness.
This paper proposes CSOD, a buffer overflow detection tool designed for the production environment. CSOD proposes a novel context-sensitive overflow detection technique that can dynamically adjust its detection strategy based on the behavior of different allocation calling contexts, enabling it to effectively detect overflows in millions of objects via four hardware watchpoints. It can correctly report root causes of buffer over-writes and over-reads, without any additional manual effort. Furthermore, CSOD only introduces 6.7% performance overhead on average, which makes it appealing as an always-on approach for production software.

Reasoning about the Node.js Event Loop using Async Graphs
Haiyang Sun, Daniele Bonetta ORCID logo, Filippo Schiavio ORCID logo, and Walter Binder ORCID logo
(USI Lugano, Switzerland; Oracle Labs, USA)
With the popularity of Node.js, asynchronous, event-driven programming has become widespread in server-side applications. While conceptually simple, event-based programming can be tedious and error-prone. The complex semantics of the Node.js event loop, coupled with the different flavors of asynchronous execution in JavaScript, easily leads to bugs. This paper introduces a new model called Async Graph to reason about the runtime behavior of applications and their interactions with the Node.js event loop. Based on the model, we have developed AsyncG, a tool to automatically build and analyze the Async Graph of a running application, and to identify bugs related to all sources of asynchronous execution in Node.js. AsyncG is compatible with the latest ECMAScript language features and can be (de)activated at runtime. In our evaluation, we show how AsyncG can be used to identify bugs in real-world Node.js applications.

Artifacts Available Artifacts Functional Results Replicated

GPUs and Tensors

Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs
Simon Garcia De Gonzalo, Sitao Huang, Juan Gómez-Luna, Simon Hammond, Onur Mutlu, and Wen-mei Hwu ORCID logo
(University of Illinois at Urbana-Champaign, USA; ETH Zurich, Switzerland; Sandia National Laboratories, USA)
Since the advent of GPU computing, GPU hardware has evolved at a fast pace. Since application performance heavily depends on the latest hardware improvements, performance portability is extremely challenging for GPU application library developers. Portability becomes even more difficult when new low-level instructions are added to the ISA (e.g., warp shuffle instructions) or the microarchitectural support for existing instructions is improved (e.g., atomic instructions). Library developers, besides re-tuning the code for new hardware features, deal with the performance portability issue by hand-writing multiple algorithm versions that leverage different instruction sets and microarchitectures. High-level programming frameworks and Domain Specific Languages (DSLs) do not typically support low-level instructions (e.g., warp shuffle and atomic instructions), so it is painful or even impossible for these programming systems to take advantage of the latest architectural improvements. In this work, we design a new set of high-level APIs and qualifiers, as well as specialized Abstract Syntax Tree (AST) transformations for high-level programming languages and DSLs. Our transformations enable warp shuffle instructions and atomic instructions (on global and shared memories) to be easily generated. We show a practical implementation of these transformations by building on Tangram, a high-level kernel synthesis framework. Using our new language and compiler extensions, we implement parallel reduction, a fundamental building block used in a wide range of algorithms. Parallel reduction is representative of the performance portability challenge, as its performance heavily depends on the latest hardware improvements. We compare our synthesized parallel reduction to another high-level programming framework and a hand-written high-performance library across three generations of GPU architectures, and show up to 7.8x speedup (2x on average) over hand-written code.

A Code Generator for High-Performance Tensor Contractions on GPUs
Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan
(Ohio State University, USA; Pacific Northwest National Laboratory, USA; Colorado State University, USA)
Tensor contractions are higher dimensional generalizations of matrix-matrix multiplication. They form the compute-intensive core of many applications in computational science and data science. In this paper, we describe a high-performance GPU code generator for arbitrary tensor contractions. It exploits domain-specific properties about data reuse in tensor contractions to devise an effective code generation schema, coupled with an effective model-driven search, to determine parameters for mapping of computation to threads and staging of data through the GPU memory hierarchy. Experimental evaluation using a set of tensor contraction benchmarks demonstrates performance improvement and/or significantly reduced code generation time over other state-of-the-art tensor contraction libraries and code generators.

Artifacts Available Artifacts Functional Results Replicated

Potpourri

Transforming Query Sequences for High-Throughput B+ Tree Processing on Many-Core Processors
Ruiqin Tian, Junqiao Qiu, Zhijia Zhao ORCID logo, Xu Liu, and Bin Ren
(College of William and Mary, USA; University of California at Riverside, USA)
The throughput of B+ tree query processing is critical to many databases, file systems, and cloud applications. Based on bulk synchronous parallel (BSP), latch-free B+ tree query processing has shown promise by processing queries in small batches and avoiding the use of locks. As the number of cores on CPUs increases, it becomes possible to process larger batches in parallel without adding any extra delays. In this work, we argue that as the batch size increases, there will be more optimization opportunities exposed beyond parallelism, especially when the query distributions are highly skewed. These include the opportunities of avoiding the evaluations of a large ratio of redundant or unnecessary queries.
To rigorously exploit the new opportunities, this work introduces a query sequence analysis and transformation framework--QTrans. QTrans can systematically reason about the redundancies at a deep level and automatically remove them from the query sequence. QTrans has interesting resemblances with the classic data-flow analysis and transformation that have been widely used in compilers. To confirm its benefits, this work integrates QTrans into an existing BSP-based B+ tree query processing system, PALM tree, to automatically eliminate redundant and unnecessary queries. Evaluation shows that, by transforming the query sequence, QTrans can substantially improve the throughput of query processing on both real-world and synthesized datasets, up to 16X.

Artifacts Available Artifacts Functional Results Replicated
Quantifying and Reducing Execution Variance in STM via Model Driven Commit Optimization
Girish Mururu ORCID logo, Ada Gavrilovska ORCID logo, and Santosh Pande ORCID logo
(Georgia Institute of Technology, USA)
Simplified parallel programming coupled with an ability to express speculative computation is realized with Software Transactional Memory (STM). Although STMs are gaining popularity because of significant improvements in parallel performance, they exhibit enormous variation in transaction execution with non-repeatable performance behavior which is unacceptable in many application domains, especially in which frame rates and responsiveness should be predictable.In other domains reproducible transactional behavior helps towards system provisioning. Thus, reducing execution variance in STM is an important performance goal that has been mostly overlooked. In this work, we minimize the variance in execution time of threads in STM by reducing non-determinism exhibited due to speculation. We define the state of STM, and we use it to first quantify non-determinism and then generate an automaton that models the execution behavior of threads in STM. We finally use the state automaton to guide the STM to avoid non-predictable transactional behavior thus reducing non-determinism in rollbacks which in turn results in reduction in variance. We observed average reduction of variance in execution time of threads up to 74% in 16 cores and 53% in 8 cores by reducing non-determinism up to 24% in 16 cores and 44% in 8 cores, respectively, on STAMP benchmark suite while experiencing average slowdown of 4.8% in 8 cores and 19.2% in 16 cores. We also reduced the variance in frame rate by maximum of 65% on a version of real world game Quake3 without degradation in timing.

Artifacts Available Artifacts Functional Results Replicated
White-Box Program Tuning
Wen-Chuan Lee, Yingqi Liu, Peng Liu, Shiqing Ma, Hongjun Choi, Xiangyu ZhangORCID logo, and Rajiv Gupta ORCID logo
(Purdue University, USA; University of California at Riverside, USA)
Many programs or algorithms are largely parameterized, especially those based on heuristics. The quality of the results depends on the parameter setting. Different inputs often have different optimal settings. Program tuning is hence of great importance. Existing tuning techniques treat the the program as a black-box and hence cannot leverage the internal program states to achieve better tuning. We propose a white-box tuning technique that is implemented as a library. The user can compose complex program tuning tasks by adding a small number of library calls to the original program and providing a few callback functions. Our experiments on 13 widely-used real-world programs show that our technique substantially improves data processing results and outperforms OpenTuner, the state-of-the-art black-box tuning technique.

Generation of In-Bounds Inputs for Arrays in Memory-Unsafe Languages
Marcus Rodrigues, Breno Guimarães, and Fernando Magno Quintão Pereira ORCID logo
(Federal University of Minas Gerais, Brazil)
This paper presents a technique to generate in-bounds inputs for arrays used in memory-unsafe programming languages, such as C and C++. We show that most memory indexation found in actual C programs follows patterns that are easy to analyze statically. Based on this observation, we show how symbolic range analysis can be used to establish contracts between the arguments of a function and the arrays used within that function. To demonstrate the effectiveness of our ideas, we use them to implement Griffin-TG, a tool to stress-test C programs whose source code might be partially available. We show how Griffin-TG improves Aprof, a well-known algorithmic profiling tool, and we show how it lets us enrich Polybench with a large set of new inputs.

Info

Code Generation

Function Merging by Sequence Alignment
Rodrigo C. O. Rocha, Pavlos Petoumenos, Zheng Wang, Murray Cole, and Hugh Leather
(University of Edinburgh, UK; Lancaster University, UK)
Resource-constrained devices for embedded systems are becoming increasingly important. In such systems, memory is highly restrictive, making code size in most cases even more important than performance. Compared to more traditional platforms, memory is a larger part of the cost and code occupies much of it. Despite that, compilers make little effort to reduce code size. One key technique attempts to merge the bodies of similar functions. However, production compilers only apply this optimization to identical functions, while research compilers improve on that by merging the few functions with identical control-flow graphs and signatures. Overall, existing solutions are insufficient and we end up having to either increase cost by adding more memory or remove functionality from programs.
We introduce a novel technique that can merge arbitrary functions through sequence alignment, a bioinformatics algorithm for identifying regions of similarity between sequences. We combine this technique with an intelligent exploration mechanism to direct the search towards the most promising function pairs. Our approach is more than 2.4x better than the state-of-the-art, reducing code size by up to 25%, with an overall average of 6%, while introducing an average compilation-time overhead of only 15%. When aided by profiling information, this optimization can be deployed without any significant impact on the performance of the generated code.

Artifacts Available Artifacts Functional Results Replicated
An Optimization-Driven Incremental Inline Substitution Algorithm for Just-in-Time Compilers
Aleksandar Prokopec ORCID logo, Gilles Duboscq, David Leopoldseder, and Thomas Würthinger
(Oracle Labs, Switzerland; JKU Linz, Austria)
Inlining is one of the most important compiler optimizations. It reduces call overheads and widens the scope of other optimizations. But, inlining is somewhat of a black art of an optimizing compiler, and was characterized as a computationally intractable problem. Intricate heuristics, tuned during countless hours of compiler engineering, are often at the core of an inliner implementation. And despite decades of research, well-established inlining heuristics are still missing.
In this paper, we describe a novel inlining algorithm for JIT compilers that incrementally explores a program’s call graph, and alternates between inlining and optimizations. We devise three novel heuristics that guide our inliner: adaptive decision thresholds, callsite clustering, and deep inlining trials. We implement the algorithm inside Graal, a dynamic JIT compiler for the HotSpot JVM. We evaluate our algorithm on a set of industry-standard benchmarks, including Java DaCapo, Scalabench, Spark-Perf, STMBench7 and other benchmarks, and we conclude that it significantly improves performance, surpassing state-of-the-art inlining approaches with speedups ranging from 5% up to 3×.

Artifacts Available Artifacts Functional Results Replicated
Tensor Algebra Compilation with Workspaces
Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman AmarasingheORCID logo
(Massachusetts Institute of Technology, USA; Adobe, USA)
This paper shows how to extend sparse tensor algebra compilers to introduce temporary tensors called workspaces to avoid inefficient sparse data structures accesses. We develop an intermediate representation (IR) for tensor operations called concrete index notation that specifies when sub-computations occur and where they are stored. We then describe the workspace transformation in this IR, how to programmatically invoke it, and how the IR is compiled to sparse code. Finally, we show how the transformation can be used to optimize sparse tensor kernels, including sparse matrix multiplication, sparse tensor addition, and the matricized tensor times Khatri-Rao product (MTTKRP).
Our results show that the workspace transformation brings the performance of these kernels on par with hand-optimized implementations. For example, we improve the performance of MTTKRP with dense output by up to 35%, and enable generating sparse matrix multiplication and MTTKRP with sparse output, neither of which were supported by prior tensor algebra compilers.

Kernel Optimization

Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman AmarasingheORCID logo
(Massachusetts Institute of Technology, USA; Politecnico di Milano, Italy; Google, USA; Adobe, USA)
This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.

Info Artifacts Available Artifacts Reusable Results Replicated
Super-Node SLP: Optimized Vectorization for Code Sequences Containing Operators and Their Inverse Elements
Vasileios Porpodas, Rodrigo C. O. Rocha, Evgueni Brevnov, Luís F. W. Góes, and Timothy Mattson
(Intel, USA; University of Edinburgh, UK; PUC-MG, Brazil)
SLP Auto-vectorization converts straight-line code into vector code. It scans input code for groups of instructions that can be combined into vectors and replaces them with their corresponding vector instructions.
This work introduces Super-Node SLP (SN-SLP), a new SLP-style algorithm, optimized for expressions that include a commutative operator (such as addition) and its corresponding inverse element (subtraction).SN-SLP uses the algebraic properties of commutative operators and their inverse elements to enable additional transformations that extend auto-vectorization to cases difficult for state-of-the-art auto-vectorizing compilers. We implemented SN-SLP in LLVM. Our evaluation on a real system demonstrates considerable performance improvements of benchmark code with no significant change in compilation time.

Locus: A System and a Language for Program Optimization
Thiago S. F. X. Teixeira, Corinne Ancourt, David Padua, and William Gropp
(University of Illinois at Urbana-Champaign, USA; MINES ParisTech, France)
We discuss the design and the implementation of Locus, a system and a language to orchestrate the optimization of applications. The increasing complexity of machines and the large space of program variants, produced by the many transformations available, conspire to make compilers deliver unsatisfactory performance. As a result, optimization experts must intervene to manually explore the space of program variants seeking the best version for each target machine. This intervention is unproductive, and maintaining and managing sequences of transformations as new architectures are adopted and new application features are incorporated is challenging.
Locus allows collections of program transformation sequences to be specified separately from the application code. The language is able to represent in a clear notation complex collections of transformations that are applied to code regions selected by the programmer. The system integrates multiple optimization modules as well as search modules that facilitate the efficient traversal of the space of program variants. Locus is intended to help experts in the optimization process, specially for complex, long-lived applications that are to be executed on different environments. Four examples are presented to illustrate the power and simplicity of the language. Although not the primary focus of this paper, the examples also show that exploring the space of variants typically leads to better performing codes than those produced by conventional compiler optimizations that are based on heuristics.

GPUs

Decoding CUDA Binary
Ari B. Hayes, Fei Hua ORCID logo, Jin Huang, Yanhao Chen ORCID logo, and Eddy Z. Zhang ORCID logo
(Rutgers University, USA)
NVIDIA's software does not offer translation of assembly code to binary for their GPUs, since the specifications are closed-source. This work fills that gap. We develop a systematic method of decoding the Instruction Set Architectures (ISAs) of NVIDIA's GPUs, and generating assemblers for different generations of GPUs. Our framework enables cross-architecture binary analysis and transformation. Making the ISA accessible in this manner opens up a world of opportunities for developers and researchers, enabling numerous optimizations and explorations that are unachievable at the source-code level. Our infrastructure has already benefited and been adopted in important applications including performance tuning, binary instrumentation, resource allocation, and memory protection.

Artifacts Available Artifacts Reusable Results Replicated
From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization
Bo Qiao, Oliver Reiche, Frank Hannig, and Jürgen Teich
(University of Erlangen-Nuremberg, Germany)
Optimizing data-intensive applications such as image processing for GPU targets with complex memory hierarchies requires to explore the tradeoffs among locality, parallelism, and computation. Loop fusion as one of the classical optimization techniques has been proven effective to improve locality at the function level. Algorithms in image processing are increasing their complexities and generally consist of many kernels in a pipeline. The inter-kernel communications are intensive and exhibit another opportunity for locality improvement at the system level. The scope of this paper is an optimization technique called kernel fusion for data locality improvement. We present a formal description of the problem by defining an objective function for locality optimization. By transforming the fusion problem to a graph partitioning problem, we propose a solution based on the minimum cut technique to search fusible kernels recursively. In addition, we develop an analytic model to quantitatively estimate potential locality improvement by incorporating domain-specific knowledge and architecture details. The proposed technique is implemented in an image processing DSL and source-to-source compiler called Hipacc, and evaluated over six image processing applications on three Nvidia GPUs. A geometric mean speedup of up to 2.52 can be observed in our experiments.

Publisher's Version Info Artifacts Available Artifacts Reusable Results Replicated
IGC: The Open Source Intel Graphics Compiler
Anupama Chandrasekhar, Gang Chen, Po-Yu Chen, Wei-Yu Chen, Junjie Gu, Peng Guo, Shruthi Hebbur Prasanna Kumar, Guei-Yuan Lueh, Pankaj Mistry, Wei Pan, Thomas Raoux, and Konrad Trifunovic
(Intel, USA; Intel, Poland)
With increasing general purpose programming capability, GPUs have become the mainstay for a wide variety of compute intensive tasks from cloud to edge computing. Because of its availability on nearly every desktop and mobile processor that Intel ships, Intel integrated GPU offers a plethora of opportunities for researchers and application developers to make significant real-world impact.
In this paper we present the Intel Graphics Compiler (IGC), the LLVM-based production compiler for Intel HD and Iris graphics. IGC supports all major graphics and compute APIs, and its OpenCL compute stack including compute runtime, compiler frontend and backend, and architecture specification is fully open-source, giving a unique opportunity for developers to optimize the entire stack. We highlight several custom optimizations that address the challenges for GPU compilation. Examples include SIMD size selection, divergence analysis, instruction scheduling, addressing mode selection, and redundant copy elimination. These optimizations take advantage of features in the Intel GPU architecture such as a larger register file, indirect register addressing, and multiple memory adressing modes. Experimental results show that our optimizations deliver significant speedup on a number of OpenCL benchmarks; compared to the baseline, we see a geometric mean of 12% speed up across benchmarks with a peak gain of 45%.


Student Research Competition

Undergraduate

Automatic Parallelization of Irregular x86-64 Loops
Brandon Neth and Michelle Mills Strout ORCID logo
(University of Arizona, USA)
Productivity languages such as Python and R are growing in popularity especially in the development of data analysis algorithms. While these languages provide powerful tools to accelerate the development process, they incur heavy overhead due to their interpreted nature. One approach to remove this overhead is to specialize the interpreter for a given script. Then, loops in the specialized code can be parallelized to further improve performance. In contrast to previous work that targets LLVM loops with affine memory access patterns, we are investigating the problem of parallelizing loops with irregular access patterns at the x86 level. To do so, we split hot loops into a sequential master thread that sends tasks to parallel worker threads.

Graduate

A Shared BTB Design for Multicore Systems
Moumita Das, Ansuman Banerjee, and Bhaskar Sardar
(Jadavpur University, India; Indian Statistical Institute, India)
With increasing use of runtime polymorphism and reliance on runtime type interpretation, the presence and importance of indirect branches has seen a considerable rise in recent workloads. Evidently, accurate target prediction for indirect branches has emerged as an important problem. While direction prediction of direct branches has received considerable research attention leading to efficient prediction policies and hardware structures implemented inside modern processors, proposals for target prediction for indirect branches has been relatively few. The problem of accurate target prediction for indirect branches is significantly tough since these transfer control to an address stored in a register that is known only at runtime. Unlike conditional direct branches, indirect branches can have more than two targets to be resolved at runtime, for which prediction requires a full 32-bit / 64-bit address to be predicted, in contrast to just a taken or not-taken decision as needed for direction prediction of direct branches. Recent research shows indirect branches, being mispredicted more frequently, can start to dominate the overall branch misprediction cost. In modern processors, the only hardware structure available to facilitate target address prediction for indirect branches is a fixed-size Branch Target Buffer (BTB). BTB is often designed as a set associative cache for storing recent target addresses for branch instructions encountered during execution, with a motivation of being able to reuse the same addresses for future instances, thereby saving latency cycles. Evidently, designing efficient indexing schemes and replacement mechanisms for BTB structures is crucial, more so, for indirect branches, since these serve as the only prediction handle. In this paper, our objective is to examine a hierarchical BTB design for multicores with total size comparable to what exists today, with a motivation towards possible improvement of prediction accuracy by facilitating collaborative constructive learning between programs executing in different cores that encounter similar histories. Specifically, we wish to propose the concept of a small on-chip L1 BTB inside each core, supported by a larger off-chip L2 BTB shared among the cores. Our motivation towards a hierarchical BTB design is twofold. On one hand, in a multiprogramming environment, where the same program is executed in multiple cores, on different test cases, there is a significant potential of target information reuse for indirect branches, as is evident from our experiments on SPEC 2006 workloads. This is typically useful for machine learning based workloads where the training phase is often executed in different cores with the same neural network being trained on different training sets. On the other hand, with different programs in different cores, there is some chance of reuse as well, due to sharing of system libraries. The essence of our idea rests on the fact that programs executed in different cores have similar history patterns that can similarly influence the target addresses. Our design aims to decrease on the on-chip L1 BTB size while investing more storage for the off-chip shared L2 BTB. Suitable allocation and replacement policies support our hierarchical design. Initial experiments show expected accuracy benefits.

Optimizing RNA-RNA Interaction Computations
Swetha Varadarajan
(Colorado State University, USA)
RNA-RNA (Ribo-Nucleic Acid) Interactions (RRI) has led to significant successes in Cancer treatment. However, programs that model these interactions are computationally and memory intensive: for sequences of lengths N and M, space is Θ(N2M2), and the time is Θ(N3M3). Conducting genome-wide RNA-RNA interactions can take forever. Therefore, there is a need to speed up these computations. On examining two RRI applications (piRNA and IRIS), we find that the dominant computations fit the polyhedral model, a widely accepted tool for loop nest optimization. Applying existing automatic polyhedral tools on this whole application is near to impossible. Hence, we analyzed a surrogate kernel that captures the main dependence pattern found in piRNA and IRIS. But even on this kernel, a state-of-the-art automatic polyhedral, Pluto does not significantly improve the performance of the baseline implementation. Whereas, with simple manual loop permutation and skewing techniques, we were able to achieve an average of 17× (sequential) and 112× (parallel on a 6-core Intel Broadwell processor) speed-up over the baseline. This performance represents 75% (respectively, 88%) of attainable single-core and multi-core L1 bandwidth. Preliminary results from tiling show that there is a room for two-three fold improvement when the data foot-print required to compute the inner three loop dimensions fit in the main memory.

Code Generation from Formal Models for Automatic RTOS Portability
Renata Martins Gomes and Marcel Baunach
(Graz University of Technology, Austria)
Current approaches for portability of real-time operating systems (RTOSs) for embedded systems are largely based on manual coding, which is arduous and error prone. With increasing dependability requirements for cyber physical systems, specially within the Internet of Things (IoT), along with the expected great diversity of hardware platforms, software platforms will only remain competitive in the long run if they guarantee correct operation and easy deployment to every hardware platform. In this scenario, a new approach to the development and portability of RTOSs that guarantees correct implementations for all current and future devices and hardware architectures becomes indispensable.
We present a framework for automatic RTOS portability that integrates model-based design and formal methods into dependable embedded software development. We focus specially on modeling the interaction between software and hardware in order to generate low-level code. This enables automatic portability for hardware-related parts of the OS (i.e., context switching, memory management, security aspects, etc,) as well as for on-chip peripheral drivers (i.e., timers, I/O, etc.).
With our framework we will be able to prove the consistency of the refinements as well as that the RTOS model fulfills various functional and non-functional requirements. Automatic code generation guarantees that the model is correctly translated to machine language, avoiding implementation mistakes common to manual coding. Changes on the software, for bug fixes or testing of new concepts, for example, do not require knowledge of the target architectures, since they are done on the model and are immediately reflected in all implementations upon code generation, assuring consistency across platforms.

Understanding RDMA Behavior in NUMA Systems
Jacob Nelson and Roberto Palmieri
(Lehigh University, USA)
Most high performance computing clusters arenowadays composed of large multicore machines that exposeNon-Uniform Memory Access (NUMA), and they are intercon-nected using modern communication paradigms, such as RemoteDirect Memory Access (RDMA). In this work we perform astudy outlining the performance impact of these two technologies,NUMA and RDMA, when combined. Findings show that system’ssoftware architecture should be designed for NUMA and RDMA;otherwise major performance penalties occur.

Translating Traditional SIMD Instructions to Vector Length Agnostic Architectures
Sheng-Yu Fu and Wei-Chung Hsu
(National Taiwan University, Taiwan)
One interesting trend of SIMD architecture is towards Vector Length Agnostic (VLA) designs. For example, ARM new vector ISA, Scalable Vector Extension (SVE), and RISC-V vector extension are adopting such a direction. VLA decouples the vector register length from the compiled binary so that the same executable could run on different implementations of vector length. However, in the current application world, the majority of SIMD code is fixed-length based, such as ARM-NEON, x86-SSE, x86-AVX, and traditional vector machines. Therefore, how to migrate legacy SIMD code to VLA architecture would be an interesting and important technical challenge.

Accelerating GPU Computing at Runtime with Binary Optimization
Guangli Li, Lei Liu, and Xiaobing Feng ORCID logo
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Nowadays, many applications use GPUs (Graphics Processing Units) to achieve high performance. When we use GPU servers, the idle CPU resource of the servers is often ignored. In this paper, we explore the idea: using the idle CPU resource to speed up GPU programs. We design a dynamic binary optimization framework for accelerating GPU computing at runtime. A template-based binary optimization method is proposed to optimize kernels, which can avoid the high cost of kernel compilation. This method replaces determined variables with constant values and generates an optimized binary kernel. Based on the analysis results of optimization opportunities, we replace the original kernels with optimized kernels during program execution. The experimental results show that it is feasible to accelerate GPU programs via binary optimization. After applying binary optimization to five convolution layers of deep neural networks, the average performance improvement can reach 20%.

Extending LLVM for Lightweight SPMD Vectorization: Using SIMD and Vector Instructions Easily from Any Language
Robin Kruppe, Julian Oppermann, Lukas SommerORCID logo, and Andreas Koch ORCID logo
(TU Darmstadt, Germany)
Popular parallel programming environments such as CUDA or OpenMP require considerable compiler support and runtime libraries and are therefore only available for a few programming languages and/or targets. We present an approach to vectorizing kernels written in an existing general-purpose language that requires minimal changes to compiler front-ends. Programmers annotate parallel (SPMD) code regions with a few intrinsic functions, which then guide an ordinary automatic vectorization algorithm. This mechanism allows programming SIMD and vector processors without many of the complexities of more powerful parallel programming environments. Our prototype implementation, based on a custom LLVM pass, is integrated into C, C++ and Rust compilers using only 29--37 lines of frontend-specific code each.

Multi-target Compiler for the Deployment of Machine Learning Models
Oscar Castro-Lopez and Ines F. Vega-Lopez
(Autonomous University of Sinaloa, Mexico)
The deployment of machine learning models into production environments is a crucial task. Its seamless integration with the operational system can be quite challenging as it must adhere to the same software requirements such as memory management, latency, and scalability as the rest of the system. Unfortunately, none of these requirements are taken into consideration when inferring new models from data. A straightforward approach for deployment consists of building a pipeline connecting the modeling tools to the operational system. This approach follows a client-server architecture and it may not address the design requirements of the software in production, especially the ones related to efficiency. An alternative is to manually generate the source code implementing the model in the programming language that was originally used to develop the software in production. However, this approach is usually avoided because it is a time-consuming and error-prone task. To circumvent the aforementioned problems, we propose to automate the process of machine learning model deployment. For this, we have developed a special-purpose compiler. Machine learning models can be formally defined using a standard language. We use this formal description as an input for our compiler, which translates it into the source code that implements the model. Our proposed compiler generates code for different programming languages. Furthermore, with this compiler we can generate source code that exploits specific characteristics of the system’s hardware architecture such as multi-core CPU´s and graphic processing cards. We have conducted experiments that indicate that automated code generation for deploying machine learning models is, not only feasible but also efficient.

A Tool for Performance Analysis of GPU-Accelerated Applications
Keren Zhou ORCID logo and John Mellor-Crummey
(Rice University, USA)
Architectures for High-Performance Computing (HPC) now commonly employ accelerators such as Graphics Processing Units (GPUs). High-level programming abstractions for accelerated computing include OpenMP as well as RAJA and Kokkos---programming abstractions based on C++ templates. Such programming models hide GPU architectural details and generate sophisticated GPU code organized as many small procedures. For example, a dot product kernel expressed using RAJA atop NVIDIA's thrust templates yields 35 procedures. Existing performance tools are ill-suited for analyzing such complex kernels because they lack a comprehensive profile view. At best, tools such as NVIDIA's nvvp provide a profile view that shows only limited CPU calling contexts and omits both calling contexts and loops in GPU code. To address this problem, we extended Rice University's HPCToolkit to build a complete profile view for GPU-accelerated applications.

Info
Kernel Fusion/Decomposition for Automatic GPU-Offloading
Alok Mishra, Martin Kong, and Barbara Chapman
(Stony Brook University, USA; Brookhaven National Laboratory, USA)
The massively parallel architecture of GPU accelerators are being harnessed to expedite computational workloads in cutting edge scientific research. Unfortunately writing applications for GPUs requires extensive knowledge of the underlying architecture, the application and the interfacing programming model. Moreover, (re-)writing kernels using lower-level programming models such as CUDA and OpenCL is a burden for application scientists. A more appealing strategy is to leverage a programming model layered on directive-based optimization: OpenMP, whose recent specification significantly extends its accelerator functionalities.
Despite this, it is still quite challenging to optimize large scale applications, since ``pragmatizing'' each kernel is a repetitive and complex task. In large scale applications most of the operations could be small, don't have enough computational work to justify a GPU execution, deeply buried in the library specification, or evenly spread throughout the application. Thus, we seek to design and build a compiler framework that can automatically and profitably offload regions of code with these characteristics.
The driving principle of our work resides in generating numerous kernel variants that result from fusing and/or decomposing existing function bodies. We analyze the program's call graph to determine the ``proximity'' of kernel calls and evaluate the degree of data reuse among adjacent or ``close-enough'' calls. When such patterns are detected we generate several scenarios, until producing a single variant whose footprint is near the capacity of the GPU.
To compare the potential performance among the various kernel variants generated, we are designing an adaptive cost model. The precision of this cost model will depend upon the analyzability of the program. We are also building upon existing cost models like Baghsorkhi et al.'s model which proposed a work flow graph based analytical model and a recent Hong et al.'s model which propose the use of abstract kernel emulations to help identify the performance bottlenecks of a GPU program execution. Along with these we introduce GPU initialization and data transfer cost to the model.
Once the profitable kernel variants are detected, we automatically insert pertinent OpenMP directives and provide a newly generated code supporting GPU offloading.

Translating CUDA to OpenCL for Hardware Generation using Neural Machine Translation
Yonghae Kim and Hyesoon Kim ORCID logo
(Georgia Institute of Technology, USA)
Hardware generation from high-level languages like C/C++ has been one of the dreams of software and hardware engineers for decades. Several high-level synthesis (HLS) or domain-specific languages (DSLs) have been developed to reduce the gap between high-level languages and hardware descriptive languages. However, each language tends to target some specific applications or there is a big learning curve in learning DSLs, which ends up having many program languages and tool chains. To address these challenges, we propose the use of a source-to-source translation to pick and choose which framework to use so that the hardware designer chooses the best target HLS/DSL that can be synthesized to the best performing hardware. In this work, we present source-to-source translation between CUDA to OpenCL using NMT, which we call PLNMT. The contribution of our work is that it develops techniques to generate training inputs. To generate a training dataset, we extract CUDA API usages from CUDA examples and write corresponding OpenCL API usages. With a pair of API usages acquired, we construct API usage trees that helps users find unseen usages from new samples and easily add them to a training input. Our initial results show that we can translate many applications from benchmarks such as CUDA SDK, polybench-gpu, and Rodinia. Furthermore, we show that translated kernel code from CUDA applications can be run in the OpenCL FPGA framework, which implies a new direction of HLS.

proc time: 1.41