CGO 2020
18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO)
Powered by
Conference Publishing Consulting

18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO), February 22–26, 2020, San Diego, CA, USA

CGO 2020 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page


Welcome form the Steering Committee
On behalf of the Steering Committee, we welcome you to the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO ’20) and to San Diego.

Message from the Program Chairs
Welcome to the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020) in the beautiful city of San Diego. This year’s symposium continues its tradition of being the premier forum for presentation of research results on leading-edge issues in code generation and optimization.

CGO 2020 Organization
Welcome to the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020) in the beautiful city of San Diego. This year’s symposium continues its tradition of being the premier forum for presentation of research results on leading-edge issues in code generation and optimization.

Report from the Artifact Evaluation Committee
Authors of accepted papers were given the opportunity to participate in the artifact evaluation process by submitting a research artifact. ACM defines an artifact as “a digital object that was either created by the authors to be used as part of the study or generated by the experiment itself”. The artifact evaluation process checks if the submitted artifact supports the claims made in the paper. Ultimately, artifact evaluation is intended to encourage researchers to take special care in conducting reproducible experiments and to package experimental workflows including related materials for making them accessible for others.

Sponsors


Dynamic Languages

Efficient Nursery Sizing for Managed Languages on Multi-core Processors with Shared Caches
Mohamed Ismail and G. Edward Suh ORCID logo
(Cornell University, USA)
In modern programming languages, automatic memory management has become a standard feature for allocating and freeing memory. In this paper, we show that the performance of today’s managed languages can degrade significantly due to cache contention among multiple concurrent applications that share a cache. To address this problem, we propose to change the programs’ memory access patterns by adjusting the nursery size. We propose Dynamic Nursery Allocator (DNA), an online dynamic scheme that automatically adjusts the nursery sizes of multiple managed-language programs running concurrently without any prior knowledge or offline profiling. The experimental results on a native Intel machine show that DNA can significantly improve the system throughput by 16.3% on average and as much as 73% over today’s nursery sizing scheme when four applications run concurrently sharing the last-level cache.

Publisher's Version Artifacts Reusable Results Replicated
Type Freezing: Exploiting Attribute Type Monomorphism in Tracing JIT Compilers
Lin Cheng ORCID logo, Berkin Ilbeyi, Carl Friedrich Bolz-Tereick, and Christopher Batten ORCID logo
(Cornell University, USA; University of Düsseldorf, Germany)
Dynamic programming languages continue to increase in popularity. While just-in-time (JIT) compilation can improve the performance of dynamic programming languages, a significant performance gap remains with respect to ahead-of-time compiled languages. Existing JIT compilers exploit type monomorphism through type specialization, and use runtime checks to ensure correctness. Unfortunately, these checks can introduce non-negligible overhead. In this paper, we present type freezing, a novel software solution for exploiting attribute type monomorphism. Type freezing "freezes" type monomorphic attributes of user-defined types, and eliminates the necessity of runtime type checks when performing reads from these attributes. Instead, runtime type checks are done when writing these attributes to validate type monomorphism. We implement type freezing as an extension to PyPy, a state-of-the-art tracing JIT compiler for Python. Our evaluation shows type freezing can improve performance and reduce dynamic instruction count for those applications with a significant number of attribute accesses.

Publisher's Version Artifacts Reusable Results Replicated

Safety and Reliability

Low-Cost Prediction-Based Fault Protection Strategy
Sunghyun Park ORCID logo, Shikai Li, Ze Zhang, and Scott Mahlke
(University of Michigan, USA)
Increasing failures from transient faults necessitates the cost-efficient protection mechanism that will be always activated. Thus, we propose a novel prediction-based transient fault protection strategy as a low-cost software-only technique. Instead of re-executing expensive computations for validation, an output prediction is used to cheaply determine an approximate value for a sequence of computation. When actual computation and prediction agree within a predefined acceptable range, the computation is assumed fault-free, and expensive re-computation can be skipped. With our approach, a significant reduction in dynamic instruction counts is possible. Missed faults may occur, but their occurrences can be explicitly kept to a small amount with a proper acceptable range. For evaluation, we build an automatic compilation system, called RSkip, that transforms a program into a resilient executable with the prediction-based protection scheme. Prior instruction replication work shows 2.33x execution time compared to the unreliable execution over nine compute-intensive benchmarks. With a control for the loss in protection rate, RSkip can reduce the protection overhead to 1.27x by skipping redundant computation in our target loops at a rate of 81.10

Publisher's Version
Secure Automatic Bounds Checking: Prevention Is Simpler Than Cure
Ejebagom John Ojogbo, Mithuna Thottethodi, and T. N. Vijaykumar
(Purdue University, USA)
Recent Spectre attacks exploit hardware speculative execution to read forbidden data. The attacks speculatively load forbidden data in misspeculated paths creating a side channel via the microarchitectural state which is not cleaned up after a misspeculation. The side channel then leaks the data. We focus on the most-challenging Spectre variant (Spectre-v1) which exploits sandboxing through bounds checking. Because the forbidden data can be accessed in only three ways only one of which remains challenging (Spectre-v1), whereas the data can be leaked through numerous side channels all of which must be plugged, preventing the access in the first place is more practical. Recent hardware schemes plug some side channels but incur significant complexity and performance loss and remain susceptible to other side channels. Most current software mitigations are architecture-dependent, have performance or semantic uncertainty problems, or both. We propose a compiler-based mitigation, called Secure Automatic Bounds Checking (SABC), which uses a simple sequence of three instructions to prevent forbidden access. The instructions have straightforward semantics and are found in all 32- and 64-bit architectures. An alternative, architecture-independent technique that leverages process boundaries– site isolation – incurs 1.8x memory overhead and 30% performance overhead over the baseline with no isolation. SABC is architecture-independent, has assured semantics, incurs little performance overhead, and renders current and future side channels useless for Spectre-v1.

Publisher's Version
Aloe: Verifying Reliability of Approximate Programs in the Presence of Recovery Mechanisms
Keyur Joshi, Vimuth Fernando, and Sasa MisailovicORCID logo
(University of Illinois at Urbana-Champaign, USA)
Modern hardware is becoming increasingly susceptible to silent data corruptions. As general methods for detection and recovery from errors are time and energy consuming, selective detection and recovery are promising alternatives for applications that have the freedom to produce results with a variable level of accuracy. Several programming languages have provided specialized constructs for expressing detection and recovery operations, but the existing static analyses of safety and quantitative analyses of programs do not have the proper support for such language constructs.
This work presents Aloe, a quantitative static analysis of reliability of programs with recovery blocks - a construct that checks for errors, and if necessary, applies the corresponding recovery strategy. The analysis supports reasoning about both reliable and potentially unreliable detection and recovery mechanisms. It implements a novel precondition generator for recovery blocks, built on top of Rely, a state-of-the-art quantitative reliability analysis for imperative programs. Aloe can reason about programs with scalar and array expressions, if-then-else conditionals, and bounded loops without early exits. The analyzed computation is idempotent and the recovery code re-executes the original computation.
We implemented Aloe and applied it to a set of eight programs previously used in approximate computing research. Our results present significantly higher reliability and scale better compared to the existing Rely analysis. Moreover, the end-to-end accuracy of the verified computations exhibits only small accuracy losses.

Publisher's Version
Interactive Debugging of Concurrent Programs under Relaxed Memory Models
Aakanksha Verma, Pankaj Kumar Kalita ORCID logo, Awanish Pandey, and Subhajit Roy
(IIT Kanpur, India)
Programming environments for sequential programs provide strong debugging support. However, concurrent programs, especially under relaxed memory models, lack powerful interactive debugging tools. In this work, we present Gambit, an interactive debugging environment that uses gdb to run a concrete debugging session on a concurrent program, while employing a symbolic execution on the program trace in the background simultaneously. The symbolic execution is analysed by a theorem prover to answer queries from the programmer on possible scenarios resulting from alternate thread interleavings or due to reorderings on other relaxed memory models.
Gambit can help programmers understand complex bug scenarios on alien environments, that is, when the program is deployed in the field under different concurrent schedulers and diverse memory models. While Gambit currently packages three memory models (PSO, TSO and SC) and provides support for constructs like fences and atomic blocks, the framework is extensible, allowing support for other memory models and programming constructs. We have evaluated Gambit on multiple programs and found that Gambit responds to the debug queries quite fast (about a second on an average across our benchmark programs).

Publisher's Version

Best Paper Finalists

Testing Static Analyses for Precision and Soundness
Jubi Taneja, Zhengyang Liu, and John Regehr ORCID logo
(University of Utah, USA)
Static analyses compute properties of programs that are true in all executions, and compilers use these properties to justify optimizations such as dead code elimination. Each static analysis in a compiler should be as precise as possible while remaining sound and being sufficiently fast. Unsound static analyses typically lead to miscompilations, whereas imprecisions typically lead to missed optimizations. Neither kind of bug is easy to track down.
Our research uses formal methods to help compiler developers create better static analyses. Our contribution is the design and evaluation of several algorithms for computing sound and maximally precise static analysis results using an SMT solver. These methods are too slow to use at compile time, but they can be used offline to find soundness and precision errors in a production compiler such as LLVM. We found no new soundness bugs in LLVM, but we can discover previously-fixed soundness errors that we re-introduced into the code base. We identified many imprecisions in LLVM’s static analyses, some of which have been fixed as a result of our work.

Publisher's Version Artifacts Reusable Results Replicated
HALO: Post-Link Heap-Layout Optimisation
Joe Savage and Timothy M. JonesORCID logo
(University of Cambridge, UK)
Today, general-purpose memory allocators dominate the landscape of dynamic memory management. While these solutions can provide reasonably good behaviour across a wide range of workloads, it is an unfortunate reality that their behaviour for any particular workload can be highly suboptimal. By catering primarily to average and worst-case usage patterns, these allocators deny programs the advantages of domain-specific optimisations, and thus may inadvertently place data in a manner that hinders performance, generating unnecessary cache misses and load stalls.
To help alleviate these issues, we propose HALO: a post-link profile-guided optimisation tool that can improve the layout of heap data to reduce cache misses automatically. Profiling the target binary to understand how allocations made in different contexts are related, we specialise memory-management routines to allocate groups of related objects from separate pools to increase their spatial locality. Unlike other solutions of its kind, HALO employs novel grouping and identification algorithms which allow it to create tight-knit allocation groups using the entire call stack and to identify these efficiently at runtime. Evaluation of HALO on contemporary out-of-order hardware demonstrates speedups of up to 28% over jemalloc, out-performing a state-of-the-art data placement technique from the literature.

Publisher's Version Artifacts Reusable Results Replicated
Efficient and Scalable Cross-ISA Virtualization of Hardware Transactional Memory
Wenwen Wang, Pen-Chung Yew, Antonia Zhai, and Stephen McCamant
(University of Georgia, USA; University of Minnesota, USA)
System virtualization is a key enabling technology. However, existing virtualization techniques suffer from a significant limitation due to their limited cross-ISA support for emerging architecture-specific hardware extensions. To address this issue, we make the first attempt at hardware transactional memory (HTM), which has been supported by modern multi-core processors and used by more and more applications to simplify concurrent programming. In particular, we propose an efficient and scalable mechanism to support cross-ISA virtualization of HTMs. The mechanism emulates guest HTMs using host HTMs, and tries to preserve as much as possible the performance and the scalability of guest applications. Experimental results on STAMP benchmarks show that an average of 2.3X and 12.6X performance speedup can be achieved respectively for x86_64 and PowerPC64 guest applications on an x86_64 host machine. Moreover, it can attain similar scalability to the native execution of the applications.

Publisher's Version

GPUs

Speculative Reconvergence for Improved SIMT Efficiency
Sana Damani, Daniel R. Johnson, Mark Stephenson ORCID logo, Stephen W. Keckler ORCID logo, Eddie Yan, Michael McKeown, and Olivier Giroux
(Georgia Institute of Technology, USA; NVIDIA, USA; University of Washington, USA; Esperanto Technologies, USA)
GPUs perform most efficiently when all threads in a warp execute the same sequence of instructions convergently. However, when threads in a warp encounter a divergent branch, the hardware serializes the execution of diverged paths. We consider a class of convergence opportunities wherein multiple threads are expected to eventually execute a given segment of code, but not all threads arrive at the same time, resulting in serialized duplicate execution of common code subsequences such as function calls and loop bodies. Our goal is to promote convergence by helping threads that execute common code arrive together before allowing execution to proceed. We propose a new user-guided compiler mechanism, Speculative Reconvergence, to help identify and exploit previously untapped convergence opportunities that increase SIMT efficiency and improve performance. For the set of workloads we study, we see improvements ranging from 10% to 3× in both SIMT efficiency and in performance.

Publisher's Version
Optimizing Occupancy and ILP on the GPU using a Combinatorial Approach
Ghassan ShobakiORCID logo, Austin Kerbow ORCID logo, and Stanislav Mekhanoshin
(California State University at Sacramento, USA; Advanced Micro Devices, USA)
This paper presents the first general solution to the problem of optimizing both occupancy and Instruction-Level Parallelism (ILP) when compiling for a Graphics Processing Unit (GPU). Exploiting ILP (minimizing schedule length) requires using more registers, but using more registers decreases occupancy (the number of thread groups that can be run in parallel). The problem of balancing these two conflicting objectives to achieve the best overall performance is a challenging open problem in code optimization. In this paper, we present a two-pass Branch-and-Bound (B&B) algorithm for solving this problem by treating occupancy as a primary objective and ILP as a secondary objective. In the first pass, the algorithm searches for a maximum-occupancy schedule, while in the second pass it iteratively searches for the shortest schedule that gives the maximum occupancy found in the first pass. The proposed scheduling algorithm was implemented in the LLVM compiler and applied to an AMD GPU. The algorithm’s performance was evaluated using benchmarks from the PlaidML machine learning framework relative to LLVM’s scheduling algorithm, AMD’s production scheduling algorithm and an existing B&B scheduling algorithm that uses a different approach. The results show that the proposed B&B scheduling algorithm speeds up almost every benchmark by up to 35% relative to LLVM’s scheduler, up to 31% relative to AMD’s scheduler and up to 18% relative to the existing B&B scheduler. The geometric-mean improvements are 16.3% relative to LLVM’s scheduler, 5.5% relative to AMD’s production scheduler and 6.2% relative to the existing B&B scheduler. If more compile time can be tolerated, a geometric-mean improvement of 6.3% relative to AMD’s scheduler can be achieved.

Publisher's Version

Compilation for Specialized Domains

Multi-layer Optimizations for End-to-End Data Analytics
Amir Shaikhha, Maximilian Schleich, Alexandru Ghita, and Dan Olteanu
(University of Oxford, UK)
We consider the problem of training machine learning models over multi-relational data. The mainstream approach is to first construct the training dataset using a feature extraction query over input database and then use a statistical software package of choice to train the model. In this paper we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language, which captures a subset of Python commonly used in Jupyter notebooks for rapid prototyping of machine learning applications. The program is subject to several layers of IFAQ optimizations, such as algebraic transformations, loop transformations, schema specialization, data layout optimizations, and finally compilation into efficient low-level C++ code specialized for the given workload and data.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and TensorFlow by several orders of magnitude for linear regression and regression tree models over several relational datasets.

Publisher's Version
Optimizing Ordered Graph Algorithms with GraphIt
Yunming Zhang, Ajay BrahmakshatriyaORCID logo, Xinyi Chen, Laxman Dhulipala, Shoaib Kamil, Saman AmarasingheORCID logo, and Julian Shun
(Massachusetts Institute of Technology, USA; Carnegie Mellon University, USA; Adobe Research, USA)
Many graph problems can be solved using ordered parallel graph algorithms that achieve significant speedup over their unordered counterparts by reducing redundant work. This paper introduces a new priority-based extension to GraphIt, a domain-specific language for writing graph applications, to simplify writing high-performance parallel ordered graph algorithms. The extension enables vertices to be processed in a dynamic order while hiding low-level implementation details from the user. We extend the compiler with new program analyses, transformations, and code generation to produce fast implementations of ordered parallel graph algorithms. We also introduce bucket fusion, a new performance optimization that fuses together different rounds of ordered algorithms to reduce synchronization overhead, resulting in 1.2×–3× speedup over the fastest existing ordered algorithm implementations on road networks with large diameters. With the extension, GraphIt achieves up to 3× speedup on six ordered graph algorithms over state-of-the-art frameworks and hand-optimized implementations (Julienne, Galois, and GAPBS) that support ordered algorithms.

Publisher's Version Artifacts Functional Results Replicated
A Performance-Optimizing Compiler for Cyber-Physical Digital Microfluidic Biochips
Tyson Loveless, Jason Ott, and Philip Brisk ORCID logo
(University of California at Riverside, USA)
This paper introduces a compiler optimization strategy for Software-Programmable Laboratories-on-a-Chip (SP-LoCs), which miniaturize and automate a wide variety of benchtop laboratory experiments. The compiler targets a specific class of SP-LoCs that manipulate discrete liquid droplets on a 2D grid, with cyber-physical feedback provided by integrated sensors and/or video monitoring equipment. The optimization strategy employed here aims to reduce the overhead of transporting fluids between operations, and explores tradeoffs between the latency and resource requirements of mixing operations: allocating more space for mixing shortens mixing time, but reduces the amount of spatial parallelism available to other operations. The compiler is empirically evaluated using a cycle-accurate simulator that mimics the behavior of the target SP-LoC. Our results show that a coalescing strategy, inspired by graph coloring register allocation, effectively reduces droplet transport latencies while speeding up the compiler and reducing its memory footprint. For biochemical reactions that are dominated by mixing operations, we observe a linear correlation between a preliminary result using a default mixing operation resource allocation and the percentage decrease in execution time that is achieved via resizing.

Publisher's Version
CogniCryptGEN: Generating Code for the Secure Usage of Crypto APIs
Stefan Krüger, Karim Ali, and Eric BoddenORCID logo
(University of Paderborn, Germany; University of Alberta, Canada; Fraunhofer IEM, Germany)
Many software applications are insecure because they misuse cryptographic APIs. Prior attempts to address misuses focused on detecting them after the fact. However, avoiding such misuses in the first place would significantly reduce development cost. In this paper,we present CogniCryptGEN, a code generator that proactively assists developers in using Java crypto APIs correctly. CogniCryptGEN accepts as input a code template and API-usage rules defined in the specification language CrySL. The code templates in CogniCryptGEN are minimal, only comprising simple glue code. All security-sensitive code is generated fully automatically from the CrySL rules that the templates merely refer to. That way, generated code is provably correct and secure with respect to the CrySL definitions. CogniCryptGEN supports the implementation of the most common cryptographic use cases, ranging from password-based encryption to digital signatures. We have empirically evaluated CogniCryptGEN from the perspectives of both crypto-API developers and application developers. Our results show that CogniCryptGEN is fast enough to be used during development. Compared to a state-of-the-art template-based solution, implementing use cases with CogniCryptGEN requires only a fourth of development effort, without any additional language skills. Real-world developers see CogniCryptgen as significantly simpler to use than the same template-based solution.

Publisher's Version Artifacts Functional Results Replicated

Tool and Practical Experience Papers

AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, Toshio Endo ORCID logo, and Satoshi Matsuoka
(Barcelona Supercomputing Center, Spain; Edgecortix, Japan; AIST, Japan; Tokyo Institute of Technology, Japan; RIKEN CCS, Japan)
Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.

Publisher's Version
The Design and Implementation of the Wolfram Language Compiler
Abdul Dakkak, Tom Wickham-Jones, and Wen-mei Hwu ORCID logo
(University of Illinois at Urbana-Champaign, USA; Wolfram Research, UK)
The popularity of data- and scientific-oriented applications, abundance of on-demand compute resources, and scarcity of domain expert programmers have given rise to high-level scripting languages. These high-level scripting languages offer a fast way to translate ideas into code, but tend to pay a heavy performance overhead. In order to alleviate the performance penalty, each implementation of these languages often offers a compilation path to a subset of the language.
In this paper we present the design and implementation of the Wolfram Language compiler, the production compiler for the Wolfram Language. We show how popular language features and runtime behavior, expected by Wolfram Language developers, are efficiently implemented within the compiler. We then show how the compiler provides a friction-less path to migrate programs from the interpreter to the compiler. We evaluate the compiler and show that compiled code matches the performance of highly tuned hand-written C code. The compiler has been released as a prominent feature of Wolfram Engine v1212 and is readily available to developers.

Publisher's Version Artifacts Functional
SIMD Support in .NET: Abstract and Concrete Vector Types and Operations
Carol Eidt and Tanner Gooding
(Microsoft, USA)
This paper describes SIMD (Single Instruction Multiple Data) APIs for .NET that expose the parallel execution capabilities available on modern processors. These APIs include both platform-independent and platform-specific APIs that expose the SIMD capabilities available on the target hardware. The platform-independent APIs abstract both the element type and length of the maximum width vector available on the target hardware, while the platform-specific APIs are a more direct mapping to the target instructions. The APIs are consumable by high-level languages such as C# and F#, and the just in time compiler (JIT) for .NET translates the operations to the appropriate hardware instructions directly in the referencing method, avoiding overhead due to calls or managed/native transitions.

Publisher's Version Artifacts Reusable Results Replicated

Code Optimization

NeuroVectorizer: End-to-End Vectorization with Deep Reinforcement Learning
Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Yakun Sophia Shao, Krste Asanovic, and Ion Stoica
(University of California at Berkeley, USA; Intel Labs, USA)
One of the key challenges arising when compilers vectorize loops for today’s SIMD-compatible architectures is to decide if vectorization or interleaving is beneficial. Then, the compiler has to determine the number of instructions to pack together and the interleaving level (stride). Compilers are designed today to use fixed-cost models that are based on heuristics to make vectorization decisions on loops. However, these models are unable to capture the data dependency, the computation graph, or the organization of instructions. Alternatively, software engineers often hand-write the vectorization factors of every loop. This, however, places a huge burden on them, since it requires prior experience and significantly increases the development time.
In this work, we explore a novel approach for handling loop vectorization and propose an end-to-end solution using deep reinforcement learning (RL). We conjecture that deep RL can capture different instructions, dependencies, and data structures to enable learning a sophisticated model that can better predict the actual performance cost and determine the optimal vectorization factors. We develop an end-to-end framework, from code to vectorization, that integrates deep RL in the LLVM compiler. Our proposed framework takes benchmark codes as input and extracts the loop codes. These loop codes are then fed to a loop embedding generator that learns an embedding for these loops. Finally, the learned embeddings are used as input to a Deep RL agent, which dynamically determines the vectorization factors for all the loops. We further extend our framework to support random search, decision trees, supervised neural networks, and nearest-neighbor search. We evaluate our approaches against the currently used LLVM vectorizer and loop polyhedral optimization techniques. Our experiments show 1.29×−4.73× performance speedup compared to baseline and only 3% worse than the brute-force search on a wide range of benchmarks.

Publisher's Version Info Artifacts Reusable Results Replicated
Introducing the Pseudorandom Value Generator Selection in the Compilation Toolchain
Michael Leonard and Simone Campanoni ORCID logo
(Northwestern University, USA)
As interest in randomization has grown within the computing community, the number of pseudorandom value generators (PRVGs) at developers' disposal dramatically increased. Today, developers lack the tools necessary to obtain optimal behavior from their PRVGs. We provide the first deep study into the tradeoffs among the PRVGs in the C++ standard, finding no silver bullet for all programs and architectures. With this in mind, we have built PRV Jeeves, the first fully automatic PRVG selector. We demonstrate that when compiling widely-used, highly optimized programs with PRV Jeeves, we are able to cut execution time by 34% on average. This enhancement comes at no cost to developers.

Publisher's Version

Heterogeneity and Parallelism

COLAB: A Collaborative Multi-factor Scheduler for Asymmetric Multicore Processors
Teng Yu, Pavlos Petoumenos, Vladimir Janjic, Hugh Leather, and John Thomson
(University of St. Andrews, UK; University of Manchester, UK; University of Edinburgh, UK)
Increasingly prevalent asymmetric multicore processors (AMP) are necessary for delivering performance in the era of limited power budget and dark silicon. However, the software fails to use them efficiently. OS schedulers, in particular, handle asymmetry only under restricted scenarios. We have efficient symmetric schedulers, efficient asymmetric schedulers for single-threaded workloads, and efficient asymmetric schedulers for single program workloads. What we do not have is a scheduler that can handle all runtime factors affecting AMP for multi-threaded multi-programmed workloads.
This paper introduces the first general purpose asymmetry-aware scheduler for multi-threaded multi-programmed workloads. It estimates the performance of each thread on each type of core and identifies communication patterns and bottleneck threads. The scheduler then makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor's time.
We evaluate our approach using the GEM5 simulator on four distinct big.LITTLE configurations and 26 mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average depending on the hardware setup.

Publisher's Version
PreScaler: An Efficient System-Aware Precision Scaling Framework on Heterogeneous Systems
Seokwon Kang, Kyunghwan Choi, and Yongjun Park
(Hanyang University, South Korea)
Graphics processing units (GPUs) have been commonly utilized to accelerate multiple emerging applications, such as big data processing and machine learning. While GPUs are proven to be effective, approximate computing, to trade off performance with accuracy, is one of the most common solutions for further performance improvement. Precision scaling of originally high-precision values into lower-precision values has recently been the most widely used GPU-side approximation technique, including hardware-level half-precision support. Although several approaches to find optimal mixed-precision configuration of GPU-side kernels have been introduced, total program performance gain is often low because total execution time is the combination of data transfer, type conversion, and kernel execution. As a result, kernel-level scaling may incur high type-conversion overhead of the kernel input/output data. To address this problem, this paper proposes an automatic precision scaling framework called PreScaler that maximizes the program performance at the memory object level by considering whole OpenCL program flows. The main difficulty is that the best configuration cannot be easily predicted due to various application- and system-specific characteristics. PreScaler solves this problem using search space minimization and decision-tree-based search processes. First, it minimizes the number of test configurations based on the information from system inspection and dynamic profiling. Then, it finds the best memory-object level mixed-precision configuration using a decision-tree-based search. PreScaler achieves an average performance gain of 1.33x over the baseline while maintaining the target output quality level.

Publisher's Version Artifacts Functional Results Replicated
ATMem: Adaptive Data Placement in Graph Applications on Heterogeneous Memories
Yu Chen, Ivy B. Peng, Zhen Peng, Xu Liu, and Bin Ren
(College of William and Mary, USA; Lawrence Livermore National Laboratory, USA)
Active development in new memory devices, such as non-volatile memories and high-bandwidth memories, brings heterogeneous memory systems (HMS) as a promising solution for implementing large-scale memory systems with cost, area, and power limitations. Typical HMS consists of a small-capacity high-performance memory and a large-capacity low-performance memory. Data placement on such systems plays a critical role in performance optimization. Existing efforts have explored coarse-grained data placement in applications with dense data structures; however, a thorough study of applications that are based on graph data structures is still missing.
This work proposes ATMem—a runtime framework for adaptive granularity data placement optimization in graph applications. ATMem consists of a lightweight profiler, an analyzer using a novel m-ary tree-based strategy to identify sampled and estimated critical data chunks, and a high-bandwidth migration mechanism using a multi-stage multi-threaded approach. ATMem is evaluated in five applications on two HMS hardware, including the Intel Optane byte-addressable NVM and MCDRAM. Experimental results show that ATMem selects 5%-18% data to be placed on high-performance memory and achieves an average of 1.7×-3.4× speedup on NVM-DRAM and 1.2×-2.0× speedup on MCDRAM-DRAM, over the baseline that places all data on the large-capacity memory. On NVM-DRAM, ATMem achieves performance comparable to a full-DRAM system with as low as 9%-54% slowdown.

Publisher's Version

Code Generation and Transformation

Automatic Generation of High-Performance Quantized Machine Learning Kernels
Meghan Cowan, Thierry Moreau, Tianqi Chen, James BornholtORCID logo, and Luis Ceze ORCID logo
(University of Washington, USA; University of Texas at Austin, USA)
Quantization optimizes machine learning inference for resource constrained environments by reducing the precision of its computation. In the extreme, even single-bit computations can produce acceptable results at dramatically lower cost. But this ultra-low-precision quantization is difficult to exploit because extracting optimal performance requires hand-tuning both high-level scheduling decisions and low-level implementations. As a result, practitioners settle for a few predefined quantized kernels, sacrificing optimality and restricting their ability to adapt to new hardware.
This paper presents a new automated approach to implementing quantized inference for machine learning models. We integrate the choice of how to lay out quantized values into the scheduling phase of a machine learning compiler, allowing it to be optimized in concert with tiling and parallelization decisions. After scheduling, we use program synthesis to automatically generate efficient low-level operator implementations for the desired precision and data layout. We scale up synthesis using a novel reduction sketch that exploits the structure of matrix multiplication. On a ResNet18 model, our generated code outperforms an optimized floating-point baseline by up to 3.9×, and a state-of-the-art quantized implementation by up to 16.6×.

Publisher's Version Artifacts Functional Results Replicated
Deriving Parametric Multi-way Recursive Divide-and-Conquer Dynamic Programming Algorithms using Polyhedral Compilers
Mohammad Mahdi Javanmard, Zafar Ahmad, Martin Kong, Louis-Noël Pouchet, Rezaul Chowdhury, and Robert Harrison
(Stony Brook University, USA; University of Oklahoma, USA; Colorado State University, USA)
We present a novel framework to automatically derive highly efficient parametric multi-way recursive divide-&-conquer algorithms for a class of dynamic programming (DP) problems. Standard two-way or any fixed R-way recursive divide-&-conquer algorithms may not fully exploit many-core processors. To run efficiently on a given machine, the value of R may need to be different for every level of recursion based on the number of processors available and the sizes of memory/caches at different levels of the memory hierarchy. The set of R values that work well on a given machine may not work efficiently on another machine with a different set of machine parameters. To improve portability and efficiency, Multi-way Autogen generates parametric multi-way recursive divide-&-conquer algorithms where the value of R can be changed on the fly for every level of recursion. We present experimental results demonstrating the performance and scalability of the parallel programs produced by our framework.

Publisher's Version

proc time: 7.74