CGO 2017 – Proceedings

Messages from the Chairs
Howdy and welcome to the 2017 International Symposium on Code Generation and Optimization (CGO) in Austin, TX! The hilly terrain on the west side of Austin coupled with technology companies in Austin has earned Austin its unique name as “Silicon Hills,” sparking friendly rivalry with its Silicon Valley sibling. We are excited to have CGO in Austin as it is a premier venue that brings together researchers and practitioners working at the interface of hardware and software on a broad range of optimization and code generation techniques and related issues. The CGO conference spans a variety of topics from purely static to fully dynamic approaches, and from simple software-based methods to specific architectural features and support for code generation and optimization. Since its inception in 2003 CGO has paved the way for several foundational contributions in program compilation and hardware-software co-design.

Report from the Artifact Evaluation Committee
Authors of accepted papers were given the option of participating in the joint CGO-PPoPP Artifact Evaluation (AE). It is intended to encourage researchers to take extra care in conducting experiments in a reproducible way, to package experimental workflows and all related materials for the purposes of making them more broadly available, and ultimately, to enable a fair comparison of published techniques.

Main Research Papers

Shared Memory

Legato: End-to-End Bounded Region Serializability Using Commodity Hardware Transactional Memory
Aritra Sengupta, Man Cao, Michael D. Bond

, and Milind Kulkarni

(Ohio State University, USA; Purdue University, USA)
Shared-memory languages and systems provide strong guarantees only for well-synchronized (data-race-free) programs. Prior work introduces support for memory consistency based on region serializability of executing code regions, but all approaches incur serious limitations such as adding high run-time overhead or relying on complex custom hardware.
This paper explores the potential for leveraging widely available, commodity hardware transactional memory to provide an end-to-end memory consistency model called dynamically bounded region serializability (DBRS). To amortize high per-transaction costs, yet mitigate the risk of unpredictable, costly aborts, we introduce dynamic runtime support called Legato that executes multiple dynamically bounded regions (DBRs) in a single transaction. Legato varies the number of DBRs per transaction on the fly, based on the recent history of committed and aborted transactions. Legato outperforms existing commodity enforcement of DBRS, and its costs are less sensitive to a program's shared-memory communication patterns. These results demonstrate the potential for providing always-on strong memory consistency using commodity transactional hardware.

Automatic Detection of Extended Data-Race-Free Regions
Alexandra Jimborean, Jonatan Waern, Per Ekemark, Stefanos Kaxiras, and Alberto Ros
(Uppsala University, Sweden; University of Murcia, Spain)
Data-race-free (DRF) parallel programming becomes a standard as newly adopted memory models of mainstream programming languages such as C++ or Java impose data-race-freedom as a requirement.
We propose compiler techniques that automatically delineate extended data-race-free regions (xDRF), namely regions of code which provide the same guarantees as the synchronization-free regions (in the context of DRF codes). xDRF regions stretch across synchronization boundaries, function calls and loop back-edges and preserve the data-race-free semantics, thus increasing the optimization opportunities exposed to the compiler and to the underlying architecture. Our compiler techniques precisely analyze the threads' memory accessing behavior and data sharing in shared-memory, general-purpose parallel applications and can therefore infer the limits of xDRF code regions.
We evaluate the potential of our technique by employing the xDRF region classification in a state-of-the-art, dual-mode cache coherence protocol. Larger xDRF regions reduce the coherence bookkeeping and enable optimizations for performance (6.8%) and energy efficiency (11.7%) compared to a standard directory-based coherence protocol.

FinePar: Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures
Feng Zhang, Bo Wu

, Jidong Zhai, Bingsheng He, and Wenguang Chen

(Tsinghua University, China; Colorado School of Mines, USA; National University of Singapore, Singapore)
The integrated architecture that features both CPU and GPU on the same die is an emerging and promising architecture for fine-grained CPU-GPU collaboration. However, the integration also brings forward several programming and system optimization challenges, especially for irregular applications. The complex interplay between heterogeneity and irregularity leads to very low processor utilization of running irregular applications on integrated architectures. Furthermore, fine-grained co-processing on the CPU and GPU is still an open problem. Particularly, in this paper, we show that the previous workload partitioning for CPU-GPU co-processing is far from ideal in terms of resource utilization and performance. To solve this problem, we propose a system software named FinePar, which considers architectural differences of the CPU and GPU and leverages fine-grained collaboration enabled by integrated architectures. Through irregularity-aware performance modeling and online auto-tuning, FinePar partitions irregular workloads and achieves both device-level and thread-level load balance. We evaluate FinePar with 8 irregular applications on an AMD integrated architecture and compare it with state-of-the-art partitioning approaches. Results show that FinePar demonstrates better resource utilization and achieves an average of 1.38X speedup over the optimal coarse-grained partitioning method.

Info

GPU Optimization

TwinKernels: An Execution Model to Improve GPU Hardware Scheduling at Compile Time
Xiang Gong, Zhongliang Chen, Amir Kavyan Ziabari, Rafael Ubal, and David Kaeli

(Northeastern University, USA)
As throughput-oriented accelerators, GPUs provide tremendous processing power by running a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPU’s resources often under-utilized.
Compared to compute resources, memory resources can tolerate considerably lower levels of TLP due to hardware bottlenecks. Unfortunately, this tolerance is not effectively exploited by the Single Instruction Multiple Thread (SIMT) execution model employed by current GPU compute frameworks. Assuming an SIMT execution model, GPU applications tend to send bursts of memory requests that compete for GPU memory resources. Traditionally, hardware units, such as the wavefront scheduler, are used to manage such requests. However, the scheduler struggles when computational operations are not abundant enough to effectively hide the long latency of memory operations.
In this paper, we propose a Twin Kernel Multiple Thread (TKMT) execution model, a compiler-centric solution that improves hardware scheduling at compile time. TKMT better distributes the burst of memory requests in some of the wavefronts through static instruction scheduling. Our results show that TKMT can offer a 12% average improvement over the baseline SIMT implementation on a variety of benchmarks on AMD Radeon systems.

Taming Warp Divergence
Jayvant Anantpur and R. Govindarajan
(IISc Bangalore, India)
Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming
Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution
configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated
to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads,
called warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., warps of a TB may
finish the kernel execution at different points in time, causing the faster warps to wait for their slower sibling warps. This, in
effect, reduces the utilization of resources of SMs and hence the performance of the GPU.
We propose a simple and elegant technique to eliminate the waiting time of warps at the end of kernel execution and
improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual warps, and
enables warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting
for their sibling warps. We propose simple source to source transformations to use virtual thread blocks and virtual warps.
Further, this technique enables us to design a warp scheduling algorithm that is aware of the progress made by the virtual
thread blocks and virtual warps, and uses this knowledge to prioritise warps effectively. Evaluation on a diverse set of
kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean
improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) warp scheduler and 1.09x over Loose
Round Robin (LRR) warp scheduler.

Dynamic Buffer Overflow Detection for GPGPUs
Christopher Erb, Mike Collins, and Joseph L. Greathouse
(AMD Research, USA)
Buffer overflows are a common source of program crashes, data corruption, and security problems. In this work, we demonstrate that GPU-based workloads can also cause buffer overflows, a problem that was traditionally ignored because CPUs and GPUs had separate memory spaces. Modern GPUs share virtual, and sometimes physical, memory with CPUs, meaning that GPU-based buffer overflows are capable of producing the same program crashes, data corruption, and security problems as CPU-based overflows. While there are many tools to find buffer overflows in CPU-based applications, the shift towards GPU-enhanced programs has expanded the problem beyond their capabilities.
This paper describes a tool that uses canaries to detect buffer overflows caused by GPGPU kernels. It wraps OpenCL API calls and alerts users to any kernel that writes outside of a memory buffer. We study a variety of optimizations, including using the GPU to perform the canary checks, which allow our tool to run at near application speeds. The resulting runtime overhead, which scales with the number of buffers used by the kernel, is 14% across 175 applications in 16 GPU benchmark suites. In these same suites, we found 13 buffer overflows in 7 benchmarks.

Lift: A Functional Data-Parallel IR for High-Performance GPU Code Generation
Michel Steuwer

, Toomas Remmelg, and Christophe Dubach
(University of Edinburgh, UK)
Parallel patterns (e.g., map, reduce) have gained traction as an abstraction for targeting parallel accelerators and are a promising answer to the performance portability problem. However, compiling high-level programs into efficient low- level parallel code is challenging. Current approaches start from a high-level parallel IR and proceed to emit GPU code directly in one big step. Fixed strategies are used to optimize and map parallelism exploiting properties of a particular GPU generation leading to performance portability issues.
We introduce the Lift IR, a new data-parallel IR which encodes OpenCL-specific constructs as functional patterns. Our prior work has shown that this functional nature simplifies the exploration of optimizations and mapping of parallelism from portable high-level programs using rewrite-rules.
This paper describes how Lift IR programs are compiled into efficient OpenCL code. This is non-trivial as many performance sensitive details such as memory allocation, array accesses or synchronization are not explicitly represented in the Lift IR. We present techniques which overcome this challenge by exploiting the pattern’s high-level semantics. Our evaluation shows that the Lift IR is flexible enough to express GPU programs with complex optimizations achieving performance on par with manually optimized code.

Best Paper Nominees

Synthesizing Benchmarks for Predictive Modeling
Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather
(University of Edinburgh, UK; Lancaster University, UK)
Predictive modeling using machine learning is an effective method for building compiler heuristics, but there is a shortage of benchmarks. Typical machine learning experiments outside of the compilation field train over thousands or millions of examples. In machine learning for compilers, however, there are typically only a few dozen common benchmarks available. This limits the quality of learned models, as they have very sparse training data for what are often high-dimensional feature spaces. What is needed is a way to generate an unbounded number of training programs that finely cover the feature space. At the same time the generated programs must be similar to the types of programs that human developers actually write, otherwise the learning will target the wrong parts of the feature space.
We mine open source repositories for program fragments and apply deep learning techniques to automatically construct models for how humans write programs. We sample these models to generate an unbounded number of runnable training programs. The quality of the programs is such that even human developers struggle to distinguish our generated programs from hand-written code.
We use our generator for OpenCL programs, CLgen, to automatically synthesize thousands of programs and show that learning over these improves the performance of a state of the art predictive model by 1.27×. In addition, the fine covering of the feature space automatically exposes weaknesses in the feature design which are invisible with the sparse training examples from existing benchmark suites. Correcting these weaknesses further increases performance by 4.30×.

Info

Formalizing the Concurrency Semantics of an LLVM Fragment
Soham Chakraborty and Viktor Vafeiadis

(MPI-SWS, Germany)
The LLVM compiler follows closely the concurrency model of C/C++ 2011, but with a crucial difference. While in C/C++ a data race between a non-atomic read and a write is declared to be undefined behavior, in LLVM such a race has defined behavior: the read returns the special `undef' value. This subtle difference in the semantics of racy programs has profound consequences on the set of allowed program transformations, but it has been not formally been studied before.
This work closes this gap by providing a formal memory model for a substantial fragment of LLVM and showing that it is correct as a concurrency model for a compiler intermediate language: (1) it is stronger than the C/C++ model, (2) weaker than the known hardware models, and (3) supports the expected program transformations. In order to support LLVM's semantics for racy accesses, our formal model does not work on the level of single executions as the hardware and the C/C++ models do, but rather uses more elaborate structures called event structures.

ThinLTO: Scalable and Incremental LTO
Teresa Johnson, Mehdi Amini, and Xinliang David Li

(Google, USA; Apple, USA)
Cross-Module Optimization (CMO) is an effective means for improving runtime performance, by extending the scope of optimizations across source module boundaries. Two CMO approaches are Link-Time Optimization (LTO) and Lightweight Inter-Procedural Optimization (LIPO). However, each of these solutions has limitations that prevent it from being enabled by default. ThinLTO is a new approach that attempts to address these limitations, with a goal of being enabled more broadly. ThinLTO aims to be as scalable as a regular non-LTO build, enabling CMO on large applications and machines without large memory configurations, while also integrating well with distributed and incremental build systems. This is achieved through fast purely summary-based Whole-Program Analysis (WPA), the only serial step, without reading or writing the program’s Intermediate Representation (IR). Instead, CMO is applied during fully parallel optimization backends.
This paper describes the motivation behind ThinLTO, its overall design, and current implementation in LLVM. Results from SPEC cpu2006 benchmarks and several large real-world applications illustrate that ThinLTO can scale as well as a non-LTO build while enabling most of the CMO performed with a full LTO build.

Automatic Generation of Fast BLAS3-GEMM: A Portable Compiler Approach
Xing Su, Xiangke Liao, and Jingling Xue

(National University of Defense Technology, China; UNSW, Australia)
GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in assembly code or generated from C code by general-purpose compilers (guided by architecture-specific directives or auto-tuning). Therefore, either performance or portability suffers.
We present a POrtable Compiler Approach, Poca, implemented in LLVM, to automatically generate and optimize this micro-kernel in an architecture-independent manner, without involving domain experts. The key insight is to leverage a wide range of architecture-specific abstractions already available in LLVM, by first generating a vectorized micro-kernel in the architecture-independent LLVM IR and then improving its performance by applying a series of domain-specific yet architecture-independent optimizations. The optimized micro-kernel drops easily in existing GEMM frameworks such as BLIS and OpenBLAS. Validation focuses on optimizing GEMM in double precision on two architectures. On Intel Sandybridge and AArch64 Cortex-A57, Poca’s micro-kernels outperform expert-crafted assembly code by 2.35% and 7.54%, respectively, and both BLIS and OpenBLAS achieve competitive or better performance once their micro-kernels are replaced by Poca’s.

Memory Dependencies

Pointer Disambiguation via Strict Inequalities
Maroua Maalej, Vitor Paisante, Pedro Ramos, Laure Gonnord

, and Fernando Magno Quintão Pereira

(University of Lyon, France; LIP, France; Federal University of Minas Gerais, Brazil)
The design and implementation of static analyses that disambiguate pointers has been a focus of research since the early days of compiler construction. One of the challenges that arise in this context is the analysis of languages that support pointer arithmetics, such as C, C++ and assembly dialects. This paper contributes to solve this challenge. We start from an obvious, yet unexplored, observation: if a pointer is strictly less than another, they cannot alias. Motivated by this remark, we use abstract interpretation to build strict less-than relations between pointers. We construct a program representation that bestows the Static Single Information (SSI) property onto our dataflow analysis. SSI gives us a sparse algorithm, whose correctness is easy to ensure. We have implemented our static analysis in LLVM. It runs in time linear on the number of program variables, and, depending on the benchmark, it can be as much as six times more precise than the pointer disambiguation techniques already in place in that compiler.

Info

A Collaborative Dependence Analysis Framework
Nick P. Johnson, Jordan Fix, Stephen R. Beard, Taewook Oh, Thomas B. Jablin, and David I. August

(Princeton University, USA; University of Illinois at Urbana-Champaign, USA; Multicoreware, USA)
Compiler optimizations discover facts about program behavior by querying static analysis. However, developing or extending precise analysis is difficult. Some prior works implement analysis with a single algorithm, but the algorithm becomes more complex as it is extended for greater precision. Other works achieve modularity by implementing several simple algorithms and trivially composing them to report the best result from among them. Such a modular approach has limited precision because it employs only one algorithm in response to one query, without synergy between algorithms. This paper presents a framework for dependence analysis algorithms to collaborate and achieve precision greater than the trivial combination of those algorithms. With this framework, developers can achieve the high precision of complex analysis algorithms through collaboration of simple and orthogonal algorithms, without sacrificing the ease of implementation of the modular approach. Results demonstrate that collaboration of simple analyses enables advanced compiler optimizations.

Characterizing Data Organization Effects on Heterogeneous Memory Architectures
Apan Qasem, Ashwin M. Aji, and Gregory Rodgers
(Texas State University, USA; AMD Research, USA)
Layout and placement of shared data structures is critical to achieving scalable performance on heterogeneous memory architectures. While recent research has established the importance of data organization and developed mechanisms for data layout conversion, a general strategy for when to make layout changes and where to map data segments in a heterogeneous environment, has not yet emerged.

In this paper, we present a cross-platform study that characterizes the performance impact of data organization on candidate HPC node architectures with heterogeneous, multi-level memory systems. We systematically explore a multi dimensional space of alternate code variants, identify program attributes that have the most impact on data organization decisions and establish a set of performance imperatives to guide data layout and placement across different architectures. The results of the study provide several new insights. The study shows that the conventional approach of using a structure-of-arrays layout for device-mapped data structures is not always profitable. Results also show that in addition to memory divergence, data layout choices are impacted by a variety of factors including arithmetic intensity and register pressure. We use the results to develop a new data layout strategy, called compressed-array layout that addresses the limitations of current approaches and yields up to an order-of-magnitude speedup on some architectures.

Accelerators and Binary Translation

Clairvoyance: Look-Ahead Compile-Time Scheduling
Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Själander, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean
(Uppsala University, Sweden; Norwegian University of Science and Technology, Norway)
To enhance the performance of memory-bound applications, hardware designs have been developed to hide memory latency, such as the out-of-order (OoO) execution engine, at the price of increased energy consumption. Contemporary processor cores span a wide range of performance and energy efficiency options: from fast and power-hungry OoO processors to efficient, but slower in-order processors. The more memory-bound an application is, the more aggressive the OoO execution engine has to be to hide memory latency. This proposal targets the middle ground, as seen in a simple OoO core, which strikes a good balance between performance and energy efficiency and currently dominates the market for mobile, hand-held devices and high-end embedded systems. We show that these simple, more energy-efficient OoO cores, equipped with the appropriate compile-time support, considerably boost the performance of single-threaded execution and reach new levels of performance for memory-bound applications. Clairvoyance generates code that is able to hide memory latency and better utilize the OoO engine, thus delivering higher performance at lower energy. To this end, Clairvoyance overcomes restrictions which yielded conventional compile-time techniques impractical: (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Thus, Clairvoyance achieves a geomean execution time improvement of 7% for memory-bound applications with a conservative approach and 13% with a speculative but safe approach, on top of standard O3 optimizations, while maintaining compute-bound applications’ high-performance.

Phase-Aware Optimization in Approximate Computing
Subrata Mitra, Manish K. Gupta, Sasa Misailovic

, and Saurabh Bagchi
(Purdue University, USA; Microsoft, USA; University of Illinois at Urbana-Champaign, USA)
This paper shows that many applications exhibit execution-phase-specific sensitivity towards approximation of the internal subcomputations. Therefore, approximation in certain phases can be more beneficial than others. Further, this paper presents OPPROX, a novel system for application's execution-phase-aware approximation. For a user provided error budget and target input parameters, OPPROX identifies different program phases and searches for profitable approximation settings for each phase of the application execution. Our evaluation with five benchmarks and four existing transformations show that our phase-aware optimization on average does 14% less work for a 5% error tolerance bound and 42% less work for a 20% tolerance bound.

A Space- and Energy-Efficient Code Compression/Decompression Technique for Coarse-Grained Reconfigurable Architectures
Bernhard Egger

, Hochan Lee, Duseok Kang, Mansureh S. Moghaddam, Youngchul Cho, Yeonbok Lee, Sukjin Kim, Soonhoi Ha, and Kiyoung Choi
(Seoul National University, South Korea; Samsung Electronics, South Korea)
We present an effective code compression technique to reduce the area and energy overhead of the configuration memory for coarse-grained reconfigurable architectures (CGRA). Based on a statistical analysis of existing code, the proposed method reorders the storage locations of the reconfigurable entities and splits the wide configuration memory into a number of partitions. Code compression is achieved by removing consecutive duplicated lines in each partition. Compressibility is increased by an optimization phase in the compiler. The optimization minimizes the number of configuration changes for individual reconfigurable entities. Decompression is performed by a simple hardware decoder logic that is able to decode lines with no additional latency and negligible area overhead. Experiments with over 190 loop kernels from different application domains show that the proposed method achieves a memory reduction of over 40% on average with four partitions. The compressibility of yet unseen code is only slightly lower with 35% on average. In addition, executing compressed code results in a 22 to 47% reduction in the configuration logic's energy consumption.

Cross-ISA Machine Emulation for Multicores
Emilio G. Cota, Paolo Bonzini, Alex Bennée, and Luca P. Carloni
(Columbia University, USA; Red Hat, Italy; Linaro, UK)
Speed, portability and correctness have traditionally been the main requirements for dynamic binary translation (DBT) systems. Given the increasing availability of multi-core machines as both emulation guests and hosts, scalability has emerged as an additional design objective. It has however been an elusive goal for two reasons: contention on common data structures such as the translation cache is difficult to avoid without hurting performance, and instruction set architecture (ISA) disparities between guest and host (such as mismatches in the memory consistency model and the semantics of atomic operations) can compromise correctness.

In this paper we address these challenges in a simple and memory-efficient way, demonstrating a multi-threaded DBT-based emulator that scales in an architecture-independent manner. Furthermore, we explore the trade-offs that exist when emulating atomic operations across ISAs, and present a novel approach for correct and scalable emulation of load-locked/store-conditional instructions based on hardware transactional memory (HTM). By adding around 1000 lines of code to QEMU, we demonstrate the scalability of both user-mode and full-system emulation on a 64-core x86_64 host running x86_64 guest code, and a 12-core, 96-thread POWER8 host running x86_64 and Aarch64 guest code.

Feedback Directed and Whole Program Optimization

Incremental Whole Program Optimization and Compilation
Patrick W. Sathyanathan, Wenlei He, and Ten H. Tzen
(Microsoft, USA)
Most modern compilers for statically typed high level languages perform whole program optimization and compilation in order to generate highly optimized code. Whole program analysis and optimization has the advantage of providing the compiler with visibility to the entire program. This allows information to be propagated across procedures to improve the quality of the generated code. The main disadvantage of existing systems with this capability however, is that recompiling a program after an edit incurs the cost of reanalyzing and regenerating code for all functions.
Compiler throughput is crucial for developers' everyday edit-compile-test cycle as well as build lab's automated rolling build. Improving the throughput of such builds significantly improves developer productivity and lab efficiency.
We present a practical and extensible framework for incremental whole program optimization (WPO) and compilation. It uses two program abstractions, namely a dependence graph representing the program entities that the code for each function depends upon, and lattices of data-flow information that affect the code generated. These abstractions are used to minimize the number of functions that must be reanalyzed and recompiled after a program edit. The framework uses a simple and fast checksum technique for detecting edits to functions and variables, and experimental evidence that this technique works well in practice. We also present a novel mechanism for achieving the no code generation diff requirement in the presence of multi-level inline expansion.
The system has been successfully implemented in the state-of-the-art commercial quality Visual C/C++ compiler and achieves up to 7X compilation speedup for typical edits.

Optimizing Function Placement for Large-Scale Data-Center Applications
Guilherme Ottoni and Bertrand Maher
(Facebook, USA)
Modern data-center applications often comprise a large amount of code, with substantial working sets, making them good candidates for code-layout optimizations. Although recent work has evaluated the impact of profile-guided intra-module optimizations and some cross-module optimizations, no recent study has evaluated the benefit of function placement for such large-scale applications. In this paper, we study the impact of function placement in the context of a simple tool we created that uses sample-based profiling data. By using sample-based profiling, this methodology follows the same principle behind AutoFDO, i.e. using profiling data collected from unmodified binaries running in production, which makes it applicable to large-scale binaries. Using this tool, we first evaluate the impact of the traditional Pettis-Hansen ‍(PH) function-placement algorithm on a set of widely deployed data-center applications. Our experiments show that using the PH algorithm improves the performance of the studied applications by an average of 2.6%. In addition to that, this paper also evaluates the impact of two improvements on top of the PH technique. The first improvement is a new algorithm, called C³, which addresses a fundamental weakness we identified in the PH algorithm. We not only qualitatively illustrate how C³ overcomes this weakness in PH, but also present experimental results confirming that C³ performs better than PH in practice, boosting the performance of our workloads by an average of 2.9% on top of PH. The second improvement we evaluate is the selective use of huge pages. Our evaluation shows that, although aggressively mapping the entire code section of a large binary onto huge pages can be detrimental to performance, judiciously using huge pages can further improve performance of our applications by 2.0% on average.

Minimizing the Cost of Iterative Compilation with Active Learning
William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, and Hugh Leather
(University of Edinburgh, UK; Lancaster University, UK)
Since performance is not portable between platforms, engineers must fine-tune heuristics for each processor in turn. This is such a laborious task that high-profile compilers, supporting many architectures, cannot keep up with hardware innovation and are actually out-of-date. Iterative compilation driven by machine learning has been shown to be efficient at generating portable optimization models automatically. However, good quality models require costly, repetitive, and extensive training which greatly hinders the wide adoption of this powerful technique.
In this work, we show that much of this cost is spent collecting training data, runtime measurements for different optimization decisions, which contribute little to the final heuristic. Current implementations evaluate randomly chosen, often redundant, training examples a pre-configured, almost always excessive, number of times – a large source of wasted effort. Our approach optimizes not only the selection of training examples but also the number of samples per example, independently. To evaluate, we construct 11 high-quality models which use a combination of optimization settings to predict the runtime of benchmarks from the SPAPT suite. Our novel, broadly applicable, methodology is able to reduce the training overhead by up to 26x compared to an approach with a fixed number of sample runs, transforming what is potentially months of work into days.

Removing Checks in Dynamically Typed Languages through Efficient Profiling
Gem Dot, Alejandro Martínez, and Antonio González
(Universitat Politècnica de Catalunya, Spain; ARM, UK)
Dynamically typed languages increase programmer’s productivity at the expense of some runtime overheads to manage the types of variables, since they are not declared at compile time and can change at runtime. One of the most important overheads is due to very frequent checks that are introduced in the specialized code to identify the type of the variables.
In this paper, we present a HW/SW hybrid mechanism that allows the removal of checks executed in the optimized code by performing a HW profiling of the types of object variables. To demonstrate the benefits of the proposed technique, we implement it in a JavaScript engine and show that it produces 7.1% speedup on average for optimized JavaScript code (up to 34% for some applications) and 6.5% energy reduction.

Reductions and Loops

Discovery and Exploitation of General Reductions: A Constraint Based Approach
Philip Ginsbach and Michael F. P. O'Boyle

(University of Edinburgh, UK)
Discovering and exploiting scalar reductions in programs has been studied for many years. The discovery of more complex reduction operations has, however, received less attention. Such reductions contain compile-time unknown parameters, indirect memory accesses and dynamic control flow, which are challenging for existing approaches. In this paper we develop a new compiler based approach that automatically detects a wide class of reductions. This approach is based on a constraint formulation of the reduction idiom and has been implemented as an LLVM pass. We use a custom constraint solver to identify program subsets that adhere to the constraint specification. Once discovered, we automatically generate parallel code to exploit the reduction. This approach is robust and was evaluated on C versions of three well known benchmark suites: NAS, Parboil and Rodinia. We detected 84 scalar reductions and 6 histograms, outperforming existing approaches. We show that exploiting histograms gives significant performance improvement.

Parallel Associative Reductions in Halide
Patricia Suriana, Andrew Adams, and Shoaib Kamil
(Google, USA; Adobe, USA)
Halide is a domain-specific language for fast image processing that separates pipelines into the algorithm, which defines what values are computed, and the schedule, which defines how they are computed. Changes to the schedule are guaranteed to not change the results. While Halide supports parallelizing and vectorizing naturally data-parallel operations, it does not support the same scheduling for reductions. Instead, the programmer must create data parallelism by manually factoring reductions into multiple stages. This manipulation of the algorithm can introduce bugs, impairs readability and portability, and makes it impossible for automatic scheduling methods to parallelize reductions.
We describe a new Halide scheduling primitive rfactor which moves this factoring transformation into the schedule, as well as a novel synthesis-based technique that takes serial Halide reductions and synthesizes an equivalent binary associative reduction operator and its identity. This enables us to automatically replace the original pipeline stage with a pair of stages which first compute partial results over slices of the reduction domain, and then combine them. Our technique permits parallelization and vectorization of Halide algorithms which previously required manipulating both the algorithm and schedule.

Optimistic Loop Optimization
Johannes Doerfert, Tobias Grosser, and Sebastian Hack

(Saarland University, Germany; ETH Zurich, Switzerland)
Compilers use static analyses to justify program optimizations. As every optimization must preserve the semantics of the original program, static analysis typically fall-back to conservative approximations. Consequently, the set of states for which the optimization is invalid is overapproximated and potential optimization opportunities are missed. Instead of justifying the optimization statically, a compiler can also synthesize preconditions that imply the correctness of the optimizations and are checked at the runtime of the program.
In this paper, we present a framework to collect, generalize, and simplify assumptions based on Presburger arithmetic. We introduce different assumptions necessary to enable a variety of complex loop transformations and derive a (close to) minimal set of preconditions to validate them at runtime. Our evaluation shows that the runtime verification introduces negligible overhead and that the assumptions we propose almost always hold true. On a large benchmark set including SPEC and NPB our technique increases the number of modeled non-trivial loop nests by a factor of 3.9×.

Software Prefetching for Indirect Memory Accesses
Sam Ainsworth and Timothy M. Jones

(University of Cambridge, UK)
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. However, these are difficult to insert to effectively improve performance, and techniques for automatic insertion are currently limited.
This paper develops a novel compiler pass to automatically generate software prefetches for indirect memory accesses, a special class of irregular memory accesses often seen in high-performance workloads. We evaluate this across a wide set of systems, all of which gain benefit from the technique. We then evaluate the extent to which good prefetch instructions are architecture dependent. Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3× and 1.1× for an Intel Haswell processor and an ARM Cortex-A57, both out-of-order cores, and performance improvements of 2.1× and 3.7× for the in-order ARM Cortex-A53 and Intel Xeon Phi.

CGO 2017 – Proceedings

Frontmatter

Main Research Papers

Shared Memory

GPU Optimization

Best Paper Nominees

Memory Dependencies

Accelerators and Binary Translation

Feedback Directed and Whole Program Optimization

Reductions and Loops