CGO 2018
2018 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Powered by
Conference Publishing Consulting

2018 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), February 24–28, 2018, Vienna, Austria

CGO 2018 – Proceedings

Contents - Abstracts - Authors


Title Page

Message from the General Chairs
Welcome to the 16th International Symposium on Code Generation and Optimization (CGO), held at the Austria Trend Eventhotel Pyramide, located just to the south of Vienna, Austria, February 24-28, 2018. CGO provides a premier venue to bring together researchers and practitioners working at the interface of hardware and software on a wide range of optimization and code generation techniques and related issues. The conference spans the spectrum from purely static to fully dynamic approaches, and from pure software-based methods to specific architectural features and support for code generation and optimization.

Message from the Program Chairs
We would like to welcome you to CGO 2018 in the beautiful city of Vienna. On behalf of the Program Committee, we are delighted to present the program for the 2018 International Symposium on Code Generation and Optimization. We hope that you find it informative and stimulating.

CGO 2018 Organization
Committee Listings

Report from the Artifact Evaluation Committee
Authors of accepted papers were given the opportunity to participate in the artifact evaluation process by submitting a research artifact. ACM defines an artifact as "a digital object that was either created by the authors to be used as part of the study or generated by the experiment itself". The artifact evaluation process checks if the submitted artifact supports the claims made in the paper. Ultimately, artifact evaluation is intended to encourage researchers to take extra care in conducting experiments in a reproducible way and to package experimental workflows and all related materials for making them accessible for others.

Sponsors and Supporters
Sponsors and Supporters

Student Research Competition
As in previous years, the 2018 International Symposium on Code Generation and Optimization (CGO) hosts the ACM Student Research Competition (SRC). We received eight poster abstracts, of which six were selected to compete in the ACM SRC. Out of these six submissions, four were selected in the graduate category and two were selected in the undergraduate category. All submissions were reviewed by five members of the selection committee. Each reviewer attributed up to 30 points to a submission and provided a short comment. The submissions were then ranked according to their total score. We would like to thank the General Co-Chairs and Program Co-Chairs for helping with our inquiries and for their initiative during the preparations of the SRC. Special thanks go to the members of the selection committee for reviewing all the abstracts on time and to all the authors who submitted a poster abstract, as well as all students who participated in the SRC.


Biological Computation (Keynote)
Sara-Jane Dunn
(Microsoft Research, UK)
Unlike engineered systems, living cells self-generate, self-organise and self-repair, they undertake massively parallel operations with slow and noisy components in a noisy environment, they sense and actuate at molecular scales, and most intriguingly, they blur the line between software and hardware. Understanding this biological computation presents a huge challenge to the scientific community. Yet the ultimate destination and prize at the culmination of this scientific journey is the promise of revolutionary and transformative technology: the rational design and implementation of biological function, or more succinctly, the ability to program life.

Publisher's Version Article Search

Managed Runtimes

SIMD Intrinsics on Managed Language Runtimes
Alen Stojanov, Ivaylo Toskov, Tiark Rompf, and Markus Püschel
(ETH Zurich, Switzerland; Purdue University, USA)
Managed language runtimes such as the Java Virtual Machine (JVM) provide adequate performance for a wide range of applications, but at the same time, they lack much of the low-level control that performance-minded programmers appreciate in languages like C/C++. One important example is the intrinsics interface that exposes instructions of SIMD (Single Instruction Multiple Data) vector ISAs (Instruction Set Architectures). In this paper we present an automatic approach for including native intrinsics in the runtime of a managed language. Our implementation consists of two parts. First, for each vector ISA, we automatically generate the intrinsics API from the vendor-provided XML specification. Second, we employ a metaprogramming approach that enables programmers to generate and load native code at runtime. In this setting, programmers can use the entire high-level language as a kind of macro system to define new high-level vector APIs with zero overhead. As an example use case we show a variable precision API. We provide an end-to-end implementation of our approach in the HotSpot VM that supports all 5912 Intel SIMD intrinsics from MMX to AVX-512. Our benchmarks demonstrate that this combination of SIMD and metaprogramming enables developers to write high-performance, vectorized code on an unmodified JVM that outperforms the auto-vectorizing HotSpot just-in-time (JIT) compiler and provides tight integration between vectorized native code and the managed JVM ecosystem.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Replicated
CollectionSwitch: A Framework for Efficient and Dynamic Collection Selection
Diego Costa and Artur Andrzejak
(University of Heidelberg, Germany)
Selecting collection data structures for a given application is a crucial aspect of the software development. Inefficient usage of collections has been credited as a major cause of performance bloat in applications written in Java, C++ and C#. Furthermore, a single implementation might not be optimal throughout the entire program execution. This demands an adaptive solution that adjusts at runtime the collection implementations to varying workloads.
We present CollectionSwitch, an application-level framework for efficient collection adaptation. It selects at runtime collection implementations in order to optimize the execution and memory performance of an application. Unlike previous works, we use workload data on the level of collection allocation sites to guide the optimization process. Our framework identifies allocation sites which instantiate suboptimal collection variants, and selects optimized variants for future instantiations. As a further contribution we propose adaptive collection implementations which switch their underlying data structures according to the size of the collection.
We implement this framework in Java, and demonstrate the improvements in terms of time and memory behavior across a range of benchmarks. To our knowledge, it is the first approach which is capable of runtime performance optimization of Java collections with very low overhead.

Publisher's Version Article Search
Analyzing and Optimizing Task Granularity on the JVM
Andrea Rosà, Eduardo Rosales, and Walter Binder
(University of Lugano, Switzerland)
Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, resulting in missed parallelization opportunities. In this paper, we provide a better understanding of task granularity for applications running on a Java Virtual Machine. We present a novel profiler which measures the granularity of every executed task. Our profiler collects carefully selected metrics from the whole system stack with only little overhead, and helps the developer locate performance problems. We analyze task granularity in the DaCapo and ScalaBench benchmark suites, revealing several inefficiencies related to fine-grained and coarse-grained tasks. We demonstrate that the collected task-granularity profiles are actionable by optimizing task granularity in two benchmarks, achieving speedups up to 1.53x.

Publisher's Version Article Search

Resilience and Security

Automating Efficient Variable-Grained Resiliency for Low-Power IoT Systems
Sara S. Baghsorkhi and Christos Margiolas
(Intel, USA)
New trends in edge computing encourage pushing more of the compute and analytics to the outer edge and processing most of the data locally. We explore how to transparently provide resiliency for heavy duty edge applications running on low-power devices that must deal with frequent and unpredictable power disruptions. Complicating this process further are (a) memory usage restrictions in tiny low-power devices, that affect not only performance but efficacy of the resiliency techniques, and (b) differing resiliency requirements across deployment environments. Nevertheless, an application developer wants the ability to write an application once, and have it be reusable across all low-power platforms and across all different deployment settings. In response to these challenges, we have devised a transparent roll-back recovery mechanism that performs incremental checkpoints with minimal execution time overhead and at variable granularities. Our solution includes the co-design of firmware, runtime and compiler transformations for providing seamless fault-tolerance, along with an auto-tuning layer that automatically generates multiple resilient variants of an application. Each variant spreads application’s execution over atomic transactional regions of a certain granularity. Variants with smaller regions provide better resiliency, but incur higher overhead; thus, there is no single best option, but rather a Pareto optimal set of configurations. We apply these strategies across a variety of edge device applications and measure the execution time overhead of the framework on a TI MSP430FR6989. When we restrict unin- terrupted atomic intervals to 100ms, our framework keeps geomean overhead below 2.48x.

Publisher's Version Article Search
Resilient Decentralized Android Application Repackaging Detection Using Logic Bombs
Qiang Zeng, Lannan Luo, Zhiyun Qian, Xiaojiang Du, and Zhoujun Li
(Temple University, USA; University of South Carolina, USA; University of California at Riverside, USA; Beihang University, China)
Application repackaging is a severe threat to Android users and the market. Existing countermeasures mostly detect repackaging based on app similarity measurement and rely on a central party to perform detection, which is unscalable and imprecise. We instead consider building the detection capability into apps, such that user devices are made use of to detect repackaging in a decentralized fashion. The main challenge is how to protect repackaging detection code from attacks. We propose a creative use of logic bombs, which are regularly used in malware, to conquer the challenge. A novel bomb structure is invented and used: the trigger conditions are constructed to exploit the differences between the attacker and users, such that a bomb that lies dormant on the attacker side will be activated on one of the user devices, while the repackaging detection code, which is packed as the bomb payload, is kept inactive until the trigger conditions are satisfied. Moreover, the repackaging detection code is woven into the original app code and gets encrypted; thus, attacks by modifying or deleting suspicious code will corrupt the app itself. We have implemented a prototype, named BombDroid, that builds the repackaging detection into apps through bytecode instrumentation, and the evaluation shows that the technique is effective, efficient, and resilient to various adversary analysis including symbol execution, multi-path exploration, and program slicing.

Publisher's Version Article Search
nAdroid: Statically Detecting Ordering Violations in Android Applications
Xinwei Fu, Dongyoon Lee, and Changhee Jung
(Virginia Tech, USA)
Modern mobile applications use a hybrid concurrency model. In this model, events are handled sequentially by event loop(s), and long-running tasks are offloaded to other threads. Concurrency errors in this hybrid concurrency model can take multiple forms: traditional atomicity and ordering violations between threads, as well as ordering violations between event callbacks on a single event loop.
This paper presents nAdroid, a static ordering violation detector for Android applications. Using our threadification technique, nAdroid statically models event callbacks as threads. Threadification converts ordering violations between event callbacks into ordering violations between threads, after which state-of-the-art thread-based race detection tools can be applied. nAdroid then applies a combination of sound and unsound filters, based on the Android concurrency model and its happens-before relation, to prune out false and benign warnings.
We evaluated nAdroid with 27 open source Android applications. Experimental results show that nAdroid detects 88 (at least 58 new) harmful ordering violations, and outperforms the state-of-the-art static technique with fewer false negatives and false positives.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Replicated
SGXElide: Enabling Enclave Code Secrecy via Self-Modification
Erick Bauman, Huibo Wang, Mingwei Zhang, and Zhiqiang Lin
(University of Texas at Dallas, USA; Intel Labs, USA)
Intel SGX provides a secure enclave in which code and data are hidden from the outside world, including privileged code such as the OS or hypervisor. However, by default, enclave code prior to initialization can be disassembled and therefore no secrets can be embedded in the binary. This is a problem for developers wishing to protect code secrets. This paper introduces SGXElide, a nearly-transparent framework that enables enclave code confidentiality. The key idea is to treat program code as data and dynamically restore secrets after an enclave is initialized. SGXElide can be integrated into any enclave, providing a mechanism to securely decrypt or deliver the secret code with the assistance of a developer-controlled trusted remote party. We have implemented SGXElide atop a recently released version of the Linux SGX SDK, and our evaluation with a number of programs shows that SGXElide can be used to protect the code secrecy of practical applications with no overhead after enclave initialization.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Replicated

Best Paper Finalists

Poker: Permutation-Based SIMD Execution of Intensive Tree Search by Path Encoding
Feng Zhang and Jingling Xue
(UNSW, Australia)
We propose POKER, a permutation-based vectorization approach for vectorizing multiple queries over B+-trees. Our key insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons needed for a query to a minimum. Implemented as a C++ template library, POKER represents a general-purpose solution for vectorizing the queries over indexing trees on multi-core processors equipped with SIMD units. For a set of five representative benchmarks evaluated with 24 configurations each, POKER outperforms the state-of-the-art by 2.11x with one single thread and 2.28x with eight threads on an Intel Broadwell processor that supports 256-bit AVX2, on average.

Publisher's Version Article Search Artifacts Available Artifacts Reusable Results Replicated
High Performance Stencil Code Generation with Lift
Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach
(University of Münster, Germany; University of Edinburgh, UK; University of Glasgow, UK)
Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers.
Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains.
This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes.

Publisher's Version Article Search Info Artifacts Available Artifacts Reusable Results Replicated
Qubit Allocation
Marcos Yukio Siraichi, Vinícius Fernandes dos Santos, Sylvain Collange, and Fernando Magno Quintao Pereira
(Federal University of Minas Gerais, Brazil; Inria, France; University of Rennes, France; CNRS, France; IRISA, France)
In May of 2016, IBM Research has made a quantum processor available in the cloud to the general public. The possibility of programming an actual quantum device has elicited much enthusiasm. Yet, quantum programming still lacks the compiler support that modern programming languages enjoy today. To use universal quantum computers like IBM's, programmers must design low-level circuits. In particular, they must map logical qubits into physical qubits that need to obey connectivity constraints. This task resembles the early days of programming, in which software was built in machine languages. In this paper, we formally introduce the qubit allocation problem and provide an exact solution to it. This optimal algorithm deals with the simple quantum machinery available today; however, it cannot scale up to the more complex architectures scheduled to appear. Thus, we also provide a heuristic solution to qubit allocation, which is faster than the current solutions already implemented to deal with this problem.

Publisher's Version Article Search Info Artifacts Available Artifacts Reusable Results Replicated
Dominance-Based Duplication Simulation (DBDS): Code Duplication to Enable Compiler Optimizations
David Leopoldseder, Lukas Stadler, Thomas Würthinger, Josef Eisl, Doug Simon, and Hanspeter Mössenböck
(JKU Linz, Austria; Oracle Labs, Austria; Oracle Labs, Switzerland)
Compilers perform a variety of advanced optimizations to improve the quality of the generated machine code. However, optimizations that depend on the data flow of a program are often limited by control-flow merges. Code duplication can solve this problem by hoisting, i.e. duplicating, instructions from merge blocks to their predecessors. However, finding optimization opportunities enabled by duplication is a non-trivial task that requires compile-time intensive analysis. This imposes a challenge on modern (just-in-time) compilers: Duplicating instructions tentatively at every control flow merge is not feasible because excessive duplication leads to uncontrolled code growth and compile time increases. Therefore, compilers need to find out whether a duplication is beneficial enough to be performed.
This paper proposes a novel approach to determine which duplication operations should be performed to increase performance. The approach is based on a duplication simulation that enables a compiler to evaluate different success metrics per potential duplication. Using this information, the compiler can then select the most promising candidates for optimization. We show how to map duplication candidates into an optimization cost model that allows us to trade-off between different success metrics including peak performance, code size and compile time.
We implemented the approach on top of the GraalVM and evaluated it with the benchmarks Java DaCapo, Scala DaCapo, JavaScript Octane and a micro-benchmark suite, in terms of performance, compilation time and code size increase.
We show that our optimization can reach peak performance improvements of up to 40% with a mean peak performance increase of 5.89%, while it generates a mean code size increase of 9.93% and mean compile time increase of 18.44%.

Publisher's Version Article Search

Linear Algebra and Vectorization

The Generalized Matrix Chain Algorithm
Henrik Barthels, Marcin Copik, and Paolo Bientinesi
(RWTH Aachen University, Germany)
In this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product M := A1 A2An that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 9. The motivation for this work comes from the fact that—despite great advances in the development of compilers—the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.

Publisher's Version Article Search
CVR: Efficient Vectorization of SpMV on X86 Processors
Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and Lixin Zhang
(Institute of Computing Technology at Chinese Academy of Sciences, China; University at Chinese Academy of Sciences, China; College of William and Mary, USA; Princeton University, USA)
Sparse Matrix-vector Multiplication (SpMV) is an important computation kernel widely used in HPC and data centers. The irregularity of SpMV is a well-known challenge that limits SpMV’s parallelism with vectorization operations. Existing work achieves limited locality and vectorization efficiency with large preprocessing overheads. To address this issue, we present the Compressed Vectorization-oriented sparse Row (CVR), a novel SpMV representation targeting efficient vectorization. The CVR simultaneously processes multiple rows within the input matrix to increase cache efficiency and separates them into multiple SIMD lanes so as to take the advantage of vector processing units in modern processors. Our method is insensitive to the sparsity and irregularity of SpMV, and thus able to deal with various scale-free and HPC matrices. We implement and evaluate CVR on an Intel Knights Landing processor and compare it with five state-of-the-art approaches through using 58 scale-free and HPC sparse matrices. Experimental results show that CVR can achieve a speedup up to 1.70 × (1.33× on average) and a speedup up to 1.57× (1.10× on average) over the best existing approaches for scale-free and HPC sparse matrices, respectively. Moreover, CVR typically incurs the lowest preprocessing overhead compared with state-of-the-art approaches.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Replicated
Look-Ahead SLP: Auto-vectorization in the Presence of Commutative Operations
Vasileios Porpodas, Rodrigo C. O. Rocha, and Luís F. W. Góes
(Intel, USA; University of Edinburgh, UK; PUC-MG, Brazil)
Auto-vectorizing compilers automatically generate vector (SIMD) instructions out of scalar code. The state-of-the-art algorithm for straight-line code vectorization is Superword-Level Parallelism (SLP). In this work we identify a major limitation at the core of the SLP algorithm, in the performance-critical step of collecting the vectorization candidate instructions that form the SLP-graph data structure. SLP lacks global knowledge when building its vectorization graph, which negatively affects its local decisions when it encounters commutative instructions. We propose LSLP, an improved algorithm that can plug-in to existing SLP implementations, and can effectively vectorize code with arbitrarily long chains of commutative operations. LSLP relies on short-depth look-ahead for better-informed local decisions. Our evaluation on a real machine shows that LSLP can significantly improve the performance of real-world code with little compilation-time overhead.

Publisher's Version Article Search
Conflict-Free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances
Peng Jiang and Gagan Agrawal
(Ohio State University, USA; The Ohio State University, USA)
Irregular applications that involve indirect memory accesses were traditionally considered unsuitable for SIMD processing. Though some progress has been made in recent years, the existing approaches require either expensive data reorganization or favorable input distribution to deliver good performance. In this work, we propose a novel vectorization approach called in-vector reduction that can efficiently accelerate a class of associative irregular applications. This approach exploits associativity in the irregular reductions to resolve the data conflicts within SIMD vectors. We implement in-vector reduction with the new conflict detecting instructions that are supported in Intel AVX-512 instruction set and provide a programming interface to facilitate the vectorization of such associative irregular applications. Compared with previous approaches, in-vector reduction eliminates a large part of the overhead of data reorganization and achieves high SIMD utilization even under adverse input distributions. The evaluation results show that our approach is efficient in vectorizing a diverse set of irregular applications, including graph algorithms, particle simulation codes, and hash-based aggregation. Our vectorization achieves 1.5x to 5.5x speedups over the original sequential codes on a single core of Intel Xeon Phi and outperforms a competing approach, conflict-masking based vectorization, by 1.4x to 11.8x.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Replicated

Static and Dynamic Analysis

Scalable Concurrency Debugging with Distributed Graph Processing
Long Zheng, Xiaofei Liao, Hai Jin, Jieshan Zhao, and Qinggang Wang
(Huazhong University of Science and Technology, China)
Existing constraint-solving-based technique enables an efficient and high-coverage concurrency debugging. Yet, there remains a significant gap between the state of the art and the state of the programming practices for scaling to handle long-running execution of programs.
In this paper, we revisit the scalability problem of state-of-the-art constraint-solving-based technique. Our key insight is that concurrency debugging for many real-world bugs can be turned into a graph traversal problem. We therefore present GraphDebugger, a novel debugging framework to enable the scalable concurrency analysis on program graphs via a tailored graph-parallel analysis in a distributed environment. It is verified that GraphDebugger is more capable than CLAP in reproducing the real-world bugs that involve a complex concurrency analysis. Our extensive evaluation on 7 real-world programs shows that, GraphDebugger (deployed on an 8-node EC2 like cluster) is significantly efficient to reproduce concurrency bugs within 1∼8 minutes while CLAP does so with 1∼30 hours, or even without returning solutions.

Publisher's Version Article Search
Lightweight Detection of Cache Conflicts
Probir Roy, Shuaiwen Leon Song, Sriram Krishnamoorthy, and Xu Liu
(College of William and Mary, USA; Pacific Northwest National Laboratory, USA)
In memory hierarchies, caches perform an important role in reducing average memory access latency. Minimizing cache misses can yield significant performance gains. As set-associative caches are widely used in modern architectures, capacity and conflict cache misses co-exist. These two types of cache misses require different optimization strategies. While cache misses are commonly studied using cache simulators, state-of-the-art simulators usually incur hundreds to thousands of times a program's execution runtime. Moreover, a simulator has difficulty in simulating complex real hardware. To overcome these limitations, measurement methods are proposed to directly monitor program execution on real hardware via performance monitoring units. However, existing measurement-based tools either focus on capacity cache misses or do not distinguish capacity and conflict cache misses. In this paper, we design and implement CCProf, a lightweight measurement-based profiler that identifies conflict cache misses and associates them with program source code and data structures. CCProf incurs moderate runtime overhead that is at least an order of magnitude lower than simulators. With the evaluation on a number of representative programs, CCProf is able to guide optimizations on cache conflict misses and obtain nontrivial speedups.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Replicated
CUDAAdvisor: LLVM-Based Runtime Profiling for Modern GPUs
Du Shen, Shuaiwen Leon Song, Ang Li, and Xu Liu
(College of William and Mary, USA; Pacific Northwest National Laboratory, USA)
General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CUDAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Replicated
May-Happen-in-Parallel Analysis with Static Vector Clocks
Qing Zhou, Lian Li, Lei Wang, Jingling Xue, and Xiaobing Feng
(Institute of Computing Technology at Chinese Academy of Sciences, China; University at Chinese Academy of Sciences, China; UNSW, Australia)
May-Happen-in-Parallel (MHP) analysis computes whether two statements in a multi-threaded program may execute concurrently or not. It works as a basis for many analyses and optimization techniques of concurrent programs. This paper proposes a novel approach for MHP analysis, by statically computing vector clocks. Static vector clocks extend the classic vector clocks algorithm to handle the complex control flow structures in static analysis, and we have developed an efficient context-sensitive algorithm to compute them. To the best of our knowledge, this is the first attempt to compute vector clocks statically. Using static vector clocks, we can drastically improve the efficiency of existing MHP analyses, without loss of precision: the performance speedup can be up to 1828X, with a much smaller memory footprint (reduced by up to 150X). We have implemented our analysis in a static data race detector, and experimental results show that our MHP analysis can help remove up to 88% of spurious data race pairs.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Replicated

Memory Usage Optimisation

DeLICM: Scalar Dependence Removal at Zero Memory Cost
Michael Kruse and Tobias Grosser
(Inria, France; ETH Zurich, Switzerland)
Increasing data movement costs motivate the integration of polyhedral loop optimizers in the standard flow (-O3) of production compilers. While polyhedral optimizers have been shown to be effective when applied as source-to-source transformation, the single static assignment form used in modern compiler mid-ends makes such optimizers less effective. Scalar dependencies (dependencies carried over a single memory location) are the main obstacle preventing effective optimization. We present DeLICM, a set of transformations which, backed by a polyhedral value analysis, eliminate problematic scalar dependences by 1) relocating scalar memory references to unused array locations and by 2) forwarding computations that otherwise cause scalar dependences. Our experiments show that DeLICM effectively eliminates dependencies introduced by compiler-internal canonicalization passes, human programmers, optimizing code generators, or inlining -- without the need for any additional memory allocation. As a result, polyhedral loop optimizations can be better integrated into compiler pass pipelines which is essential for metaprogramming optimization.

Publisher's Version Article Search Info Artifacts Available Artifacts Functional Results Replicated
Loop Transformations Leveraging Hardware Prefetching
Savvas Sioutas, Sander Stuijk, Henk Corporaal, Twan Basten, and Lou Somers
(Eindhoven University of Technology, Netherlands)
Memory-bound applications heavily depend on the bandwidth of the system in order to achieve high performance. Improving temporal and/or spatial locality through loop transformations is a common way of mitigating this dependency. However, choosing the right combination of optimizations is not a trivial task, due to the fact that most of them alter the memory access pattern of the application and as a result interfere with the efficiency of the hardware prefetching mechanisms present in modern architectures. We propose an optimization algorithm that analytically classifies an algorithmic description of a loop nest in order to decide whether it should be optimized stressing its temporal or spatial locality, while also taking hardware prefetching into account. We implement our technique as a tool to be used with the Halide compiler and test it on a variety of benchmarks. We find an average performance improvement of over 40% compared to previous analytical models targeting the Halide language and compiler.

Publisher's Version Article Search
Transforming Loop Chains via Macro Dataflow Graphs
Eddie C. Davis, Michelle Mills Strout, and Catherine Olschanowsky
(Boise State University, USA; University of Arizona, USA)
This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized.
We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.

Publisher's Version Article Search
Local Memory-Aware Kernel Perforation
Daniel Maier, Biagio Cosenza, and Ben Juurlink
(TU Berlin, Germany)
Many applications provide inherent resilience to some amount of error and can potentially trade accuracy for performance by using approximate computing. Applications running on GPUs often use local memory to minimize the number of global memory accesses and to speed up execution. Local memory can also be very useful to improve the way approximate computation is performed, e.g., by improving the quality of approximation with data reconstruction techniques. This paper introduces local memory-aware perforation techniques specifically designed for the acceleration and approximation of GPU kernels. We propose a local memory-aware kernel perforation technique that first skips the loading of parts of the input data from global memory, and later uses reconstruction techniques on local memory to reach higher accuracy while having performance similar to state-of-the-art techniques. Experiments show that our approach is able to accelerate the execution of a variety of applications from 1.6× to 3× while introducing an average error of 6%, which is much smaller than that of other approaches. Results further show how much the error depends on the input data and application scenario, the impact of local memory tuning and different parameter configurations.

Publisher's Version Article Search

Program Generation and Synthesis

AutoPA: Automatically Generating Active Driver from Original Passive Driver Code
Jia-Ju Bai, Yu-Ping Wang, and Shi-Min Hu
(Tsinghua University, China)
Original device drivers are often passive in common operating systems, and they should correctly handle synchronization when concurrently invoked by multiple external threads. However, many concurrency bugs have occurred in drivers due to incautious synchronization. To solve concurrency problems, active driver is proposed to replace original passive driver. An active driver has its own thread and does not need to handle synchronization, thus the occurrence probability of many concurrency bugs can be effectively reduced. But previous approaches of active driver have some limitations. The biggest limitation is that original passive driver code needs to be manually rewritten. In this paper, we propose a practical approach, AutoPA, to automatically generate efficient active driver from original passive driver code. AutoPA uses function analysis and code instrumentation to perform automated driver generation, and it uses an improved active driver architecture to reduce performance degradation. We have evaluated AutoPA on 20 Linux drivers. The results show that AutoPA can automatically and successfully generate usable active drivers from original driver code. And generated active drivers can work normally with or without the synchronization primitives in original driver code. To check the effect of AutoPA on driver reliability, we perform fault injection testing on the generated active drivers, and find that all injected concurrency faults are well tolerated and the drivers can work normally. And the performance of generated active drivers is not obviously degraded compared to original passive drivers.

Publisher's Version Article Search
Synthesizing an Instruction Selection Rule Library from Semantic Specifications
Sebastian Buchwald, Andreas Fried, and Sebastian Hack
(KIT, Germany; Saarland University, Germany)
Instruction selection is the part of a compiler that transforms intermediate representation (IR) code into machine code. Instruction selectors build on a library of hundreds if not thousands of rules. Creating and maintaining these rules is a tedious and error-prone manual process.
In this paper, we present a fully automatic approach to create provably correct rule libraries from formal specifications of the instruction set architecture and the compiler IR. We use a hybrid approach that combines enumerative techniques with template-based counterexample-guided inductive synthesis (CEGIS). Thereby, we overcome several shortcomings of existing approaches, which were not able to handle complex instructions in a reasonable amount of time. In particular, we efficiently model memory operations.
Our tool synthesized a large part of the integer arithmetic rules for the x86 architecture within a few days where existing techniques could not deliver a substantial rule library within weeks. Using the rule library, we generate a prototype instruction selector that produces code on par with a manually-tuned instruction selector. Furthermore, using 63012 test cases generated from the rule library, we identified 29498 rules that both Clang and GCC miss.

Publisher's Version Article Search Info Artifacts Available Artifacts Reusable Results Replicated
Synthesizing Programs That Expose Performance Bottlenecks
Luca Della Toffola, Michael Pradel, and Thomas R. Gross
(ETH Zurich, Switzerland; TU Darmstadt, Germany)
Software often suffers from performance bottlenecks, e.g., because some code has a higher computational complexity than expected or because a code change introduces a performance regression. Finding such bottlenecks is challenging for developers and for profiling techniques because both rely on performance tests to execute the software, which are often not available in practice. This paper presents PerfSyn, an approach for synthesizing test programs that expose performance bottlenecks in a given method under test. The basic idea is to repeatedly mutate a program that uses the method to systematically increase the amount of work done by the method. We formulate the problem of synthesizing a bottleneck-exposing program as a combinatorial search and show that it can be effectively and efficiently addressed using well known graph search algorithms. We evaluate the approach with 147 methods from seven Java code bases. PerfSyn automatically synthesizes test programs that expose 22 bottlenecks. The bottlenecks are due to unexpectedly high computational complexity and due to performance differences between different versions of the same code.

Publisher's Version Article Search
Program Generation for Small-Scale Linear Algebra Applications
Daniele G. Spampinato, Diego Fabregat-Traver, Paolo Bientinesi, and Markus Püschel
(ETH Zurich, Switzerland; RWTH Aachen University, Germany)
We present SLinGen, a program generation system for linear algebra. The input to SLinGen is an application expressed mathematically in a linear-algebra-inspired language (LA) that we define. LA provides basic scalar/vector/matrix additions/multiplications and higher level operations including linear systems solvers, Cholesky and LU factorizations. The output of SLinGen is performance-optimized single-source C code, optionally vectorized with intrinsics. The target of SLinGen are small-scale computations on fixed-size operands, for which a straightforward implementation using optimized libraries (e.g., BLAS or LAPACK) is known to yield suboptimal performance (besides increasing code size and introducing dependencies), but which are crucial in control, signal processing, computer vision, and other domains. Internally, SLinGen uses synthesis and DSL-based techniques to optimize at a high level of abstraction. We benchmark our program generator on three prototypical applications: the Kalman filter, Gaussian process regression, and an L1-analysis convex solver, as well as basic routines including Cholesky factorization and solvers for the continuous-time Lyapunov and Sylvester equations. The results show significant speed-ups compared to straightforward C with Intel icc and clang with a polyhedral optimizer, as well as library-based and template-based implementations.

Publisher's Version Article Search

Compilation for Specialised Domains

Optimal DNN Primitive Selection with Partitioned Boolean Quadratic Programming
Andrew Anderson and David Gregg
(Trinity College Dublin, Ireland)
Deep Neural Networks (DNNs) require very large amounts of computation, and many different algorithms have been proposed to implement their most expensive layers, each of which has a large number of variants with different trade-offs of parallelism, locality, memory footprint, and execution time. In addition, specific algorithms operate much more efficiently on specialized data layouts.
We state the problem of optimal primitive selection in the presence of data layout transformations, and show that it is NP-hard by demonstrating an embedding in the Partitioned Boolean Quadratic Assignment problem (PBQP). We propose an analytic solution via a PBQP solver, and evaluate our approach experimentally by optimizing several popular DNNs using a library of more than 70 DNN primitives, on an embedded platform and a general purpose platform. We show experimentally that significant gains are possible versus the state of the art vendor libraries by using a principled analytic solution to the problem of primitive selection in the presence of data layout transformations.

Publisher's Version Article Search Artifacts Available Artifacts Functional Results Replicated
Register Allocation for Intel Processor Graphics
Wei-Yu Chen, Guei-Yuan Lueh, Pratik Ashar, Kaiyu Chen, and Buqi Cheng
(Intel, USA; Intel, India)
Register allocation is a well-studied problem, but surprisingly little work has been published on assigning registers for GPU architectures. In this paper we present the register allocator in the production compiler for Intel HD and Iris Graphics. Intel GPUs feature a large byte-addressable register file organized into banks, an expressive instruction set that supports variable SIMD-sizes and divergent control flow, and high spill overhead due to relatively long memory latencies. These distinctive characteristics impose challenges for register allocation, as input programs may have arbitrarily-sized variables, partial updates, and complex control flow. Not only should the allocator make a program spill-free, but it must also reduce the number of register bank conflicts and anti-dependencies. Since compilation occurs in a JIT environment, the allocator also needs to incur little overhead.
To manage compilation overhead, our register allocation framework adopts a hybrid approach that separates the assignment of local and global variables. Several extensions are introduced to the traditional graph-coloring algorithm to support variables with different sizes and to accurately model liveness under divergent branches. Different assignment polices are applied to exploit the trade-offs between minimizing register usage and avoiding bank conflicts and anti-dependencies. Experimental results show our framework produces very few spilling kernels and can improve RA JIT time by up to 4x over pure graph-coloring. Our round-robin and bank-conflict-reduction assignment policies can also achieve up to 20% runtime improvements.

Publisher's Version Article Search
A Compiler for Cyber-Physical Digital Microfluidic Biochips
Christopher Curtis, Daniel Grissom, and Philip Brisk
(University of California at Riverside, USA; Azusa Pacific University, USA)
Programmable microfluidic laboratories-on-a-chip (LoCs) offer the benefits of automation and miniaturization to the life sciences. This paper presents an updated version of the BioCoder language and a fully static (offline) compiler that can target an emerging class of LoCs called Digital Microfluidic Biochips (DMFBs), which manipulate discrete droplets of liquid on a 2D electrode grid. The BioCoder language and runtime execution engine leverage advances in sensor integration to enable specification, compilation, and execution of assays (bio-chemical procedures) that feature online decision-making based on sensory data acquired during assay execution. The compiler features a novel hybrid intermediate representation (IR) that interleaves fluidic operations with computations performed on sensor data. The IR extends the traditional notions of liveness and interference to fluidic variables and operations, as needed to target the DMFB, which itself can be viewed as a spatially reconfigurable array. The code generator converts the IR into the following: (1) a set of electrode activation sequences for each basic block in the control flow graph (CFG); (2) a set of computations performed on sensor data, which dynamically determine the result of each control flow operation; and (3) a set of electrode activation sequences for each control flow transfer operation (CFG edge). The compiler is validated using a software simulator which produces animated videos of realistic bioassay execution on a DMFB.

Publisher's Version Article Search Artifacts Available

proc time: 12.99