CC 2016
25th International Conference on Compiler Construction (CC 2016)
Powered by
Conference Publishing Consulting

25th International Conference on Compiler Construction (CC 2016), March 17–18, 2016, Barcelona, Spain

CC 2016 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page


Messages from the Chairs
This volume contains the proceedings of the 25th International Conference on Compiler Construction (CC) held during March 17 and 18, 2016, in Barcelona, Spain. The conference was co-located with the International Symposium on Code Generation and Optimization (CGO), the International Symposium on High-Performance Computer Architecture (HPCA), the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), and the 2016 European LLVM Developers Meeting (EuroLLVM).

Organization
This volume contains the proceedings of the 25th International Conference on Compiler Construction (CC) held during March 17 and 18, 2016, in Barcelona, Spain. The conference was co-located with the International Symposium on Code Generation and Optimization (CGO), the International Symposium on High-Performance Computer Architecture (HPCA), the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), and the 2016 European LLVM Developers Meeting (EuroLLVM).

Sponsors and Supporters
This volume contains the proceedings of the 25th International Conference on Compiler Construction (CC) held during March 17 and 18, 2016, in Barcelona, Spain. The conference was co-located with the International Symposium on Code Generation and Optimization (CGO), the International Symposium on High-Performance Computer Architecture (HPCA), the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), and the 2016 European LLVM Developers Meeting (EuroLLVM).

Keynote Abstract
Axiomatic Programming is the practice of structured Generic Programming a la Stepanov. The overarching idea is that good programming is mathematics. We want our programs to look as much as possible like the mathematical formulation of the algorithms they implement. So, effective generic programming at scale requires direct support for expressing axioms and set of related properties, called concepts, directly in code. I report on progress made over the last decade to bring structured generic programming to the masses via the popular C++ programming language. Some of the results are now directly available as an ISO C++ Technical Specification, and practically in production C++ compilers. Others, still requiring more explorations are available in the Liz programming language, built on top of the core of C++ semantics marrying practical proof theory and type theory.

Speculation

Thread-Level Speculation with Kernel Support
Clemens Hammacher, Kevin Streit, Andreas Zeller, and Sebastian HackORCID logo
(Saarland University, Germany)
Runtime systems for speculative parallelization can be substantially sped up by implementing them with kernel support. We describe a novel implementation of a thread-level speculation (TLS) system using virtual memory to isolate speculative state, implemented in a Linux kernel module. This design choice not only maximizes performance, but also allows to guarantee soundness in the presence of system calls, such as I/O. Its ability to maintain speedups even on programs with frequent mis-speculation, significantly extends its usability, for instance in speculative parallelization. We demonstrate the advantage of kernel-based TLS on a number of programs from the Cilk suite, where this approach is superior to the state of the art in each single case (7.28x on average). All systems described in this paper are made available as open source.

Info
Reducing Memory Buffering Overhead in Software Thread-Level Speculation
Zhen Cao and Clark Verbrugge ORCID logo
(McGill University, Canada)
Software-based, automatic parallelization through Thread-Level Speculation (TLS) has significant practical potential, but also high overhead costs. Traditional "lazy" buffering mechanisms enable strong isolation of speculative threads, but imply large memory overheads, while more recent "eager" mechanisms improve scalability, but are more sensitive to data dependencies and have higher rollback costs. We here describe an integrated system that incorporates the best of both designs, automatically selecting the best buffering mechanism. Our approach builds on well-optimized designs for both techniques, and we describe specific optimizations that improve both lazy and eager buffer management as well. We implement our design within MUTLS, a software-TLS system based on the LLVM compiler framework. Results show that we can get 75% geometric mean performance of OpenMP versions on 9 memory intensive benchmarks. Application of these optimizations is thus a useful part of the optimization stack needed for effective and practical software TLS.

Info
Performance Implications of Transient Loop-Carried Data Dependences in Automatically Parallelized Loops
Niall Murphy, Timothy JonesORCID logo, Robert MullinsORCID logo, and Simone Campanoni ORCID logo
(University of Cambridge, UK; Northwestern University, USA)
Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra thread-level parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loop-carried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism.

Run-Time Techniques

Safe and Flexible Adaptation via Alternate Data Structure Representations
Amlan Kusum, Iulian Neamtiu, and Rajiv Gupta ORCID logo
(University of California at Riverside, USA; New Jersey Institute of Technology, USA)
The choice of data structures is crucial for achieving high performance. For applications that are long-running and/or operate on large data sets, the best choice for main data structures can change multiple times over the course of a single execution. For example, in a graph-processing application where the graph evolves over time, the best data structure for representing the graph may change as the program executes. Similarly, in a database or a key-value store application, with changes in relative frequencies of different types of queries over time, the most efficient data structure changes as well. We introduce an approach that allows applications to adapt to current conditions (input characteristics, operations on data, state) by switching their data structures on-the-fly with little overhead and without the developer worrying about safety or specifying adaptation points (this is handled by our compiler infrastructure). We use our approach on different classes of problems that are compute- and memory-intensive: graph algorithms, database indexing, and two real-world applications, the Memcached object cache and the Space Tyrant online game server. Our results show that off-the-shelf applications can be transformed into adaptive applications with modest programmer effort; that the adaptive versions outperform the original, fixed-representation versions; and that adaptation can be performed on-the-fly safely and with very little runtime overhead.

Relaxed Dependence Tracking for Parallel Runtime Support
Minjia Zhang, Swarnendu Biswas, and Michael D. Bond ORCID logo
(Ohio State University, USA)
It is notoriously difficult to achieve both correctness and scalability for many shared-memory parallel programs. To improve correctness and scalability, researchers have developed various kinds of parallel runtime support such as multithreaded record & replay and software transactional memory. Existing forms of runtime support slow programs significantly in order to track an execution's cross-thread dependences accurately. This paper investigates the potential for runtime support to hide latency introduced by dependence tracking, by tracking dependences in a relaxed way---meaning that not all dependences are tracked accurately. The key challenge in relaxing dependence tracking is preserving both the program's semantics and the runtime support's guarantees. We present an approach called relaxed dependence tracking (RT) and demonstrate its potential by building two types of RT-based runtime support. Our evaluation shows that RT hides much of the latency incurred by dependence tracking, although RT-based runtime support incurs costs and complexity in order to handle relaxed dependence information. By demonstrating how to relax dependence tracking to hide latency while preserving correctness, this work shows the potential for addressing a key cost of dependence tracking, thus advancing knowledge in the design of parallel runtime support.

Kindergarten Cop: Dynamic Nursery Resizing for GHC
Henrique Ferreiro, Laura CastroORCID logo, Vladimir Janjic, and Kevin Hammond
(Universidade da Coruña, Spain; University of St. Andrews, UK)
Generational garbage collectors are among the most popular garbage collectors used in programming language runtime systems. Their performance is known to depend heavily on choosing the appropriate size of the area where new objects are allocated (the nursery). In imperative languages, it is usual to make the nursery as large as possible, within the limits imposed by the heap size. Functional languages, however, have quite different memory be- haviour. In this paper, we study the effect that the nursery size has on the performance of lazy functional programs, through the interplay between cache locality and the frequency of collections. We demonstrate that, in contrast with imperative programs, having large nurseries is not always the best solution. Based on these re- sults, we propose two novel algorithms for dynamic nursery resiz- ing that aim to achieve a compromise between good cache locality and the frequency of garbage collections. We present an implemen- tation of these algorithms in the state-of-the-art GHC compiler for the functional language Haskell, and evaluate them using an exten- sive benchmark suite. In the best case, we demonstrate a reduction in total execution times of up to 88.5%, or an 8.7 overall speedup, compared to using the production GHC garbage collector. On aver- age, our technique gives an improvement of 9.3% in overall perfor- mance across a standard suite of 63 benchmarks for the production GHC compiler.

Verified Compilation

Verified Construction of Static Single Assignment Form
Sebastian Buchwald, Denis Lohner, and Sebastian Ullrich
(KIT, Germany)
Modern compilers use intermediate representations in static single assignment (SSA) form, which simplifies many optimizations. However, the high implementation complexity of efficient SSA construction algorithms poses a challenge to verified compilers. In this paper, we consider a variant of the recent SSA construction algorithm by Braun et al. that combines simplicity and efficiency, and is therefore a promising candidate to tackle this challenge. We prove the correctness of the algorithm using the theorem prover Isabelle/HOL. Furthermore, we prove that the algorithm constructs pruned SSA form and, in case of a reducible control flow graph, minimal SSA form. To the best of our knowledge, these are the first formal proofs regarding the quality of an SSA construction algorithm. Finally, we replace the SSA construction of the CompCertSSA project with code extracted by Isabelle's code generator to demonstrate the applicability to real world programs.

Info
Mechanizing Conventional SSA for a Verified Destruction with Coalescing
Delphine Demange ORCID logo and Yon Fernandez de Retana
(University of Rennes 1, France; IRISA, France; Inria, France)
Modern optimizing compilers rely on the Static Single Assignment (SSA) form to make optimizations fast and simpler to implement. From a semantic perspective, the SSA form is nowadays fairly well understood, as witnessed by recent advances in the field of formally verified compilers. The destruction of the SSA form, however, remains a difficult problem, even in a non-verified environment. In fact, the out-of-SSA transformation has been revisited, for correctness and performance issues, up until recently. Unsurprisingly, state-of-the-art compiler formalizations thus either completely ignore, only partially handle, or implement naively the SSA destruction. This paper reports on the implementation of such a destruction within a verified compiler. We formally define and prove the properties of the generation of Conventional SSA (CSSA) which make its destruction simple to implement and prove. Second, we implement and prove correct a coalescing destruction of CSSA, a la Boissinot et al., where variables can be coalesced according to a refined notion of interference. This formalization work extends the CompCertSSA compiler, whose correctness proof is mechanized in the Coq proof assistant. Our CSSA-based, coalescing destruction removes, on average, more than 99% of introduced copies, and leads to encouraging results concerning spilling during post-SSA register allocation.

Reachability and Error Diagnosis in LR(1) Parsers
François PottierORCID logo
(Inria, France)
Given an LR(1) automaton, what are the states in which an error can be detected? For each such "error state", what is a minimal input sentence that causes an error in this state? We propose an algorithm that answers these questions. This allows building a collection of pairs of an erroneous input sentence and a (handwritten) diagnostic message, ensuring that this collection covers every error state, and maintaining this property as the grammar evolves. We report on an application of this technique to the CompCert ISO C99 parser, and discuss its strengths and limitations.

Verification

Automatic Fault Location for Data Structures
Vineet Singh, Rajiv Gupta ORCID logo, and Iulian Neamtiu
(University of California at Riverside, USA; New Jersey Institute of Technology, USA)
Specification-based data structure verification is a powerful debugging technique. In this work we combine specification-based data structure verification with automatic detection of faulty program statements that corrupt data structures. The user specifies the consistency constraints for dynamic data structures as relationships among the nodes of a memory graph. Our system detects constraint violations to identify corrupted data structures during program execution and then automatically locates faulty code responsible for data structure corruption. Our approach offers two main advantages: (1) a highly precise automatic fault location method, and (2) a simple specification language. We employ incremental constraint checking for time efficient constraint matching and fault location. On average, while Tarantula statistical debugging technique narrows the fault to 10 statements, our technique narrows it to ≈ 4 statements..

Sparse Representation of Implicit Flows with Applications to Side-Channel Detection
Bruno Rodrigues, Fernando Magno Quintão Pereira, and Diego F. Aranha
(Federal University of Minas Gerais, Brazil; UNICAMP, Brazil)
Information flow analyses traditionally use the Program Dependence Graph (PDG) as a supporting data-structure. This graph relies on Ferrante et al.'s notion of control dependences to represent implicit flows of information. A limitation of this approach is that it may create O(|I| x |E|) implicit flow edges in the PDG, where I are the instructions in a program, and E are the edges in its control flow graph. This paper shows that it is possible to compute information flow analyses using a different notion of implicit dependence, which yields a number of edges linear on the number of definitions plus uses of variables. Our algorithm computes these dependences in a single traversal of the program's dominance tree. This efficiency is possible due to a key property of programs in Static Single Assignment form: the definition of a variable dominates all its uses. Our algorithm correctly implements Hunt and Sands system of security types. Contrary to their original formulation, which required O(IxI) space and time for structured programs, we require only O(I). We have used our ideas to build FlowTracker, a tool that uncovers side-channel vulnerabilities in cryptographic algorithms. FlowTracker handles programs with over one-million assembly instructions in less than 200 seconds, and creates 24% less implicit flow edges than Ferrante et al.'s technique. FlowTracker has detected an issue in a constant-time implementation of Elliptic Curve Cryptography; it has found several time-variant constructions in OpenSSL, one issue in TrueCrypt and it has validated the isochronous behavior of the NaCl library.

Energy and Dynamic Checking

Multiversioned Decoupled Access-Execute: The Key to Energy-Efficient Compilation of General-Purpose Programs
Konstantinos Koukos, Per Ekemark, Georgios Zacharopoulos, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean
(Uppsala University, Sweden; University of Lugano, Switzerland)
Computer architecture design faces an era of great challenges in an attempt to simultaneously improve performance and energy efficiency. Previous hardware techniques for energy management become severely limited, and thus, compilers play an essential role in matching the software to the more restricted hardware capabilities. One promising approach is software decoupled access-execute (DAE), in which the compiler transforms the code into coarse-grain phases that are well-matched to the Dynamic Voltage and Frequency Scaling (DVFS) capabilities of the hardware. While this method is proved efficient for statically analyzable codes, general-purpose applications pose significant challenges due to pointer aliasing, complex control flow and unknown runtime events. We propose a universal compile-time method to decouple general-purpose applications, using simple but efficient heuristics. Our solutions overcome the challenges of complex code and show that automatic decoupled execution significantly reduces the energy expenditure of irregular or memory-bound applications and even yields slight performance boosts. Overall, our technique achieves over 20% on average energy-delay-product (EDP) improvements (energy over 15% and performance over 5%) across 14 benchmarks from SPEC CPU 2006 and Parboil benchmark suites, with peak EDP improvements surpassing 70%.

Heap Bounds Protection with Low Fat Pointers
Gregory J. Duck and Roland H. C. Yap
(National University of Singapore, Singapore)
Heap buffer overflow (underflow) errors are a common source of security vulnerabilities. One prevention mechanism is to add object bounds meta information and to instrument the program with explicit bounds checks for all memory access. The so-called "fat pointers" approach is one method for maintaining and propagating the meta information where native machine pointers are replaced with "fat" objects that explicitly store object bounds. Another approach is "low fat pointers", which encodes meta information within a native pointer itself, eliminating space overheads and also code compatibility issues. This paper presents a new low-fat pointer encoding that is fully compatible with existing libraries (e.g. pre-compiled libraries unaware of the encoding) and standard hardware (e.g. x86_64). We show that our approach has very low memory overhead, and competitive with existing state-of-the-art bounds instrumentation solutions.

Static and Dynamic Optimization

Register Allocation and Promotion through Combined Instruction Scheduling and Loop Unrolling
Łukasz Domagała, Duco van Amstel, Fabrice Rastello, and P. Sadayappan
(Inria, France; Ohio State University, USA)
Register allocation is a much studied problem. A particularly important context for optimizing register allocation is within loops, since a significant fraction of the execution time of programs is often inside loop code. A variety of algorithms have been proposed in the past for register allocation, but the complexity of the problem has resulted in a decoupling of several important aspects, including loop unrolling, register promotion, and instruction reordering. In this paper, we develop an approach to register allocation and promotion in a unified optimization framework that simultaneously considers the impact of loop unrolling and instruction scheduling. This is done via a novel instruction tiling approach where instructions within a loop are represented along one dimension and innermost loop iterations along the other dimension. By exploiting the regularity along the loop dimension, and imposing essential dependence based constraints on intra-tile execution order, the problem of optimizing register pressure is cast in a constraint programming formalism. Experimental results are provided from thousands of innermost loops extracted from the SPEC benchmarks, demonstrating improvements over the current state-of-the-art.

On Fusing Recursive Traversals of K-d Trees
Samyam Rajbhandari, Jinsung Kim, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, Robert J. Harrison, and P. Sadayappan
(Ohio State University, USA; Pacific Northwest National Laboratory, USA; Inria, France; Stony Brook University, USA)
Loop fusion is a key program transformation for data locality optimization that is implemented in production compilers. But optimizing compilers for imperative languages currently cannot ex- ploit fusion opportunities across a set of recursive tree traversal computations with producer-consumer relationships. In this paper, we develop a compile-time approach to dependence characterization and program transformation to enable fusion across recursively specified traversals over k-d trees. We present the FuseT source-to- source code transformation framework to automatically generate fused composite recursive operators from an input program containing a sequence of primitive recursive operators. We use our framework to implement fused operators for MADNESS, Multi-resolution Adaptive Numerical Environment for Scientific Simulation. We show that locality optimization through fusion can offer significant performance improvement.

Restrictification of Function Arguments
Victor Hugo Sperle Campos, Péricles Rafael Alves, Henrique Nazaré Santos, and Fernando Magno Quintão Pereira
(Federal University of Minas Gerais, Brazil)
Pointer aliasing still hinders compiler optimizations, in spite of years of research on pointer disambiguation. Because the automatic disambiguation of pointers is a difficult endeavor, several programming languages offer programmers mechanisms to distinguish memory references, such as the “restrict” keyword in C. However, the use of such mechanisms is prone to human mistakes. In this paper we present a suite of automatic techniques that mitigate this problem. We have designed, implemented and tested three different ways to disambiguate pointers passed as arguments of functions. Our techniques combine static analyses to infer symbolic bounds of memory regions and code versioning. We generate a clone for each function whose arguments we can disambiguate and optimize it assuming the absence of aliasing among formal parameters. At runtime, we use the results of the symbolic interval tests to decide which version of a function we should call: the original one or the “restricted” clone, whenever we can prove that no aliasing can occur at runtime. An implementation of our restrictification methods in LLVM shows that we can vectorize up to 63% more operations than what could be accomplished using the -O3 optimization level of said compiler. When applying the optimization on OpenCV benchmarks, we have observed speedups as great as 40%.

Static Analysis

Static Deadlock Detection for Concurrent Go by Global Session Graph Synthesis
Nicholas Ng and Nobuko Yoshida ORCID logo
(Imperial College London, UK)
Go is a programming language developed at Google, with channel-based concurrent features based on CSP. Go can detect global communication deadlocks at runtime when all threads of execution are blocked, but deadlocks in other paths of execution could be undetected. We present a new static analyser for concurrent Go code to find potential communication errors such as communication mismatch and deadlocks at compile time. Our tool extracts the communication operations as session types, which are then converted into Communicating Finite State Machines (CFSMs). Finally, we apply a recent theoretical result on choreography synthesis to generate a global graph representing the overall communication pattern of a concurrent program. If the synthesis is successful, then the program is free from communication errors. We have implemented the technique in a tool, and applied it to analyse common Go concurrency patterns and an open source application with over 700 lines of code.

Video
Static Detection of Energy Defect Patterns in Android Applications
Haowei Wu, Shengqian Yang, and Atanas Rountev ORCID logo
(Ohio State University, USA)
For static analysis researchers, Android software presents a wide variety of interesting challenges. The target of our work is static detection of energy-drain defects in Android applications. The management of energy-intensive resources (e.g., GPS) creates various opportunities for software defects. Our goal is to detect statically “missing deactivation” energy-drain defects in the user interface of the application. First, we define precisely two patterns of run-time energy-drain behaviors, based on modeling of Android GUI control-flow paths and energy-related listener leaks along such paths. Next, we define a static detection algorithm targeting these patterns. The analysis considers valid interprocedural control-flow paths in a callback method and its transitive callees, in order to detect operations that add or remove listeners. Sequences of callbacks are then analyzed for possible listener leaks. Our evaluation considers the detection of GUI-related energy-drain defects reported in prior work, as well as new defects not discovered by prior approaches. In summary, the detection is very effective and precise, suggesting that the proposed analysis is suitable for practical use in static checking tools for Android.

On Fast Large-Scale Program Analysis in Datalog
Bernhard Scholz, Herbert Jordan, Pavle Subotić, and Till Westmann
(Oracle Labs, Australia; University College London, UK; Oracle Labs, USA)
Designing and crafting a static program analysis is challenging due to the complexity of the task at hand. Among the challenges are modelling the semantics of the input language, finding suitable abstractions for the analysis, and handwriting efficient code for the analysis in a traditional imperative language such as C++. Hence, the development of static program analysis tools is costly in terms of development time and resources for real world languages. To overcome, or at least alleviate the costs of developing a static program analysis, Datalog has been proposed as a domain specific language (DSL). With Datalog, a designer expresses a static program analysis in the form of a logical specification. While a domain specific language approach aids in the ease of development of program analyses, it is commonly accepted that such an approach has worse runtime performance than handcrafted static analysis tools. In this work, we introduce a new program synthesis methodology for Datalog specifications to produce highly efficient monolithic C++ analyzers. The synthesis technique requires the re-interpretation of the semi-naive evaluation as a scaffolding for translation using partial evaluation. To achieve high-performance, we employ staged-compilation techniques and specialize the underlying relational data structures for a given Datalog specification. Experimentation on benchmarks for large-scale program analysis validates the superior performance of our approach over available Datalog tools and demonstrates our competitiveness with state-of-the-art handcrafted tools.

Improved MHP Analyses
Aravind Sankar, Soham Chakraborty, and V. Krishna Nandivada
(IIT Madras, India; MPI-SWS, Germany)
May-Happen-in-Parallel (MHP) analysis is becoming the backbone of many of the parallel analyses and optimizations. In this paper, we present new approaches to do MHP analysis for X10-like languages that support async-finish-atomic parallelism. We present a fast incremental MHP algorithm to derive all the statements that may run in parallel with a given statement. We also extend the MHP algorithm of Agarwal et al. (answers if two given X10 statements may run in parallel, and under what condition) to improve the computational complexity, without compromising on the precision.

Data Layout and Polyhedral Techniques

Extended Lattice-Based Memory Allocation
Alain Darte, Alexandre Isoard, and Tomofumi Yuki
(CNRS, France; ENS de Lyon, France; Inria, France)
This work extends lattice-based memory allocation, an earlier work on memory reuse through array contraction. Such an optimization is used for optimizing high-level programming languages where storage mapping may be abstracted away from programmers and to complement code transformations that introduce intermediate buffers. The main motivation for this extension is to improve the handling of more general forms of specifications we see today, e.g., with loop tiling, pipelining, and other forms of parallelism available in explicitly-parallel languages. Specifically, we handle the case when conflicting constraints (those that describe the array indices that cannot share the same location) are specified as a (non-convex) union of polyhedra. The choice of directions (or basis) of array reuse becomes important when dealing with non-convex specifications. We extend the two dual approaches in the original work to handle unions of polyhedra, and to select a suitable basis. Our final approach relies on a combination of the two, also revealing their links with, on one hand, the construction of multi-dimensional schedules for parallelism and tiling (but with a fundamental difference that we identify) and, on the other hand, the construction of universal reuse vectors (UOV), which was only used so far in a specific context, for schedule-independent mapping.

Mapping Deviation: A Technique to Adapt or to Guard Loop Transformation Intuitions for Legality
Cédric Bastoul
(University of Strasbourg, France; Inria, France)
Parallel architectures are now omnipresent in mainstream electronic devices and exploiting them efficiently is a challenge for all developers. Hence, they need the support of languages, libraries and tools to assist them in the optimization or parallelization task. Compilers can provide a major help by automating this work. However they are very fragile black-boxes. A compiler may take a bad optimization decision because of imprecise heuristics or may turn off an optimization because of imprecise analyses, without providing much control or feedback to the end user. To address this issue, we introduce mapping deviation, a new compiler technique that aims at providing a useful feedback on the semantics of a given program restructuring. Starting from a transformation intuition a user or a compiler wants to apply, our algorithm studies its correctness and can suggest changes or conditions to make it possible rather than being limited to the classical go/no-go answer. This algorithm builds on state-of-the-art polyhedral representation of programs and provides a high flexibility. We present two example applications of this technique: improving semi-automatic optimization tools for programmers and automatically designing runtime tests to check the correctness of a transformation for compilers.

Automatic Data Layout Generation and Kernel Mapping for CPU+GPU Architectures
Deepak Majeti, Kuldeep S. Meel, Rajkishore Barik, and Vivek Sarkar
(Rice University, USA; Intel, USA)
The ubiquity of hybrid CPU+GPU architectures has led to renewed interest in automatic data layout generation owing to the fact that data layouts have a large impact on performance, and that different data layouts yield the best performance on CPUs vs. GPUs. Unfortunately, current programming models still fail to provide an effective solution to the problem of automatic data layout generation for CPU+GPU processors. Specifically, the interaction among wholeprogram data layout optimizations, data movement optimizations, and mapping of kernels across heterogeneous cores pose a major challenge to current programming systems. In this paper, we introduce a novel two-level hierarchical formulation of the data layout and kernel mapping problem for modern heterogeneous architectures. The bottom level formulation deals with the data layout problem for a parallel code region on a given processor, which is NPHard, and we provide a greedy algorithm that uses an affinity graph to obtain approximate solutions. The top level formulation targets data layouts and kernel mapping for the entire program for which we provide a polynomial-time solution using a graph-based shortest path algorithm that uses the data layouts for the code regions (sections) for a given processor computed in the bottom level formulation. We implement this data layout transformation in the new Heterogeneous Habanero-C (H2C) parallel programming framework and propose performance models to characterize the data layout impact on both the CPU and GPU. Our data layout framework shows significant performance improvements of up to 2.9x (geometric mean 1.5x) on a multicore CPU+GPU compared to the manually specified layouts for a set of parallel programs running on a heterogeneous platform consisting of an Intel Xeon CPU and an NVIDIA GPU. Further, our framework also shows performance improvements of up to 2.7x (geometric mean 1.6x) on just the multicore CPU, demonstrating the applicability of our approach to both heterogeneous and homogeneous hardware platforms.

Input Space Splitting for OpenCL
Simon Moll, Johannes Doerfert, and Sebastian HackORCID logo
(Saarland University, Germany)
The performance of OpenCL programs suffers from memory and control flow divergence. Therefore, OpenCL compilers employ static analyses to identify non-divergent control flow and memory accesses in order to produce faster code. However, divergence is often input-dependent, hence can be observed for some, but not all inputs. In these cases, vectorizing compilers have to generate slow code because divergence can occur at run time. In this paper, we use a polyhedral abstraction to partition the input space of an OpenCL kernel. For each partition, divergence analysis produces more precise results i.e., it can classify more code parts as non-divergent. Consequently, specializing the kernel for the input space partitions allows for generating better SIMD code because of less divergence. We implemented our technique in an OpenCL driver for the AVX instruction set and evaluate it on a range of OpenCL benchmarks. We observe speed ups of up to 9x for irregular kernels over a state-of-the-art vectorizing OpenCL driver.

Tool Demonstrations

GreenThumb: Superoptimizer Construction Framework
Phitchaya Mangpo Phothilimthana, Aditya Thakur, Rastislav Bodik, and Dinakar Dhurjati
(University of California at Berkeley, USA; Google, USA; University of Washington, USA; Qualcomm Research, USA)
Developing an optimizing compiler backend remains a laborious process, especially for nontraditional ISAs that have been appearing recently. Superoptimization sidesteps the need for many code transformations by searching for the most optimal instruction sequence semantically equivalent to the original code fragment. Even though superoptimization discovers the best machine-specific code optimizations, it has yet to become widely-used. We propose GreenThumb, an extensible framework that reduces the cost of constructing superoptimizers and provides a fast search algorithm that can be reused for any ISA, exploiting the unique strengths of enumerative, stochastic, and symbolic (SAT-solver-based) search algorithms. To extend GreenThumb to a new ISA, it is only necessary to implement an emulator for the ISA and provide some ISA-specific search utility functions.

Register Allocation and Instruction Scheduling in Unison
Roberto Castañeda Lozano, Mats Carlsson, Gabriel Hjort Blindell, and Christian Schulte
(Swedish Institute of Computer Science, Sweden; KTH, Sweden)
This paper describes Unison, a simple, flexible, and potentially optimal software tool that performs register allocation and instruction scheduling in integration using combinatorial optimization. The tool can be used as an alternative or as a complement to traditional approaches, which are fast but complex and suboptimal. Unison is most suitable whenever high-quality code is required and longer compilation times can be tolerated (such as in embedded systems or library releases), or the targeted processors are so irregular that traditional compilers fail to generate satisfactory code.

SVF: Interprocedural Static Value-Flow Analysis in LLVM
Yulei Sui and Jingling Xue ORCID logo
(UNSW, Australia)
This paper presents SVF, a tool that enables scalable and precise interprocedural Static Value-Flow analysis for C programs by leveraging recent advances in sparse analysis. SVF, which is fully implemented in LLVM, allows value-flow construction and pointer analysis to be performed in an iterative manner, thereby providing increasingly improved precision for both. SVF accepts points- to information generated by any pointer analysis (e.g., Andersen’s analysis) and constructs an interprocedural memory SSA form, in which the def-use chains of both top-level and address-taken variables are captured. Such value-flows can be subsequently exploited to support various forms of program analysis or enable more precise pointer analysis (e.g., flow-sensitive analysis) to be performed sparsely. By dividing a pointer analysis into three loosely coupled components: Graph, Rules and Solver, SVF provides an extensible interface for users to write their own solutions easily. SVF is publicly available at http://unsw-corg.github.io/SVF.

Iguana: A Practical Data-Dependent Parsing Framework
Ali Afroozeh and Anastasia Izmaylova
(CWI, Netherlands)
Data-dependent grammars extend context-free grammars with arbitrary computation, variable binding, and constraints. These features provide the user with the freedom and power to express syntactic constructs outside the realm of context-free grammars, e.g., indentation rules in Haskell and type definitions in C. Data-dependent grammars have been recently presented by Jim et al. as a grammar formalism that enables construction of parsers from a rich format specification. Although some features of data-dependent grammars are available in current parsing tools, e.g., semantic predicates in ANTLR, data-dependent grammars have not yet fully found their way into practice.
In this paper we present Iguana, a data-dependent parsing framework, implemented on top of the GLL parsing algorithm. In addition to basic features of data-dependent grammars, Iguana also provides high-level syntactic constructs, e.g., for operator precedence and indentation rules, which are implemented as desugaring to data-dependent grammars. These high-level constructs enable a concise and declarative way to define the syntax of programming languages. Moreover, Iguana's extensible data-dependent grammar API allows the user to easily add new high-level constructs or modify existing ones. We have used Iguana to parse various real programming languages, such as OCaml, Haskell, Java, and C#. In this paper we describe the architecture and features of Iguana, and report on its current implementation status.

SYCO: A Systematic Testing Tool for Concurrent Objects
Elvira Albert, Miguel Gómez-Zamalloa, and Miguel Isabel
(Complutense University of Madrid, Spain)
We present the concepts, usage and prototypical implementation of SYCO: a SYstematic testing tool for Concurrent Objects. The system receives as input a program, a selection of method to be tested, and a set of initial values for its parameters. SYCO offers a visual web interface to carry out the testing process and visualize the results of the different executions as well as the sequences of tasks scheduled as a sequence diagram. Its kernel includes state-of-the-art partial-order reduction techniques to avoid redundant computations during testing. Besides, SYCO incorporates an option to effectively catch deadlock errors. In particular, it uses advanced techniques which guide the execution towards potential deadlock paths and discard paths that are guaranteed to be deadlock free.

Info

proc time: 1.19