CC 2019
28th International Conference on Compiler Construction (CC 2019)
Powered by
Conference Publishing Consulting

28th International Conference on Compiler Construction (CC 2019), February 16–17, 2019, Washington, DC, USA

CC 2019 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page

Welcome from the General Chair
Over the years, there have been many insights, connections, relations that started with interactions at CC. Sometimes inspired by an insightful observation in a presentation, often from a discussion with a colleague during a coffee break or over a meal. I hope that for all the participants for whom this is their first CC that it will be inspirational and rewarding, and I hope to see you again in future editions of CC. For our returning colleagues, welcome back.
Welcome from the Program Chair
Welcome to the 28th International Conference on Compiler Construction (CC), held in Washington, DC, February 16--17 2019. CC continues its focus on processing programs in the most general sense: analyzing, transforming or executing input that describes how a system operates. I think our program demonstrates the breadth of work that CC welcomes, ranging from classic code generation concerns like vectorization to new APIs to help speed up web applications.
CC 2019 Conference Organization
Listing of: Chairs, Program Committee, Steering Committee, External Reviewers
Sponsors
CC 2019 Sponsors

Keynote

The Sparse Tensor Algebra Compiler (Keynote)
Saman Amarasinghe
(Massachusetts Institute of Technology, USA)
Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering, and science. Increasingly often the tensors are sparse, which means most components are zeros. To get the best performance, currently programmers are left to write kernels for every operation, with different mixes of sparse and dense tensors in different formats. There are countless combinations, which makes it impossible to manually implement and optimize them all. The Tensor Algebra Compiler (TACO) is the first system to automatically generate kernels for any tensor algebra operation on tensors in any of the commonly used formats. Its performance is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations. For more information, see http://tensor-compiler.org.
Publisher's Version Article Search

Vectors and Accelerators

PPOpenCL: A Performance-Portable OpenCL Compiler with Host and Kernel Thread Code Fusion
Ying Liu, Lei Huang, Mingchuan Wu, Huimin Cui, Fang Lv, Xiaobing Feng, and Jingling Xue
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia)
OpenCL offers code portability but no performance portability. Given an OpenCL program X specifically written for one platform P, existing OpenCL compilers, which usually optimize its host and kernel codes individually, often yield poor performance for another platform Q. Instead of obtaining a performance-improved version of X for Q via manual tuning, we aim to achieve this automatically by a source-to-source OpenCL compiler framework, PPOpenCL. By fusing X's host and kernel thread codes (with the operations in different work-items in the same work-group represented explicitly), we are able to apply data flow analyses, and subsequently, performance-enhancing optimizations on a fused control flow graph specifically for platform Q. Validation against OpenCL benchmarks shows that PPOpenCL (implemented in Clang 3.9.1) can achieve significantly improved portable performance on seven platforms considered.
Publisher's Version Article Search
Enabling Prefix Sum Parallelism Pattern for Recurrences with Principled Function Reconstruction
Yang Xia, Peng Jiang, and Gagan Agrawal
(Ohio State University, USA)
Much research work has been done to parallelize loops with recurrences over the last several decades. Recently, sampling-and-reconstruction method was proposed to parallelize a broad class of loops with recurrences in an automated fashion, with a practical runtime approach. Although the parallelized codes achieve linear scalability across multi-cores architectures, the sequential merge inherent to this method makes it not scalable on many core architectures, such as GPUs. At the same time, existing parallel merge approaches used for simple reduction loops cannot be directly and correctly applied to this method. Based on this observation, we propose new methods to merge partial results in parallel on GPUs and achieve linear scalability. Our approach involves refined runtime-checking rules to avoid unnecessary runtime check failures and reduce the overhead of reprocessing. We also propose sample converge technique to reduce the number of sample points so that communication and computation overhead is reduced. Finally, based on GPU architectural features, we develop optimization techniques to further improve performance. Our evaluation results of a set of representative algorithms show that our parallel merge implementation is substantially more efficient than sequential merge, and achieves linear scalability on different GPUs.
Publisher's Version Article Search
Revec: Program Rejuvenation through Revectorization
Charith Mendis, Ajay Jain, Paras Jain, and Saman Amarasinghe
(Massachusetts Institute of Technology, USA; University of California at Berkeley, USA)

Modern microprocessors are equipped with Single Instruction Multiple Data (SIMD) or vector instructions which expose data level parallelism at a fine granularity. Programmers exploit this parallelism by using low-level vector intrinsics in their code. However, once programs are written using vector intrinsics of a specific instruction set, the code becomes non-portable. Modern compilers are unable to analyze and retarget the code to newer vector instruction sets. Hence, programmers have to manually rewrite the same code using vector intrinsics of a newer generation to exploit higher data widths and capabilities of new instruction sets. This process is tedious, error-prone and requires maintaining multiple code bases. We propose Revec, a compiler optimization pass which revectorizes already vectorized code, by retargeting it to use vector instructions of newer generations. The transformation is transparent, happening at the compiler intermediate representation level, and enables performance portability of hand-vectorized code.

Revec can achieve performance improvements in real-world performance critical kernels. In particular, Revec achieves geometric mean speedups of 1.160× and 1.430× on fast integer unpacking kernels, and speedups of 1.145× and 1.195× on hand-vectorized x265 media codec kernels when retargeting their SSE-series implementations to use AVX2 and AVX-512 vector instructions respectively. We also extensively test Revec’s impact on 216 intrinsic-rich implementations of image processing and stencil kernels relative to hand-retargeting.


Publisher's Version Article Search

Effective Compilation

To Unify or Not to Unify: A Case Study on Unified Builds (in WebKit)
Takafumi Kubota, Yusuke Suzuki, and Kenji Kono
(Keio University, Japan)

Unified builds are a simple but effective technique to reduce the build time of large software projects. Unified builds generate large compiler tasks by bundling multiple source files into one, resulting in a significant reduction in build time through removal of redundant work incurred by shared headers. However, unified builds have a negative effect on incremental builds because each compiler task gets larger. An ad-hoc unification strategy causes an excessive slowdown in incremental builds. A rough report from WebKit says the worst slowdown is 20% (6s → 7s), but our investigation shows it is as high as 479% (19s → 110s).

In this paper, we investigate the characteristics of unified builds to find a sweet spot, which generates compiler tasks that reduce the full build time without increasing the incremental build time. An in-depth analysis of WebKit reveals 1) source files with higher header similarity should be unified, 2) source files that have significant differences in compile times should not be unified, and 3) source files that are not frontend-intensive should not be unified. Our case study shows the total build time is reduced by 2.66%, and the worst slowdown falls from 479% to 129%. These findings will be of help in deriving a more intelligent strategy of unification and give a basis for discussions on future build systems, compilers, and module systems that cooperatively generate efficient compiler tasks.


Publisher's Version Article Search
A Static Slicing Method for Functional Programs and Its Incremental Version
Prasanna Kumar K., Amitabha Sanyal, Amey Karkare, and Saswat Padhi
(IIT Bombay, India; IIT Kanpur, India; University of California at Los Angeles, USA)
An effective static slicing technique for functional programs must have two features. Its handling of function calls must be context sensitive without being inefficient, and, because of the widespread use of algebraic datatypes, it must take into account structure transmitted dependences. It has been shown that any analysis that combines these two characteristics is undecidable, and existing slicing methods drop one or the other. We propose a slicing method that only weakens (and not entirely drop) the requirement of context-sensitivity and that too for some and not all programs. We then consider applications that require the same program to be sliced with respect to several slicing criteria. We propose an incremental version of our slicing method to handle such situations efficiently. The incremental version consists of a one time pre-computation step that uses the non-incremental version to slice the program with respect to a fixed default slicing criterion and processes the results to a canonical form. Presented with a slicing criterion, a low-cost incremental step uses the results of the pre-computation to obtain the slice. Our experiments with a prototype incremental slicer confirms the expected benefits---the cost of incremental slicing, even when amortized over only a few slicing criteria, is much lower than the cost of non-incremental slicing.
Publisher's Version Article Search
Codestitcher: Inter-procedural Basic Block Layout Optimization
Rahman Lavaee, John Criswell, and Chen Ding
(University of Rochester, USA)
Modern software executes a large amount of code. Previous techniques of code layout optimization were developed one or two decades ago and have become inadequate to cope with the scale and complexity of new types of applications such as compilers, browsers, interpreters, language VMs and shared libraries. This paper presents Codestitcher, an inter-procedural basic block code layout optimizer which reorders basic blocks in an executable to benefit from better cache and TLB performance. Codestitcher provides a hierarchical framework which can be used to improve locality in various layers of the memory hierarchy. Our evaluation shows that Codestitcher improves the performance of the original program (already optimized with O3 and link time optimizations) by 3% to 25% (on average, by 10%) on 5 widely used applications with large code sizes: MySQL, Clang, Firefox, PHP server, and Python. It gives an additional improvement of 4% over LLVM's PGO and 3% over PGO combined with the best function reordering technique. For profiling, Codestitcher does not need instrumentation. Instead it uses branch history samples which are collected during the execution of the original program. Codestitcher's profiling and trace processing together incur an average overhead of 22.5%, compared to an average overhead of 90% from LLVM's PGO.
Publisher's Version Article Search

Embedded, IoT, and Web Programming

Low-Cost Deterministic C++ Exceptions for Embedded Systems
James Renwick, Tom Spink, and Björn Franke
(University of Edinburgh, UK)
The C++ programming language offers a strong exception mechanism for error handling at the language level, improving code readability, safety, and maintainability. However, current C++ implementations are targeted at general-purpose systems, often sacrificing code size, memory usage, and resource determinism for the sake of performance. This makes C++ exceptions a particularly undesirable choice for embedded applications where code size and resource determinism are often paramount. Consequently, embedded coding guidelines either forbid the use of C++ exceptions, or embedded C++ tool chains omit exception handling altogether. In this paper, we develop a novel implementation of C++ exceptions that eliminates these issues, and enables their use for embedded systems. We combine existing stack unwinding techniques with a new approach to memory management and run-time type information (RTTI). In doing so we create a compliant C++ exception handling implementation, providing bounded runtime and memory usage, while reducing code size requirements by up to 82%, and incurring only a minimal runtime overhead for the common case of no exceptions.
Publisher's Version Article Search
Spinal Code: Automatic Code Extraction for Near-User Computation in Fogs
Bongjun Kim, Seonyeong Heo, Gyeongmin Lee, Seungbin Song, Jong Kim, and Hanjun Kim
(POSTECH, South Korea; Yonsei University, South Korea)
In the Internet of Things (IoT) environments, cloud servers integrate various IoT devices including sensors and actuators, and provide new services that assist daily lives of users interacting with the physical world. While response time is a crucial factor of quality of the services, supporting short response time is challenging for the cloud servers due to a growing number and amount of connected devices and their communication. To reduce the burden of the cloud servers, fog computing is a promising alternative to offload computation and communication overheads from the cloud servers to fog nodes. However, since existing fog computing frameworks do not extract codes for fog nodes fully automatically, programmers should manually write and analyze their applications for fog computing. This work proposes Spinal Code, a new compiler-runtime framework for near-user computation that automatically partitions an original cloud-centric program into distributed sub-programs running over the cloud and fog nodes. Moreover, to reduce response time in the physical world, Spinal Code allows programmers to annotate latency sensitive actuators in a program, and optimizes the critical paths from required sensors to the actuators when it generates the sub-programs. This work implements 9 IoT programs across 4 service domains: healthcare, smart home, smart building and smart factory, and demonstrates that Spinal Code successfully reduces 44.3% of response time and 79.9% of communication on the cloud compared with a cloud-centric model.
Publisher's Version Article Search
Property Caches Revisited
Manuel Serrano and Marc Feeley
(Inria, France; University of Côte d'Azur, France; Université de Montréal, Canada)

Property caches are a well-known technique invented over 30 years ago to improve dynamic object accesses. They have been adapted to JavaScript, which they have greatly contributed to accelerate. However, this technique is applicable only when some constraints are satisfied by the objects, the properties, and the property access sites. In this work, we propose enhancements to improve two common usage patterns: prototype accesses and megamorphic accesses. We have implemented these in the Hopc AOT JavaScript compiler and we have measured their impact. We observe that they effectively complement traditional caches. They reduce cache misses and consequently accelerate execution. Moreover, they do not cause a slowdown in the handling of the other usage patterns.


Publisher's Version Article Search
Accelerating Web Application Loading with Snapshot of Event and DOM Handling
JiHwan Yeo, JinSeok Oh, and Soo-Mook Moon
(Seoul National University, South Korea)

Reducing the loading time of a web app is important for a better user experience. The loading time includes a large amount of JavaScript execution, often composed of the execution of the global code in the script tags followed by the execution of event handlers. One approach to accelerate the app loading is saving the already-loaded execution state of JavaScript objects in a file called the snapshot in advance. Then, we start an app by copying the objects in the snapshot to the JavaScript heap directly, instead of executing JavaScript code to create them. Unfortunately, existing works save only the execution state of the global code in the snapshot, not that of the event handlers. Also, JavaScript code whose execution may change the DOM (Document Object Model) tree is not saved in the snapshot, limiting the coverage of the snapshot.

In this paper, we introduce snapshot+, an extended version of the snapshot where we save the execution state of the event handlers as well as that of the global code. Unlike the global code whose execution order is statically determined, the event handlers have a more dynamic execution order, depending on the triggering order of the events and the registration order of the event handlers. This makes it somewhat complicated to decide when to create and restore the snapshot. Snapshot+ provides a new API for the app developer to specify and register the event handlers eligible for snapshot, so the browser can create and restore the snapshot at the right place automatically. Also, snapshot+ can save a changed DOM tree in a serialized HTML string and restore it during app loading, allowing the DOM-changing JavaScript executions to be saved in the snapshot. Our experimental results show that snapshot+ can improve the app loading time by 52% (AngularJS apps), 42% (Ext JS apps), 48% (jQuery apps), and 18% (MooTools apps), compared to the original snapshot.


Publisher's Version Article Search

Static and Dynamic Analysis

GPU-Accelerated Fixpoint Algorithms for Faster Compiler Analyses
Thorsten Blaß and Michael Philippsen
(University of Erlangen-Nuremberg, Germany)
Inter-procedural data-flow analyses are slow. We parallelize these predicate propagation fixpoint algorithms efficiently on a GPU. Our approach is (mostly) synchronization free even though the processed graphs in general are cyclic and have nodes with fan-in and fan-out degrees above 1. We detect and fix any data races that occur while propagating predicates in a SIMD fashion. Additionally, we solve the parallel termination problem by means of heuristics. The GPU-codes of five data-flow analyses are up to 89.8 times faster than their sequential LLVM variants. Offloading the analyses to the GPU saves up to 26.5% of the total compilation time.
Publisher's Version Article Search
Compare Less, Defer More: Scaling Value-Contexts Based Whole-Program Heap Analyses
Manas Thakur and V. Krishna Nandivada
(IIT Madras, India)
The precision of heap analyses determines the precision of several associated optimizations, and has been a prominent area in compiler research. It has been shown that context-sensitive heap analyses are more precise than the insensitive ones, but their scalability continues to be a cause of concern. Though the value-contexts approach improves the scalability of classical call-string based context-sensitive analyses, it still does not scale well for several popular whole-program heap analyses. In this paper, we propose a three-stage analysis approach that lets us scale complex whole-program value-contexts based heap analyses for large programs, without losing their precision. Our approach is based on a novel idea of level-summarized relevant value-contexts (LSRV-contexts), which take into account an important observation that we do not need to compare the complete value-contexts at each call-site. Our overall approach consists of three stages: (i) a fast pre-analysis stage that finds the portion of the caller-context which is actually needed in the callee; (ii) a main-analysis stage which uses LSRV-contexts to defer the analysis of methods that do not impact the callers' heap and analyze the rest efficiently; and (iii) a post-analysis stage that analyzes the deferred methods separately. We demonstrate the usefulness of our approach by using it to perform whole-program context-, flow- and field-sensitive thread-escape analysis and control-flow analysis of Java programs. Our evaluation of the two analyses against their traditional value-contexts based versions shows that we not only reduce the analysis time and memory consumption significantly, but also succeed in analyzing otherwise unanalyzable programs in less than 40 minutes.
Publisher's Version Article Search
Valence: Variable Length Calling Context Encoding
Tong Zhou, Michael R. Jantz, Prasad A. Kulkarni, Kshitij A. Doshi, and Vivek Sarkar
(Georgia Institute of Technology, USA; University of Tennessee, USA; University of Kansas, USA; Intel, USA)
Many applications, including program optimizations, debugging tools, and event loggers, rely on calling context to gain additional insight about how a program behaves during execution. One common strategy for determining calling contexts is to use compiler instrumentation at each function call site and return sites to encode the call paths and store them in a designated area of memory. While recent works have shown that this approach can generate precise calling context encodings with low overhead, the encodings can grow to hundreds or even thousands of bytes to encode a long call path, for some applications. Such lengthy encodings increase the costs associated with storing, detecting, and decoding call path contexts, and can limit the effectiveness of this approach for many usage scenarios. This work introduces a new compiler-based strategy that significantly reduces the length of calling context encoding with little or no impact on instrumentation costs for many applications. Rather than update or store an entire word at each function call and return, our approach leverages static analysis and variable length instrumentation to record each piece of the calling context using only a small number of bits, in most cases. We implemented our approach as an LLVM compiler pass, and compared it directly to the state-of-the-art calling context encoding strategy (PCCE) using a standard set of C/C++ applications from SPEC CPU 2017. Overall, our approach reduces the length of calling context encoding from 4.3 words to 1.6 words on average (> 60% reduction), thereby improving the efficiency of applications that frequently store or query calling contexts.
Publisher's Version Article Search
Path Sensitive MFP Solutions in Presence of Intersecting Infeasible Control Flow Path Segments
Komal Pathade and Uday P. Khedker
(Tata Consultancy Services, India; IIT Bombay, India)

Data flow analysis computes Maximum Fix Point (MFP) solution which represents an over approximation of the data reaching a program point along all control flow paths (cfps). Some of these cfps may be infeasible; meaning, the necessary pre-condition for execution of cfp is not satisfiable in any run of the program. Approximations that do not discern data along infeasible cfps may lead to imprecision, because they include spurious information.

Recent methods progressively separate the data along feasible and infeasible prefixes of infeasible cfps to ignore data corresponding to prefix that is infeasible. A criteria called minimal infeasible path segment is used to identify the cluster of infeasible cfps which can be considered equivalent for maintaining separate data. Clustering is useful because it avoids the possibly exponential cost of keeping the data along each infeasible cfp separate. The recent clustering approach is imprecise in presence of shared edges between cfps from two different clusters.

In this work, we formalize the interaction between clusters and provide a more general and effective criteria for clustering the infeasible cfps. Our experiments indicate up to 2-3 times increase in the precision over previous approach, with average 100% increase in memory and time required for the analysis. This is possible because our empirical observation indicates that on average 70% clusters overlap with other clusters.


Publisher's Version Article Search

Scientific Computing Concerns

Automatic Adaptive Approximation for Stencil Computations
Maxime Schmitt, Philippe Helluy, and Cédric Bastoul
(University of Strasbourg, France; Inria, France)
Approximate computing is necessary to meet deadlines in some compute-intensive applications like simulation. Building them requires a high level of expertise from the application designers as well as a significant development effort. Some application programming interfaces greatly facilitate their conception but they still heavily rely on the developer's domain-specific knowledge and require many modifications to successfully generate an approximate version of the program. In this paper we present new techniques to semi-automatically discover relevant approximate computing parameters. We believe that superior compiler-user interaction is the key to improved productivity. After pinpointing the region of interest to optimize, the developer is guided by the compiler in making the best implementation choices. Static analysis and runtime monitoring are used to infer approximation parameter values for the application. We evaluated these techniques on multiple application kernels that support approximation and show that with the help of our method, we achieve similar performance as non-assisted, hand-tuned version while requiring minimal intervention from the user.
Publisher's Version Article Search
Efficiency and Expressiveness in UW-OpenMP
Raghesh Aloor and V. Krishna Nandivada
(IIT Madras, India)
OpenMP uses the efficient ‘team of workers’ model, where workers are given chunks of tasks (iterations of a parallel-for-loop, or sections in a parallel-sections block) to execute, and worker (not tasks) can be synchronized using barriers. Thus, OpenMP restricts the invocation of barriers in these tasks; as otherwise, the behavior of the program would be dependent on the number of runtime workers. To address such a restriction which can adversely impact programmability and readability, Aloor and Nandivada proposed UW-OpenMP by taking inspiration from the more intuitive interaction of tasks and barriers in newer task parallel languages like X10, HJ, Chapel and so on. UW-OpenMP gives the programmer an impression that each parallel task is executed by a unique worker, and importantly these parallel tasks can be synchronized using a barrier construct. Though UW-OpenMP is a useful extension of OpenMP (more expressive and efficient), it does not admit barriers within recursive functions invoked from parallel-for-loops, because of the inherent challenges in handing them. In this paper, we extend UW-OpenMP (we call it UWOmp++) to address this challenging limitation and in the process also realize more efficient programs. We propose a source to source transformation scheme to translate UWOmp++ C programs to equivalent OpenMP C programs that are guaranteed not to invoke barriers in any task. Our translation uses a novel intermediate representation called UWOmpCPS, which represents a parallel program written in OpenMP in an extended CPS format (admits parallel-for-loops and barriers). The use of this intermediate representation leads us to handle recursive functions within parallel-for-loops efficiently. We have implemented our proposed translation scheme in the ROSE compiler framework. Our preliminary evaluation shows that the proposed language extension to allow recursion helps plug an important gap in expressiveness, without compromising on the efficiency resulting from the ‘team of workers’ model.
Publisher's Version Article Search
Efficient Concolic Testing of MPI Applications
Hongbo Li, Zizhong Chen, and Rajiv Gupta
(University of California at Riverside, USA)

Software testing is widely used in industry, but its application in the High Performance Computing area has been scarce. Concolic testing, that automates testing via generation of inputs, has been highly successful for desktop applications and thus recent work on the COMPI tool has extended it to MPI programs. However, COMPI has two limitations. First, it requires the user to specify an upper limit on input size – if the chosen limit is too big, considerable time is wasted and if the chosen limit is too small, the branch coverage achieved is limited. Second, COMPI does not support floating point arithmetic that is common in HPC applications.

In this paper, we overcome the above limitations as follows. We propose input tuning that eliminates the need for users to set hard limits and generates inputs such that the testing achieves high coverage while avoiding waste of testing time by selecting suitable input sizes. Moreover, we enable handling of floating point data types and operations and demonstrate that the efficiency of constraint solving can be improved if we rely on the use of reals instead of floating point values. Our evaluation demonstrates that with input tuning the coverage we achieve in 10 minutes is typically higher than the coverage achieved in 1 hour when input tuning is not used. Without input tuning, 9.6-57.1% loss in coverage occurs for a real-world physics simulation program. For the physics simulation program, using our floating-point extension that uses reals covers 46 more branches than without using the extension. Also, we cover 122 more branches when solving floating-point constraints using reals rather than directly using floating-point numbers.


Publisher's Version Article Search

proc time: 2.77