Powered by
26th International Conference on Compiler Construction (CC 2017),
February 5–6, 2017,
Austin, TX, USA
Frontmatter
Concurrency and Parallelism
Partially Redundant Fence Elimination for x86, ARM, and Power Processors
Robin Morisset and Francesco Zappa Nardelli
(ENS, France; Inria, France)
We show how partial redundancy elimination (PRE) can be instantiated to perform provably correct fence elimination for multi-threaded programs running on top of the x86, ARM and IBM Power relaxed memory models. We have implemented our algorithm in the backends of the LLVM compiler infrastructure. The optimisation does not induce an observable overhead at compile-time and can result in up-to 10% speedup on some benchmarks.
@InProceedings{CC17p1,
author = {Robin Morisset and Francesco Zappa Nardelli},
title = {Partially Redundant Fence Elimination for x86, ARM, and Power Processors},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {1--10},
doi = {},
year = {2017},
}
Lightweight Data Race Detection for Production Runs
Swarnendu Biswas, Man Cao,
Minjia Zhang,
Michael D. Bond, and Benjamin P. Wood
(University of Texas at Austin, USA; Ohio State University, USA; Microsoft Research, USA; Wellesley College, USA)
To detect data races that harm production systems, program analysis must target production runs. However, sound and precise data race detection adds too much run-time overhead for use in production systems. Even existing approaches that provide soundness or precision incur significant limitations.
This work addresses the need for soundness (no missed races) and precision (no false races) by introducing novel, efficient production-time analyses that address each need separately. (1) Precise data race detection is useful for developers, who want to fix bugs but loathe false positives. We introduce a precise analysis called RaceChaser that provides low, bounded run-time overhead. (2) Sound race detection benefits analyses and tools whose correctness relies on knowledge of all potential data races. We present a sound, efficient approach called Caper that combines static and dynamic analysis to catch all data races in observed runs. RaceChaser and Caper are useful not only on their own; we introduce a framework that combines these analyses, using Caper as a sound filter for precise data race detection by RaceChaser.
Our evaluation shows that RaceChaser and Caper are efficient and effective, and compare favorably with existing state-of-the-art approaches. These results suggest that RaceChaser and Caper enable practical data race detection that is precise and sound, respectively, ultimately leading to more reliable software systems.
@InProceedings{CC17p11,
author = {Swarnendu Biswas and Man Cao and Minjia Zhang and Michael D. Bond and Benjamin P. Wood},
title = {Lightweight Data Race Detection for Production Runs},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {11--21},
doi = {},
year = {2017},
}
Optimized Two-Level Parallelization for GPU Accelerators using the Polyhedral Model
Jun Shirako, Akihiro Hayashi, and Vivek Sarkar
(Rice University, USA)
While GPUs play an increasingly important role in today’s high-performance computers, optimizing GPU performance continues to impose large burdens upon programmers. A major challenge in optimizing codes for GPUs stems from the two levels of hardware parallelism, blocks and threads; each of these levels has significantly different characteristics, requiring different optimization strategies.
In this paper, we propose a novel compiler optimization algorithm for GPU parallelism. Our approach is based on the polyhedral model, which has enabled significant advances in program analysis and transformation compared to traditional AST-based frameworks. We extend polyhedral schedules to enable two-level parallelization through the idea of superposition, which integrates separate schedules for block-level and thread-level parallelism. Our experimental results demonstrate that our proposed compiler optimization framework can deliver 1.8× and 2.1× geometric mean improvements on NVIDIA Tesla M2050 and K80 GPUs, compared to a state-of-the-art polyhedral parallel code generator (PPCG) for GPGPUs.
@InProceedings{CC17p22,
author = {Jun Shirako and Akihiro Hayashi and Vivek Sarkar},
title = {Optimized Two-Level Parallelization for GPU Accelerators using the Polyhedral Model},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {22--33},
doi = {},
year = {2017},
}
Optimization Space Pruning without Regrets
Ulysse Beaugnon, Antoine Pouille,
Marc Pouzet,
Jacques Pienaar, and Albert Cohen
(ENS, France; Google, USA; Inria, France)
Many computationally-intensive algorithms benefit from the wide parallelism offered by Graphical Processing Units (GPUs). However, the search for a close-to-optimal implementation remains extremely tedious due to the specialization and complexity of GPU architectures.
We present a novel approach to automatically discover the best performing code from a given set of possible implementations. It involves a branch and bound algorithm with two distinctive features: (1) an analytic performance model of a lower bound on the execution time, and (2) the ability to estimate such bounds on a partially-specified implementation.
The unique features of this performance model allow to aggressively prune the optimization space without eliminating the best performing implementation. While the space considered in this paper focuses on GPUs, the approach is generic enough to be applied to other architectures.
We implemented our algorithm in a tool called Telamon and demonstrate its effectiveness on a huge, architecture-specific and input-sensitive optimization space. The information provided by the performance model also helps to identify ways to enrich the search space to consider better candidates, or to highlight architectural bottlenecks.
@InProceedings{CC17p34,
author = {Ulysse Beaugnon and Antoine Pouille and Marc Pouzet and Jacques Pienaar and Albert Cohen},
title = {Optimization Space Pruning without Regrets},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {34--44},
doi = {},
year = {2017},
}
Compilers
Compile-Time Function Memoization
Arjun Suresh,
Erven Rohou, and André Seznec
(Ohio State University, USA; Inria, France)
Memoization is the technique of saving the results of computations so that future executions can be omitted when the same inputs repeat. Recent work showed that memoization can be applied to dynamically linked pure functions using a load-time technique and results were encouraging for the demonstrated transcendental functions. A restriction of the proposed framework was that memoization was restricted only to dynamically linked functions and the functions must be determined beforehand. In this work, we propose function memoization using a compile-time technique thus extending the scope of memoization to user defined functions as well as making it transparently applicable to any dynamically linked functions. Our compile-time technique allows static linking of memoization code and this increases the benefit due to memoization by leveraging the inlining capability for the memoization wrapper. Our compile-time analysis can also handle functions with pointer parameters, and we handle constants more efficiently. Instruction set support can also be considered, and we propose associated hardware leading to additional performance gain.
@InProceedings{CC17p45,
author = {Arjun Suresh and Erven Rohou and André Seznec},
title = {Compile-Time Function Memoization},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {45--54},
doi = {},
year = {2017},
}
One Compiler: Deoptimization to Optimized Code
Christian Wimmer,
Vojin Jovanovic, Erik Eckstein, and
Thomas Würthinger
(Oracle Labs, USA; Oracle Labs, Switzerland; Oracle Labs, Austria)
A multi-tier virtual machine (VM) deoptimizes and transfers last-tier execution to the first-tier execution when a speculative optimization is invalidated.
The first-tier target of deoptimization is either an interpreter or code compiled by a baseline compiler.
Because such a first-tier execution uses a fixed stack frame layout, this complicates all VM components that need to walk the stack.
We propose to use the optimizing compiler also to compile deoptimization target code, i.e., the non-speculative first-tier code where execution continues after a deoptimization.
Deoptimization entry points are described with the same scope descriptors used to describe the origin of the deoptimization, i.e., deoptimization is a two-way matching of two scope descriptors describing the same abstract frame at the same virtual program counter.
We evaluate this deoptimization approach in a high-performance JavaScript VM.
It strictly uses a one-compiler approach, i.e., all frames on the stack originate from the same compiler.
@InProceedings{CC17p55,
author = {Christian Wimmer and Vojin Jovanovic and Erik Eckstein and Thomas Würthinger},
title = {One Compiler: Deoptimization to Optimized Code},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {55--64},
doi = {},
year = {2017},
}
Static Optimization in PHP 7
Nikita Popov, Biagio Cosenza, Ben Juurlink, and Dmitry Stogov
(TU Berlin, Germany; Zend Technologies, Russia)
PHP is a dynamically typed programming language commonly used for the server-side implementation of web applications.
Approachability and ease of deployment have made PHP one of the most widely used scripting languages for the web, powering important web applications such as WordPress, Wikipedia, and Facebook.
PHP's highly dynamic nature, while providing useful language features, also makes it hard to optimize statically.
This paper reports on the implementation of purely static bytecode optimizations for PHP 7, the last major version of PHP. We discuss the challenge of integrating classical compiler optimizations, which have been developed in the context of statically-typed languages, into a programming language that is dynamically and weakly typed, and supports a plethora of dynamic language features. Based on a careful analysis of language semantics, we adapt static single assignment (SSA) form for use in PHP. Combined with type inference, this allows type-based specialization of instructions, as well as the application of various classical SSA-enabled compiler optimizations such as constant propagation or dead code elimination.
We evaluate the impact of the proposed static optimizations on a wide collection of programs, including micro-benchmarks, libraries and web frameworks.
Despite the dynamic nature of PHP, our approach achieves an average speedup of 50% on micro-benchmarks, 13% on computationally intensive libraries, as well as 1.1% (MediaWiki) and 3.5% (WordPress) on web applications.
@InProceedings{CC17p65,
author = {Nikita Popov and Biagio Cosenza and Ben Juurlink and Dmitry Stogov},
title = {Static Optimization in PHP 7},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {65--75},
doi = {},
year = {2017},
}
From Functional Programs to Pipelined Dataflow Circuits
Richard Townsend, Martha A. Kim, and Stephen A. Edwards
(Columbia University, USA)
We present a translation from programs expressed in a functional IR into dataflow networks as an intermediate step within a Haskell-to-Hardware compiler. Our networks exploit pipeline parallelism, particularly across multiple tail-recursive calls, via non-strict function evaluation. To handle the long-latency memory operations common to our target applications, we employ a latency-insensitive methodology that ensures arbitrary delays do not change the functionality of the circuit. We present empirical results comparing our networks against their strict counterparts, showing that non-strictness can mitigate small increases in memory latency and improve overall performance by up to 2×.
@InProceedings{CC17p76,
author = {Richard Townsend and Martha A. Kim and Stephen A. Edwards},
title = {From Functional Programs to Pipelined Dataflow Circuits},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {76--86},
doi = {},
year = {2017},
}
Types
Granullar: Gradual Nullable Types for Java
Dan Brotherston,
Werner Dietl, and
Ondřej Lhoták
(University of Waterloo, Canada)
Object-oriented languages like Java and
C# allow the null value for all references.
This supports many flexible patterns, but
has led to many errors, security vulnerabilities, and system crashes.
%
Static type systems can prevent null-pointer exceptions at compile
time, but require annotations, in particular for used libraries.
Conservative defaults choose the most restrictive typing, preventing many
errors, but requiring a large annotation effort.
Liberal defaults choose the most flexible typing, requiring less
annotations, but giving weaker guarantees.
Trusted annotations can be provided, but are not checked and require a
large manual effort.
None of these approaches provide a strong guarantee that the
checked part of the program is isolated from the unchecked part: even
with conservative defaults, null-pointer exceptions can occur in the
checked part.
This paper presents Granullar, a gradual type system for
null-safety. Developers start out verifying null-safety for
the most important components of their applications. At the boundary
to unchecked components, runtime checks are inserted by Granullar to guard the
verified system from being polluted by unexpected null values. This
ensures that null-pointer exceptions can only occur within the
unchecked code or at the boundary to checked code; the checked code is
free of null-pointer exceptions.
We present Granullar for Java, define the
checked-unchecked boundary, and how runtime checks are generated.
We evaluate our approach on real world software annotated for null-safety.
We demonstrate the runtime checks, and acceptable
compile-time and run-time performance impacts.
Granullar enables combining a checked core with
untrusted libraries in a safe manner, improving on the practicality of
such a system.
@InProceedings{CC17p87,
author = {Dan Brotherston and Werner Dietl and Ondřej Lhoták},
title = {Granullar: Gradual Nullable Types for Java},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {87--97},
doi = {},
year = {2017},
}
Let It Recover: Multiparty Protocol-Induced Recovery
Rumyana Neykova and
Nobuko Yoshida
(Imperial College London, UK)
Fault-tolerant communication systems rely on recovery strategies which are often error-prone (e.g. a programmer manually specifies recovery strategies) or inefficient (e.g. the whole system is restarted from the beginning). This paper proposes a static analysis based on multiparty session types that can efficiently compute a safe global state from which a system of interacting processes should be recovered. We statically analyse the communication flow of a program, given as a multiparty protocol, to extract the causal dependencies between processes and to localise failures. We formalise our recovery algorithm and prove its safety. A recovered communication system is free from deadlocks, orphan messages and reception errors. Our recovery algorithm incurs less communication cost (only affected processes are notified) and overall execution time (only required states are repeated). On top of our analysis, we design and implement a runtime framework in Erlang where failed processes and their dependencies are soundly restarted from a computed safe state. We evaluate our recovery framework on message-passing benchmarks and a use case for crawling webpages. The experimental results indicate our framework outperforms a built-in static recovery strategy in Erlang when a part of the protocol can be safely recovered.
@InProceedings{CC17p98,
author = {Rumyana Neykova and Nobuko Yoshida},
title = {Let It Recover: Multiparty Protocol-Induced Recovery},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {98--108},
doi = {},
year = {2017},
}
Program Analysis
Data Structure-Aware Heap Partitioning
Nouraldin Jaber and
Milind Kulkarni
(Purdue University, USA)
There are many applications of program (or heap) partitioning, such as computation offloading, region-based memory management, and OS-driven memory locality optimizations. Although these applications are conceptually different, fundamentally, they must generate code such that objects in the heap (and hence the code that operates on those objects) get partitioned depending on how those objects are used. Regardless of the intended application goal, the granularity at which the heap is partitioned is the key factor in partition quality, and hence it needs to be carefully chosen.
Previous work suggested two main granularities: class-based and allocation site--based, where objects from the same class (or those allocated at the same allocation site) are co-located. Both approaches share a critical drawback: data structures that are used in different ways can share the same class, or the same allocation sites for internal objects, and hence are forced to be co-located despite their different usage patterns.
We introduce the notion of data structure--aware partitioning to allow different data structures to be placed in different partitions, even by existing tools and analyses that inherently operate in a class-based or allocation site--based manner. Our strategy consists of an analysis that infers ownership properties between objects to identify data structures, and a code generation phase that encodes this ownership information into objects' data types and allocation sites without changing the semantics of the code.
We evaluate the quality of data structure--aware partitions by comparing it to the state-of-the-art allocation site--based partitioning on a subset of the DaCapo Benchmarks. Across a set of randomized trials, we had a median range of 5% to 25% reduction of cross-partition accesses, and, depending on partitioning decisions, up to a 95% reduction.
@InProceedings{CC17p109,
author = {Nouraldin Jaber and Milind Kulkarni},
title = {Data Structure-Aware Heap Partitioning},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {109--119},
doi = {},
year = {2017},
}
Dynamic Symbolic Execution for Polymorphism
Lian Li, Yi Lu, and
Jingling Xue
(Oracle Labs, Australia; Institute of Computing Technology at Chinese Academy of Sciences, China; UNSW, Australia)
Symbolic execution is an important program analysis technique that provides auxiliary execution semantics to execute programs with symbolic rather than concrete values. There has been much recent interest in symbolic execution for automatic test case generation and security vulnerability detection, resulting in various tools being deployed in academia and industry. Nevertheless, (subtype or dynamic) polymorphism of object-oriented programs has been neglected: existing symbolic execution techniques can explore different targets of conditional branches but not different targets of method invocations. We address the problem of how this polymorphism can be expressed in a symbolic execution framework. We propose the notion of symbolic types, which make object types symbolic. With symbolic types,[ various targets of a method invocation can be explored systematically by mutating the type of the receiver object of the method during automatic test case generation. To the best of our knowledge, this is the first attempt to address polymorphism in symbolic execution. Mutation of method invocation targets is critical for effectively testing object-oriented programs, especially libraries. Our experimental results show that symbolic types are significantly more effective than existing symbolic execution techniques in achieving test coverage and finding bugs and security vulnerabilities in OpenJDK.
@InProceedings{CC17p120,
author = {Lian Li and Yi Lu and Jingling Xue},
title = {Dynamic Symbolic Execution for Polymorphism},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {120--130},
doi = {},
year = {2017},
}
rev.ng: A Unified Binary Analysis Framework to Recover CFGs and Function Boundaries
Alessandro Di Federico,
Mathias Payer, and Giovanni Agosta
(Politecnico di Milano, Italy; Purdue University, USA)
Static binary analysis is a key tool to assess the security of third-party binaries and legacy programs. Most forms of binary analysis rely on the availability of two key pieces of information: the program's control-flow graph and function boundaries. However, current tools struggle to provide accurate and precise results, in particular when dealing with hand-written assembly functions and non-trivial control-flow transfer instructions, such as tail calls. In addition, most of the existing solutions are ad-hoc, rely on hand-coded heuristics, and are tied to a specific architecture.
In this paper we highlight the challenges faced by an architecture agnostic static binary analysis framework to provide accurate information about a program's CFG and function boundaries without employing debugging information or symbols. We propose a set of analyses to address predicate instructions, noreturn functions, tail calls, and context-dependent CFG.
rev.ng, our binary analysis framework based on QEMU and LLVM, handles all the 17 architectures supported by QEMU and produces a compilable LLVM IR. We implement our described analyses on top of LLVM IR. In an extensive evaluation, we test our tool on binaries compiled for MIPS, ARM, and x86-64 using GCC and clang and compare them to the industry's state of the art tool, IDA Pro, and two well-known academic tools, BAP/ByteWeight and angr. In all cases, the quality of the CFG and function boundaries produced by rev.ng is comparable to or improves over the alternatives.
@InProceedings{CC17p131,
author = {Alessandro Di Federico and Mathias Payer and Giovanni Agosta},
title = {rev.ng: A Unified Binary Analysis Framework to Recover CFGs and Function Boundaries},
booktitle = {Proc.\ CC},
publisher = {ACM},
pages = {131--141},
doi = {},
year = {2017},
}
Info
proc time: 1.23