Powered by
14th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2016),
March 12–18, 2016,
Barcelona, Spain
Frontmatter
Messages from the Chairs
On behalf of the Organizing Committee, welcome to the 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’16) held at Hotel Princesa Sofia, Barcelona, Spain between March 12th-18th, 2016. As in previous years CGO’16 is co-located with the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) and the International Symposium on High-Performance Computer Architecture (HPCA).
Poster Abstracts
As in previous years, the 2016 International Symposium on Code Generation and Optimization (CGO) hosts a combined poster session and ACM Student Research Competition (SRC). We received 32 poster abstracts, of which 15 competed in the ACM SRC.
Profiling Feedback
Cheetah: Detecting False Sharing Efficiently and Effectively
Tongping Liu and Xu Liu
(University of Texas at San Antonio, USA; College of William and Mary, USA)
False sharing is a notorious performance problem that may occur in multithreaded programs when they are running on ubiquitous multicore hardware. It can dramatically degrade the performance by up to an order of magnitude, significantly hurting the scalability. Identifying false sharing in complex programs is challenging. Existing tools either incur significant performance overhead or do not provide adequate information to guide code optimization. To address these problems, we develop Cheetah, a profiler that detects false sharing both efficiently and effectively. Cheetah leverages the lightweight hardware performance monitoring units (PMUs) that are available in most modern CPU architectures to sample memory accesses. Cheetah develops the first approach to quantify the optimization potential of false sharing instances without actual fixes, based on the latency information collected by PMUs. Cheetah precisely reports false sharing and provides insightful optimization guidance for programmers, while adding less than 7% runtime overhead on average. Cheetah is ready for real deployment.
@InProceedings{CGO16p1,
author = {Tongping Liu and Xu Liu},
title = {Cheetah: Detecting False Sharing Efficiently and Effectively},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {1--11},
doi = {},
year = {2016},
}
AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications
Dehao Chen,
David Xinliang Li, and Tipp Moseley
(Google, USA)
AutoFDO is a system to simplify real-world deployment of feedback-directed optimization (FDO). The system works by sampling hardware performance monitors on production machines and using those profiles in to guide optimization. Profile data is stale by design, and we have implemented compiler features to deliver stable speedup across releases. The resulting performance is a geomean of 10.5% improvement on our benchmarks. AutoFDO achieves 85% of the gains of traditional FDO, despite imprecision due to sampling and information lost in the compilation pipeline. The system is deployed to hundreds of binaries at Google, and it is extremely easy to enable; users need only to add some flags to their release build. To date, AutoFDO has increased the number of FDO users at Google by 8X and has doubled the number of cycles spent in FDO-optimized binaries. Over half of CPU cycles used are now spent in some flavor of FDO-optimized binaries.
@InProceedings{CGO16p12,
author = {Dehao Chen and David Xinliang Li and Tipp Moseley},
title = {AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {12--23},
doi = {},
year = {2016},
}
Info
Portable Performance on Asymmetric Multicore Processors
Ivan Jibaja, Ting Cao,
Stephen M. Blackburn, and
Kathryn S. McKinley
(University of Texas at Austin, USA; Pure Storage, USA; Institute of Computing Technology at Chinese Academy of Sciences, China; Australian National University, Australia; Microsoft Research, USA)
Static and dynamic power constraints are steering chip manufacturers to build single-ISA Asymmetric Multicore Processors (AMPs) with big and small cores. To deliver on their energy efficiency potential, schedulers must consider core sensitivity, load balance, and the critical path. Applying these criteria effectively is challenging especially for complex and non-scalable multithreaded applications. We demonstrate that runtimes for managed languages, which are now ubiquitous, provide a unique opportunity to abstract over AMP complexity and inform scheduling with rich semantics such as thread priorities, locks, and parallelism— information not directly available to the hardware, OS, or application. We present the WASH AMP scheduler, which (1) automatically identifies and accelerates critical threads in concurrent, but non-scalable applications; (2) respects thread priorities; (3) considers core availability and thread sensitivity; and (4) proportionally schedules threads on big and small cores to optimize performance and energy. We introduce new dynamic analyses that identify critical threads and classify applications as sequential, scalable, or non-scalable. Compared to prior work, WASH improves performance by 20% and energy by 9% or more on frequency-scaled AMP hardware (not simulation). Performance advantages grow to 27% when asymmetry increases. Performance advantages are robust to a complex multithreaded adversary independently scheduled by the OS. WASH effectively identifies and optimizes a wider class of workloads than prior work.
@InProceedings{CGO16p24,
author = {Ivan Jibaja and Ting Cao and Stephen M. Blackburn and Kathryn S. McKinley},
title = {Portable Performance on Asymmetric Multicore Processors},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {24--35},
doi = {},
year = {2016},
}
Data Layout and Vectorization
StructSlim: A Lightweight Profiler to Guide Structure Splitting
Probir Roy and Xu Liu
(College of William and Mary, USA)
Memory access latency continues to be a dominant bottleneck in a large class of applications on modern architectures. To optimize memory performance, it is important to utilize the locality in the memory hierarchy. Structure splitting can significantly improve memory locality. However, pinpointing inefficient code and providing insightful guidance for structure splitting is challenging. Existing tools typically leverage heavyweight memory instrumentations, which hinders the applicability of these tools for real long-running programs. To address this issue, we develop StructSlim, a profiler to pinpoint top candidates that benefit from structure splitting. StructSlim makes three unique contributions. First, it adopts lightweight address sampling to collect and analyze memory traces. Second, StructSlim employs a set of novel methods to determine memory access patterns to guide structure splitting. We also formally prove that our method has high accuracy even with sparse memory access samples. Third, StructSlim scales on multithreaded machines. StructSlim works on fully optimized, unmodified binary executables independently from their compiler and language, incurring around 7% runtime overhead. To evaluate StructSlim, we study seven sequential and parallel benchmarks. With the guidance of StructSlim, we are able to significantly improve all these benchmarks; the speedup is up to 1.37x.
@InProceedings{CGO16p36,
author = {Probir Roy and Xu Liu},
title = {StructSlim: A Lightweight Profiler to Guide Structure Splitting},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {36--46},
doi = {},
year = {2016},
}
Exploiting Recent SIMD Architectural Advances for Irregular Applications
Linchuan Chen, Peng Jiang, and Gagan Agrawal
(Ohio State University, USA)
A broad class of applications involve indirect or datadependent memory accesses and are referred to as irregular applications. Recent developments in SIMD architectures – specifically, the emergence of wider SIMD lanes, combination of SIMD parallelism with many-core MIMD parallelism, and more flexible programming APIs – are providing new opportunities as well as challenges for this class of applications. In this paper, we propose a general opti- mization methodology, to effectively optimize different subclasses of irregular applications. Based on the observation that all applications with indirect memory accesses can be viewed as sparse matrix computations, we design an optimization methodology, which includes three sub-steps: 1) locality enhancement through tiling, 2) data access pattern identification, and 3) write conflict removal at both SIMD and MIMD levels. This method has been applied to unstructured grids, molecular dynamics, and graph applications, in addition to sparse matrix computations. The speedups achieved by our single threaded vectorized code over serial code is up to 9.05, whereas the overall speedup while utilizing both SIMD and MIMD (61 cores in Intel Xeon Phi) with our approach is up to 467.1. Further optimization using matrix reordering on irregular reductions and graph algorithms is able to achieve an incremental speedup of up to 1.69, though at a relatively high preprocessing cost. Moreover, SpMM using our approach outperforms routines from a highly optimized commercial library by up to 2.81x.
@InProceedings{CGO16p47,
author = {Linchuan Chen and Peng Jiang and Gagan Agrawal},
title = {Exploiting Recent SIMD Architectural Advances for Irregular Applications},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {47--58},
doi = {},
year = {2016},
}
Info
Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead
Hao Zhou and
Jingling Xue
(UNSW, Australia)
Existing loop vectorization techniques can exploit either intra- or inter-iteration SIMD parallelism alone in a code region if one part of the region vectorized for one type of parallelism has data dependences (called mixed-parallelism-inhibiting dependences) on the other part of the region vectorized for the other type of parallelism. In this paper, we consider a class of loops that exhibit both types of parallelism (i.e., mixed SIMD parallelism) in its code regions that contain mixed-parallelism-inhibiting data dependences. We present a new compiler approach for exploiting such mixed SIMD parallelism effectively by reducing the data reorganization overhead incurred when one type of parallelism is switched to the other. Our auto-vectorizer is simple and has been implemented in LLVM (3.5.0). We evaluate it on seven benchmarks with mixed SIMD parallelism selected from SPEC and NAS benchmark suites and demonstrate its performance advantages over the state-of-the-art.
@InProceedings{CGO16p59,
author = {Hao Zhou and Jingling Xue},
title = {Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {59--69},
doi = {},
year = {2016},
}
GPU
A Black-Box Approach to Energy-Aware Scheduling on Integrated CPU-GPU Systems
Rajkishore Barik, Naila Farooqui, Brian T. Lewis, Chunling Hu, and Tatiana Shpeisman
(Intel Labs, USA; Georgia Institute of Technology, USA)
Energy efficiency is now a top design goal for all computing systems, from fitness trackers and tablets, where it affects battery life, to cloud computing centers, where it directly impacts operational cost, maintainability, and environmental impact. Today's widespread integrated CPU-GPU processors combine a CPU and a GPU compute device with different power-performance characteristics. For these integrated processors, hardware vendors implement automatic power management policies that are typically not exposed to the end-user. Furthermore, these policies often vary between different processor generations and SKUs. As a result, it is challenging to design a generally-applicable energy-aware runtime to schedule work onto both the CPU and GPU of such integrated CPU-GPU processors to optimize energy consumption. We propose a new black-box scheduling technique to reduce energy use by effectively partitioning work across the CPU and GPU cores of integrated CPU-GPU processors. Our energy-aware scheduler combines a power model with information about the runtime behavior of a specific workload. This power model is computed once for each processor to characterize its power consumption for different kinds of workloads. On two widely different platforms, a high-end desktop system and a low-power tablet, our energy-aware runtime yields an energy-delay product that is 96% and 93%, respectively, of the near-ideal Oracle energy-delay product on a diverse set of workloads.
@InProceedings{CGO16p70,
author = {Rajkishore Barik and Naila Farooqui and Brian T. Lewis and Chunling Hu and Tatiana Shpeisman},
title = {A Black-Box Approach to Energy-Aware Scheduling on Integrated CPU-GPU Systems},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {70--81},
doi = {},
year = {2016},
}
Portable and Transparent Software Managed Scheduling on Accelerators for Fair Resource Sharing
Christos Margiolas and
Michael F. P. O'Boyle
(University of Edinburgh, UK)
Accelerators, such as Graphic Processing Units (GPUs), are popular components of modern parallel systems. Their energy-efficient performance make them attractive components for modern data center nodes. However, they lack control for fair resource sharing amongst multiple users. This paper presents a runtime and Just In Time compiler that enable resource sharing control and software managed scheduling on accelerators. It is portable and transparent, requiring no modification or recompilation of existing systems or user applications. We provide an extensive evaluation of our scheme with over 40,000 different workloads on 2 platforms and we deliver fairness improvements ranging from 6.8x to 13.66x. In addition, we also deliver system throughput speedups ranging from 1.13x to 1.31x.
@InProceedings{CGO16p82,
author = {Christos Margiolas and Michael F. P. O'Boyle},
title = {Portable and Transparent Software Managed Scheduling on Accelerators for Fair Resource Sharing},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {82--93},
doi = {},
year = {2016},
}
Communication-Aware Mapping of Stream Graphs for Multi-GPU Platforms
Dong Nguyen and Jongeun Lee
(Ulsan National Institute of Science and Technology, South Korea)
Stream graphs can provide a natural way to represent many applications in multimedia and DSP domains. Though the exposed parallelism of stream graphs makes it relatively easy to map them to GP (General Purpose)-GPUs, very large stream graphs as well as how to best exploit multi-GPU platforms to achieve scalable performance poses great challenges for stream graph mapping. Previous work considers either a single GPU only or is based on a crude heuristic that achieves a very low degree of workload balancing, and thus shows only limited scalability. In this paper we present a highly scalable GP-GPU mapping technique for large stream graphs with the following highlights: (1) an accurate GPU performance estimation model for subsets of stream graphs, (2) a novel partitioning heuristic exploiting stream graph's structural properties, and (3) ILP (Integer Linear Programming) formulation of the mapping problem. Our experimental results on a real GPU platform demonstrate that our technique can generate scalable performance for up to 4 GPUs with large stream graphs, and can generate highly optimized multi-GPU code especially for compute-bound ones.
@InProceedings{CGO16p94,
author = {Dong Nguyen and Jongeun Lee},
title = {Communication-Aware Mapping of Stream Graphs for Multi-GPU Platforms},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {94--104},
doi = {},
year = {2016},
}
gpucc: An Open-Source GPGPU Compiler
Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary,
Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt
(Google, USA)
Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has not been a fully open-source compiler targeting the CUDA environment, hampering general compiler and architecture research and making deployment difficult in datacenter or supercomputer environments. In this paper, we present gpucc, an LLVM-based, fully open-source, CUDA compatible compiler for high performance computing. It performs various general and CUDA-specific optimizations to generate high performance code. The Clang-based frontend supports modern language features such as those in C++11 and C++14. Compile time is 8% faster than NVIDIA’s toolchain (nvcc) and it reduces compile time by up to 2.4x for pathological compilations (>100 secs), which tend to dominate build times in parallel build environments. Compared to nvcc, gpucc’s runtime performance is on par for several open-source benchmarks, such as Rodinia (0.8% faster), SHOC (0.5% slower), or Tensor (3.7% faster). It outperforms nvcc on internal large-scale end-to-end benchmarks by up to 51.0%, with a geometric mean of 22.9%.
@InProceedings{CGO16p105,
author = {Jingyue Wu and Artem Belevich and Eli Bendersky and Mark Heffernan and Chris Leary and Jacques Pienaar and Bjarke Roune and Rob Springer and Xuetian Weng and Robert Hundt},
title = {gpucc: An Open-Source GPGPU Compiler},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {105--116},
doi = {},
year = {2016},
}
Affine Programs
A Basic Linear Algebra Compiler for Structured Matrices
Daniele G. Spampinato and
Markus Püschel
(ETH Zurich, Switzerland)
Many problems in science and engineering are in practice modeled and solved through matrix computations. Often, the matrices involved have structure such as symmetric or triangular, which reduces the operations count needed to perform the computation. For example, dense linear systems of equations are solved by first converting to triangular form and optimization problems may yield matrices with any kind of structure. The well-known BLAS (basic linear algebra subroutine) interface provides a small set of structured matrix computations, chosen to serve a certain set of higher level functions (LAPACK). However, if a user encounters a computation or structure that is not supported, she loses the benefits of the structure and chooses a generic library. In this paper, we address this problem by providing a compiler that translates a given basic linear algebra computation on structured matrices into optimized C code, optionally vectorized with intrinsics. Our work combines prior work on the Spiral-like LGen compiler with techniques from polyhedral compilation to mathematically capture matrix structures. In the paper we consider upper/lower triangular and symmetric matrices but the approach is extensible to a much larger set including blocked structures. We run experiments on a modern Intel platform against the Intel MKL library and a baseline implementation showing competitive performance results for both BLAS and non-BLAS functionalities.
@InProceedings{CGO16p117,
author = {Daniele G. Spampinato and Markus Püschel},
title = {A Basic Linear Algebra Compiler for Structured Matrices},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {117--127},
doi = {},
year = {2016},
}
Opening Polyhedral Compiler's Black Box
Lénaïc Bagnères, Oleksandr Zinenko, Stéphane Huot, and Cédric Bastoul
(INRIA, France; University of Paris-Saclay, France; University of Strasbourg, France)
While compilers offer a fair trade-off between productivity and executable performance in single-threaded execution, their optimizations remain fragile when addressing compute-intensive code for parallel architectures with deep memory hierarchies. Moreover, these optimizations operate as black boxes, impenetrable for the user, leaving them with no alternative to time-consuming and error-prone manual optimization in cases where an imprecise cost model or a weak analysis resulted in a bad optimization decision. To address this issue, we propose a technique allowing to automatically translate an arbitrary polyhedral optimization, used internally by loop-level optimization frameworks of several modern compilers, into a sequence of comprehensible syntactic transformations as long as this optimization focuses on scheduling loop iterations. This approach opens the black box of the polyhedral frameworks enabling users to examine, refine, replay and even design complex optimizations semi-automatically in partnership with the compiler.
@InProceedings{CGO16p128,
author = {Lénaïc Bagnères and Oleksandr Zinenko and Stéphane Huot and Cédric Bastoul},
title = {Opening Polyhedral Compiler's Black Box},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {128--138},
doi = {},
year = {2016},
}
Trace-Based Affine Reconstruction of Codes
Gabriel Rodríguez, José M. Andión, Mahmut T. Kandemir, and Juan Touriño
(Universidade da Coruña, Spain; Pennsylvania State University, USA)
Complete comprehension of loop codes is desirable for a variety of program optimizations. Compilers perform static code analyses and transformations, such as loop tiling or memory partitioning, by constructing and manipulating formal representations of the source code. Runtime systems observe and characterize application behavior to drive resource management and allocation, including dependence detection and parallelization, or scheduling. However, the source codes of target applications are not always available to the compiler or runtime system in an analyzable form. It becomes necessary to find alternate ways to model application behavior. This paper presents a novel mathematical framework to rebuild loops from their memory access traces. An exploration engine traverses a tree-like solution space, driven by the access strides in the trace. It is guaranteed that the engine will find the minimal affine nest capable of reproducing the observed sequence of accesses by exploring this space in a brute force fashion, but most real traces will not be tractable in this way. Methods for an efficient solution space traversal based on mathematical properties of the equation systems which model the solution space are proposed. The experimental evaluation shows that these strategies achieve efficient loop reconstruction, processing hundreds of gigabytes of trace data in minutes. The proposed approach is capable of correctly and minimally reconstructing 100% of the static control parts in PolyBench/C applications. As a side effect, the trace reconstruction process can be used to efficiently compress trace files. The proposed tool can also be used for dynamic access characterization, predicting over 99% of future memory accesses.
@InProceedings{CGO16p139,
author = {Gabriel Rodríguez and José M. Andión and Mahmut T. Kandemir and Juan Touriño},
title = {Trace-Based Affine Reconstruction of Codes},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {139--149},
doi = {},
year = {2016},
}
Info
Static Analysis
Inference of Peak Density of Indirect Branches to Detect ROP Attacks
Mateus Tymburibá, Rubens E. A. Moreira, and
Fernando Magno Quintão Pereira
(Federal University of Minas Gerais, Brazil)
A program subject to a Return-Oriented Programming (ROP) attack usually presents an execution trace with a high frequency of indirect branches. From this observation, several researchers have proposed to monitor the density of these instructions to detect ROP attacks. These techniques use universal thresholds: the density of indirect branches that characterizes an attack is the same for every application. This paper shows that universal thresholds are easy to circumvent. As an alternative, we introduce an inter-procedural semi-context-sensitive static code analysis that estimates the maximum density of indirect branches possible for a program. This analysis determines detection thresholds for each application; thus, making it more difficult for attackers to compromise programs via ROP. We have used an implementation of our technique in LLVM to find specific thresholds for the programs in SPEC CPU2006. By comparing these thresholds against actual execution traces of corresponding programs, we demonstrate the accuracy of our approach. Furthermore, our algorithm is practical: it finds an approximate solution to a theoretically undecidable problem, and handles programs with up to 700 thousand assembly instructions in 25 minutes.
@InProceedings{CGO16p150,
author = {Mateus Tymburibá and Rubens E. A. Moreira and Fernando Magno Quintão Pereira},
title = {Inference of Peak Density of Indirect Branches to Detect ROP Attacks},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {150--159},
doi = {},
year = {2016},
}
Video
Info
Sparse Flow-Sensitive Pointer Analysis for Multithreaded Programs
Yulei Sui, Peng Di, and
Jingling Xue
(UNSW, Australia)
For C programs, flow-sensitivity is important to enable pointer analysis to achieve highly usable precision. Despite significant recent advances in scaling flow-sensitive pointer analysis sparsely for sequential C programs, relatively little progress has been made for multithreaded C programs. In this paper, we present FSAM, a new Flow-Sensitive pointer Analysis that achieves its scalability for large Multithreaded C programs by performing sparse analysis on top of a series of thread interference analysis phases. We evaluate FSAM with 10 multithreaded C programs (with more than 100K lines of code for the largest) from Phoenix-2.0, Parsec-3.0 and open-source applications. For two programs, raytrace and x264, the traditional data-flow-based flow-sensitive pointer analysis is un- scalable (under two hours) but our analysis spends just under 5 minutes on raytrace and 9 minutes on x264. For the rest, our analysis is 12x faster and uses 28x less memory.
@InProceedings{CGO16p160,
author = {Yulei Sui and Peng Di and Jingling Xue},
title = {Sparse Flow-Sensitive Pointer Analysis for Multithreaded Programs},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {160--170},
doi = {},
year = {2016},
}
Symbolic Range Analysis of Pointers
Vitor Paisante, Maroua Maalej, Leonardo Barbosa,
Laure Gonnord, and
Fernando Magno Quintão Pereira
(Federal University of Minas Gerais, Brazil; University of Lyon, France; LIP, France)
Alias analysis is one of the most fundamental techniques that compilers use to optimize languages with pointers. However, in spite of all the attention that this topic has received, the current state-of-the-art approaches inside compilers still face challenges regarding precision and speed. In particular, pointer arithmetic, a key feature in C and C++, is yet to be handled satisfactorily. This paper presents a new alias analysis algorithm to solve this problem. The key insight of our approach is to combine alias analysis with symbolic range analysis. This combination lets us disambiguate fields within arrays and structs, effectively achieving more precision than traditional algorithms. To validate our technique, we have implemented it on top of the LLVM compiler. Tests on a vast suite of benchmarks show that we can disambiguate several kinds of C idioms that current state-of-the-art analyses cannot deal with. In particular, we can disambiguate 1.35x more queries than the alias analysis currently available in LLVM. Furthermore, our analysis is very fast: we can go over one million assembly instructions in 10 seconds.
@InProceedings{CGO16p171,
author = {Vitor Paisante and Maroua Maalej and Leonardo Barbosa and Laure Gonnord and Fernando Magno Quintão Pereira},
title = {Symbolic Range Analysis of Pointers},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {171--181},
doi = {},
year = {2016},
}
Programming Models
Towards Automatic Significance Analysis for Approximate Computing
Vassilis Vassiliadis, Jan Riehme, Jens Deussen, Konstantinos Parasyris, Christos D. Antonopoulos, Nikolaos Bellas, Spyros Lalis, and Uwe Naumann
(CERTH, Greece; University of Thessaly, Greece; RWTH Aachen University, Germany)
Several applications may trade-off output quality for energy efficiency by computing only an approximation of their output. Current approaches to software-based approximate computing often require the programmer to specify parts of the code or data structures that can be approximated. A largely unaddressed challenge is how to automate the analysis of the significance of code for the output quality. To this end, we propose a methodology and toolset for automatic significance analysis. We use interval arithmetic and algorithmic differentiation in our profile-driven yet mathematical approach to evaluate the significance of input and intermediate variables for the output of a computation. Our methodology effectively matches decisions of a domain expert in significance characterization for a set of benchmarks, and in some cases offers new insights. Evaluation of the software infrastructure on a multicore x86 platform shows energy reduction (from 31% up to 91% with a mean of 56% compared to fully accurate execution, with graceful quality degradation.
@InProceedings{CGO16p182,
author = {Vassilis Vassiliadis and Jan Riehme and Jens Deussen and Konstantinos Parasyris and Christos D. Antonopoulos and Nikolaos Bellas and Spyros Lalis and Uwe Naumann},
title = {Towards Automatic Significance Analysis for Approximate Computing},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {182--193},
doi = {},
year = {2016},
}
Have Abstraction and Eat Performance, Too: Optimized Heterogeneous Computing with Parallel Patterns
Kevin J. Brown, HyoukJoong Lee,
Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and
Kunle Olukotun
(Stanford University, USA; Google, USA; Purdue University, USA)
High performance in modern computing platforms requires programs to be parallel, distributed, and run on heterogeneous hardware. However programming such architectures is extremely difficult due to the need to implement the application using multiple programming models and combine them together in ad-hoc ways. To optimize distributed applications both for modern hardware and for modern programmers we need a programming model that is sufficiently expressive to support a variety of parallel applications, sufficiently performant to surpass hand-optimized sequential implementations, and sufficiently portable to support a variety of heterogeneous hardware. Unfortunately existing systems tend to fall short of these requirements. In this paper we introduce the Distributed Multiloop Language (DMLL), a new intermediate language based on common parallel patterns that captures the necessary semantic knowledge to efficiently target distributed heterogeneous architectures. We show straightforward analyses that determine what data to distribute based on its usage as well as powerful transformations of nested patterns that restructure computation to enable distribution and optimize for heterogeneous devices. We present experimental results for a range of applications spanning multiple domains and demonstrate highly efficient execution compared to manually-optimized counterparts in multiple distributed programming models.
@InProceedings{CGO16p194,
author = {Kevin J. Brown and HyoukJoong Lee and Tiark Rompf and Arvind K. Sujeeth and Christopher De Sa and Christopher Aberger and Kunle Olukotun},
title = {Have Abstraction and Eat Performance, Too: Optimized Heterogeneous Computing with Parallel Patterns},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {194--205},
doi = {},
year = {2016},
}
NRG-Loops: Adjusting Power from within Applications
Melanie Kambadur and Martha A. Kim
(Columbia University, USA)
NRG-Loops are source-level abstractions that allow an application to dynamically manage its power and energy through adjustments to functionality, performance, and accuracy. The adjustments, which come in the form of truncated, adapted, or perforated loops, are conditionally enabled as runtime power and energy constraints dictate. NRG-Loops are portable across different hardware platforms and operating systems and are complementary to existing system-level efficiency techniques, such as DVFS and idle states. Using a prototype C library supported by commodity hardware energy meters (and with no modifications to the compiler or operating system), this paper demonstrates four NRG-Loop applications that in 2-6 lines of source code changes can save up to 55% power and 90% energy, resulting in up to 12X better energy efficiency than system-level techniques
@InProceedings{CGO16p206,
author = {Melanie Kambadur and Martha A. Kim},
title = {NRG-Loops: Adjusting Power from within Applications},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {206--215},
doi = {},
year = {2016},
}
Correctness
Validating Optimizations of Concurrent C/C++ Programs
Soham Chakraborty and
Viktor Vafeiadis
(MPI-SWS, Germany)
We present a validator for checking the correctness of LLVM compiler optimizations on C11 programs as far as concurrency is concerned. Our validator checks that optimizations do not change memory accesses in ways disallowed by the C11 and/or LLVM memory models. We use a custom C11 concurrent program generator to trigger multiple LLVM optimizations and evaluate the efficacy of our validator. Our experiments highlighted the difference between the C11 and LLVM memory models, and uncovered a number of previously unknown compilation errors in the LLVM optimizations involving the C11 concurrency primitives.
@InProceedings{CGO16p216,
author = {Soham Chakraborty and Viktor Vafeiadis},
title = {Validating Optimizations of Concurrent C/C++ Programs},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {216--226},
doi = {},
year = {2016},
}
Info
IPAS: Intelligent Protection against Silent Output Corruption in Scientific Applications
Ignacio Laguna, Martin Schulz, David F. Richards, Jon Calhoun, and Luke Olson
(Lawrence Livermore National Laboratory, USA; University of Illinois at Urbana-Champaign, USA)
This paper presents IPAS, an instruction duplication technique that protects scientific applications from silent data corruption (SDC) in their output. The motivation for IPAS is that, due to natural error masking, only a subset of SDC errors actually affects the output of scientific codes—we call these errors silent output corruption (SOC) errors. Thus applications require duplication only on code that, when affected by a fault, yields SOC. We use machine learning to learn code instructions that must be protected to avoid SOC, and, using a compiler, we protect only those vulnerable instructions by duplication, thus significantly reducing the overhead that is introduced by instruction duplication. In our experiments with five workloads, IPAS reduces the percentage of SOC by up to 90% with a slowdown that ranges between 1.04x and 1.35x, which corresponds to as much as 47% less slowdown than state-of-the-art instruction duplication techniques.
@InProceedings{CGO16p227,
author = {Ignacio Laguna and Martin Schulz and David F. Richards and Jon Calhoun and Luke Olson},
title = {IPAS: Intelligent Protection against Silent Output Corruption in Scientific Applications},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {227--238},
doi = {},
year = {2016},
}
Atomicity Violation Checker for Task Parallel Programs
Adarsh Yoga and
Santosh Nagarakatte
(Rutgers University, USA)
Task based programming models (e.g., Cilk, Intel TBB, X10, Java Fork-Join tasks) simplify multicore programming in contrast to programming with threads. In a task based model, the programmer specifies parallel tasks and the runtime maps these tasks to hardware threads. The runtime automatically balances the load using work-stealing and provides performance portability. However, interference between parallel tasks can result in concurrency errors.
This paper proposes a dynamic analysis technique to detect atomicity violations in task parallel programs, which could occur in different schedules for a given input without performing interleaving exploration. Our technique leverages the series-parallel dynamic execution structure of a task parallel program to identify parallel accesses. It also maintains access history metadata with each shared memory location to identify parallel accesses that can cause atomicity violations in different schedules. To streamline metadata management, the access history metadata is split into global metadata that is shared by all tasks and local metadata that is specific to each task. The global metadata tracks a fixed number of access histories for each shared memory location that capture all possible access patterns necessary for an atomicity violation. Our prototype tool for Intel Threading Building Blocks (TBB) detects atomicity violations that can potentially occur in different interleavings for a given input with performance overheads similar to Velodrome atomicity checker for thread based programs.
@InProceedings{CGO16p239,
author = {Adarsh Yoga and Santosh Nagarakatte},
title = {Atomicity Violation Checker for Task Parallel Programs},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {239--249},
doi = {},
year = {2016},
}
Binary/Virtualization
Flexible On-Stack Replacement in LLVM
Daniele Cono D'Elia and Camil Demetrescu
(Sapienza University of Rome, Italy)
On-Stack Replacement (OSR) is a technique for dynamically transferring execution between different versions of a function at run time. OSR is typically used in virtual machines to interrupt a long-running function and recompile it at a higher optimization level, or to replace it with a different one when a speculative assumption made during its compilation no longer holds.
In this paper we present a framework for OSR that introduces novel ideas and combines features of existing techniques that no previous solution provided simultaneously. New features include OSR with compensation code to adjust the program state during a transition and the ability to fire an OSR from arbitrary locations in the code. Our approach is platform-independent as the OSR machinery is entirely encoded at a compiler’s intermediate representation level.
We implement and evaluate our technique in the LLVM compiler infrastructure, which is gaining popularity as Just-In-Time (JIT) compiler in virtual machines for dynamic languages such as Javascript, MATLAB, Python, and Ruby. As a case study of our approach, we show how to improve the state of the art in the optimization of the feval instruction, a performance-critical construct of the MATLAB language.
@InProceedings{CGO16p250,
author = {Daniele Cono D'Elia and Camil Demetrescu},
title = {Flexible On-Stack Replacement in LLVM},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {250--260},
doi = {},
year = {2016},
}
Info
BlackBox: Lightweight Security Monitoring for COTS Binaries
Byron Hawkins,
Brian Demsky, and Michael B. Taylor
(University of California at Irvine, USA; University of California at San Diego, USA)
After a software system is compromised, it can be difficult to understand what vulnerabilities attackers exploited. Any information residing on that machine cannot be trusted as attackers may have tampered with it to cover their tracks. Moreover, even after an exploit is known, it can be difficult to determine whether it has been used to compromise a given machine. Aviation has long-used black boxes to better understand the causes of accidents, enabling improvements that reduce the likelihood of future accidents. Many attacks introduce abnormal control flows to compromise systems. In this paper, we present BlackBox, a monitoring system for COTS software. Our techniques enable BlackBox to efficiently monitor unexpected and potentially harmful control flow in COTS binaries. BlackBox constructs dynamic profiles of an application's typical control flows to filter the vast majority of expected control flow behavior, leaving us with a manageable amount of data that can be logged across the network to remote devices. Modern applications make extensive use of dynamically generated code, some of which varies greatly between executions. We introduce support for code generators that can detect security-sensitive behaviors while allowing BlackBox to avoid logging the majority of ordinary behaviors. We have implemented BlackBox in DynamoRIO. We evaluate the runtime overhead of BlackBox, and show that it can effectively monitor recent versions of Microsoft Office and Google Chrome. We show that in ROP, COOP, and state- of-the-art JIT injection attacks, BlackBox logs the pivotal actions by which the attacker takes control, and can also blacklist those actions to prevent repeated exploits.
@InProceedings{CGO16p261,
author = {Byron Hawkins and Brian Demsky and Michael B. Taylor},
title = {BlackBox: Lightweight Security Monitoring for COTS Binaries},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {261--272},
doi = {},
year = {2016},
}
Info
Re-constructing High-Level Information for Language-Specific Binary Re-optimization
Toshihiko Koju, Reid Copeland, Motohiro Kawahito, and Moriyoshi Ohara
(IBM Research, Japan; IBM, Canada)
In this paper, we show a binary optimizer can achieve competitive performance relative to a state-of-the-art source code compiler by re-constructing high-level information (HLI) from binaries. Recent advances in compiler technologies have resulted in a large performance gap between binaries compiled with old compilers and those compiled with latest ones. This motivated us to develop a binary optimizer for old binaries using a compiler engine for a latest source code compiler. However, a traditional approach to naively convert machine instructions into an intermediate representation (IR) of the compiler engine, does not allow us to take full advantage of optimization techniques available in the compiler. This is because the HLI, such as information about variables and their data types, is not available in such an IR. To address this issue, we have devised a technique to re-construct the HLI from binaries by using contextual information. This contextual information is a set of knowledge about specific compilation technologies, such as the conventions of data structures, the patterns of instruction sequences, and the semantics of runtime routines. With this technique, our binary optimizer has improved the performance of binaries generated from an older compiler by 40.1% on average in the CPU time for a set of benchmarks, which is close to the one due to a source-code recompilation with the same compiler engine, 55.2% on average.
@InProceedings{CGO16p273,
author = {Toshihiko Koju and Reid Copeland and Motohiro Kawahito and Moriyoshi Ohara},
title = {Re-constructing High-Level Information for Language-Specific Binary Re-optimization},
booktitle = {Proc.\ CGO},
publisher = {ACM},
pages = {273--283},
doi = {},
year = {2016},
}
proc time: 1.27