CGO 2021 – Author Index |
Contents -
Abstracts -
Authors
|
A B C D E F G H I J K L M N O P R S T V W X Y Z
Alappat, Christie L. |
CGO '21: "YaskSite: Stencil Optimization ..."
YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures
Christie L. Alappat, Johannes Seiferth, Georg Hager, Matthias Korch, Thomas Rauber, and Gerhard Wellein (University of Erlangen-Nuremberg, Germany; University of Bayreuth, Germany) The landscape of multi-core architectures is growing more complex and diverse. Optimal application performance tuning parameters can vary widely across CPUs, and finding them in a possibly multidimensional parameter search space can be time consuming, expensive and potentially infeasible. In this work, we introduce YaskSite, a tool capable of tackling these challenges for stencil computations. YaskSite is built upon Intel’s YASK framework. It combines YASK’s flexibility to deal with different target architectures with the Execution-Cache-Memory performance model, which enables identifying optimal performance parameters analytically without the need to run the code. Further we show that YaskSite’s features can be exploited by external tuning frameworks to reliably select the most efficient kernel(s) for the application at hand. To demonstrate this, we integrate YaskSite into Offsite, an offline tuner for explicit ordinary differential equation methods, and show that the generated performance predictions are reliable and accurate, leading to considerable performance gains at minimal code generation time and autotuning costs on the latest Intel Cascade Lake and AMD Rome CPUs. @InProceedings{CGO21p174, author = {Christie L. Alappat and Johannes Seiferth and Georg Hager and Matthias Korch and Thomas Rauber and Gerhard Wellein}, title = {YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {174--186}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Amaral, José Nelson |
CGO '21: "Vulkan Vision: Ray Tracing ..."
Vulkan Vision: Ray Tracing Workload Characterization using Automatic Graphics Instrumentation
David Pankratz, Tyler Nowicki, Ahmed Eltantawy, and José Nelson Amaral (University of Alberta, Canada; Huawei Technologies, Canada) While there are mature performance monitoring, profiling and instrumentation tools to help understanding the dynamic behaviour of general-purpose GPU applications, the abstract programming models of graphics applications have limited the development of such tools for graphics. This paper introduces Vulcan Vision(V-Vision), a framework for collecting detailed GPU execution data from Vulkan applications to guide hardware-informed improvements. A core contribution of V-Vision is providing out-of-the-box data collection for capturing complete dynamic warp and thread execution traces. V-Vision also provides analyses for the follow purposes: identifying and visualizing application hotspots to guide optimization, characterizing application behaviour and estimating the effect of architectural modifications. This paper demonstrates the potential for these analyses in applications that utilize the recent ray tracing extension in Vulkan and describes new insights about the applications and the underlying hardware. @InProceedings{CGO21p137, author = {David Pankratz and Tyler Nowicki and Ahmed Eltantawy and José Nelson Amaral}, title = {Vulkan Vision: Ray Tracing Workload Characterization using Automatic Graphics Instrumentation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {137--149}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Amarasinghe, Saman |
CGO '21: "BuildIt: A Type-Based Multi-stage ..."
BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++
Ajay Brahmakshatriya and Saman Amarasinghe (Massachusetts Institute of Technology, USA) The simplest implementation of a domain-specific language is to embed it in an existing language using operator overloading. This way, the DSL can inherit parsing, syntax and type checking, error handling, and the toolchain of debuggers and IDEs from the host language. A natural host language choice for most high-performance DSLs is the de-facto high-performance language, C++. However, DSL designers quickly run into the problem of not being able to extract control flows due to a lack of introspection in C++ and have to resort to special functions with lambdas to represent loops and conditionals. This approach introduces unnecessary syntax and does not capture the side effects of updates inside the lambdas in a safe way. We present BuildIt, a type-based multi-stage execution framework that solves this problem by extracting all control flow operators like if-then-else conditionals and for and while loops using a pure library approach. BuildIt achieves this by repeated execution of the program to explore all control flow paths and construct the AST piece by piece. We show that BuildIt can do this without exponential blow-up in terms of output size and execution time. We apply BuildIt’s staging capabilities to the state-of-the-art tensor compiler TACO to generate low-level IR for custom-level formats. Thus, BuildIt offers a way to get both generalization and programmability for the user while generating specialized and efficient code. We also demonstrate that BuildIt can generate rich control-flow from relatively simple code by using it to stage an interpreter for an esoteric language. BuildIt changes the way we think about multi-staging as a problem by reducing the PL complexity of a supposedly harder problem requiring features like introspection or specialized compiler support to a set of common features found in most languages. @InProceedings{CGO21p39, author = {Ajay Brahmakshatriya and Saman Amarasinghe}, title = {BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {39--51}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced CGO '21: "Compiling Graph Applications ..." Compiling Graph Applications for GPUs with GraphIt Ajay Brahmakshatriya, Yunming Zhang, Changwan Hong, Shoaib Kamil, Julian Shun, and Saman Amarasinghe (Massachusetts Institute of Technology, USA; Adobe, USA) The performance of graph programs depends highly on the algorithm, size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations as well as the hardware platform to use. The GraphIt programming language makes it easy for the developers to write the algorithm once and optimize it for different inputs using a scheduling language. However GraphIt currently has no support for generating high performance CUDA code for GPUs Developers have to resort to re-implementing the entire algorithm with entirely different set of abstractions and optimizations in order to achieve high-performance. We propose G2, an extension to the GraphIt compiler frame- work, that achieves high performance on both CPUs and GPUs using the same algorithm specification. G2 significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that en- ables combining load balancing, edge traversal direction, active vertex set creation, active vertex set processing ordering, and kernel fusion optimizations. G2 also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balance by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within L2 cache. We evaluate G2 on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11× speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments. @InProceedings{CGO21p248, author = {Ajay Brahmakshatriya and Yunming Zhang and Changwan Hong and Shoaib Kamil and Julian Shun and Saman Amarasinghe}, title = {Compiling Graph Applications for GPUs with GraphIt}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {248--261}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Amini, Mehdi |
CGO '21: "MLIR: Scaling Compiler Infrastructure ..."
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko (Google, USA; Indian Institute of Science, India; Google, France) This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. @InProceedings{CGO21p2, author = {Chris Lattner and Mehdi Amini and Uday Bondhugula and Albert Cohen and Andy Davis and Jacques Pienaar and River Riddle and Tatiana Shpeisman and Nicolas Vasilache and Oleksandr Zinenko}, title = {MLIR: Scaling Compiler Infrastructure for Domain Specific Computation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {2--14}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable |
|
Arif, Mahwish |
CGO '21: "Cinnamon: A Domain-Specific ..."
Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring
Mahwish Arif, Ruoyu Zhou, Hsi-Ming Ho, and Timothy M. Jones (University of Cambridge, UK; University of Sussex, UK) Binary instrumentation and rewriting frameworks provide a powerful way of implementing custom analysis and transformation techniques for applications ranging from performance profiling to security monitoring. However, using these frameworks to write even simple analyses and transformations is non-trivial. Developers often need to write framework-specific boilerplate code and work with low-level and complex programming details. This not only results in hundreds (or thousands) of lines of code, but also leaves significant room for error. To address this, we introduce Cinnamon, a domain-specific language designed to write programs for binary profiling and monitoring. Cinnamon's abstractions allow the programmer to focus on implementing their technique in a platform-independent way, without worrying about complex lower-level details. Programmers can use these abstractions to perform analysis and instrumentation at different locations and granularity levels in the binary. The flexibility of Cinnamon also enables its programs to be mapped to static, dynamic or hybrid analysis and instrumentation approaches. As a proof of concept, we target Cinnamon to three different binary frameworks by implementing a custom Cinnamon to C/C++ compiler and integrating the generated code within these frameworks. We further demonstrate the ability of Cinnamon to express a range of profiling and monitoring tools through different use-cases. @InProceedings{CGO21p103, author = {Mahwish Arif and Ruoyu Zhou and Hsi-Ming Ho and Timothy M. Jones}, title = {Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {103--114}, doi = {}, year = {2021}, } |
|
Barbar, Mohamad |
CGO '21: "Object Versioning for Flow-Sensitive ..."
Object Versioning for Flow-Sensitive Pointer Analysis
Mohamad Barbar, Yulei Sui, and Shiping Chen (University of Technology Sydney, Australia; CSIRO's Data61, Australia) Flow-sensitive points-to analysis provides better precision than its flow-insensitive counterpart. Traditionally performed on the control-flow graph, it incurs heavy analysis overhead. For performance, staged flow-sensitive analysis (SFS) is conducted on a pre-computed def-use (value-flow) graph where points-to sets of variables are propagated across def-use chains sparsely rather than across control-flow in the control-flow graph. SFS makes the propagation of different objects’ points-to sets sparse (multiple-object sparsity), however, it suffers from redundant propagation between instructions of the same object’s points-to sets (single-object sparsity). The points-to set of an object is often duplicated, resulting in redundant propagation and storage, especially in real-world heap-intensive programs. We notice that a simple graph prelabelling extension can identify much of this redundancy in a pre-analysis. With this pre-analysis, multiple nodes (instructions) in the value-flow graph can share an individual memory object’s points-to set rather than each node maintaining its own points-to set for that single object. We present object versioning for flow-sensitive points-to analysis, a finer single-object sparsity technique which maintains the same precision while allowing us to avoid much of the redundancy present in propagating and storing points-to sets. Our experiments conducted on 15 open-source programs, when compared with SFS, show that our approach runs up to 26.22× faster (5.31× on average), and reduces memory usage by up to 5.46× (2.11× on average). @InProceedings{CGO21p222, author = {Mohamad Barbar and Yulei Sui and Shiping Chen}, title = {Object Versioning for Flow-Sensitive Pointer Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {222--235}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Barik, Raj |
CGO '21: "An Experience with Code-Size ..."
An Experience with Code-Size Optimization for Production iOS Mobile Applications
Milind Chabbi, Jin Lin, and Raj Barik (Uber Technologies, USA) Modern mobile application binaries are bulky for many reasons: software and its dependencies, fast-paced addition of new features, high-level language constructs, and statically linked platform libraries. Reduced application size is critical not only for the end-user experience but also for vendor’s download size limitations. Moreover, download size restrictions may impact revenues for critical businesses. In this paper, we highlight some of the key reasons of code-size bloat in iOS mobile applications, specifically apps written using a mix of Swift and Objective-C. Our observation reveals that machine code sequences systematically repeat throughout the app’s binary. We highlight source-code patterns and high-level language constructs that lead to an increase in the code size. We propose whole-program, fine-grained machine-code outlining as an effective optimization to constrain the code-size growth. We evaluate the effectiveness of our new optimization pipeline on the UberRider iOS app used by millions of customers daily. Our optimizations reduce the code size by 23%. The impact of our optimizations on the code size grows in magnitude over time as the code evolves. For a set of performance spans defined by the app developers, the optimizations do not statistically regress production performance. We applied the same optimizations to Uber’s UberDriver and UberEats apps and gained 17% and 19% size savings, respectively. @InProceedings{CGO21p363, author = {Milind Chabbi and Jin Lin and Raj Barik}, title = {An Experience with Code-Size Optimization for Production iOS Mobile Applications}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {363--377}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable |
|
Bello, Luciano |
CGO '21: "Relaxed Peephole Optimization: ..."
Relaxed Peephole Optimization: A Novel Compiler Optimization for Quantum Circuits
Ji Liu, Luciano Bello, and Huiyang Zhou (North Carolina State University, USA; IBM Research, USA) As in classical computing, compilers play an important role in quantum computing. Quantum processors typically support a limited set of primitive operations or quantum gates and have certain hardware-related limitations. A quantum compiler is responsible for adapting a quantum program to these constraint environments and decomposing quantum gates into a sequence of the primitive ones. During the compilation process, it is also critical for the compiler to optimize the quantum circuits in order to reduce the noise in the computation results. Since the noise is introduced by operations and decoherence, reducing the gate count is the key for improving performance. In this paper, we propose a novel quantum compiler optimization, named relaxed peephole optimization (RPO) for quantum computers. RPO leverages the single-qubit state information that can be determined statically by the compiler. We define that a qubit is in a basis state when, at a given point in time, its state is either in the X-, Y-, or Z-basis (|+>/|−>, |L>/|R> and |0>/|1>). When basis qubits are used as inputs to quantum gates, there exist opportunities for strength reduction, which replaces quantum operations with equivalent but less expensive ones. Compared to the existing peephole optimization for quantum programs, the difference is that our proposed optimization does not require an identical unitary matrix, thereby named ‘relaxed’ peephole optimization. We also extend our approach to optimize the quantum gates when some input qubits are in known pure states. Both optimizations, namely the Quantum Basis-state Optimization (QBO) and the Quantum Pure-state Optimization (QPO), are implemented in the IBM’s Qiskit transpiler. Our experimental results show that our proposed optimization pass is fast and effective. The circuits optimized with our compiler optimizations obtain up to 18.0% (11.7% on average) fewer CNOT gates and up to 8.2% (7.1% on average) lower transpilation time than that of the most aggressive optimization level in the Qiskit compiler. When running on real quantum computers, the success rates of 3-qubit quantum phase estimation algorithm improve by 2.30X due to the reduced gate counts. @InProceedings{CGO21p301, author = {Ji Liu and Luciano Bello and Huiyang Zhou}, title = {Relaxed Peephole Optimization: A Novel Compiler Optimization for Quantum Circuits}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {301--314}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Ben-Nun, Tal |
CGO '21: "StencilFlow: Mapping Large ..."
StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems
Johannes de Fine Licht, Andreas Kuster, Tiziano De Matteis, Tal Ben-Nun, Dominic Hofer, and Torsten Hoefler (ETH Zurich, Switzerland; MeteoSwiss, Switzerland) Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice. @InProceedings{CGO21p315, author = {Johannes de Fine Licht and Andreas Kuster and Tiziano De Matteis and Tal Ben-Nun and Dominic Hofer and Torsten Hoefler}, title = {StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {315--326}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Bondhugula, Uday |
CGO '21: "MLIR: Scaling Compiler Infrastructure ..."
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko (Google, USA; Indian Institute of Science, India; Google, France) This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. @InProceedings{CGO21p2, author = {Chris Lattner and Mehdi Amini and Uday Bondhugula and Albert Cohen and Andy Davis and Jacques Pienaar and River Riddle and Tatiana Shpeisman and Nicolas Vasilache and Oleksandr Zinenko}, title = {MLIR: Scaling Compiler Infrastructure for Domain Specific Computation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {2--14}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable |
|
Brahmakshatriya, Ajay |
CGO '21: "BuildIt: A Type-Based Multi-stage ..."
BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++
Ajay Brahmakshatriya and Saman Amarasinghe (Massachusetts Institute of Technology, USA) The simplest implementation of a domain-specific language is to embed it in an existing language using operator overloading. This way, the DSL can inherit parsing, syntax and type checking, error handling, and the toolchain of debuggers and IDEs from the host language. A natural host language choice for most high-performance DSLs is the de-facto high-performance language, C++. However, DSL designers quickly run into the problem of not being able to extract control flows due to a lack of introspection in C++ and have to resort to special functions with lambdas to represent loops and conditionals. This approach introduces unnecessary syntax and does not capture the side effects of updates inside the lambdas in a safe way. We present BuildIt, a type-based multi-stage execution framework that solves this problem by extracting all control flow operators like if-then-else conditionals and for and while loops using a pure library approach. BuildIt achieves this by repeated execution of the program to explore all control flow paths and construct the AST piece by piece. We show that BuildIt can do this without exponential blow-up in terms of output size and execution time. We apply BuildIt’s staging capabilities to the state-of-the-art tensor compiler TACO to generate low-level IR for custom-level formats. Thus, BuildIt offers a way to get both generalization and programmability for the user while generating specialized and efficient code. We also demonstrate that BuildIt can generate rich control-flow from relatively simple code by using it to stage an interpreter for an esoteric language. BuildIt changes the way we think about multi-staging as a problem by reducing the PL complexity of a supposedly harder problem requiring features like introspection or specialized compiler support to a set of common features found in most languages. @InProceedings{CGO21p39, author = {Ajay Brahmakshatriya and Saman Amarasinghe}, title = {BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {39--51}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced CGO '21: "Compiling Graph Applications ..." Compiling Graph Applications for GPUs with GraphIt Ajay Brahmakshatriya, Yunming Zhang, Changwan Hong, Shoaib Kamil, Julian Shun, and Saman Amarasinghe (Massachusetts Institute of Technology, USA; Adobe, USA) The performance of graph programs depends highly on the algorithm, size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations as well as the hardware platform to use. The GraphIt programming language makes it easy for the developers to write the algorithm once and optimize it for different inputs using a scheduling language. However GraphIt currently has no support for generating high performance CUDA code for GPUs Developers have to resort to re-implementing the entire algorithm with entirely different set of abstractions and optimizations in order to achieve high-performance. We propose G2, an extension to the GraphIt compiler frame- work, that achieves high performance on both CPUs and GPUs using the same algorithm specification. G2 significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that en- ables combining load balancing, edge traversal direction, active vertex set creation, active vertex set processing ordering, and kernel fusion optimizations. G2 also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balance by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within L2 cache. We evaluate G2 on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11× speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments. @InProceedings{CGO21p248, author = {Ajay Brahmakshatriya and Yunming Zhang and Changwan Hong and Shoaib Kamil and Julian Shun and Saman Amarasinghe}, title = {Compiling Graph Applications for GPUs with GraphIt}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {248--261}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Cao, Liqing |
CGO '21: "Scaling Up the IFDS Algorithm ..."
Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing
Haofeng Li, Haining Meng, Hengjie Zheng, Liqing Cao, Jie Lu, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China) The IFDS algorithm can be memory-intensive, requiring a memory budget of more than 100 GB of RAM for some applications. The large memory requirements significantly restrict the deployment of IFDS-based tools in practise. To improve this, we propose a disk-assisted solution that drastically reduces the memory requirements of traditional IFDS solvers. Our solution saves memory by 1) recomputing instead of memorizing intermediate analysis data, and 2) swapping in-memory data to disk when memory usages reach a threshold. We implement sophisticated scheduling schemes to swap data between memory and disks efficiently. We have developed a new taint analysis tool, DiskDroid, based on our disk-assisted IFDS solver. Compared to FlowDroid, a state-of-the-art IFDS-based taint analysis tool, for a set of 19 apps which take from 10 to 128 GB of RAM by FlowDroid, DiskDroid can analyze them with less than 10GB of RAM at a slight performance improvement of 8.6%. In addition, for 21 apps requiring more than 128GB of RAM by FlowDroid, DiskDroid can analyze each app in 3 hours, under the same memory budget of 10GB. This makes the tool deployable to normal desktop environments. We make the tool publicly available at https://github.com/HaofLi/DiskDroid. @InProceedings{CGO21p236, author = {Haofeng Li and Haining Meng and Hengjie Zheng and Liqing Cao and Jie Lu and Lian Li and Lin Gao}, title = {Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {236--247}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Carlson, Trevor E. |
CGO '21: "ELFies: Executable Region ..."
ELFies: Executable Region Checkpoints for Performance Analysis and Simulation
Harish Patil, Alexander Isaev, Wim Heirman, Alen Sabu, Ali Hajiabadi, and Trevor E. Carlson (Intel Corporation, USA; Intel Corporation, Belgium; National University of Singapore, Singapore) We address the challenge faced in characterizing long-running workloads, namely how to reliably focus the detailed analysis on interesting execution regions. We present a set of tools that allows users to precisely capture any region of interest in program execution, and create a stand-alone executable, called an ELFie, from it. An ELFie starts with the same program state captured at the beginning of the region of interest and then executes natively. With ELFies, there is no fast-forwarding to the region of interest needed or the uncertainty of reaching the region. ELFies can be fed to dynamic program-analysis tools or simulators that work with regular program binaries. Our tool-chain is based on the PinPlay framework and requires no special hardware, operating system changes, re-compilation, or re-linking of test programs. This paper describes the design of our ELFie generation tool-chain and the application of ELFies in performance analysis and simulation of regions of interest in popular long-running single and multi-threaded benchmarks. @InProceedings{CGO21p126, author = {Harish Patil and Alexander Isaev and Wim Heirman and Alen Sabu and Ali Hajiabadi and Trevor E. Carlson}, title = {ELFies: Executable Region Checkpoints for Performance Analysis and Simulation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {126--136}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Chabbi, Milind |
CGO '21: "An Experience with Code-Size ..."
An Experience with Code-Size Optimization for Production iOS Mobile Applications
Milind Chabbi, Jin Lin, and Raj Barik (Uber Technologies, USA) Modern mobile application binaries are bulky for many reasons: software and its dependencies, fast-paced addition of new features, high-level language constructs, and statically linked platform libraries. Reduced application size is critical not only for the end-user experience but also for vendor’s download size limitations. Moreover, download size restrictions may impact revenues for critical businesses. In this paper, we highlight some of the key reasons of code-size bloat in iOS mobile applications, specifically apps written using a mix of Swift and Objective-C. Our observation reveals that machine code sequences systematically repeat throughout the app’s binary. We highlight source-code patterns and high-level language constructs that lead to an increase in the code size. We propose whole-program, fine-grained machine-code outlining as an effective optimization to constrain the code-size growth. We evaluate the effectiveness of our new optimization pipeline on the UberRider iOS app used by millions of customers daily. Our optimizations reduce the code size by 23%. The impact of our optimizations on the code size grows in magnitude over time as the code evolves. For a set of performance spans defined by the app developers, the optimizations do not statistically regress production performance. We applied the same optimizations to Uber’s UberDriver and UberEats apps and gained 17% and 19% size savings, respectively. @InProceedings{CGO21p363, author = {Milind Chabbi and Jin Lin and Raj Barik}, title = {An Experience with Code-Size Optimization for Production iOS Mobile Applications}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {363--377}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable |
|
Chelini, Lorenzo |
CGO '21: "Progressive Raising in Multi-level ..."
Progressive Raising in Multi-level IR
Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, and Henk Corporaal (Eindhoven University of Technology, Netherlands; Inria, France; ENS Paris, France; Google, France; Google, Switzerland; University of Edinburgh, UK) Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opinionated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multi-Level Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements. @InProceedings{CGO21p15, author = {Lorenzo Chelini and Andi Drebes and Oleksandr Zinenko and Albert Cohen and Nicolas Vasilache and Tobias Grosser and Henk Corporaal}, title = {Progressive Raising in Multi-level IR}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {15--26}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Chen, Gang |
CGO '21: "C-for-Metal: High Performance ..."
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, Joel Fuentes, Wei-Yu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, and Daniel Rhee (Intel Corporation, USA) The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU. @InProceedings{CGO21p289, author = {Guei-Yuan Lueh and Kaiyu Chen and Gang Chen and Joel Fuentes and Wei-Yu Chen and Fangwen Fu and Hong Jiang and Hongzheng Li and Daniel Rhee}, title = {C-for-Metal: High Performance SIMD Programming on Intel GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {289--300}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Chen, Kaiyu |
CGO '21: "C-for-Metal: High Performance ..."
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, Joel Fuentes, Wei-Yu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, and Daniel Rhee (Intel Corporation, USA) The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU. @InProceedings{CGO21p289, author = {Guei-Yuan Lueh and Kaiyu Chen and Gang Chen and Joel Fuentes and Wei-Yu Chen and Fangwen Fu and Hong Jiang and Hongzheng Li and Daniel Rhee}, title = {C-for-Metal: High Performance SIMD Programming on Intel GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {289--300}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Chen, Shiping |
CGO '21: "Object Versioning for Flow-Sensitive ..."
Object Versioning for Flow-Sensitive Pointer Analysis
Mohamad Barbar, Yulei Sui, and Shiping Chen (University of Technology Sydney, Australia; CSIRO's Data61, Australia) Flow-sensitive points-to analysis provides better precision than its flow-insensitive counterpart. Traditionally performed on the control-flow graph, it incurs heavy analysis overhead. For performance, staged flow-sensitive analysis (SFS) is conducted on a pre-computed def-use (value-flow) graph where points-to sets of variables are propagated across def-use chains sparsely rather than across control-flow in the control-flow graph. SFS makes the propagation of different objects’ points-to sets sparse (multiple-object sparsity), however, it suffers from redundant propagation between instructions of the same object’s points-to sets (single-object sparsity). The points-to set of an object is often duplicated, resulting in redundant propagation and storage, especially in real-world heap-intensive programs. We notice that a simple graph prelabelling extension can identify much of this redundancy in a pre-analysis. With this pre-analysis, multiple nodes (instructions) in the value-flow graph can share an individual memory object’s points-to set rather than each node maintaining its own points-to set for that single object. We present object versioning for flow-sensitive points-to analysis, a finer single-object sparsity technique which maintains the same precision while allowing us to avoid much of the redundancy present in propagating and storing points-to sets. Our experiments conducted on 15 open-source programs, when compared with SFS, show that our approach runs up to 26.22× faster (5.31× on average), and reduces memory usage by up to 5.46× (2.11× on average). @InProceedings{CGO21p222, author = {Mohamad Barbar and Yulei Sui and Shiping Chen}, title = {Object Versioning for Flow-Sensitive Pointer Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {222--235}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Chen, Wei-Yu |
CGO '21: "C-for-Metal: High Performance ..."
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, Joel Fuentes, Wei-Yu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, and Daniel Rhee (Intel Corporation, USA) The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU. @InProceedings{CGO21p289, author = {Guei-Yuan Lueh and Kaiyu Chen and Gang Chen and Joel Fuentes and Wei-Yu Chen and Fangwen Fu and Hong Jiang and Hongzheng Li and Daniel Rhee}, title = {C-for-Metal: High Performance SIMD Programming on Intel GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {289--300}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Chen, Ying |
CGO '21: "Enhancing Atomic Instruction ..."
Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation
Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, and Pen-Chung Yew (Nankai University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) Dynamic Binary Translation (DBT) is a key enabler for cross-ISA emulation, system virtualization, runtime instrumentation, and many other important applications. Among several critical requirements for DBT, it is important to provide equivalent semantics for atomic synchronization instructions such as Load-link/Store-conditional (LL/SC), which are mostly included in the reduced-instruction set architectures (RISC) and Compare-and-Swap(CAS), which is mostly in the complex instruction set architectures (CISC). However, the state-of-the-art DBT tools often do not provide a fully correct translation of these atomic instructions, in particular, from RISC atomic instructions (i.e. LL/SC) to CISC atomic instructions (i.e. CAS), due to performance concerns. As a result, some may cause the well-known ABA problem, which could lead to wrong results or program crashes. In our experimental studies on QEMU, a state-of-the-art DBT, that runs multi-threaded lock-free stack operations implemented with ARM instruction set (i.e. using LL/SC) on Intel x86 platforms (i.e. using CAS), it often crashes within 2 seconds. Although attempts have been made to provide correct emulation for such atomic instructions, they either result in heavy execution overheads or require additional hardware support. In this paper, we propose several schemes to address those issues and implement them on QEMU to evaluate their performance overheads. The results show that all of the proposed schemes can provide correct emulation and, for the best solution, can achieve a min, max, geomean speedup of 1.25x, 3.21x, 2.03x respectively, over the best existing software-based scheme. @InProceedings{CGO21p351, author = {Ziyi Zhao and Zhang Jiang and Ying Chen and Xiaoli Gong and Wenwen Wang and Pen-Chung Yew}, title = {Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {351--362}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Choi, Heelim |
CGO '21: "Fine-Grained Pipeline Parallelization ..."
Fine-Grained Pipeline Parallelization for Network Function Programs
Seungbin Song, Heelim Choi, and Hanjun Kim (Yonsei University, South Korea) Network programming languages enable programmers to implement new network functions on various hardware and software stacks in the domain of Software Defined Networking (SDN). Although the languages extend the flexibility of network devices, existing compilers do not fully optimize the network programs due to their coarse-grained parallelization methods. The compilers consider each packet processing table that consists of match and action functions as a unit of tasks and parallelize the programs without decomposing match and action functions. This work proposes a new fine-grained pipeline parallelization compiler for network programming languages, named PSDN. First, the PSDN compiler decouples match and action functions from packet processing tables and analyzes dependencies among the matches and actions. While respecting the dependencies, the compiler efficiently schedules each match and action function into a pipeline with clock cycle estimation and fuses functions to reduce synchronization overheads. This work implements the PSDN compiler that translates a P4 network program to a Xilinx PX program, which is synthesizable to NetFPGA-SUME hardware. The proposed compiler reduces packet processing latency by 12.1% and utilization by 3.5% compared to previous work. @InProceedings{CGO21p162, author = {Seungbin Song and Heelim Choi and Hanjun Kim}, title = {Fine-Grained Pipeline Parallelization for Network Function Programs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {162--173}, doi = {}, year = {2021}, } |
|
Cho, Sungjun |
CGO '21: "Thread-Aware Area-Efficient ..."
Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices
Changsu Kim, Shinnung Jeong, Sungjun Cho, Yongwoo Lee, William Song, Youngsok Kim, and Hanjun Kim (POSTECH, South Korea; Yonsei University, South Korea) In the embedded device market, custom hardware platforms such as an application specific integrated circuit (ASIC) and a field programmable gate array (FPGA) are attractive thanks to their high performance and power efficiency. However, its huge design costs make it challenging for manufacturers to timely launch new devices. High-level synthesis (HLS) helps significantly reduce the design costs by automating the translation of service algorithms into hardware logics; however, current HLS compilers do not fit well to embedded devices as they fail to produce area-efficient solutions while supporting concurrent events from diverse peripherals such as sensors, actuators and network modules. This paper proposes a new thread-aware HLS compiler named DURO that produces area-efficient embedded devices. DURO shares commonly-invoked functions and operators across different callers and threads with a new thread-aware area cost model, and thus effectively reduces the logic size. Moreover, DURO supports a variety of device peripherals by automatically integrating peripheral controllers and interfaces as peripheral drivers. The experiment results of six embedded devices with ten peripherals demonstrate that DURO reduces the area and energy dissipation of embedded devices by 28.5% and 25.3% compared with the designs generated by the state-of-the-art HLS compiler. This work also implements FPGA prototypes of the six devices using DURO, and the measurement results show 65.3% energy saving over Raspberry Pi Zero with slightly better computation performance. @InProceedings{CGO21p327, author = {Changsu Kim and Shinnung Jeong and Sungjun Cho and Yongwoo Lee and William Song and Youngsok Kim and Hanjun Kim}, title = {Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {327--339}, doi = {}, year = {2021}, } |
|
Cohen, Albert |
CGO '21: "Seamless Compiler Integration ..."
Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic
Tiago Trevisan Jost, Yves Durand, Christian Fabre, Albert Cohen, and Frédéric Pétrot (Université Grenoble Alpes, France; CEA LIST, France; Google, France; CNRS, France; Grenoble INP, France; TIMA, France) Floating-Point (FP) units in processors are generally limited to supporting a subset of formats defined by the IEEE 754 standard. As a result, high-efficiency languages and optimizing compilers for high-performance computing only support IEEE standard types and applications needing higher precision involve cumbersome memory management and calls to external libraries, resulting in code bloat and making the intent of the program unclear. We present an extension of the C type system that can represent generic FP operations and formats, supporting both static precision and dynamically variable precision. We design and implement a compilation flow bridging the abstraction gap between this type system and low-level FP instructions or software libraries. The effectiveness of our solution is demonstrated through an LLVM-based implementation, leveraging aggressive optimizations in LLVM including the Polly loop nest optimizer, which targets two backend code generators: one for the ISA of a variable precision FP arithmetic coprocessor, and one for the MPFR multi-precision floating-point library. Our optimizing compilation flow targeting MPFR outperforms the Boost programming interface for the MPFR library by a factor of 1.80× and 1.67× in sequential execution of the PolyBench and RAJAPerf suites, respectively, and by a factor of 7.62× on an 8-core (and 16-thread) machine for RAJAPerf in OpenMP. @InProceedings{CGO21p65, author = {Tiago Trevisan Jost and Yves Durand and Christian Fabre and Albert Cohen and Frédéric Pétrot}, title = {Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {65--76}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced CGO '21: "MLIR: Scaling Compiler Infrastructure ..." MLIR: Scaling Compiler Infrastructure for Domain Specific Computation Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko (Google, USA; Indian Institute of Science, India; Google, France) This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. @InProceedings{CGO21p2, author = {Chris Lattner and Mehdi Amini and Uday Bondhugula and Albert Cohen and Andy Davis and Jacques Pienaar and River Riddle and Tatiana Shpeisman and Nicolas Vasilache and Oleksandr Zinenko}, title = {MLIR: Scaling Compiler Infrastructure for Domain Specific Computation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {2--14}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable CGO '21: "Progressive Raising in Multi-level ..." Progressive Raising in Multi-level IR Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, and Henk Corporaal (Eindhoven University of Technology, Netherlands; Inria, France; ENS Paris, France; Google, France; Google, Switzerland; University of Edinburgh, UK) Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opinionated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multi-Level Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements. @InProceedings{CGO21p15, author = {Lorenzo Chelini and Andi Drebes and Oleksandr Zinenko and Albert Cohen and Nicolas Vasilache and Tobias Grosser and Henk Corporaal}, title = {Progressive Raising in Multi-level IR}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {15--26}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Cole, Murray |
CGO '21: "Loop Parallelization using ..."
Loop Parallelization using Dynamic Commutativity Analysis
Christos Vasiladiotis, Roberto Castañeda Lozano, Murray Cole, and Björn Franke (University of Edinburgh, UK) Automatic parallelization has largely failed to keep its promise of extracting parallelism from sequential legacy code to maximize performance on multi-core systems outside the numerical domain. In this paper, we develop a novel dynamic commutativity analysis (DCA) for identifying parallelizable loops. Using commutativity instead of dependence tests, DCA avoids many of the overly strict data dependence constraints limiting existing parallelizing compilers. DCA extends the scope of automatic parallelization to uniformly include both regular array-based and irregular pointer-based codes. We have prototyped our novel parallelism detection analysis and evaluated it extensively against five state-of-the-art dependence-based techniques in two experimental settings. First, when applied to the NAS benchmarks which contain almost 1400 loops, DCA is able to identify as many parallel loops (over 1200) as the profile-guided dependence techniques and almost twice as many as all the static techniques combined. We then apply DCA to complex pointer-based loops, where it can successfully detect parallelism, while existing techniques fail to identify any. When combined with existing parallel code generation techniques, this results in an average speedup of 3.6× (and up to 55×) across the NAS benchmarks on a 72-core host, and up to 36.9× for the pointer-based loops, demonstrating the effectiveness of DCA in identifying profitable parallelism across a wide range of loops. @InProceedings{CGO21p150, author = {Christos Vasiladiotis and Roberto Castañeda Lozano and Murray Cole and Björn Franke}, title = {Loop Parallelization using Dynamic Commutativity Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {150--161}, doi = {}, year = {2021}, } |
|
Corporaal, Henk |
CGO '21: "Progressive Raising in Multi-level ..."
Progressive Raising in Multi-level IR
Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, and Henk Corporaal (Eindhoven University of Technology, Netherlands; Inria, France; ENS Paris, France; Google, France; Google, Switzerland; University of Edinburgh, UK) Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opinionated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multi-Level Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements. @InProceedings{CGO21p15, author = {Lorenzo Chelini and Andi Drebes and Oleksandr Zinenko and Albert Cohen and Nicolas Vasilache and Tobias Grosser and Henk Corporaal}, title = {Progressive Raising in Multi-level IR}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {15--26}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Da Silva, Anderson Faustino |
CGO '21: "AnghaBench: A Suite with One ..."
AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction
Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimarães, and Fernando Magno Quintão Pereira (State University of Maringá, Brazil; Federal University of Minas Gerais, Brazil) A predictive compiler uses properties of a program to decide how to optimize it. The compiler is trained on a collection of programs to derive a model which determines its actions in face of unknown codes. One of the challenges of predictive compilation is how to find good training sets. Regardless of the programming language, the availability of human-made benchmarks is limited. Moreover, current synthesizers produce code that is very different from actual programs, and mining compilable code from open repositories is difficult, due to program dependencies. In this paper, we use a combination of web crawling and type inference to overcome these problems for the C programming language. We use a type reconstructor based on Hindley-Milner's algorithm to produce AnghaBench, a virtually unlimited collection of real-world compilable C programs. Although AnghaBench programs are not executable, they can be transformed into object files by any C compliant compiler. Therefore, they can be used to train compilers for code size reduction. We have used thousands of AnghaBench programs to train YaCoS, a predictive compiler based on LLVM. The version of YaCoS autotuned with AnghaBench generates binaries for the LLVM test suite over 10% smaller than clang -Oz. It compresses code impervious even to the state-of-the-art Function Sequence Alignment technique published in 2019, as it does not require large binaries to work well. @InProceedings{CGO21p378, author = {Anderson Faustino da Silva and Bruno Conde Kind and José Wesley de Souza Magalhães and Jerônimo Nunes Rocha and Breno Campos Ferreira Guimarães and Fernando Magno Quintão Pereira}, title = {AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {378--390}, doi = {}, year = {2021}, } Info |
|
Davis, Andy |
CGO '21: "MLIR: Scaling Compiler Infrastructure ..."
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko (Google, USA; Indian Institute of Science, India; Google, France) This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. @InProceedings{CGO21p2, author = {Chris Lattner and Mehdi Amini and Uday Bondhugula and Albert Cohen and Andy Davis and Jacques Pienaar and River Riddle and Tatiana Shpeisman and Nicolas Vasilache and Oleksandr Zinenko}, title = {MLIR: Scaling Compiler Infrastructure for Domain Specific Computation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {2--14}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable |
|
De Fine Licht, Johannes |
CGO '21: "StencilFlow: Mapping Large ..."
StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems
Johannes de Fine Licht, Andreas Kuster, Tiziano De Matteis, Tal Ben-Nun, Dominic Hofer, and Torsten Hoefler (ETH Zurich, Switzerland; MeteoSwiss, Switzerland) Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice. @InProceedings{CGO21p315, author = {Johannes de Fine Licht and Andreas Kuster and Tiziano De Matteis and Tal Ben-Nun and Dominic Hofer and Torsten Hoefler}, title = {StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {315--326}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
De Matteis, Tiziano |
CGO '21: "StencilFlow: Mapping Large ..."
StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems
Johannes de Fine Licht, Andreas Kuster, Tiziano De Matteis, Tal Ben-Nun, Dominic Hofer, and Torsten Hoefler (ETH Zurich, Switzerland; MeteoSwiss, Switzerland) Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice. @InProceedings{CGO21p315, author = {Johannes de Fine Licht and Andreas Kuster and Tiziano De Matteis and Tal Ben-Nun and Dominic Hofer and Torsten Hoefler}, title = {StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {315--326}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Dong, Xiao |
CGO '21: "Unleashing the Low-Precision ..."
Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs
Guangli Li, Jingling Xue, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia; Jilin University, China) Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The main challenge faced when accelerating a tensor program on tensor-specialized hardware is how to achieve the best performance possible in reduced precision by fully utilizing its computational resources while keeping the precision loss in a controlled manner. In this paper, we address this challenge by proposing QUANTENSOR, a new approach for accelerating general-purpose tensor programs by replacing its tensor computations with low-precision quantized tensor computations on NVIDIA Tensor Cores. The key novelty is a new residual-based precision refinement technique for controlling the quantization errors, allowing tradeoffs between performance and precision to be made. Evaluation with GEMM, deep neural networks, and linear algebra applications shows that QUANTENSOR can achieve remarkable performance improvements while reducing the precision loss incurred significantly at acceptable overheads. @InProceedings{CGO21p90, author = {Guangli Li and Jingling Xue and Lei Liu and Xueying Wang and Xiu Ma and Xiao Dong and Jiansong Li and Xiaobing Feng}, title = {Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {90--102}, doi = {}, year = {2021}, } |
|
Drebes, Andi |
CGO '21: "Progressive Raising in Multi-level ..."
Progressive Raising in Multi-level IR
Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, and Henk Corporaal (Eindhoven University of Technology, Netherlands; Inria, France; ENS Paris, France; Google, France; Google, Switzerland; University of Edinburgh, UK) Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opinionated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multi-Level Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements. @InProceedings{CGO21p15, author = {Lorenzo Chelini and Andi Drebes and Oleksandr Zinenko and Albert Cohen and Nicolas Vasilache and Tobias Grosser and Henk Corporaal}, title = {Progressive Raising in Multi-level IR}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {15--26}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Durand, Yves |
CGO '21: "Seamless Compiler Integration ..."
Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic
Tiago Trevisan Jost, Yves Durand, Christian Fabre, Albert Cohen, and Frédéric Pétrot (Université Grenoble Alpes, France; CEA LIST, France; Google, France; CNRS, France; Grenoble INP, France; TIMA, France) Floating-Point (FP) units in processors are generally limited to supporting a subset of formats defined by the IEEE 754 standard. As a result, high-efficiency languages and optimizing compilers for high-performance computing only support IEEE standard types and applications needing higher precision involve cumbersome memory management and calls to external libraries, resulting in code bloat and making the intent of the program unclear. We present an extension of the C type system that can represent generic FP operations and formats, supporting both static precision and dynamically variable precision. We design and implement a compilation flow bridging the abstraction gap between this type system and low-level FP instructions or software libraries. The effectiveness of our solution is demonstrated through an LLVM-based implementation, leveraging aggressive optimizations in LLVM including the Polly loop nest optimizer, which targets two backend code generators: one for the ISA of a variable precision FP arithmetic coprocessor, and one for the MPFR multi-precision floating-point library. Our optimizing compilation flow targeting MPFR outperforms the Boost programming interface for the MPFR library by a factor of 1.80× and 1.67× in sequential execution of the PolyBench and RAJAPerf suites, respectively, and by a factor of 7.62× on an 8-core (and 16-thread) machine for RAJAPerf in OpenMP. @InProceedings{CGO21p65, author = {Tiago Trevisan Jost and Yves Durand and Christian Fabre and Albert Cohen and Frédéric Pétrot}, title = {Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {65--76}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Eltantawy, Ahmed |
CGO '21: "Vulkan Vision: Ray Tracing ..."
Vulkan Vision: Ray Tracing Workload Characterization using Automatic Graphics Instrumentation
David Pankratz, Tyler Nowicki, Ahmed Eltantawy, and José Nelson Amaral (University of Alberta, Canada; Huawei Technologies, Canada) While there are mature performance monitoring, profiling and instrumentation tools to help understanding the dynamic behaviour of general-purpose GPU applications, the abstract programming models of graphics applications have limited the development of such tools for graphics. This paper introduces Vulcan Vision(V-Vision), a framework for collecting detailed GPU execution data from Vulkan applications to guide hardware-informed improvements. A core contribution of V-Vision is providing out-of-the-box data collection for capturing complete dynamic warp and thread execution traces. V-Vision also provides analyses for the follow purposes: identifying and visualizing application hotspots to guide optimization, characterizing application behaviour and estimating the effect of architectural modifications. This paper demonstrates the potential for these analyses in applications that utilize the recent ray tracing extension in Vulkan and describes new insights about the applications and the underlying hardware. @InProceedings{CGO21p137, author = {David Pankratz and Tyler Nowicki and Ahmed Eltantawy and José Nelson Amaral}, title = {Vulkan Vision: Ray Tracing Workload Characterization using Automatic Graphics Instrumentation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {137--149}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Fabre, Christian |
CGO '21: "Seamless Compiler Integration ..."
Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic
Tiago Trevisan Jost, Yves Durand, Christian Fabre, Albert Cohen, and Frédéric Pétrot (Université Grenoble Alpes, France; CEA LIST, France; Google, France; CNRS, France; Grenoble INP, France; TIMA, France) Floating-Point (FP) units in processors are generally limited to supporting a subset of formats defined by the IEEE 754 standard. As a result, high-efficiency languages and optimizing compilers for high-performance computing only support IEEE standard types and applications needing higher precision involve cumbersome memory management and calls to external libraries, resulting in code bloat and making the intent of the program unclear. We present an extension of the C type system that can represent generic FP operations and formats, supporting both static precision and dynamically variable precision. We design and implement a compilation flow bridging the abstraction gap between this type system and low-level FP instructions or software libraries. The effectiveness of our solution is demonstrated through an LLVM-based implementation, leveraging aggressive optimizations in LLVM including the Polly loop nest optimizer, which targets two backend code generators: one for the ISA of a variable precision FP arithmetic coprocessor, and one for the MPFR multi-precision floating-point library. Our optimizing compilation flow targeting MPFR outperforms the Boost programming interface for the MPFR library by a factor of 1.80× and 1.67× in sequential execution of the PolyBench and RAJAPerf suites, respectively, and by a factor of 7.62× on an 8-core (and 16-thread) machine for RAJAPerf in OpenMP. @InProceedings{CGO21p65, author = {Tiago Trevisan Jost and Yves Durand and Christian Fabre and Albert Cohen and Frédéric Pétrot}, title = {Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {65--76}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Feng, Xiaobing |
CGO '21: "Unleashing the Low-Precision ..."
Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs
Guangli Li, Jingling Xue, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia; Jilin University, China) Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The main challenge faced when accelerating a tensor program on tensor-specialized hardware is how to achieve the best performance possible in reduced precision by fully utilizing its computational resources while keeping the precision loss in a controlled manner. In this paper, we address this challenge by proposing QUANTENSOR, a new approach for accelerating general-purpose tensor programs by replacing its tensor computations with low-precision quantized tensor computations on NVIDIA Tensor Cores. The key novelty is a new residual-based precision refinement technique for controlling the quantization errors, allowing tradeoffs between performance and precision to be made. Evaluation with GEMM, deep neural networks, and linear algebra applications shows that QUANTENSOR can achieve remarkable performance improvements while reducing the precision loss incurred significantly at acceptable overheads. @InProceedings{CGO21p90, author = {Guangli Li and Jingling Xue and Lei Liu and Xueying Wang and Xiu Ma and Xiao Dong and Jiansong Li and Xiaobing Feng}, title = {Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {90--102}, doi = {}, year = {2021}, } |
|
Franchetti, Franz |
CGO '21: "An Interval Compiler for Sound ..."
An Interval Compiler for Sound Floating-Point Computations
Joao Rivera, Franz Franchetti, and Markus Püschel (ETH Zurich, Switzerland; Carnegie Mellon University, USA) Floating-point arithmetic is widely used by software developers but is unsound, i.e., there is no guarantee on the accuracy obtained, which can be imperative in safety-critical applications. We present IGen, a source-to-source compiler that translates a given C function using floating-point into an equivalent sound C function that uses interval arithmetic. IGen supports Intel SIMD intrinsics in the input function using a specially designed code generator and can produce SIMD-optimized output. To mitigate a possible loss of accuracy due to the increase of interval sizes, IGen can compile to double-double precision, again SIMD-optimized. Finally, IGen implements an accuracy optimization for the common reduction pattern. We benchmark our compiler on high-performance code in the domain of linear algebra and signal processing. The results show that the generated code delivers sound double precision results at high performance. In particular, we observe speed-ups of up to 9.8 when compared to commonly used interval libraries using double precision. When compiling to double-double, our compiler delivers intervals that keep error accumulation small enough to compute results with at most one bit of error in double precision, i.e., certified double precision results. @InProceedings{CGO21p52, author = {Joao Rivera and Franz Franchetti and Markus Püschel}, title = {An Interval Compiler for Sound Floating-Point Computations}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {52--64}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Franke, Björn |
CGO '21: "Loop Parallelization using ..."
Loop Parallelization using Dynamic Commutativity Analysis
Christos Vasiladiotis, Roberto Castañeda Lozano, Murray Cole, and Björn Franke (University of Edinburgh, UK) Automatic parallelization has largely failed to keep its promise of extracting parallelism from sequential legacy code to maximize performance on multi-core systems outside the numerical domain. In this paper, we develop a novel dynamic commutativity analysis (DCA) for identifying parallelizable loops. Using commutativity instead of dependence tests, DCA avoids many of the overly strict data dependence constraints limiting existing parallelizing compilers. DCA extends the scope of automatic parallelization to uniformly include both regular array-based and irregular pointer-based codes. We have prototyped our novel parallelism detection analysis and evaluated it extensively against five state-of-the-art dependence-based techniques in two experimental settings. First, when applied to the NAS benchmarks which contain almost 1400 loops, DCA is able to identify as many parallel loops (over 1200) as the profile-guided dependence techniques and almost twice as many as all the static techniques combined. We then apply DCA to complex pointer-based loops, where it can successfully detect parallelism, while existing techniques fail to identify any. When combined with existing parallel code generation techniques, this results in an average speedup of 3.6× (and up to 55×) across the NAS benchmarks on a 72-core host, and up to 36.9× for the pointer-based loops, demonstrating the effectiveness of DCA in identifying profitable parallelism across a wide range of loops. @InProceedings{CGO21p150, author = {Christos Vasiladiotis and Roberto Castañeda Lozano and Murray Cole and Björn Franke}, title = {Loop Parallelization using Dynamic Commutativity Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {150--161}, doi = {}, year = {2021}, } |
|
Fuentes, Joel |
CGO '21: "C-for-Metal: High Performance ..."
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, Joel Fuentes, Wei-Yu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, and Daniel Rhee (Intel Corporation, USA) The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU. @InProceedings{CGO21p289, author = {Guei-Yuan Lueh and Kaiyu Chen and Gang Chen and Joel Fuentes and Wei-Yu Chen and Fangwen Fu and Hong Jiang and Hongzheng Li and Daniel Rhee}, title = {C-for-Metal: High Performance SIMD Programming on Intel GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {289--300}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Fu, Fangwen |
CGO '21: "C-for-Metal: High Performance ..."
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, Joel Fuentes, Wei-Yu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, and Daniel Rhee (Intel Corporation, USA) The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU. @InProceedings{CGO21p289, author = {Guei-Yuan Lueh and Kaiyu Chen and Gang Chen and Joel Fuentes and Wei-Yu Chen and Fangwen Fu and Hong Jiang and Hongzheng Li and Daniel Rhee}, title = {C-for-Metal: High Performance SIMD Programming on Intel GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {289--300}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Gao, Lin |
CGO '21: "Scaling Up the IFDS Algorithm ..."
Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing
Haofeng Li, Haining Meng, Hengjie Zheng, Liqing Cao, Jie Lu, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China) The IFDS algorithm can be memory-intensive, requiring a memory budget of more than 100 GB of RAM for some applications. The large memory requirements significantly restrict the deployment of IFDS-based tools in practise. To improve this, we propose a disk-assisted solution that drastically reduces the memory requirements of traditional IFDS solvers. Our solution saves memory by 1) recomputing instead of memorizing intermediate analysis data, and 2) swapping in-memory data to disk when memory usages reach a threshold. We implement sophisticated scheduling schemes to swap data between memory and disks efficiently. We have developed a new taint analysis tool, DiskDroid, based on our disk-assisted IFDS solver. Compared to FlowDroid, a state-of-the-art IFDS-based taint analysis tool, for a set of 19 apps which take from 10 to 128 GB of RAM by FlowDroid, DiskDroid can analyze them with less than 10GB of RAM at a slight performance improvement of 8.6%. In addition, for 21 apps requiring more than 128GB of RAM by FlowDroid, DiskDroid can analyze each app in 3 hours, under the same memory budget of 10GB. This makes the tool deployable to normal desktop environments. We make the tool publicly available at https://github.com/HaofLi/DiskDroid. @InProceedings{CGO21p236, author = {Haofeng Li and Haining Meng and Hengjie Zheng and Liqing Cao and Jie Lu and Lian Li and Lin Gao}, title = {Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {236--247}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Gong, Xiaoli |
CGO '21: "Enhancing Atomic Instruction ..."
Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation
Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, and Pen-Chung Yew (Nankai University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) Dynamic Binary Translation (DBT) is a key enabler for cross-ISA emulation, system virtualization, runtime instrumentation, and many other important applications. Among several critical requirements for DBT, it is important to provide equivalent semantics for atomic synchronization instructions such as Load-link/Store-conditional (LL/SC), which are mostly included in the reduced-instruction set architectures (RISC) and Compare-and-Swap(CAS), which is mostly in the complex instruction set architectures (CISC). However, the state-of-the-art DBT tools often do not provide a fully correct translation of these atomic instructions, in particular, from RISC atomic instructions (i.e. LL/SC) to CISC atomic instructions (i.e. CAS), due to performance concerns. As a result, some may cause the well-known ABA problem, which could lead to wrong results or program crashes. In our experimental studies on QEMU, a state-of-the-art DBT, that runs multi-threaded lock-free stack operations implemented with ARM instruction set (i.e. using LL/SC) on Intel x86 platforms (i.e. using CAS), it often crashes within 2 seconds. Although attempts have been made to provide correct emulation for such atomic instructions, they either result in heavy execution overheads or require additional hardware support. In this paper, we propose several schemes to address those issues and implement them on QEMU to evaluate their performance overheads. The results show that all of the proposed schemes can provide correct emulation and, for the best solution, can achieve a min, max, geomean speedup of 1.25x, 3.21x, 2.03x respectively, over the best existing software-based scheme. @InProceedings{CGO21p351, author = {Ziyi Zhao and Zhang Jiang and Ying Chen and Xiaoli Gong and Wenwen Wang and Pen-Chung Yew}, title = {Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {351--362}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Grosser, Tobias |
CGO '21: "Progressive Raising in Multi-level ..."
Progressive Raising in Multi-level IR
Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, and Henk Corporaal (Eindhoven University of Technology, Netherlands; Inria, France; ENS Paris, France; Google, France; Google, Switzerland; University of Edinburgh, UK) Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opinionated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multi-Level Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements. @InProceedings{CGO21p15, author = {Lorenzo Chelini and Andi Drebes and Oleksandr Zinenko and Albert Cohen and Nicolas Vasilache and Tobias Grosser and Henk Corporaal}, title = {Progressive Raising in Multi-level IR}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {15--26}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Guimarães, Breno Campos Ferreira |
CGO '21: "AnghaBench: A Suite with One ..."
AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction
Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimarães, and Fernando Magno Quintão Pereira (State University of Maringá, Brazil; Federal University of Minas Gerais, Brazil) A predictive compiler uses properties of a program to decide how to optimize it. The compiler is trained on a collection of programs to derive a model which determines its actions in face of unknown codes. One of the challenges of predictive compilation is how to find good training sets. Regardless of the programming language, the availability of human-made benchmarks is limited. Moreover, current synthesizers produce code that is very different from actual programs, and mining compilable code from open repositories is difficult, due to program dependencies. In this paper, we use a combination of web crawling and type inference to overcome these problems for the C programming language. We use a type reconstructor based on Hindley-Milner's algorithm to produce AnghaBench, a virtually unlimited collection of real-world compilable C programs. Although AnghaBench programs are not executable, they can be transformed into object files by any C compliant compiler. Therefore, they can be used to train compilers for code size reduction. We have used thousands of AnghaBench programs to train YaCoS, a predictive compiler based on LLVM. The version of YaCoS autotuned with AnghaBench generates binaries for the LLVM test suite over 10% smaller than clang -Oz. It compresses code impervious even to the state-of-the-art Function Sequence Alignment technique published in 2019, as it does not require large binaries to work well. @InProceedings{CGO21p378, author = {Anderson Faustino da Silva and Bruno Conde Kind and José Wesley de Souza Magalhães and Jerônimo Nunes Rocha and Breno Campos Ferreira Guimarães and Fernando Magno Quintão Pereira}, title = {AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {378--390}, doi = {}, year = {2021}, } Info |
|
Hager, Georg |
CGO '21: "YaskSite: Stencil Optimization ..."
YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures
Christie L. Alappat, Johannes Seiferth, Georg Hager, Matthias Korch, Thomas Rauber, and Gerhard Wellein (University of Erlangen-Nuremberg, Germany; University of Bayreuth, Germany) The landscape of multi-core architectures is growing more complex and diverse. Optimal application performance tuning parameters can vary widely across CPUs, and finding them in a possibly multidimensional parameter search space can be time consuming, expensive and potentially infeasible. In this work, we introduce YaskSite, a tool capable of tackling these challenges for stencil computations. YaskSite is built upon Intel’s YASK framework. It combines YASK’s flexibility to deal with different target architectures with the Execution-Cache-Memory performance model, which enables identifying optimal performance parameters analytically without the need to run the code. Further we show that YaskSite’s features can be exploited by external tuning frameworks to reliably select the most efficient kernel(s) for the application at hand. To demonstrate this, we integrate YaskSite into Offsite, an offline tuner for explicit ordinary differential equation methods, and show that the generated performance predictions are reliable and accurate, leading to considerable performance gains at minimal code generation time and autotuning costs on the latest Intel Cascade Lake and AMD Rome CPUs. @InProceedings{CGO21p174, author = {Christie L. Alappat and Johannes Seiferth and Georg Hager and Matthias Korch and Thomas Rauber and Gerhard Wellein}, title = {YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {174--186}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Hajiabadi, Ali |
CGO '21: "ELFies: Executable Region ..."
ELFies: Executable Region Checkpoints for Performance Analysis and Simulation
Harish Patil, Alexander Isaev, Wim Heirman, Alen Sabu, Ali Hajiabadi, and Trevor E. Carlson (Intel Corporation, USA; Intel Corporation, Belgium; National University of Singapore, Singapore) We address the challenge faced in characterizing long-running workloads, namely how to reliably focus the detailed analysis on interesting execution regions. We present a set of tools that allows users to precisely capture any region of interest in program execution, and create a stand-alone executable, called an ELFie, from it. An ELFie starts with the same program state captured at the beginning of the region of interest and then executes natively. With ELFies, there is no fast-forwarding to the region of interest needed or the uncertainty of reaching the region. ELFies can be fed to dynamic program-analysis tools or simulators that work with regular program binaries. Our tool-chain is based on the PinPlay framework and requires no special hardware, operating system changes, re-compilation, or re-linking of test programs. This paper describes the design of our ELFie generation tool-chain and the application of ELFies in performance analysis and simulation of regions of interest in popular long-running single and multi-threaded benchmarks. @InProceedings{CGO21p126, author = {Harish Patil and Alexander Isaev and Wim Heirman and Alen Sabu and Ali Hajiabadi and Trevor E. Carlson}, title = {ELFies: Executable Region Checkpoints for Performance Analysis and Simulation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {126--136}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Hall, Mary |
CGO '21: "Data Layout and Data Representation ..."
Data Layout and Data Representation Optimizations to Reduce Data Movement (Keynote)
Mary Hall (University of Utah, USA) Code generation and optimization for the diversity of current and future architectures must focus on reducing data movement to achieve high performance. How data is laid out in memory, and representations that compress data (e.g., reduced floating point precision) have a profound impact on data movement. Moreover, the cost of data movement in a program is architecture-specific, and consequently, optimizing data layout and data representation must be performed by a compiler once the target architecture is known. With this context in mind, this talk will provide examples of data layout and data representation optimizations, and call for integrating these data properties into code generation and optimization systems. @InProceedings{CGO21p1, author = {Mary Hall}, title = {Data Layout and Data Representation Optimizations to Reduce Data Movement (Keynote)}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {1--1}, doi = {}, year = {2021}, } |
|
Heirman, Wim |
CGO '21: "ELFies: Executable Region ..."
ELFies: Executable Region Checkpoints for Performance Analysis and Simulation
Harish Patil, Alexander Isaev, Wim Heirman, Alen Sabu, Ali Hajiabadi, and Trevor E. Carlson (Intel Corporation, USA; Intel Corporation, Belgium; National University of Singapore, Singapore) We address the challenge faced in characterizing long-running workloads, namely how to reliably focus the detailed analysis on interesting execution regions. We present a set of tools that allows users to precisely capture any region of interest in program execution, and create a stand-alone executable, called an ELFie, from it. An ELFie starts with the same program state captured at the beginning of the region of interest and then executes natively. With ELFies, there is no fast-forwarding to the region of interest needed or the uncertainty of reaching the region. ELFies can be fed to dynamic program-analysis tools or simulators that work with regular program binaries. Our tool-chain is based on the PinPlay framework and requires no special hardware, operating system changes, re-compilation, or re-linking of test programs. This paper describes the design of our ELFie generation tool-chain and the application of ELFies in performance analysis and simulation of regions of interest in popular long-running single and multi-threaded benchmarks. @InProceedings{CGO21p126, author = {Harish Patil and Alexander Isaev and Wim Heirman and Alen Sabu and Ali Hajiabadi and Trevor E. Carlson}, title = {ELFies: Executable Region Checkpoints for Performance Analysis and Simulation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {126--136}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Hendren, Laurie |
CGO '21: "r3d3: Optimized Query Compilation ..."
r3d3: Optimized Query Compilation on GPUs
Alexander Krolik, Clark Verbrugge, and Laurie Hendren (McGill University, Canada) Query compilation is an effective approach to improve the performance of repeated database queries. GPU-based approaches have significant promise, but face difficulties in managing compilation time, data transfer costs, and in addressing a reasonably comprehensive range of SQL operations. In this work we describe a hybrid AoT/JIT approach to GPU-based query compilation. We use multiple optimizations to reduce execution, compile, and data transfer times, improving performance over both other GPU-based approaches and CPU-based query compilers as well. Our design addresses a wide range of SQL queries, sufficient to demonstrate the practicality of using GPUs for query optimization. @InProceedings{CGO21p277, author = {Alexander Krolik and Clark Verbrugge and Laurie Hendren}, title = {r3d3: Optimized Query Compilation on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {277--288}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Functional Results Reproduced |
|
Hoefler, Torsten |
CGO '21: "StencilFlow: Mapping Large ..."
StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems
Johannes de Fine Licht, Andreas Kuster, Tiziano De Matteis, Tal Ben-Nun, Dominic Hofer, and Torsten Hoefler (ETH Zurich, Switzerland; MeteoSwiss, Switzerland) Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice. @InProceedings{CGO21p315, author = {Johannes de Fine Licht and Andreas Kuster and Tiziano De Matteis and Tal Ben-Nun and Dominic Hofer and Torsten Hoefler}, title = {StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {315--326}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Hofer, Dominic |
CGO '21: "StencilFlow: Mapping Large ..."
StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems
Johannes de Fine Licht, Andreas Kuster, Tiziano De Matteis, Tal Ben-Nun, Dominic Hofer, and Torsten Hoefler (ETH Zurich, Switzerland; MeteoSwiss, Switzerland) Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice. @InProceedings{CGO21p315, author = {Johannes de Fine Licht and Andreas Kuster and Tiziano De Matteis and Tal Ben-Nun and Dominic Hofer and Torsten Hoefler}, title = {StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {315--326}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Ho, Hsi-Ming |
CGO '21: "Cinnamon: A Domain-Specific ..."
Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring
Mahwish Arif, Ruoyu Zhou, Hsi-Ming Ho, and Timothy M. Jones (University of Cambridge, UK; University of Sussex, UK) Binary instrumentation and rewriting frameworks provide a powerful way of implementing custom analysis and transformation techniques for applications ranging from performance profiling to security monitoring. However, using these frameworks to write even simple analyses and transformations is non-trivial. Developers often need to write framework-specific boilerplate code and work with low-level and complex programming details. This not only results in hundreds (or thousands) of lines of code, but also leaves significant room for error. To address this, we introduce Cinnamon, a domain-specific language designed to write programs for binary profiling and monitoring. Cinnamon's abstractions allow the programmer to focus on implementing their technique in a platform-independent way, without worrying about complex lower-level details. Programmers can use these abstractions to perform analysis and instrumentation at different locations and granularity levels in the binary. The flexibility of Cinnamon also enables its programs to be mapped to static, dynamic or hybrid analysis and instrumentation approaches. As a proof of concept, we target Cinnamon to three different binary frameworks by implementing a custom Cinnamon to C/C++ compiler and integrating the generated code within these frameworks. We further demonstrate the ability of Cinnamon to express a range of profiling and monitoring tools through different use-cases. @InProceedings{CGO21p103, author = {Mahwish Arif and Ruoyu Zhou and Hsi-Ming Ho and Timothy M. Jones}, title = {Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {103--114}, doi = {}, year = {2021}, } |
|
Hong, Changwan |
CGO '21: "Compiling Graph Applications ..."
Compiling Graph Applications for GPUs with GraphIt
Ajay Brahmakshatriya, Yunming Zhang, Changwan Hong, Shoaib Kamil, Julian Shun, and Saman Amarasinghe (Massachusetts Institute of Technology, USA; Adobe, USA) The performance of graph programs depends highly on the algorithm, size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations as well as the hardware platform to use. The GraphIt programming language makes it easy for the developers to write the algorithm once and optimize it for different inputs using a scheduling language. However GraphIt currently has no support for generating high performance CUDA code for GPUs Developers have to resort to re-implementing the entire algorithm with entirely different set of abstractions and optimizations in order to achieve high-performance. We propose G2, an extension to the GraphIt compiler frame- work, that achieves high performance on both CPUs and GPUs using the same algorithm specification. G2 significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that en- ables combining load balancing, edge traversal direction, active vertex set creation, active vertex set processing ordering, and kernel fusion optimizations. G2 also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balance by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within L2 cache. We evaluate G2 on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11× speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments. @InProceedings{CGO21p248, author = {Ajay Brahmakshatriya and Yunming Zhang and Changwan Hong and Shoaib Kamil and Julian Shun and Saman Amarasinghe}, title = {Compiling Graph Applications for GPUs with GraphIt}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {248--261}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Isaev, Alexander |
CGO '21: "ELFies: Executable Region ..."
ELFies: Executable Region Checkpoints for Performance Analysis and Simulation
Harish Patil, Alexander Isaev, Wim Heirman, Alen Sabu, Ali Hajiabadi, and Trevor E. Carlson (Intel Corporation, USA; Intel Corporation, Belgium; National University of Singapore, Singapore) We address the challenge faced in characterizing long-running workloads, namely how to reliably focus the detailed analysis on interesting execution regions. We present a set of tools that allows users to precisely capture any region of interest in program execution, and create a stand-alone executable, called an ELFie, from it. An ELFie starts with the same program state captured at the beginning of the region of interest and then executes natively. With ELFies, there is no fast-forwarding to the region of interest needed or the uncertainty of reaching the region. ELFies can be fed to dynamic program-analysis tools or simulators that work with regular program binaries. Our tool-chain is based on the PinPlay framework and requires no special hardware, operating system changes, re-compilation, or re-linking of test programs. This paper describes the design of our ELFie generation tool-chain and the application of ELFies in performance analysis and simulation of regions of interest in popular long-running single and multi-threaded benchmarks. @InProceedings{CGO21p126, author = {Harish Patil and Alexander Isaev and Wim Heirman and Alen Sabu and Ali Hajiabadi and Trevor E. Carlson}, title = {ELFies: Executable Region Checkpoints for Performance Analysis and Simulation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {126--136}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Jain, Animesh |
CGO '21: "UNIT: Unifying Tensorized ..."
UNIT: Unifying Tensorized Instruction Compilation
Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony Nowatzki (University of California at Los Angeles, USA; Amazon, USA) Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions. In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3× speedup over Intel oneDNN on an x86 CPU, 1.75× speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13× speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU. @InProceedings{CGO21p77, author = {Jian Weng and Animesh Jain and Jie Wang and Leyuan Wang and Yida Wang and Tony Nowatzki}, title = {UNIT: Unifying Tensorized Instruction Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {77--89}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Jeong, Shinnung |
CGO '21: "Thread-Aware Area-Efficient ..."
Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices
Changsu Kim, Shinnung Jeong, Sungjun Cho, Yongwoo Lee, William Song, Youngsok Kim, and Hanjun Kim (POSTECH, South Korea; Yonsei University, South Korea) In the embedded device market, custom hardware platforms such as an application specific integrated circuit (ASIC) and a field programmable gate array (FPGA) are attractive thanks to their high performance and power efficiency. However, its huge design costs make it challenging for manufacturers to timely launch new devices. High-level synthesis (HLS) helps significantly reduce the design costs by automating the translation of service algorithms into hardware logics; however, current HLS compilers do not fit well to embedded devices as they fail to produce area-efficient solutions while supporting concurrent events from diverse peripherals such as sensors, actuators and network modules. This paper proposes a new thread-aware HLS compiler named DURO that produces area-efficient embedded devices. DURO shares commonly-invoked functions and operators across different callers and threads with a new thread-aware area cost model, and thus effectively reduces the logic size. Moreover, DURO supports a variety of device peripherals by automatically integrating peripheral controllers and interfaces as peripheral drivers. The experiment results of six embedded devices with ten peripherals demonstrate that DURO reduces the area and energy dissipation of embedded devices by 28.5% and 25.3% compared with the designs generated by the state-of-the-art HLS compiler. This work also implements FPGA prototypes of the six devices using DURO, and the measurement results show 65.3% energy saving over Raspberry Pi Zero with slightly better computation performance. @InProceedings{CGO21p327, author = {Changsu Kim and Shinnung Jeong and Sungjun Cho and Yongwoo Lee and William Song and Youngsok Kim and Hanjun Kim}, title = {Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {327--339}, doi = {}, year = {2021}, } |
|
Jiang, Hong |
CGO '21: "C-for-Metal: High Performance ..."
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, Joel Fuentes, Wei-Yu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, and Daniel Rhee (Intel Corporation, USA) The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU. @InProceedings{CGO21p289, author = {Guei-Yuan Lueh and Kaiyu Chen and Gang Chen and Joel Fuentes and Wei-Yu Chen and Fangwen Fu and Hong Jiang and Hongzheng Li and Daniel Rhee}, title = {C-for-Metal: High Performance SIMD Programming on Intel GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {289--300}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Jiang, Zhang |
CGO '21: "Enhancing Atomic Instruction ..."
Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation
Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, and Pen-Chung Yew (Nankai University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) Dynamic Binary Translation (DBT) is a key enabler for cross-ISA emulation, system virtualization, runtime instrumentation, and many other important applications. Among several critical requirements for DBT, it is important to provide equivalent semantics for atomic synchronization instructions such as Load-link/Store-conditional (LL/SC), which are mostly included in the reduced-instruction set architectures (RISC) and Compare-and-Swap(CAS), which is mostly in the complex instruction set architectures (CISC). However, the state-of-the-art DBT tools often do not provide a fully correct translation of these atomic instructions, in particular, from RISC atomic instructions (i.e. LL/SC) to CISC atomic instructions (i.e. CAS), due to performance concerns. As a result, some may cause the well-known ABA problem, which could lead to wrong results or program crashes. In our experimental studies on QEMU, a state-of-the-art DBT, that runs multi-threaded lock-free stack operations implemented with ARM instruction set (i.e. using LL/SC) on Intel x86 platforms (i.e. using CAS), it often crashes within 2 seconds. Although attempts have been made to provide correct emulation for such atomic instructions, they either result in heavy execution overheads or require additional hardware support. In this paper, we propose several schemes to address those issues and implement them on QEMU to evaluate their performance overheads. The results show that all of the proposed schemes can provide correct emulation and, for the best solution, can achieve a min, max, geomean speedup of 1.25x, 3.21x, 2.03x respectively, over the best existing software-based scheme. @InProceedings{CGO21p351, author = {Ziyi Zhao and Zhang Jiang and Ying Chen and Xiaoli Gong and Wenwen Wang and Pen-Chung Yew}, title = {Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {351--362}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Jones, Timothy M. |
CGO '21: "Cinnamon: A Domain-Specific ..."
Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring
Mahwish Arif, Ruoyu Zhou, Hsi-Ming Ho, and Timothy M. Jones (University of Cambridge, UK; University of Sussex, UK) Binary instrumentation and rewriting frameworks provide a powerful way of implementing custom analysis and transformation techniques for applications ranging from performance profiling to security monitoring. However, using these frameworks to write even simple analyses and transformations is non-trivial. Developers often need to write framework-specific boilerplate code and work with low-level and complex programming details. This not only results in hundreds (or thousands) of lines of code, but also leaves significant room for error. To address this, we introduce Cinnamon, a domain-specific language designed to write programs for binary profiling and monitoring. Cinnamon's abstractions allow the programmer to focus on implementing their technique in a platform-independent way, without worrying about complex lower-level details. Programmers can use these abstractions to perform analysis and instrumentation at different locations and granularity levels in the binary. The flexibility of Cinnamon also enables its programs to be mapped to static, dynamic or hybrid analysis and instrumentation approaches. As a proof of concept, we target Cinnamon to three different binary frameworks by implementing a custom Cinnamon to C/C++ compiler and integrating the generated code within these frameworks. We further demonstrate the ability of Cinnamon to express a range of profiling and monitoring tools through different use-cases. @InProceedings{CGO21p103, author = {Mahwish Arif and Ruoyu Zhou and Hsi-Ming Ho and Timothy M. Jones}, title = {Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {103--114}, doi = {}, year = {2021}, } |
|
Kamil, Shoaib |
CGO '21: "Compiling Graph Applications ..."
Compiling Graph Applications for GPUs with GraphIt
Ajay Brahmakshatriya, Yunming Zhang, Changwan Hong, Shoaib Kamil, Julian Shun, and Saman Amarasinghe (Massachusetts Institute of Technology, USA; Adobe, USA) The performance of graph programs depends highly on the algorithm, size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations as well as the hardware platform to use. The GraphIt programming language makes it easy for the developers to write the algorithm once and optimize it for different inputs using a scheduling language. However GraphIt currently has no support for generating high performance CUDA code for GPUs Developers have to resort to re-implementing the entire algorithm with entirely different set of abstractions and optimizations in order to achieve high-performance. We propose G2, an extension to the GraphIt compiler frame- work, that achieves high performance on both CPUs and GPUs using the same algorithm specification. G2 significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that en- ables combining load balancing, edge traversal direction, active vertex set creation, active vertex set processing ordering, and kernel fusion optimizations. G2 also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balance by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within L2 cache. We evaluate G2 on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11× speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments. @InProceedings{CGO21p248, author = {Ajay Brahmakshatriya and Yunming Zhang and Changwan Hong and Shoaib Kamil and Julian Shun and Saman Amarasinghe}, title = {Compiling Graph Applications for GPUs with GraphIt}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {248--261}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Kim, Changsu |
CGO '21: "Thread-Aware Area-Efficient ..."
Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices
Changsu Kim, Shinnung Jeong, Sungjun Cho, Yongwoo Lee, William Song, Youngsok Kim, and Hanjun Kim (POSTECH, South Korea; Yonsei University, South Korea) In the embedded device market, custom hardware platforms such as an application specific integrated circuit (ASIC) and a field programmable gate array (FPGA) are attractive thanks to their high performance and power efficiency. However, its huge design costs make it challenging for manufacturers to timely launch new devices. High-level synthesis (HLS) helps significantly reduce the design costs by automating the translation of service algorithms into hardware logics; however, current HLS compilers do not fit well to embedded devices as they fail to produce area-efficient solutions while supporting concurrent events from diverse peripherals such as sensors, actuators and network modules. This paper proposes a new thread-aware HLS compiler named DURO that produces area-efficient embedded devices. DURO shares commonly-invoked functions and operators across different callers and threads with a new thread-aware area cost model, and thus effectively reduces the logic size. Moreover, DURO supports a variety of device peripherals by automatically integrating peripheral controllers and interfaces as peripheral drivers. The experiment results of six embedded devices with ten peripherals demonstrate that DURO reduces the area and energy dissipation of embedded devices by 28.5% and 25.3% compared with the designs generated by the state-of-the-art HLS compiler. This work also implements FPGA prototypes of the six devices using DURO, and the measurement results show 65.3% energy saving over Raspberry Pi Zero with slightly better computation performance. @InProceedings{CGO21p327, author = {Changsu Kim and Shinnung Jeong and Sungjun Cho and Yongwoo Lee and William Song and Youngsok Kim and Hanjun Kim}, title = {Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {327--339}, doi = {}, year = {2021}, } |
|
Kim, Hanjun |
CGO '21: "Thread-Aware Area-Efficient ..."
Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices
Changsu Kim, Shinnung Jeong, Sungjun Cho, Yongwoo Lee, William Song, Youngsok Kim, and Hanjun Kim (POSTECH, South Korea; Yonsei University, South Korea) In the embedded device market, custom hardware platforms such as an application specific integrated circuit (ASIC) and a field programmable gate array (FPGA) are attractive thanks to their high performance and power efficiency. However, its huge design costs make it challenging for manufacturers to timely launch new devices. High-level synthesis (HLS) helps significantly reduce the design costs by automating the translation of service algorithms into hardware logics; however, current HLS compilers do not fit well to embedded devices as they fail to produce area-efficient solutions while supporting concurrent events from diverse peripherals such as sensors, actuators and network modules. This paper proposes a new thread-aware HLS compiler named DURO that produces area-efficient embedded devices. DURO shares commonly-invoked functions and operators across different callers and threads with a new thread-aware area cost model, and thus effectively reduces the logic size. Moreover, DURO supports a variety of device peripherals by automatically integrating peripheral controllers and interfaces as peripheral drivers. The experiment results of six embedded devices with ten peripherals demonstrate that DURO reduces the area and energy dissipation of embedded devices by 28.5% and 25.3% compared with the designs generated by the state-of-the-art HLS compiler. This work also implements FPGA prototypes of the six devices using DURO, and the measurement results show 65.3% energy saving over Raspberry Pi Zero with slightly better computation performance. @InProceedings{CGO21p327, author = {Changsu Kim and Shinnung Jeong and Sungjun Cho and Yongwoo Lee and William Song and Youngsok Kim and Hanjun Kim}, title = {Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {327--339}, doi = {}, year = {2021}, } CGO '21: "Fine-Grained Pipeline Parallelization ..." Fine-Grained Pipeline Parallelization for Network Function Programs Seungbin Song, Heelim Choi, and Hanjun Kim (Yonsei University, South Korea) Network programming languages enable programmers to implement new network functions on various hardware and software stacks in the domain of Software Defined Networking (SDN). Although the languages extend the flexibility of network devices, existing compilers do not fully optimize the network programs due to their coarse-grained parallelization methods. The compilers consider each packet processing table that consists of match and action functions as a unit of tasks and parallelize the programs without decomposing match and action functions. This work proposes a new fine-grained pipeline parallelization compiler for network programming languages, named PSDN. First, the PSDN compiler decouples match and action functions from packet processing tables and analyzes dependencies among the matches and actions. While respecting the dependencies, the compiler efficiently schedules each match and action function into a pipeline with clock cycle estimation and fuses functions to reduce synchronization overheads. This work implements the PSDN compiler that translates a P4 network program to a Xilinx PX program, which is synthesizable to NetFPGA-SUME hardware. The proposed compiler reduces packet processing latency by 12.1% and utilization by 3.5% compared to previous work. @InProceedings{CGO21p162, author = {Seungbin Song and Heelim Choi and Hanjun Kim}, title = {Fine-Grained Pipeline Parallelization for Network Function Programs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {162--173}, doi = {}, year = {2021}, } |
|
Kim, Youngsok |
CGO '21: "Thread-Aware Area-Efficient ..."
Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices
Changsu Kim, Shinnung Jeong, Sungjun Cho, Yongwoo Lee, William Song, Youngsok Kim, and Hanjun Kim (POSTECH, South Korea; Yonsei University, South Korea) In the embedded device market, custom hardware platforms such as an application specific integrated circuit (ASIC) and a field programmable gate array (FPGA) are attractive thanks to their high performance and power efficiency. However, its huge design costs make it challenging for manufacturers to timely launch new devices. High-level synthesis (HLS) helps significantly reduce the design costs by automating the translation of service algorithms into hardware logics; however, current HLS compilers do not fit well to embedded devices as they fail to produce area-efficient solutions while supporting concurrent events from diverse peripherals such as sensors, actuators and network modules. This paper proposes a new thread-aware HLS compiler named DURO that produces area-efficient embedded devices. DURO shares commonly-invoked functions and operators across different callers and threads with a new thread-aware area cost model, and thus effectively reduces the logic size. Moreover, DURO supports a variety of device peripherals by automatically integrating peripheral controllers and interfaces as peripheral drivers. The experiment results of six embedded devices with ten peripherals demonstrate that DURO reduces the area and energy dissipation of embedded devices by 28.5% and 25.3% compared with the designs generated by the state-of-the-art HLS compiler. This work also implements FPGA prototypes of the six devices using DURO, and the measurement results show 65.3% energy saving over Raspberry Pi Zero with slightly better computation performance. @InProceedings{CGO21p327, author = {Changsu Kim and Shinnung Jeong and Sungjun Cho and Yongwoo Lee and William Song and Youngsok Kim and Hanjun Kim}, title = {Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {327--339}, doi = {}, year = {2021}, } |
|
Kind, Bruno Conde |
CGO '21: "AnghaBench: A Suite with One ..."
AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction
Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimarães, and Fernando Magno Quintão Pereira (State University of Maringá, Brazil; Federal University of Minas Gerais, Brazil) A predictive compiler uses properties of a program to decide how to optimize it. The compiler is trained on a collection of programs to derive a model which determines its actions in face of unknown codes. One of the challenges of predictive compilation is how to find good training sets. Regardless of the programming language, the availability of human-made benchmarks is limited. Moreover, current synthesizers produce code that is very different from actual programs, and mining compilable code from open repositories is difficult, due to program dependencies. In this paper, we use a combination of web crawling and type inference to overcome these problems for the C programming language. We use a type reconstructor based on Hindley-Milner's algorithm to produce AnghaBench, a virtually unlimited collection of real-world compilable C programs. Although AnghaBench programs are not executable, they can be transformed into object files by any C compliant compiler. Therefore, they can be used to train compilers for code size reduction. We have used thousands of AnghaBench programs to train YaCoS, a predictive compiler based on LLVM. The version of YaCoS autotuned with AnghaBench generates binaries for the LLVM test suite over 10% smaller than clang -Oz. It compresses code impervious even to the state-of-the-art Function Sequence Alignment technique published in 2019, as it does not require large binaries to work well. @InProceedings{CGO21p378, author = {Anderson Faustino da Silva and Bruno Conde Kind and José Wesley de Souza Magalhães and Jerônimo Nunes Rocha and Breno Campos Ferreira Guimarães and Fernando Magno Quintão Pereira}, title = {AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {378--390}, doi = {}, year = {2021}, } Info |
|
Koehler, Thomas |
CGO '21: "Towards a Domain-Extensible ..."
Towards a Domain-Extensible Compiler: Optimizing an Image Processing Pipeline on Mobile CPUs
Thomas Koehler and Michel Steuwer (University of Glasgow, UK; University of Edinburgh, UK) Halide and many similar projects have demonstrated the great potential of domain specific optimizing compilers. They enable programs to be expressed at a convenient high-level, while generating high-performance code for parallel architectures. As domains of interest expand towards deep learning, probabilistic programming and beyond, it becomes increasingly clear that it is unsustainable to redesign domain specific compilers for each new domain. In addition, the rapid growth of hardware architectures to optimize for poses great challenges for designing these compilers. In this paper, we show how to extend a unifying domain-extensible compiler with domain-specific as well as hardware-specific optimizations. The compiler operates on generic patterns that have proven flexible enough to express a wide range of computations. Optimizations are not hard-coded into the compiler but are expressed as user-defined rewrite rules that are composed into strategies controlling the optimization process. Crucially, both computational patterns and optimization strategies are extensible without modifying the core compiler implementation. We demonstrate that this domain-extensible compiler design is capable of expressing image processing pipelines and well-known image processing optimizations. Our results on four mobile ARM multi-core CPUs, often used for image processing tasks, show that the code generated for the Harris operator outperforms the image processing library OpenCV by up to 16× and achieves performance close to - or even up to 1.4× better than - the state-of-the-art image processing compiler Halide. @InProceedings{CGO21p27, author = {Thomas Koehler and Michel Steuwer}, title = {Towards a Domain-Extensible Compiler: Optimizing an Image Processing Pipeline on Mobile CPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {27--38}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Korch, Matthias |
CGO '21: "YaskSite: Stencil Optimization ..."
YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures
Christie L. Alappat, Johannes Seiferth, Georg Hager, Matthias Korch, Thomas Rauber, and Gerhard Wellein (University of Erlangen-Nuremberg, Germany; University of Bayreuth, Germany) The landscape of multi-core architectures is growing more complex and diverse. Optimal application performance tuning parameters can vary widely across CPUs, and finding them in a possibly multidimensional parameter search space can be time consuming, expensive and potentially infeasible. In this work, we introduce YaskSite, a tool capable of tackling these challenges for stencil computations. YaskSite is built upon Intel’s YASK framework. It combines YASK’s flexibility to deal with different target architectures with the Execution-Cache-Memory performance model, which enables identifying optimal performance parameters analytically without the need to run the code. Further we show that YaskSite’s features can be exploited by external tuning frameworks to reliably select the most efficient kernel(s) for the application at hand. To demonstrate this, we integrate YaskSite into Offsite, an offline tuner for explicit ordinary differential equation methods, and show that the generated performance predictions are reliable and accurate, leading to considerable performance gains at minimal code generation time and autotuning costs on the latest Intel Cascade Lake and AMD Rome CPUs. @InProceedings{CGO21p174, author = {Christie L. Alappat and Johannes Seiferth and Georg Hager and Matthias Korch and Thomas Rauber and Gerhard Wellein}, title = {YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {174--186}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Krolik, Alexander |
CGO '21: "r3d3: Optimized Query Compilation ..."
r3d3: Optimized Query Compilation on GPUs
Alexander Krolik, Clark Verbrugge, and Laurie Hendren (McGill University, Canada) Query compilation is an effective approach to improve the performance of repeated database queries. GPU-based approaches have significant promise, but face difficulties in managing compilation time, data transfer costs, and in addressing a reasonably comprehensive range of SQL operations. In this work we describe a hybrid AoT/JIT approach to GPU-based query compilation. We use multiple optimizations to reduce execution, compile, and data transfer times, improving performance over both other GPU-based approaches and CPU-based query compilers as well. Our design addresses a wide range of SQL queries, sufficient to demonstrate the practicality of using GPUs for query optimization. @InProceedings{CGO21p277, author = {Alexander Krolik and Clark Verbrugge and Laurie Hendren}, title = {r3d3: Optimized Query Compilation on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {277--288}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Functional Results Reproduced |
|
Kuster, Andreas |
CGO '21: "StencilFlow: Mapping Large ..."
StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems
Johannes de Fine Licht, Andreas Kuster, Tiziano De Matteis, Tal Ben-Nun, Dominic Hofer, and Torsten Hoefler (ETH Zurich, Switzerland; MeteoSwiss, Switzerland) Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice. @InProceedings{CGO21p315, author = {Johannes de Fine Licht and Andreas Kuster and Tiziano De Matteis and Tal Ben-Nun and Dominic Hofer and Torsten Hoefler}, title = {StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {315--326}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Lattner, Chris |
CGO '21: "MLIR: Scaling Compiler Infrastructure ..."
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko (Google, USA; Indian Institute of Science, India; Google, France) This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. @InProceedings{CGO21p2, author = {Chris Lattner and Mehdi Amini and Uday Bondhugula and Albert Cohen and Andy Davis and Jacques Pienaar and River Riddle and Tatiana Shpeisman and Nicolas Vasilache and Oleksandr Zinenko}, title = {MLIR: Scaling Compiler Infrastructure for Domain Specific Computation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {2--14}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable |
|
Lee, Yongwoo |
CGO '21: "Thread-Aware Area-Efficient ..."
Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices
Changsu Kim, Shinnung Jeong, Sungjun Cho, Yongwoo Lee, William Song, Youngsok Kim, and Hanjun Kim (POSTECH, South Korea; Yonsei University, South Korea) In the embedded device market, custom hardware platforms such as an application specific integrated circuit (ASIC) and a field programmable gate array (FPGA) are attractive thanks to their high performance and power efficiency. However, its huge design costs make it challenging for manufacturers to timely launch new devices. High-level synthesis (HLS) helps significantly reduce the design costs by automating the translation of service algorithms into hardware logics; however, current HLS compilers do not fit well to embedded devices as they fail to produce area-efficient solutions while supporting concurrent events from diverse peripherals such as sensors, actuators and network modules. This paper proposes a new thread-aware HLS compiler named DURO that produces area-efficient embedded devices. DURO shares commonly-invoked functions and operators across different callers and threads with a new thread-aware area cost model, and thus effectively reduces the logic size. Moreover, DURO supports a variety of device peripherals by automatically integrating peripheral controllers and interfaces as peripheral drivers. The experiment results of six embedded devices with ten peripherals demonstrate that DURO reduces the area and energy dissipation of embedded devices by 28.5% and 25.3% compared with the designs generated by the state-of-the-art HLS compiler. This work also implements FPGA prototypes of the six devices using DURO, and the measurement results show 65.3% energy saving over Raspberry Pi Zero with slightly better computation performance. @InProceedings{CGO21p327, author = {Changsu Kim and Shinnung Jeong and Sungjun Cho and Yongwoo Lee and William Song and Youngsok Kim and Hanjun Kim}, title = {Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {327--339}, doi = {}, year = {2021}, } |
|
Li, Guangli |
CGO '21: "Unleashing the Low-Precision ..."
Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs
Guangli Li, Jingling Xue, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia; Jilin University, China) Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The main challenge faced when accelerating a tensor program on tensor-specialized hardware is how to achieve the best performance possible in reduced precision by fully utilizing its computational resources while keeping the precision loss in a controlled manner. In this paper, we address this challenge by proposing QUANTENSOR, a new approach for accelerating general-purpose tensor programs by replacing its tensor computations with low-precision quantized tensor computations on NVIDIA Tensor Cores. The key novelty is a new residual-based precision refinement technique for controlling the quantization errors, allowing tradeoffs between performance and precision to be made. Evaluation with GEMM, deep neural networks, and linear algebra applications shows that QUANTENSOR can achieve remarkable performance improvements while reducing the precision loss incurred significantly at acceptable overheads. @InProceedings{CGO21p90, author = {Guangli Li and Jingling Xue and Lei Liu and Xueying Wang and Xiu Ma and Xiao Dong and Jiansong Li and Xiaobing Feng}, title = {Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {90--102}, doi = {}, year = {2021}, } |
|
Li, Guangwei |
CGO '21: "GoBench: A Benchmark Suite ..."
GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs
Ting Yuan, Guangwei Li, Jie Lu, Chen Liu, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) Go, a fast growing programming language, is often considered as "the programming language of the cloud". The language provides a rich set of synchronization primitives, making it easy to write concurrent programs with great parallelism. However, the rich set of primitives also introduces many bugs. We build GoBench, the first benchmark suite for Go concurrency bugs. Currently, GoBench consists of 82 real bugs from 9 popular open source applications and 103 bug kernels. The bug kernels are carefully extracted and simplified from 67 out of these 82 bugs and 36 additional bugs reported in a recent study to preserve their bug-inducing complexities as much as possible. These bugs cover a variety of concurrency issues, both traditional and Go-specific. We believe GoBench will be instrumental in helping researchers understand concurrency bugs in Go and develop effective tools for their detection. We have therefore evaluated a range of representative concurrency error detection tools using GoBench. Our evaluation has revealed their limitations and provided insights for making further improvements. @InProceedings{CGO21p187, author = {Ting Yuan and Guangwei Li and Jie Lu and Chen Liu and Lian Li and Jingling Xue}, title = {GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {187--199}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Li, Haofeng |
CGO '21: "Scaling Up the IFDS Algorithm ..."
Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing
Haofeng Li, Haining Meng, Hengjie Zheng, Liqing Cao, Jie Lu, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China) The IFDS algorithm can be memory-intensive, requiring a memory budget of more than 100 GB of RAM for some applications. The large memory requirements significantly restrict the deployment of IFDS-based tools in practise. To improve this, we propose a disk-assisted solution that drastically reduces the memory requirements of traditional IFDS solvers. Our solution saves memory by 1) recomputing instead of memorizing intermediate analysis data, and 2) swapping in-memory data to disk when memory usages reach a threshold. We implement sophisticated scheduling schemes to swap data between memory and disks efficiently. We have developed a new taint analysis tool, DiskDroid, based on our disk-assisted IFDS solver. Compared to FlowDroid, a state-of-the-art IFDS-based taint analysis tool, for a set of 19 apps which take from 10 to 128 GB of RAM by FlowDroid, DiskDroid can analyze them with less than 10GB of RAM at a slight performance improvement of 8.6%. In addition, for 21 apps requiring more than 128GB of RAM by FlowDroid, DiskDroid can analyze each app in 3 hours, under the same memory budget of 10GB. This makes the tool deployable to normal desktop environments. We make the tool publicly available at https://github.com/HaofLi/DiskDroid. @InProceedings{CGO21p236, author = {Haofeng Li and Haining Meng and Hengjie Zheng and Liqing Cao and Jie Lu and Lian Li and Lin Gao}, title = {Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {236--247}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Li, Hongzheng |
CGO '21: "C-for-Metal: High Performance ..."
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, Joel Fuentes, Wei-Yu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, and Daniel Rhee (Intel Corporation, USA) The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU. @InProceedings{CGO21p289, author = {Guei-Yuan Lueh and Kaiyu Chen and Gang Chen and Joel Fuentes and Wei-Yu Chen and Fangwen Fu and Hong Jiang and Hongzheng Li and Daniel Rhee}, title = {C-for-Metal: High Performance SIMD Programming on Intel GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {289--300}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Li, Jiansong |
CGO '21: "Unleashing the Low-Precision ..."
Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs
Guangli Li, Jingling Xue, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia; Jilin University, China) Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The main challenge faced when accelerating a tensor program on tensor-specialized hardware is how to achieve the best performance possible in reduced precision by fully utilizing its computational resources while keeping the precision loss in a controlled manner. In this paper, we address this challenge by proposing QUANTENSOR, a new approach for accelerating general-purpose tensor programs by replacing its tensor computations with low-precision quantized tensor computations on NVIDIA Tensor Cores. The key novelty is a new residual-based precision refinement technique for controlling the quantization errors, allowing tradeoffs between performance and precision to be made. Evaluation with GEMM, deep neural networks, and linear algebra applications shows that QUANTENSOR can achieve remarkable performance improvements while reducing the precision loss incurred significantly at acceptable overheads. @InProceedings{CGO21p90, author = {Guangli Li and Jingling Xue and Lei Liu and Xueying Wang and Xiu Ma and Xiao Dong and Jiansong Li and Xiaobing Feng}, title = {Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {90--102}, doi = {}, year = {2021}, } |
|
Li, Lian |
CGO '21: "Scaling Up the IFDS Algorithm ..."
Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing
Haofeng Li, Haining Meng, Hengjie Zheng, Liqing Cao, Jie Lu, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China) The IFDS algorithm can be memory-intensive, requiring a memory budget of more than 100 GB of RAM for some applications. The large memory requirements significantly restrict the deployment of IFDS-based tools in practise. To improve this, we propose a disk-assisted solution that drastically reduces the memory requirements of traditional IFDS solvers. Our solution saves memory by 1) recomputing instead of memorizing intermediate analysis data, and 2) swapping in-memory data to disk when memory usages reach a threshold. We implement sophisticated scheduling schemes to swap data between memory and disks efficiently. We have developed a new taint analysis tool, DiskDroid, based on our disk-assisted IFDS solver. Compared to FlowDroid, a state-of-the-art IFDS-based taint analysis tool, for a set of 19 apps which take from 10 to 128 GB of RAM by FlowDroid, DiskDroid can analyze them with less than 10GB of RAM at a slight performance improvement of 8.6%. In addition, for 21 apps requiring more than 128GB of RAM by FlowDroid, DiskDroid can analyze each app in 3 hours, under the same memory budget of 10GB. This makes the tool deployable to normal desktop environments. We make the tool publicly available at https://github.com/HaofLi/DiskDroid. @InProceedings{CGO21p236, author = {Haofeng Li and Haining Meng and Hengjie Zheng and Liqing Cao and Jie Lu and Lian Li and Lin Gao}, title = {Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {236--247}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional CGO '21: "GoBench: A Benchmark Suite ..." GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs Ting Yuan, Guangwei Li, Jie Lu, Chen Liu, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) Go, a fast growing programming language, is often considered as "the programming language of the cloud". The language provides a rich set of synchronization primitives, making it easy to write concurrent programs with great parallelism. However, the rich set of primitives also introduces many bugs. We build GoBench, the first benchmark suite for Go concurrency bugs. Currently, GoBench consists of 82 real bugs from 9 popular open source applications and 103 bug kernels. The bug kernels are carefully extracted and simplified from 67 out of these 82 bugs and 36 additional bugs reported in a recent study to preserve their bug-inducing complexities as much as possible. These bugs cover a variety of concurrency issues, both traditional and Go-specific. We believe GoBench will be instrumental in helping researchers understand concurrency bugs in Go and develop effective tools for their detection. We have therefore evaluated a range of representative concurrency error detection tools using GoBench. Our evaluation has revealed their limitations and provided insights for making further improvements. @InProceedings{CGO21p187, author = {Ting Yuan and Guangwei Li and Jie Lu and Chen Liu and Lian Li and Jingling Xue}, title = {GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {187--199}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Lin, Jin |
CGO '21: "An Experience with Code-Size ..."
An Experience with Code-Size Optimization for Production iOS Mobile Applications
Milind Chabbi, Jin Lin, and Raj Barik (Uber Technologies, USA) Modern mobile application binaries are bulky for many reasons: software and its dependencies, fast-paced addition of new features, high-level language constructs, and statically linked platform libraries. Reduced application size is critical not only for the end-user experience but also for vendor’s download size limitations. Moreover, download size restrictions may impact revenues for critical businesses. In this paper, we highlight some of the key reasons of code-size bloat in iOS mobile applications, specifically apps written using a mix of Swift and Objective-C. Our observation reveals that machine code sequences systematically repeat throughout the app’s binary. We highlight source-code patterns and high-level language constructs that lead to an increase in the code size. We propose whole-program, fine-grained machine-code outlining as an effective optimization to constrain the code-size growth. We evaluate the effectiveness of our new optimization pipeline on the UberRider iOS app used by millions of customers daily. Our optimizations reduce the code size by 23%. The impact of our optimizations on the code size grows in magnitude over time as the code evolves. For a set of performance spans defined by the app developers, the optimizations do not statistically regress production performance. We applied the same optimizations to Uber’s UberDriver and UberEats apps and gained 17% and 19% size savings, respectively. @InProceedings{CGO21p363, author = {Milind Chabbi and Jin Lin and Raj Barik}, title = {An Experience with Code-Size Optimization for Production iOS Mobile Applications}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {363--377}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable |
|
Liu, Bin |
CGO '21: "HHVM Jump-Start: Boosting ..."
HHVM Jump-Start: Boosting Both Warmup and Steady-State Performance at Scale
Guilherme Ottoni and Bin Liu (Facebook, USA) Just-In-Time (JIT) compilation is often employed in Virtual Machines (VMs) to translate their virtual-machine languages into real-machine code. This approach not only brings portability, but it also enables aggressive compiler optimizations based on runtime behavior observed via profiling. The downside of JIT compilation, compared to Ahead-Of-Time native compilation, is that the profiling and compilation overheads are incurred during execution. To mitigate these overheads, previous work have proposed sharing either profile data or final JIT compiled code across VM executions. Unfortunately, these techniques have drawbacks, including steady-state performance degradation and difficulty of use. To address these issues, this paper presents the Jump-Start mechanism implemented inside the HipHop Virtual Machine (HHVM). Jump-Start is a practical approach to share VM profile data at a large scale, being used to power one of the largest websites in the world. In this paper, we argue for HHVM’s Jump-Start approach, describe it in detail, and present steady-state optimizations built on top of it. Running the Facebook website, we demonstrate that Jump-Start effectively solves the warmup problem in HHVM, reducing the server capacity loss during warmup by 54.9%, while also improving steady-state performance by 5.4%. @InProceedings{CGO21p340, author = {Guilherme Ottoni and Bin Liu}, title = {HHVM Jump-Start: Boosting Both Warmup and Steady-State Performance at Scale}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {340--350}, doi = {}, year = {2021}, } |
|
Liu, Chen |
CGO '21: "GoBench: A Benchmark Suite ..."
GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs
Ting Yuan, Guangwei Li, Jie Lu, Chen Liu, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) Go, a fast growing programming language, is often considered as "the programming language of the cloud". The language provides a rich set of synchronization primitives, making it easy to write concurrent programs with great parallelism. However, the rich set of primitives also introduces many bugs. We build GoBench, the first benchmark suite for Go concurrency bugs. Currently, GoBench consists of 82 real bugs from 9 popular open source applications and 103 bug kernels. The bug kernels are carefully extracted and simplified from 67 out of these 82 bugs and 36 additional bugs reported in a recent study to preserve their bug-inducing complexities as much as possible. These bugs cover a variety of concurrency issues, both traditional and Go-specific. We believe GoBench will be instrumental in helping researchers understand concurrency bugs in Go and develop effective tools for their detection. We have therefore evaluated a range of representative concurrency error detection tools using GoBench. Our evaluation has revealed their limitations and provided insights for making further improvements. @InProceedings{CGO21p187, author = {Ting Yuan and Guangwei Li and Jie Lu and Chen Liu and Lian Li and Jingling Xue}, title = {GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {187--199}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Liu, Ji |
CGO '21: "Relaxed Peephole Optimization: ..."
Relaxed Peephole Optimization: A Novel Compiler Optimization for Quantum Circuits
Ji Liu, Luciano Bello, and Huiyang Zhou (North Carolina State University, USA; IBM Research, USA) As in classical computing, compilers play an important role in quantum computing. Quantum processors typically support a limited set of primitive operations or quantum gates and have certain hardware-related limitations. A quantum compiler is responsible for adapting a quantum program to these constraint environments and decomposing quantum gates into a sequence of the primitive ones. During the compilation process, it is also critical for the compiler to optimize the quantum circuits in order to reduce the noise in the computation results. Since the noise is introduced by operations and decoherence, reducing the gate count is the key for improving performance. In this paper, we propose a novel quantum compiler optimization, named relaxed peephole optimization (RPO) for quantum computers. RPO leverages the single-qubit state information that can be determined statically by the compiler. We define that a qubit is in a basis state when, at a given point in time, its state is either in the X-, Y-, or Z-basis (|+>/|−>, |L>/|R> and |0>/|1>). When basis qubits are used as inputs to quantum gates, there exist opportunities for strength reduction, which replaces quantum operations with equivalent but less expensive ones. Compared to the existing peephole optimization for quantum programs, the difference is that our proposed optimization does not require an identical unitary matrix, thereby named ‘relaxed’ peephole optimization. We also extend our approach to optimize the quantum gates when some input qubits are in known pure states. Both optimizations, namely the Quantum Basis-state Optimization (QBO) and the Quantum Pure-state Optimization (QPO), are implemented in the IBM’s Qiskit transpiler. Our experimental results show that our proposed optimization pass is fast and effective. The circuits optimized with our compiler optimizations obtain up to 18.0% (11.7% on average) fewer CNOT gates and up to 8.2% (7.1% on average) lower transpilation time than that of the most aggressive optimization level in the Qiskit compiler. When running on real quantum computers, the success rates of 3-qubit quantum phase estimation algorithm improve by 2.30X due to the reduced gate counts. @InProceedings{CGO21p301, author = {Ji Liu and Luciano Bello and Huiyang Zhou}, title = {Relaxed Peephole Optimization: A Novel Compiler Optimization for Quantum Circuits}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {301--314}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Liu, Lei |
CGO '21: "Unleashing the Low-Precision ..."
Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs
Guangli Li, Jingling Xue, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia; Jilin University, China) Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The main challenge faced when accelerating a tensor program on tensor-specialized hardware is how to achieve the best performance possible in reduced precision by fully utilizing its computational resources while keeping the precision loss in a controlled manner. In this paper, we address this challenge by proposing QUANTENSOR, a new approach for accelerating general-purpose tensor programs by replacing its tensor computations with low-precision quantized tensor computations on NVIDIA Tensor Cores. The key novelty is a new residual-based precision refinement technique for controlling the quantization errors, allowing tradeoffs between performance and precision to be made. Evaluation with GEMM, deep neural networks, and linear algebra applications shows that QUANTENSOR can achieve remarkable performance improvements while reducing the precision loss incurred significantly at acceptable overheads. @InProceedings{CGO21p90, author = {Guangli Li and Jingling Xue and Lei Liu and Xueying Wang and Xiu Ma and Xiao Dong and Jiansong Li and Xiaobing Feng}, title = {Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {90--102}, doi = {}, year = {2021}, } |
|
Lozano, Roberto Castañeda |
CGO '21: "Loop Parallelization using ..."
Loop Parallelization using Dynamic Commutativity Analysis
Christos Vasiladiotis, Roberto Castañeda Lozano, Murray Cole, and Björn Franke (University of Edinburgh, UK) Automatic parallelization has largely failed to keep its promise of extracting parallelism from sequential legacy code to maximize performance on multi-core systems outside the numerical domain. In this paper, we develop a novel dynamic commutativity analysis (DCA) for identifying parallelizable loops. Using commutativity instead of dependence tests, DCA avoids many of the overly strict data dependence constraints limiting existing parallelizing compilers. DCA extends the scope of automatic parallelization to uniformly include both regular array-based and irregular pointer-based codes. We have prototyped our novel parallelism detection analysis and evaluated it extensively against five state-of-the-art dependence-based techniques in two experimental settings. First, when applied to the NAS benchmarks which contain almost 1400 loops, DCA is able to identify as many parallel loops (over 1200) as the profile-guided dependence techniques and almost twice as many as all the static techniques combined. We then apply DCA to complex pointer-based loops, where it can successfully detect parallelism, while existing techniques fail to identify any. When combined with existing parallel code generation techniques, this results in an average speedup of 3.6× (and up to 55×) across the NAS benchmarks on a 72-core host, and up to 36.9× for the pointer-based loops, demonstrating the effectiveness of DCA in identifying profitable parallelism across a wide range of loops. @InProceedings{CGO21p150, author = {Christos Vasiladiotis and Roberto Castañeda Lozano and Murray Cole and Björn Franke}, title = {Loop Parallelization using Dynamic Commutativity Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {150--161}, doi = {}, year = {2021}, } |
|
Lueh, Guei-Yuan |
CGO '21: "C-for-Metal: High Performance ..."
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, Joel Fuentes, Wei-Yu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, and Daniel Rhee (Intel Corporation, USA) The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU. @InProceedings{CGO21p289, author = {Guei-Yuan Lueh and Kaiyu Chen and Gang Chen and Joel Fuentes and Wei-Yu Chen and Fangwen Fu and Hong Jiang and Hongzheng Li and Daniel Rhee}, title = {C-for-Metal: High Performance SIMD Programming on Intel GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {289--300}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Lu, Jie |
CGO '21: "Scaling Up the IFDS Algorithm ..."
Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing
Haofeng Li, Haining Meng, Hengjie Zheng, Liqing Cao, Jie Lu, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China) The IFDS algorithm can be memory-intensive, requiring a memory budget of more than 100 GB of RAM for some applications. The large memory requirements significantly restrict the deployment of IFDS-based tools in practise. To improve this, we propose a disk-assisted solution that drastically reduces the memory requirements of traditional IFDS solvers. Our solution saves memory by 1) recomputing instead of memorizing intermediate analysis data, and 2) swapping in-memory data to disk when memory usages reach a threshold. We implement sophisticated scheduling schemes to swap data between memory and disks efficiently. We have developed a new taint analysis tool, DiskDroid, based on our disk-assisted IFDS solver. Compared to FlowDroid, a state-of-the-art IFDS-based taint analysis tool, for a set of 19 apps which take from 10 to 128 GB of RAM by FlowDroid, DiskDroid can analyze them with less than 10GB of RAM at a slight performance improvement of 8.6%. In addition, for 21 apps requiring more than 128GB of RAM by FlowDroid, DiskDroid can analyze each app in 3 hours, under the same memory budget of 10GB. This makes the tool deployable to normal desktop environments. We make the tool publicly available at https://github.com/HaofLi/DiskDroid. @InProceedings{CGO21p236, author = {Haofeng Li and Haining Meng and Hengjie Zheng and Liqing Cao and Jie Lu and Lian Li and Lin Gao}, title = {Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {236--247}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional CGO '21: "GoBench: A Benchmark Suite ..." GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs Ting Yuan, Guangwei Li, Jie Lu, Chen Liu, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) Go, a fast growing programming language, is often considered as "the programming language of the cloud". The language provides a rich set of synchronization primitives, making it easy to write concurrent programs with great parallelism. However, the rich set of primitives also introduces many bugs. We build GoBench, the first benchmark suite for Go concurrency bugs. Currently, GoBench consists of 82 real bugs from 9 popular open source applications and 103 bug kernels. The bug kernels are carefully extracted and simplified from 67 out of these 82 bugs and 36 additional bugs reported in a recent study to preserve their bug-inducing complexities as much as possible. These bugs cover a variety of concurrency issues, both traditional and Go-specific. We believe GoBench will be instrumental in helping researchers understand concurrency bugs in Go and develop effective tools for their detection. We have therefore evaluated a range of representative concurrency error detection tools using GoBench. Our evaluation has revealed their limitations and provided insights for making further improvements. @InProceedings{CGO21p187, author = {Ting Yuan and Guangwei Li and Jie Lu and Chen Liu and Lian Li and Jingling Xue}, title = {GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {187--199}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Magalhães, José Wesley de Souza |
CGO '21: "AnghaBench: A Suite with One ..."
AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction
Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimarães, and Fernando Magno Quintão Pereira (State University of Maringá, Brazil; Federal University of Minas Gerais, Brazil) A predictive compiler uses properties of a program to decide how to optimize it. The compiler is trained on a collection of programs to derive a model which determines its actions in face of unknown codes. One of the challenges of predictive compilation is how to find good training sets. Regardless of the programming language, the availability of human-made benchmarks is limited. Moreover, current synthesizers produce code that is very different from actual programs, and mining compilable code from open repositories is difficult, due to program dependencies. In this paper, we use a combination of web crawling and type inference to overcome these problems for the C programming language. We use a type reconstructor based on Hindley-Milner's algorithm to produce AnghaBench, a virtually unlimited collection of real-world compilable C programs. Although AnghaBench programs are not executable, they can be transformed into object files by any C compliant compiler. Therefore, they can be used to train compilers for code size reduction. We have used thousands of AnghaBench programs to train YaCoS, a predictive compiler based on LLVM. The version of YaCoS autotuned with AnghaBench generates binaries for the LLVM test suite over 10% smaller than clang -Oz. It compresses code impervious even to the state-of-the-art Function Sequence Alignment technique published in 2019, as it does not require large binaries to work well. @InProceedings{CGO21p378, author = {Anderson Faustino da Silva and Bruno Conde Kind and José Wesley de Souza Magalhães and Jerônimo Nunes Rocha and Breno Campos Ferreira Guimarães and Fernando Magno Quintão Pereira}, title = {AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {378--390}, doi = {}, year = {2021}, } Info |
|
Ma, Xiu |
CGO '21: "Unleashing the Low-Precision ..."
Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs
Guangli Li, Jingling Xue, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia; Jilin University, China) Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The main challenge faced when accelerating a tensor program on tensor-specialized hardware is how to achieve the best performance possible in reduced precision by fully utilizing its computational resources while keeping the precision loss in a controlled manner. In this paper, we address this challenge by proposing QUANTENSOR, a new approach for accelerating general-purpose tensor programs by replacing its tensor computations with low-precision quantized tensor computations on NVIDIA Tensor Cores. The key novelty is a new residual-based precision refinement technique for controlling the quantization errors, allowing tradeoffs between performance and precision to be made. Evaluation with GEMM, deep neural networks, and linear algebra applications shows that QUANTENSOR can achieve remarkable performance improvements while reducing the precision loss incurred significantly at acceptable overheads. @InProceedings{CGO21p90, author = {Guangli Li and Jingling Xue and Lei Liu and Xueying Wang and Xiu Ma and Xiao Dong and Jiansong Li and Xiaobing Feng}, title = {Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {90--102}, doi = {}, year = {2021}, } |
|
Mehta, Sanyam |
CGO '21: "Variable-Sized Blocks for ..."
Variable-Sized Blocks for Locality-Aware SpMV
Naveen Namashivayam, Sanyam Mehta, and Pen-Chung Yew (HPE, USA; University of Minnesota at Twin Cities, USA) Blocking is an important optimization option available to mitigate the data movement overhead and improve the temporal locality in SpMV, a sparse BLAS kernel with irregular memory reference pattern. In this work, we propose an analytical model to determine the effective block size for highly irregular sparse matrices by factoring the distribution of non-zeros in the sparse dataset. As a result, the blocks generated by our scheme are variable-sized as opposed to constant-sized in most existing SpMV algorithms. We demonstrate our blocking scheme using Compressed Vector Blocks (CVB), a new column-based blocked data format, on Intel Xeon Skylake-X multicore processor. We evaluated the performance of CVB-based SpMV with variable-sized blocks using extensive set of matrices from Stanford Network Analysis Platform (SNAP). Our evaluation shows a speedup of up to 2.62X (with an average of 1.73X) and 2.02X (with an average of 1.18X) over the highly vendor tuned SpMV implementation in Intel’s Math Kernel Library (MKL) on single and multiple Intel Xeon cores respectively. @InProceedings{CGO21p211, author = {Naveen Namashivayam and Sanyam Mehta and Pen-Chung Yew}, title = {Variable-Sized Blocks for Locality-Aware SpMV}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {211--221}, doi = {}, year = {2021}, } |
|
Mellor-Crummey, John |
CGO '21: "GPA: A GPU Performance Advisor ..."
GPA: A GPU Performance Advisor Based on Instruction Sampling
Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey (Rice University, USA) Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained tuning advice at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program’s structure and the GPU to match inefficiency patterns with optimization strategies. To quantify the potential benefits of each optimization strategy, we developed PC sampling-based performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides insightful reports to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01× to 3.58×, with a geometric mean of 1.22×. @InProceedings{CGO21p115, author = {Keren Zhou and Xiaozhu Meng and Ryuichi Sai and John Mellor-Crummey}, title = {GPA: A GPU Performance Advisor Based on Instruction Sampling}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {115--125}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Meng, Haining |
CGO '21: "Scaling Up the IFDS Algorithm ..."
Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing
Haofeng Li, Haining Meng, Hengjie Zheng, Liqing Cao, Jie Lu, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China) The IFDS algorithm can be memory-intensive, requiring a memory budget of more than 100 GB of RAM for some applications. The large memory requirements significantly restrict the deployment of IFDS-based tools in practise. To improve this, we propose a disk-assisted solution that drastically reduces the memory requirements of traditional IFDS solvers. Our solution saves memory by 1) recomputing instead of memorizing intermediate analysis data, and 2) swapping in-memory data to disk when memory usages reach a threshold. We implement sophisticated scheduling schemes to swap data between memory and disks efficiently. We have developed a new taint analysis tool, DiskDroid, based on our disk-assisted IFDS solver. Compared to FlowDroid, a state-of-the-art IFDS-based taint analysis tool, for a set of 19 apps which take from 10 to 128 GB of RAM by FlowDroid, DiskDroid can analyze them with less than 10GB of RAM at a slight performance improvement of 8.6%. In addition, for 21 apps requiring more than 128GB of RAM by FlowDroid, DiskDroid can analyze each app in 3 hours, under the same memory budget of 10GB. This makes the tool deployable to normal desktop environments. We make the tool publicly available at https://github.com/HaofLi/DiskDroid. @InProceedings{CGO21p236, author = {Haofeng Li and Haining Meng and Hengjie Zheng and Liqing Cao and Jie Lu and Lian Li and Lin Gao}, title = {Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {236--247}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Meng, Xiaozhu |
CGO '21: "GPA: A GPU Performance Advisor ..."
GPA: A GPU Performance Advisor Based on Instruction Sampling
Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey (Rice University, USA) Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained tuning advice at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program’s structure and the GPU to match inefficiency patterns with optimization strategies. To quantify the potential benefits of each optimization strategy, we developed PC sampling-based performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides insightful reports to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01× to 3.58×, with a geometric mean of 1.22×. @InProceedings{CGO21p115, author = {Keren Zhou and Xiaozhu Meng and Ryuichi Sai and John Mellor-Crummey}, title = {GPA: A GPU Performance Advisor Based on Instruction Sampling}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {115--125}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Namashivayam, Naveen |
CGO '21: "Variable-Sized Blocks for ..."
Variable-Sized Blocks for Locality-Aware SpMV
Naveen Namashivayam, Sanyam Mehta, and Pen-Chung Yew (HPE, USA; University of Minnesota at Twin Cities, USA) Blocking is an important optimization option available to mitigate the data movement overhead and improve the temporal locality in SpMV, a sparse BLAS kernel with irregular memory reference pattern. In this work, we propose an analytical model to determine the effective block size for highly irregular sparse matrices by factoring the distribution of non-zeros in the sparse dataset. As a result, the blocks generated by our scheme are variable-sized as opposed to constant-sized in most existing SpMV algorithms. We demonstrate our blocking scheme using Compressed Vector Blocks (CVB), a new column-based blocked data format, on Intel Xeon Skylake-X multicore processor. We evaluated the performance of CVB-based SpMV with variable-sized blocks using extensive set of matrices from Stanford Network Analysis Platform (SNAP). Our evaluation shows a speedup of up to 2.62X (with an average of 1.73X) and 2.02X (with an average of 1.18X) over the highly vendor tuned SpMV implementation in Intel’s Math Kernel Library (MKL) on single and multiple Intel Xeon cores respectively. @InProceedings{CGO21p211, author = {Naveen Namashivayam and Sanyam Mehta and Pen-Chung Yew}, title = {Variable-Sized Blocks for Locality-Aware SpMV}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {211--221}, doi = {}, year = {2021}, } |
|
Nowatzki, Tony |
CGO '21: "UNIT: Unifying Tensorized ..."
UNIT: Unifying Tensorized Instruction Compilation
Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony Nowatzki (University of California at Los Angeles, USA; Amazon, USA) Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions. In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3× speedup over Intel oneDNN on an x86 CPU, 1.75× speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13× speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU. @InProceedings{CGO21p77, author = {Jian Weng and Animesh Jain and Jie Wang and Leyuan Wang and Yida Wang and Tony Nowatzki}, title = {UNIT: Unifying Tensorized Instruction Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {77--89}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Nowicki, Tyler |
CGO '21: "Vulkan Vision: Ray Tracing ..."
Vulkan Vision: Ray Tracing Workload Characterization using Automatic Graphics Instrumentation
David Pankratz, Tyler Nowicki, Ahmed Eltantawy, and José Nelson Amaral (University of Alberta, Canada; Huawei Technologies, Canada) While there are mature performance monitoring, profiling and instrumentation tools to help understanding the dynamic behaviour of general-purpose GPU applications, the abstract programming models of graphics applications have limited the development of such tools for graphics. This paper introduces Vulcan Vision(V-Vision), a framework for collecting detailed GPU execution data from Vulkan applications to guide hardware-informed improvements. A core contribution of V-Vision is providing out-of-the-box data collection for capturing complete dynamic warp and thread execution traces. V-Vision also provides analyses for the follow purposes: identifying and visualizing application hotspots to guide optimization, characterizing application behaviour and estimating the effect of architectural modifications. This paper demonstrates the potential for these analyses in applications that utilize the recent ray tracing extension in Vulkan and describes new insights about the applications and the underlying hardware. @InProceedings{CGO21p137, author = {David Pankratz and Tyler Nowicki and Ahmed Eltantawy and José Nelson Amaral}, title = {Vulkan Vision: Ray Tracing Workload Characterization using Automatic Graphics Instrumentation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {137--149}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Ottoni, Guilherme |
CGO '21: "HHVM Jump-Start: Boosting ..."
HHVM Jump-Start: Boosting Both Warmup and Steady-State Performance at Scale
Guilherme Ottoni and Bin Liu (Facebook, USA) Just-In-Time (JIT) compilation is often employed in Virtual Machines (VMs) to translate their virtual-machine languages into real-machine code. This approach not only brings portability, but it also enables aggressive compiler optimizations based on runtime behavior observed via profiling. The downside of JIT compilation, compared to Ahead-Of-Time native compilation, is that the profiling and compilation overheads are incurred during execution. To mitigate these overheads, previous work have proposed sharing either profile data or final JIT compiled code across VM executions. Unfortunately, these techniques have drawbacks, including steady-state performance degradation and difficulty of use. To address these issues, this paper presents the Jump-Start mechanism implemented inside the HipHop Virtual Machine (HHVM). Jump-Start is a practical approach to share VM profile data at a large scale, being used to power one of the largest websites in the world. In this paper, we argue for HHVM’s Jump-Start approach, describe it in detail, and present steady-state optimizations built on top of it. Running the Facebook website, we demonstrate that Jump-Start effectively solves the warmup problem in HHVM, reducing the server capacity loss during warmup by 54.9%, while also improving steady-state performance by 5.4%. @InProceedings{CGO21p340, author = {Guilherme Ottoni and Bin Liu}, title = {HHVM Jump-Start: Boosting Both Warmup and Steady-State Performance at Scale}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {340--350}, doi = {}, year = {2021}, } |
|
Pai, Sreepathi |
CGO '21: "Efficient Execution of Graph ..."
Efficient Execution of Graph Algorithms on CPU with SIMD Extensions
Ruohuang Zheng and Sreepathi Pai (University of Rochester, USA) Existing state-of-the-art CPU graph frameworks take advantage of multiple cores, but not the SIMD capability within each core. In this work, we retarget an existing GPU graph algorithm compiler to obtain the first graph framework that uses SIMD extensions on CPUs to efficiently execute graph algorithms. We evaluate this compiler on 10 benchmarks and 3 graphs on 3 different CPUs and also compare to the GPU. Evaluation results show that on a 8-core machine, enabling SIMD on a naive multi-core implementation achieves an additional 7.48x speedup, averaged across 10 benchmarks and 3 inputs. Applying our SIMD-targeted optimizations improves the plain SIMD implementation by 1.67x, outperforming a serial implementation by 12.46x. On average, the optimized multi-core SIMD version also outperforms the state-of-the-art graph framework, GraphIt, by 1.53x, averaged across 5 (common) benchmarks. SIMD execution on CPUs closes the gap between the CPU and GPU to 1.76x, but the CPU virtual memory performs better when graphs are much bigger than available physical memory. @InProceedings{CGO21p262, author = {Ruohuang Zheng and Sreepathi Pai}, title = {Efficient Execution of Graph Algorithms on CPU with SIMD Extensions}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {262--276}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Pankratz, David |
CGO '21: "Vulkan Vision: Ray Tracing ..."
Vulkan Vision: Ray Tracing Workload Characterization using Automatic Graphics Instrumentation
David Pankratz, Tyler Nowicki, Ahmed Eltantawy, and José Nelson Amaral (University of Alberta, Canada; Huawei Technologies, Canada) While there are mature performance monitoring, profiling and instrumentation tools to help understanding the dynamic behaviour of general-purpose GPU applications, the abstract programming models of graphics applications have limited the development of such tools for graphics. This paper introduces Vulcan Vision(V-Vision), a framework for collecting detailed GPU execution data from Vulkan applications to guide hardware-informed improvements. A core contribution of V-Vision is providing out-of-the-box data collection for capturing complete dynamic warp and thread execution traces. V-Vision also provides analyses for the follow purposes: identifying and visualizing application hotspots to guide optimization, characterizing application behaviour and estimating the effect of architectural modifications. This paper demonstrates the potential for these analyses in applications that utilize the recent ray tracing extension in Vulkan and describes new insights about the applications and the underlying hardware. @InProceedings{CGO21p137, author = {David Pankratz and Tyler Nowicki and Ahmed Eltantawy and José Nelson Amaral}, title = {Vulkan Vision: Ray Tracing Workload Characterization using Automatic Graphics Instrumentation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {137--149}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Patil, Harish |
CGO '21: "ELFies: Executable Region ..."
ELFies: Executable Region Checkpoints for Performance Analysis and Simulation
Harish Patil, Alexander Isaev, Wim Heirman, Alen Sabu, Ali Hajiabadi, and Trevor E. Carlson (Intel Corporation, USA; Intel Corporation, Belgium; National University of Singapore, Singapore) We address the challenge faced in characterizing long-running workloads, namely how to reliably focus the detailed analysis on interesting execution regions. We present a set of tools that allows users to precisely capture any region of interest in program execution, and create a stand-alone executable, called an ELFie, from it. An ELFie starts with the same program state captured at the beginning of the region of interest and then executes natively. With ELFies, there is no fast-forwarding to the region of interest needed or the uncertainty of reaching the region. ELFies can be fed to dynamic program-analysis tools or simulators that work with regular program binaries. Our tool-chain is based on the PinPlay framework and requires no special hardware, operating system changes, re-compilation, or re-linking of test programs. This paper describes the design of our ELFie generation tool-chain and the application of ELFies in performance analysis and simulation of regions of interest in popular long-running single and multi-threaded benchmarks. @InProceedings{CGO21p126, author = {Harish Patil and Alexander Isaev and Wim Heirman and Alen Sabu and Ali Hajiabadi and Trevor E. Carlson}, title = {ELFies: Executable Region Checkpoints for Performance Analysis and Simulation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {126--136}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Pereira, Fernando Magno Quintão |
CGO '21: "AnghaBench: A Suite with One ..."
AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction
Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimarães, and Fernando Magno Quintão Pereira (State University of Maringá, Brazil; Federal University of Minas Gerais, Brazil) A predictive compiler uses properties of a program to decide how to optimize it. The compiler is trained on a collection of programs to derive a model which determines its actions in face of unknown codes. One of the challenges of predictive compilation is how to find good training sets. Regardless of the programming language, the availability of human-made benchmarks is limited. Moreover, current synthesizers produce code that is very different from actual programs, and mining compilable code from open repositories is difficult, due to program dependencies. In this paper, we use a combination of web crawling and type inference to overcome these problems for the C programming language. We use a type reconstructor based on Hindley-Milner's algorithm to produce AnghaBench, a virtually unlimited collection of real-world compilable C programs. Although AnghaBench programs are not executable, they can be transformed into object files by any C compliant compiler. Therefore, they can be used to train compilers for code size reduction. We have used thousands of AnghaBench programs to train YaCoS, a predictive compiler based on LLVM. The version of YaCoS autotuned with AnghaBench generates binaries for the LLVM test suite over 10% smaller than clang -Oz. It compresses code impervious even to the state-of-the-art Function Sequence Alignment technique published in 2019, as it does not require large binaries to work well. @InProceedings{CGO21p378, author = {Anderson Faustino da Silva and Bruno Conde Kind and José Wesley de Souza Magalhães and Jerônimo Nunes Rocha and Breno Campos Ferreira Guimarães and Fernando Magno Quintão Pereira}, title = {AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {378--390}, doi = {}, year = {2021}, } Info CGO '21: "Memory-Safe Elimination of ..." Memory-Safe Elimination of Side Channels Luigi Soares and Fernando Magno Quintão Pereira (Federal University of Minas Gerais, Brazil) A program is said to be isochronous if its running time does not depend on classified information. The programming languages literature contains much work that transforms programs to ensure isochronicity. The current state-of-the-art approach is a code transformation technique due to Wu et al., published in 2018. That technique has an important virtue: it ensures that the transformed program runs exactly the same set of operations, regardless of inputs. However, in this paper we demonstrate that it has also a shortcoming: it might add out-of-bounds memory accesses into programs that were originally memory sound. From this observation, we show how to deliver the same runtime guarantees that Wu et al. provide, in a memory-safe way. In addition to being safer, our LLVM-based implementation is more efficient than its original inspiration, achieving shorter repairing times, and producing code that is smaller and faster. @InProceedings{CGO21p200, author = {Luigi Soares and Fernando Magno Quintão Pereira}, title = {Memory-Safe Elimination of Side Channels}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {200--210}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Pétrot, Frédéric |
CGO '21: "Seamless Compiler Integration ..."
Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic
Tiago Trevisan Jost, Yves Durand, Christian Fabre, Albert Cohen, and Frédéric Pétrot (Université Grenoble Alpes, France; CEA LIST, France; Google, France; CNRS, France; Grenoble INP, France; TIMA, France) Floating-Point (FP) units in processors are generally limited to supporting a subset of formats defined by the IEEE 754 standard. As a result, high-efficiency languages and optimizing compilers for high-performance computing only support IEEE standard types and applications needing higher precision involve cumbersome memory management and calls to external libraries, resulting in code bloat and making the intent of the program unclear. We present an extension of the C type system that can represent generic FP operations and formats, supporting both static precision and dynamically variable precision. We design and implement a compilation flow bridging the abstraction gap between this type system and low-level FP instructions or software libraries. The effectiveness of our solution is demonstrated through an LLVM-based implementation, leveraging aggressive optimizations in LLVM including the Polly loop nest optimizer, which targets two backend code generators: one for the ISA of a variable precision FP arithmetic coprocessor, and one for the MPFR multi-precision floating-point library. Our optimizing compilation flow targeting MPFR outperforms the Boost programming interface for the MPFR library by a factor of 1.80× and 1.67× in sequential execution of the PolyBench and RAJAPerf suites, respectively, and by a factor of 7.62× on an 8-core (and 16-thread) machine for RAJAPerf in OpenMP. @InProceedings{CGO21p65, author = {Tiago Trevisan Jost and Yves Durand and Christian Fabre and Albert Cohen and Frédéric Pétrot}, title = {Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {65--76}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Pienaar, Jacques |
CGO '21: "MLIR: Scaling Compiler Infrastructure ..."
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko (Google, USA; Indian Institute of Science, India; Google, France) This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. @InProceedings{CGO21p2, author = {Chris Lattner and Mehdi Amini and Uday Bondhugula and Albert Cohen and Andy Davis and Jacques Pienaar and River Riddle and Tatiana Shpeisman and Nicolas Vasilache and Oleksandr Zinenko}, title = {MLIR: Scaling Compiler Infrastructure for Domain Specific Computation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {2--14}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable |
|
Püschel, Markus |
CGO '21: "An Interval Compiler for Sound ..."
An Interval Compiler for Sound Floating-Point Computations
Joao Rivera, Franz Franchetti, and Markus Püschel (ETH Zurich, Switzerland; Carnegie Mellon University, USA) Floating-point arithmetic is widely used by software developers but is unsound, i.e., there is no guarantee on the accuracy obtained, which can be imperative in safety-critical applications. We present IGen, a source-to-source compiler that translates a given C function using floating-point into an equivalent sound C function that uses interval arithmetic. IGen supports Intel SIMD intrinsics in the input function using a specially designed code generator and can produce SIMD-optimized output. To mitigate a possible loss of accuracy due to the increase of interval sizes, IGen can compile to double-double precision, again SIMD-optimized. Finally, IGen implements an accuracy optimization for the common reduction pattern. We benchmark our compiler on high-performance code in the domain of linear algebra and signal processing. The results show that the generated code delivers sound double precision results at high performance. In particular, we observe speed-ups of up to 9.8 when compared to commonly used interval libraries using double precision. When compiling to double-double, our compiler delivers intervals that keep error accumulation small enough to compute results with at most one bit of error in double precision, i.e., certified double precision results. @InProceedings{CGO21p52, author = {Joao Rivera and Franz Franchetti and Markus Püschel}, title = {An Interval Compiler for Sound Floating-Point Computations}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {52--64}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Rauber, Thomas |
CGO '21: "YaskSite: Stencil Optimization ..."
YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures
Christie L. Alappat, Johannes Seiferth, Georg Hager, Matthias Korch, Thomas Rauber, and Gerhard Wellein (University of Erlangen-Nuremberg, Germany; University of Bayreuth, Germany) The landscape of multi-core architectures is growing more complex and diverse. Optimal application performance tuning parameters can vary widely across CPUs, and finding them in a possibly multidimensional parameter search space can be time consuming, expensive and potentially infeasible. In this work, we introduce YaskSite, a tool capable of tackling these challenges for stencil computations. YaskSite is built upon Intel’s YASK framework. It combines YASK’s flexibility to deal with different target architectures with the Execution-Cache-Memory performance model, which enables identifying optimal performance parameters analytically without the need to run the code. Further we show that YaskSite’s features can be exploited by external tuning frameworks to reliably select the most efficient kernel(s) for the application at hand. To demonstrate this, we integrate YaskSite into Offsite, an offline tuner for explicit ordinary differential equation methods, and show that the generated performance predictions are reliable and accurate, leading to considerable performance gains at minimal code generation time and autotuning costs on the latest Intel Cascade Lake and AMD Rome CPUs. @InProceedings{CGO21p174, author = {Christie L. Alappat and Johannes Seiferth and Georg Hager and Matthias Korch and Thomas Rauber and Gerhard Wellein}, title = {YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {174--186}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Rhee, Daniel |
CGO '21: "C-for-Metal: High Performance ..."
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, Joel Fuentes, Wei-Yu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, and Daniel Rhee (Intel Corporation, USA) The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU. @InProceedings{CGO21p289, author = {Guei-Yuan Lueh and Kaiyu Chen and Gang Chen and Joel Fuentes and Wei-Yu Chen and Fangwen Fu and Hong Jiang and Hongzheng Li and Daniel Rhee}, title = {C-for-Metal: High Performance SIMD Programming on Intel GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {289--300}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Riddle, River |
CGO '21: "MLIR: Scaling Compiler Infrastructure ..."
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko (Google, USA; Indian Institute of Science, India; Google, France) This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. @InProceedings{CGO21p2, author = {Chris Lattner and Mehdi Amini and Uday Bondhugula and Albert Cohen and Andy Davis and Jacques Pienaar and River Riddle and Tatiana Shpeisman and Nicolas Vasilache and Oleksandr Zinenko}, title = {MLIR: Scaling Compiler Infrastructure for Domain Specific Computation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {2--14}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable |
|
Rivera, Joao |
CGO '21: "An Interval Compiler for Sound ..."
An Interval Compiler for Sound Floating-Point Computations
Joao Rivera, Franz Franchetti, and Markus Püschel (ETH Zurich, Switzerland; Carnegie Mellon University, USA) Floating-point arithmetic is widely used by software developers but is unsound, i.e., there is no guarantee on the accuracy obtained, which can be imperative in safety-critical applications. We present IGen, a source-to-source compiler that translates a given C function using floating-point into an equivalent sound C function that uses interval arithmetic. IGen supports Intel SIMD intrinsics in the input function using a specially designed code generator and can produce SIMD-optimized output. To mitigate a possible loss of accuracy due to the increase of interval sizes, IGen can compile to double-double precision, again SIMD-optimized. Finally, IGen implements an accuracy optimization for the common reduction pattern. We benchmark our compiler on high-performance code in the domain of linear algebra and signal processing. The results show that the generated code delivers sound double precision results at high performance. In particular, we observe speed-ups of up to 9.8 when compared to commonly used interval libraries using double precision. When compiling to double-double, our compiler delivers intervals that keep error accumulation small enough to compute results with at most one bit of error in double precision, i.e., certified double precision results. @InProceedings{CGO21p52, author = {Joao Rivera and Franz Franchetti and Markus Püschel}, title = {An Interval Compiler for Sound Floating-Point Computations}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {52--64}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Rocha, Jerônimo Nunes |
CGO '21: "AnghaBench: A Suite with One ..."
AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction
Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimarães, and Fernando Magno Quintão Pereira (State University of Maringá, Brazil; Federal University of Minas Gerais, Brazil) A predictive compiler uses properties of a program to decide how to optimize it. The compiler is trained on a collection of programs to derive a model which determines its actions in face of unknown codes. One of the challenges of predictive compilation is how to find good training sets. Regardless of the programming language, the availability of human-made benchmarks is limited. Moreover, current synthesizers produce code that is very different from actual programs, and mining compilable code from open repositories is difficult, due to program dependencies. In this paper, we use a combination of web crawling and type inference to overcome these problems for the C programming language. We use a type reconstructor based on Hindley-Milner's algorithm to produce AnghaBench, a virtually unlimited collection of real-world compilable C programs. Although AnghaBench programs are not executable, they can be transformed into object files by any C compliant compiler. Therefore, they can be used to train compilers for code size reduction. We have used thousands of AnghaBench programs to train YaCoS, a predictive compiler based on LLVM. The version of YaCoS autotuned with AnghaBench generates binaries for the LLVM test suite over 10% smaller than clang -Oz. It compresses code impervious even to the state-of-the-art Function Sequence Alignment technique published in 2019, as it does not require large binaries to work well. @InProceedings{CGO21p378, author = {Anderson Faustino da Silva and Bruno Conde Kind and José Wesley de Souza Magalhães and Jerônimo Nunes Rocha and Breno Campos Ferreira Guimarães and Fernando Magno Quintão Pereira}, title = {AnghaBench: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {378--390}, doi = {}, year = {2021}, } Info |
|
Sabu, Alen |
CGO '21: "ELFies: Executable Region ..."
ELFies: Executable Region Checkpoints for Performance Analysis and Simulation
Harish Patil, Alexander Isaev, Wim Heirman, Alen Sabu, Ali Hajiabadi, and Trevor E. Carlson (Intel Corporation, USA; Intel Corporation, Belgium; National University of Singapore, Singapore) We address the challenge faced in characterizing long-running workloads, namely how to reliably focus the detailed analysis on interesting execution regions. We present a set of tools that allows users to precisely capture any region of interest in program execution, and create a stand-alone executable, called an ELFie, from it. An ELFie starts with the same program state captured at the beginning of the region of interest and then executes natively. With ELFies, there is no fast-forwarding to the region of interest needed or the uncertainty of reaching the region. ELFies can be fed to dynamic program-analysis tools or simulators that work with regular program binaries. Our tool-chain is based on the PinPlay framework and requires no special hardware, operating system changes, re-compilation, or re-linking of test programs. This paper describes the design of our ELFie generation tool-chain and the application of ELFies in performance analysis and simulation of regions of interest in popular long-running single and multi-threaded benchmarks. @InProceedings{CGO21p126, author = {Harish Patil and Alexander Isaev and Wim Heirman and Alen Sabu and Ali Hajiabadi and Trevor E. Carlson}, title = {ELFies: Executable Region Checkpoints for Performance Analysis and Simulation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {126--136}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Sai, Ryuichi |
CGO '21: "GPA: A GPU Performance Advisor ..."
GPA: A GPU Performance Advisor Based on Instruction Sampling
Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey (Rice University, USA) Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained tuning advice at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program’s structure and the GPU to match inefficiency patterns with optimization strategies. To quantify the potential benefits of each optimization strategy, we developed PC sampling-based performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides insightful reports to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01× to 3.58×, with a geometric mean of 1.22×. @InProceedings{CGO21p115, author = {Keren Zhou and Xiaozhu Meng and Ryuichi Sai and John Mellor-Crummey}, title = {GPA: A GPU Performance Advisor Based on Instruction Sampling}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {115--125}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Seiferth, Johannes |
CGO '21: "YaskSite: Stencil Optimization ..."
YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures
Christie L. Alappat, Johannes Seiferth, Georg Hager, Matthias Korch, Thomas Rauber, and Gerhard Wellein (University of Erlangen-Nuremberg, Germany; University of Bayreuth, Germany) The landscape of multi-core architectures is growing more complex and diverse. Optimal application performance tuning parameters can vary widely across CPUs, and finding them in a possibly multidimensional parameter search space can be time consuming, expensive and potentially infeasible. In this work, we introduce YaskSite, a tool capable of tackling these challenges for stencil computations. YaskSite is built upon Intel’s YASK framework. It combines YASK’s flexibility to deal with different target architectures with the Execution-Cache-Memory performance model, which enables identifying optimal performance parameters analytically without the need to run the code. Further we show that YaskSite’s features can be exploited by external tuning frameworks to reliably select the most efficient kernel(s) for the application at hand. To demonstrate this, we integrate YaskSite into Offsite, an offline tuner for explicit ordinary differential equation methods, and show that the generated performance predictions are reliable and accurate, leading to considerable performance gains at minimal code generation time and autotuning costs on the latest Intel Cascade Lake and AMD Rome CPUs. @InProceedings{CGO21p174, author = {Christie L. Alappat and Johannes Seiferth and Georg Hager and Matthias Korch and Thomas Rauber and Gerhard Wellein}, title = {YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {174--186}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Shpeisman, Tatiana |
CGO '21: "MLIR: Scaling Compiler Infrastructure ..."
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko (Google, USA; Indian Institute of Science, India; Google, France) This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. @InProceedings{CGO21p2, author = {Chris Lattner and Mehdi Amini and Uday Bondhugula and Albert Cohen and Andy Davis and Jacques Pienaar and River Riddle and Tatiana Shpeisman and Nicolas Vasilache and Oleksandr Zinenko}, title = {MLIR: Scaling Compiler Infrastructure for Domain Specific Computation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {2--14}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable |
|
Shun, Julian |
CGO '21: "Compiling Graph Applications ..."
Compiling Graph Applications for GPUs with GraphIt
Ajay Brahmakshatriya, Yunming Zhang, Changwan Hong, Shoaib Kamil, Julian Shun, and Saman Amarasinghe (Massachusetts Institute of Technology, USA; Adobe, USA) The performance of graph programs depends highly on the algorithm, size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations as well as the hardware platform to use. The GraphIt programming language makes it easy for the developers to write the algorithm once and optimize it for different inputs using a scheduling language. However GraphIt currently has no support for generating high performance CUDA code for GPUs Developers have to resort to re-implementing the entire algorithm with entirely different set of abstractions and optimizations in order to achieve high-performance. We propose G2, an extension to the GraphIt compiler frame- work, that achieves high performance on both CPUs and GPUs using the same algorithm specification. G2 significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that en- ables combining load balancing, edge traversal direction, active vertex set creation, active vertex set processing ordering, and kernel fusion optimizations. G2 also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balance by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within L2 cache. We evaluate G2 on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11× speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments. @InProceedings{CGO21p248, author = {Ajay Brahmakshatriya and Yunming Zhang and Changwan Hong and Shoaib Kamil and Julian Shun and Saman Amarasinghe}, title = {Compiling Graph Applications for GPUs with GraphIt}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {248--261}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Soares, Luigi |
CGO '21: "Memory-Safe Elimination of ..."
Memory-Safe Elimination of Side Channels
Luigi Soares and Fernando Magno Quintão Pereira (Federal University of Minas Gerais, Brazil) A program is said to be isochronous if its running time does not depend on classified information. The programming languages literature contains much work that transforms programs to ensure isochronicity. The current state-of-the-art approach is a code transformation technique due to Wu et al., published in 2018. That technique has an important virtue: it ensures that the transformed program runs exactly the same set of operations, regardless of inputs. However, in this paper we demonstrate that it has also a shortcoming: it might add out-of-bounds memory accesses into programs that were originally memory sound. From this observation, we show how to deliver the same runtime guarantees that Wu et al. provide, in a memory-safe way. In addition to being safer, our LLVM-based implementation is more efficient than its original inspiration, achieving shorter repairing times, and producing code that is smaller and faster. @InProceedings{CGO21p200, author = {Luigi Soares and Fernando Magno Quintão Pereira}, title = {Memory-Safe Elimination of Side Channels}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {200--210}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Song, Seungbin |
CGO '21: "Fine-Grained Pipeline Parallelization ..."
Fine-Grained Pipeline Parallelization for Network Function Programs
Seungbin Song, Heelim Choi, and Hanjun Kim (Yonsei University, South Korea) Network programming languages enable programmers to implement new network functions on various hardware and software stacks in the domain of Software Defined Networking (SDN). Although the languages extend the flexibility of network devices, existing compilers do not fully optimize the network programs due to their coarse-grained parallelization methods. The compilers consider each packet processing table that consists of match and action functions as a unit of tasks and parallelize the programs without decomposing match and action functions. This work proposes a new fine-grained pipeline parallelization compiler for network programming languages, named PSDN. First, the PSDN compiler decouples match and action functions from packet processing tables and analyzes dependencies among the matches and actions. While respecting the dependencies, the compiler efficiently schedules each match and action function into a pipeline with clock cycle estimation and fuses functions to reduce synchronization overheads. This work implements the PSDN compiler that translates a P4 network program to a Xilinx PX program, which is synthesizable to NetFPGA-SUME hardware. The proposed compiler reduces packet processing latency by 12.1% and utilization by 3.5% compared to previous work. @InProceedings{CGO21p162, author = {Seungbin Song and Heelim Choi and Hanjun Kim}, title = {Fine-Grained Pipeline Parallelization for Network Function Programs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {162--173}, doi = {}, year = {2021}, } |
|
Song, William |
CGO '21: "Thread-Aware Area-Efficient ..."
Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices
Changsu Kim, Shinnung Jeong, Sungjun Cho, Yongwoo Lee, William Song, Youngsok Kim, and Hanjun Kim (POSTECH, South Korea; Yonsei University, South Korea) In the embedded device market, custom hardware platforms such as an application specific integrated circuit (ASIC) and a field programmable gate array (FPGA) are attractive thanks to their high performance and power efficiency. However, its huge design costs make it challenging for manufacturers to timely launch new devices. High-level synthesis (HLS) helps significantly reduce the design costs by automating the translation of service algorithms into hardware logics; however, current HLS compilers do not fit well to embedded devices as they fail to produce area-efficient solutions while supporting concurrent events from diverse peripherals such as sensors, actuators and network modules. This paper proposes a new thread-aware HLS compiler named DURO that produces area-efficient embedded devices. DURO shares commonly-invoked functions and operators across different callers and threads with a new thread-aware area cost model, and thus effectively reduces the logic size. Moreover, DURO supports a variety of device peripherals by automatically integrating peripheral controllers and interfaces as peripheral drivers. The experiment results of six embedded devices with ten peripherals demonstrate that DURO reduces the area and energy dissipation of embedded devices by 28.5% and 25.3% compared with the designs generated by the state-of-the-art HLS compiler. This work also implements FPGA prototypes of the six devices using DURO, and the measurement results show 65.3% energy saving over Raspberry Pi Zero with slightly better computation performance. @InProceedings{CGO21p327, author = {Changsu Kim and Shinnung Jeong and Sungjun Cho and Yongwoo Lee and William Song and Youngsok Kim and Hanjun Kim}, title = {Thread-Aware Area-Efficient High-Level Synthesis Compiler for Embedded Devices}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {327--339}, doi = {}, year = {2021}, } |
|
Steuwer, Michel |
CGO '21: "Towards a Domain-Extensible ..."
Towards a Domain-Extensible Compiler: Optimizing an Image Processing Pipeline on Mobile CPUs
Thomas Koehler and Michel Steuwer (University of Glasgow, UK; University of Edinburgh, UK) Halide and many similar projects have demonstrated the great potential of domain specific optimizing compilers. They enable programs to be expressed at a convenient high-level, while generating high-performance code for parallel architectures. As domains of interest expand towards deep learning, probabilistic programming and beyond, it becomes increasingly clear that it is unsustainable to redesign domain specific compilers for each new domain. In addition, the rapid growth of hardware architectures to optimize for poses great challenges for designing these compilers. In this paper, we show how to extend a unifying domain-extensible compiler with domain-specific as well as hardware-specific optimizations. The compiler operates on generic patterns that have proven flexible enough to express a wide range of computations. Optimizations are not hard-coded into the compiler but are expressed as user-defined rewrite rules that are composed into strategies controlling the optimization process. Crucially, both computational patterns and optimization strategies are extensible without modifying the core compiler implementation. We demonstrate that this domain-extensible compiler design is capable of expressing image processing pipelines and well-known image processing optimizations. Our results on four mobile ARM multi-core CPUs, often used for image processing tasks, show that the code generated for the Harris operator outperforms the image processing library OpenCV by up to 16× and achieves performance close to - or even up to 1.4× better than - the state-of-the-art image processing compiler Halide. @InProceedings{CGO21p27, author = {Thomas Koehler and Michel Steuwer}, title = {Towards a Domain-Extensible Compiler: Optimizing an Image Processing Pipeline on Mobile CPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {27--38}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Sui, Yulei |
CGO '21: "Object Versioning for Flow-Sensitive ..."
Object Versioning for Flow-Sensitive Pointer Analysis
Mohamad Barbar, Yulei Sui, and Shiping Chen (University of Technology Sydney, Australia; CSIRO's Data61, Australia) Flow-sensitive points-to analysis provides better precision than its flow-insensitive counterpart. Traditionally performed on the control-flow graph, it incurs heavy analysis overhead. For performance, staged flow-sensitive analysis (SFS) is conducted on a pre-computed def-use (value-flow) graph where points-to sets of variables are propagated across def-use chains sparsely rather than across control-flow in the control-flow graph. SFS makes the propagation of different objects’ points-to sets sparse (multiple-object sparsity), however, it suffers from redundant propagation between instructions of the same object’s points-to sets (single-object sparsity). The points-to set of an object is often duplicated, resulting in redundant propagation and storage, especially in real-world heap-intensive programs. We notice that a simple graph prelabelling extension can identify much of this redundancy in a pre-analysis. With this pre-analysis, multiple nodes (instructions) in the value-flow graph can share an individual memory object’s points-to set rather than each node maintaining its own points-to set for that single object. We present object versioning for flow-sensitive points-to analysis, a finer single-object sparsity technique which maintains the same precision while allowing us to avoid much of the redundancy present in propagating and storing points-to sets. Our experiments conducted on 15 open-source programs, when compared with SFS, show that our approach runs up to 26.22× faster (5.31× on average), and reduces memory usage by up to 5.46× (2.11× on average). @InProceedings{CGO21p222, author = {Mohamad Barbar and Yulei Sui and Shiping Chen}, title = {Object Versioning for Flow-Sensitive Pointer Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {222--235}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Trevisan Jost, Tiago |
CGO '21: "Seamless Compiler Integration ..."
Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic
Tiago Trevisan Jost, Yves Durand, Christian Fabre, Albert Cohen, and Frédéric Pétrot (Université Grenoble Alpes, France; CEA LIST, France; Google, France; CNRS, France; Grenoble INP, France; TIMA, France) Floating-Point (FP) units in processors are generally limited to supporting a subset of formats defined by the IEEE 754 standard. As a result, high-efficiency languages and optimizing compilers for high-performance computing only support IEEE standard types and applications needing higher precision involve cumbersome memory management and calls to external libraries, resulting in code bloat and making the intent of the program unclear. We present an extension of the C type system that can represent generic FP operations and formats, supporting both static precision and dynamically variable precision. We design and implement a compilation flow bridging the abstraction gap between this type system and low-level FP instructions or software libraries. The effectiveness of our solution is demonstrated through an LLVM-based implementation, leveraging aggressive optimizations in LLVM including the Polly loop nest optimizer, which targets two backend code generators: one for the ISA of a variable precision FP arithmetic coprocessor, and one for the MPFR multi-precision floating-point library. Our optimizing compilation flow targeting MPFR outperforms the Boost programming interface for the MPFR library by a factor of 1.80× and 1.67× in sequential execution of the PolyBench and RAJAPerf suites, respectively, and by a factor of 7.62× on an 8-core (and 16-thread) machine for RAJAPerf in OpenMP. @InProceedings{CGO21p65, author = {Tiago Trevisan Jost and Yves Durand and Christian Fabre and Albert Cohen and Frédéric Pétrot}, title = {Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {65--76}, doi = {}, year = {2021}, } Artifacts Functional Results Reproduced |
|
Vasilache, Nicolas |
CGO '21: "MLIR: Scaling Compiler Infrastructure ..."
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko (Google, USA; Indian Institute of Science, India; Google, France) This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. @InProceedings{CGO21p2, author = {Chris Lattner and Mehdi Amini and Uday Bondhugula and Albert Cohen and Andy Davis and Jacques Pienaar and River Riddle and Tatiana Shpeisman and Nicolas Vasilache and Oleksandr Zinenko}, title = {MLIR: Scaling Compiler Infrastructure for Domain Specific Computation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {2--14}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable CGO '21: "Progressive Raising in Multi-level ..." Progressive Raising in Multi-level IR Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, and Henk Corporaal (Eindhoven University of Technology, Netherlands; Inria, France; ENS Paris, France; Google, France; Google, Switzerland; University of Edinburgh, UK) Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opinionated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multi-Level Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements. @InProceedings{CGO21p15, author = {Lorenzo Chelini and Andi Drebes and Oleksandr Zinenko and Albert Cohen and Nicolas Vasilache and Tobias Grosser and Henk Corporaal}, title = {Progressive Raising in Multi-level IR}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {15--26}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Vasiladiotis, Christos |
CGO '21: "Loop Parallelization using ..."
Loop Parallelization using Dynamic Commutativity Analysis
Christos Vasiladiotis, Roberto Castañeda Lozano, Murray Cole, and Björn Franke (University of Edinburgh, UK) Automatic parallelization has largely failed to keep its promise of extracting parallelism from sequential legacy code to maximize performance on multi-core systems outside the numerical domain. In this paper, we develop a novel dynamic commutativity analysis (DCA) for identifying parallelizable loops. Using commutativity instead of dependence tests, DCA avoids many of the overly strict data dependence constraints limiting existing parallelizing compilers. DCA extends the scope of automatic parallelization to uniformly include both regular array-based and irregular pointer-based codes. We have prototyped our novel parallelism detection analysis and evaluated it extensively against five state-of-the-art dependence-based techniques in two experimental settings. First, when applied to the NAS benchmarks which contain almost 1400 loops, DCA is able to identify as many parallel loops (over 1200) as the profile-guided dependence techniques and almost twice as many as all the static techniques combined. We then apply DCA to complex pointer-based loops, where it can successfully detect parallelism, while existing techniques fail to identify any. When combined with existing parallel code generation techniques, this results in an average speedup of 3.6× (and up to 55×) across the NAS benchmarks on a 72-core host, and up to 36.9× for the pointer-based loops, demonstrating the effectiveness of DCA in identifying profitable parallelism across a wide range of loops. @InProceedings{CGO21p150, author = {Christos Vasiladiotis and Roberto Castañeda Lozano and Murray Cole and Björn Franke}, title = {Loop Parallelization using Dynamic Commutativity Analysis}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {150--161}, doi = {}, year = {2021}, } |
|
Verbrugge, Clark |
CGO '21: "r3d3: Optimized Query Compilation ..."
r3d3: Optimized Query Compilation on GPUs
Alexander Krolik, Clark Verbrugge, and Laurie Hendren (McGill University, Canada) Query compilation is an effective approach to improve the performance of repeated database queries. GPU-based approaches have significant promise, but face difficulties in managing compilation time, data transfer costs, and in addressing a reasonably comprehensive range of SQL operations. In this work we describe a hybrid AoT/JIT approach to GPU-based query compilation. We use multiple optimizations to reduce execution, compile, and data transfer times, improving performance over both other GPU-based approaches and CPU-based query compilers as well. Our design addresses a wide range of SQL queries, sufficient to demonstrate the practicality of using GPUs for query optimization. @InProceedings{CGO21p277, author = {Alexander Krolik and Clark Verbrugge and Laurie Hendren}, title = {r3d3: Optimized Query Compilation on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {277--288}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Functional Results Reproduced |
|
Wang, Jie |
CGO '21: "UNIT: Unifying Tensorized ..."
UNIT: Unifying Tensorized Instruction Compilation
Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony Nowatzki (University of California at Los Angeles, USA; Amazon, USA) Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions. In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3× speedup over Intel oneDNN on an x86 CPU, 1.75× speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13× speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU. @InProceedings{CGO21p77, author = {Jian Weng and Animesh Jain and Jie Wang and Leyuan Wang and Yida Wang and Tony Nowatzki}, title = {UNIT: Unifying Tensorized Instruction Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {77--89}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Wang, Leyuan |
CGO '21: "UNIT: Unifying Tensorized ..."
UNIT: Unifying Tensorized Instruction Compilation
Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony Nowatzki (University of California at Los Angeles, USA; Amazon, USA) Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions. In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3× speedup over Intel oneDNN on an x86 CPU, 1.75× speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13× speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU. @InProceedings{CGO21p77, author = {Jian Weng and Animesh Jain and Jie Wang and Leyuan Wang and Yida Wang and Tony Nowatzki}, title = {UNIT: Unifying Tensorized Instruction Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {77--89}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Wang, Wenwen |
CGO '21: "Enhancing Atomic Instruction ..."
Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation
Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, and Pen-Chung Yew (Nankai University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) Dynamic Binary Translation (DBT) is a key enabler for cross-ISA emulation, system virtualization, runtime instrumentation, and many other important applications. Among several critical requirements for DBT, it is important to provide equivalent semantics for atomic synchronization instructions such as Load-link/Store-conditional (LL/SC), which are mostly included in the reduced-instruction set architectures (RISC) and Compare-and-Swap(CAS), which is mostly in the complex instruction set architectures (CISC). However, the state-of-the-art DBT tools often do not provide a fully correct translation of these atomic instructions, in particular, from RISC atomic instructions (i.e. LL/SC) to CISC atomic instructions (i.e. CAS), due to performance concerns. As a result, some may cause the well-known ABA problem, which could lead to wrong results or program crashes. In our experimental studies on QEMU, a state-of-the-art DBT, that runs multi-threaded lock-free stack operations implemented with ARM instruction set (i.e. using LL/SC) on Intel x86 platforms (i.e. using CAS), it often crashes within 2 seconds. Although attempts have been made to provide correct emulation for such atomic instructions, they either result in heavy execution overheads or require additional hardware support. In this paper, we propose several schemes to address those issues and implement them on QEMU to evaluate their performance overheads. The results show that all of the proposed schemes can provide correct emulation and, for the best solution, can achieve a min, max, geomean speedup of 1.25x, 3.21x, 2.03x respectively, over the best existing software-based scheme. @InProceedings{CGO21p351, author = {Ziyi Zhao and Zhang Jiang and Ying Chen and Xiaoli Gong and Wenwen Wang and Pen-Chung Yew}, title = {Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {351--362}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Wang, Xueying |
CGO '21: "Unleashing the Low-Precision ..."
Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs
Guangli Li, Jingling Xue, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia; Jilin University, China) Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The main challenge faced when accelerating a tensor program on tensor-specialized hardware is how to achieve the best performance possible in reduced precision by fully utilizing its computational resources while keeping the precision loss in a controlled manner. In this paper, we address this challenge by proposing QUANTENSOR, a new approach for accelerating general-purpose tensor programs by replacing its tensor computations with low-precision quantized tensor computations on NVIDIA Tensor Cores. The key novelty is a new residual-based precision refinement technique for controlling the quantization errors, allowing tradeoffs between performance and precision to be made. Evaluation with GEMM, deep neural networks, and linear algebra applications shows that QUANTENSOR can achieve remarkable performance improvements while reducing the precision loss incurred significantly at acceptable overheads. @InProceedings{CGO21p90, author = {Guangli Li and Jingling Xue and Lei Liu and Xueying Wang and Xiu Ma and Xiao Dong and Jiansong Li and Xiaobing Feng}, title = {Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {90--102}, doi = {}, year = {2021}, } |
|
Wang, Yida |
CGO '21: "UNIT: Unifying Tensorized ..."
UNIT: Unifying Tensorized Instruction Compilation
Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony Nowatzki (University of California at Los Angeles, USA; Amazon, USA) Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions. In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3× speedup over Intel oneDNN on an x86 CPU, 1.75× speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13× speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU. @InProceedings{CGO21p77, author = {Jian Weng and Animesh Jain and Jie Wang and Leyuan Wang and Yida Wang and Tony Nowatzki}, title = {UNIT: Unifying Tensorized Instruction Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {77--89}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Wellein, Gerhard |
CGO '21: "YaskSite: Stencil Optimization ..."
YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures
Christie L. Alappat, Johannes Seiferth, Georg Hager, Matthias Korch, Thomas Rauber, and Gerhard Wellein (University of Erlangen-Nuremberg, Germany; University of Bayreuth, Germany) The landscape of multi-core architectures is growing more complex and diverse. Optimal application performance tuning parameters can vary widely across CPUs, and finding them in a possibly multidimensional parameter search space can be time consuming, expensive and potentially infeasible. In this work, we introduce YaskSite, a tool capable of tackling these challenges for stencil computations. YaskSite is built upon Intel’s YASK framework. It combines YASK’s flexibility to deal with different target architectures with the Execution-Cache-Memory performance model, which enables identifying optimal performance parameters analytically without the need to run the code. Further we show that YaskSite’s features can be exploited by external tuning frameworks to reliably select the most efficient kernel(s) for the application at hand. To demonstrate this, we integrate YaskSite into Offsite, an offline tuner for explicit ordinary differential equation methods, and show that the generated performance predictions are reliable and accurate, leading to considerable performance gains at minimal code generation time and autotuning costs on the latest Intel Cascade Lake and AMD Rome CPUs. @InProceedings{CGO21p174, author = {Christie L. Alappat and Johannes Seiferth and Georg Hager and Matthias Korch and Thomas Rauber and Gerhard Wellein}, title = {YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {174--186}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Weng, Jian |
CGO '21: "UNIT: Unifying Tensorized ..."
UNIT: Unifying Tensorized Instruction Compilation
Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony Nowatzki (University of California at Los Angeles, USA; Amazon, USA) Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions. In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3× speedup over Intel oneDNN on an x86 CPU, 1.75× speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13× speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU. @InProceedings{CGO21p77, author = {Jian Weng and Animesh Jain and Jie Wang and Leyuan Wang and Yida Wang and Tony Nowatzki}, title = {UNIT: Unifying Tensorized Instruction Compilation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {77--89}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Xue, Jingling |
CGO '21: "Unleashing the Low-Precision ..."
Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs
Guangli Li, Jingling Xue, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia; Jilin University, China) Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The main challenge faced when accelerating a tensor program on tensor-specialized hardware is how to achieve the best performance possible in reduced precision by fully utilizing its computational resources while keeping the precision loss in a controlled manner. In this paper, we address this challenge by proposing QUANTENSOR, a new approach for accelerating general-purpose tensor programs by replacing its tensor computations with low-precision quantized tensor computations on NVIDIA Tensor Cores. The key novelty is a new residual-based precision refinement technique for controlling the quantization errors, allowing tradeoffs between performance and precision to be made. Evaluation with GEMM, deep neural networks, and linear algebra applications shows that QUANTENSOR can achieve remarkable performance improvements while reducing the precision loss incurred significantly at acceptable overheads. @InProceedings{CGO21p90, author = {Guangli Li and Jingling Xue and Lei Liu and Xueying Wang and Xiu Ma and Xiao Dong and Jiansong Li and Xiaobing Feng}, title = {Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {90--102}, doi = {}, year = {2021}, } CGO '21: "GoBench: A Benchmark Suite ..." GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs Ting Yuan, Guangwei Li, Jie Lu, Chen Liu, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) Go, a fast growing programming language, is often considered as "the programming language of the cloud". The language provides a rich set of synchronization primitives, making it easy to write concurrent programs with great parallelism. However, the rich set of primitives also introduces many bugs. We build GoBench, the first benchmark suite for Go concurrency bugs. Currently, GoBench consists of 82 real bugs from 9 popular open source applications and 103 bug kernels. The bug kernels are carefully extracted and simplified from 67 out of these 82 bugs and 36 additional bugs reported in a recent study to preserve their bug-inducing complexities as much as possible. These bugs cover a variety of concurrency issues, both traditional and Go-specific. We believe GoBench will be instrumental in helping researchers understand concurrency bugs in Go and develop effective tools for their detection. We have therefore evaluated a range of representative concurrency error detection tools using GoBench. Our evaluation has revealed their limitations and provided insights for making further improvements. @InProceedings{CGO21p187, author = {Ting Yuan and Guangwei Li and Jie Lu and Chen Liu and Lian Li and Jingling Xue}, title = {GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {187--199}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Yew, Pen-Chung |
CGO '21: "Variable-Sized Blocks for ..."
Variable-Sized Blocks for Locality-Aware SpMV
Naveen Namashivayam, Sanyam Mehta, and Pen-Chung Yew (HPE, USA; University of Minnesota at Twin Cities, USA) Blocking is an important optimization option available to mitigate the data movement overhead and improve the temporal locality in SpMV, a sparse BLAS kernel with irregular memory reference pattern. In this work, we propose an analytical model to determine the effective block size for highly irregular sparse matrices by factoring the distribution of non-zeros in the sparse dataset. As a result, the blocks generated by our scheme are variable-sized as opposed to constant-sized in most existing SpMV algorithms. We demonstrate our blocking scheme using Compressed Vector Blocks (CVB), a new column-based blocked data format, on Intel Xeon Skylake-X multicore processor. We evaluated the performance of CVB-based SpMV with variable-sized blocks using extensive set of matrices from Stanford Network Analysis Platform (SNAP). Our evaluation shows a speedup of up to 2.62X (with an average of 1.73X) and 2.02X (with an average of 1.18X) over the highly vendor tuned SpMV implementation in Intel’s Math Kernel Library (MKL) on single and multiple Intel Xeon cores respectively. @InProceedings{CGO21p211, author = {Naveen Namashivayam and Sanyam Mehta and Pen-Chung Yew}, title = {Variable-Sized Blocks for Locality-Aware SpMV}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {211--221}, doi = {}, year = {2021}, } CGO '21: "Enhancing Atomic Instruction ..." Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, and Pen-Chung Yew (Nankai University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) Dynamic Binary Translation (DBT) is a key enabler for cross-ISA emulation, system virtualization, runtime instrumentation, and many other important applications. Among several critical requirements for DBT, it is important to provide equivalent semantics for atomic synchronization instructions such as Load-link/Store-conditional (LL/SC), which are mostly included in the reduced-instruction set architectures (RISC) and Compare-and-Swap(CAS), which is mostly in the complex instruction set architectures (CISC). However, the state-of-the-art DBT tools often do not provide a fully correct translation of these atomic instructions, in particular, from RISC atomic instructions (i.e. LL/SC) to CISC atomic instructions (i.e. CAS), due to performance concerns. As a result, some may cause the well-known ABA problem, which could lead to wrong results or program crashes. In our experimental studies on QEMU, a state-of-the-art DBT, that runs multi-threaded lock-free stack operations implemented with ARM instruction set (i.e. using LL/SC) on Intel x86 platforms (i.e. using CAS), it often crashes within 2 seconds. Although attempts have been made to provide correct emulation for such atomic instructions, they either result in heavy execution overheads or require additional hardware support. In this paper, we propose several schemes to address those issues and implement them on QEMU to evaluate their performance overheads. The results show that all of the proposed schemes can provide correct emulation and, for the best solution, can achieve a min, max, geomean speedup of 1.25x, 3.21x, 2.03x respectively, over the best existing software-based scheme. @InProceedings{CGO21p351, author = {Ziyi Zhao and Zhang Jiang and Ying Chen and Xiaoli Gong and Wenwen Wang and Pen-Chung Yew}, title = {Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {351--362}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Yuan, Ting |
CGO '21: "GoBench: A Benchmark Suite ..."
GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs
Ting Yuan, Guangwei Li, Jie Lu, Chen Liu, Lian Li, and Jingling Xue (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; UNSW, Australia) Go, a fast growing programming language, is often considered as "the programming language of the cloud". The language provides a rich set of synchronization primitives, making it easy to write concurrent programs with great parallelism. However, the rich set of primitives also introduces many bugs. We build GoBench, the first benchmark suite for Go concurrency bugs. Currently, GoBench consists of 82 real bugs from 9 popular open source applications and 103 bug kernels. The bug kernels are carefully extracted and simplified from 67 out of these 82 bugs and 36 additional bugs reported in a recent study to preserve their bug-inducing complexities as much as possible. These bugs cover a variety of concurrency issues, both traditional and Go-specific. We believe GoBench will be instrumental in helping researchers understand concurrency bugs in Go and develop effective tools for their detection. We have therefore evaluated a range of representative concurrency error detection tools using GoBench. Our evaluation has revealed their limitations and provided insights for making further improvements. @InProceedings{CGO21p187, author = {Ting Yuan and Guangwei Li and Jie Lu and Chen Liu and Lian Li and Jingling Xue}, title = {GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {187--199}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
|
Zhang, Yunming |
CGO '21: "Compiling Graph Applications ..."
Compiling Graph Applications for GPUs with GraphIt
Ajay Brahmakshatriya, Yunming Zhang, Changwan Hong, Shoaib Kamil, Julian Shun, and Saman Amarasinghe (Massachusetts Institute of Technology, USA; Adobe, USA) The performance of graph programs depends highly on the algorithm, size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations as well as the hardware platform to use. The GraphIt programming language makes it easy for the developers to write the algorithm once and optimize it for different inputs using a scheduling language. However GraphIt currently has no support for generating high performance CUDA code for GPUs Developers have to resort to re-implementing the entire algorithm with entirely different set of abstractions and optimizations in order to achieve high-performance. We propose G2, an extension to the GraphIt compiler frame- work, that achieves high performance on both CPUs and GPUs using the same algorithm specification. G2 significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that en- ables combining load balancing, edge traversal direction, active vertex set creation, active vertex set processing ordering, and kernel fusion optimizations. G2 also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balance by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within L2 cache. We evaluate G2 on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11× speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments. @InProceedings{CGO21p248, author = {Ajay Brahmakshatriya and Yunming Zhang and Changwan Hong and Shoaib Kamil and Julian Shun and Saman Amarasinghe}, title = {Compiling Graph Applications for GPUs with GraphIt}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {248--261}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Zhao, Ziyi |
CGO '21: "Enhancing Atomic Instruction ..."
Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation
Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, and Pen-Chung Yew (Nankai University, China; University of Georgia, USA; University of Minnesota at Twin Cities, USA) Dynamic Binary Translation (DBT) is a key enabler for cross-ISA emulation, system virtualization, runtime instrumentation, and many other important applications. Among several critical requirements for DBT, it is important to provide equivalent semantics for atomic synchronization instructions such as Load-link/Store-conditional (LL/SC), which are mostly included in the reduced-instruction set architectures (RISC) and Compare-and-Swap(CAS), which is mostly in the complex instruction set architectures (CISC). However, the state-of-the-art DBT tools often do not provide a fully correct translation of these atomic instructions, in particular, from RISC atomic instructions (i.e. LL/SC) to CISC atomic instructions (i.e. CAS), due to performance concerns. As a result, some may cause the well-known ABA problem, which could lead to wrong results or program crashes. In our experimental studies on QEMU, a state-of-the-art DBT, that runs multi-threaded lock-free stack operations implemented with ARM instruction set (i.e. using LL/SC) on Intel x86 platforms (i.e. using CAS), it often crashes within 2 seconds. Although attempts have been made to provide correct emulation for such atomic instructions, they either result in heavy execution overheads or require additional hardware support. In this paper, we propose several schemes to address those issues and implement them on QEMU to evaluate their performance overheads. The results show that all of the proposed schemes can provide correct emulation and, for the best solution, can achieve a min, max, geomean speedup of 1.25x, 3.21x, 2.03x respectively, over the best existing software-based scheme. @InProceedings{CGO21p351, author = {Ziyi Zhao and Zhang Jiang and Ying Chen and Xiaoli Gong and Wenwen Wang and Pen-Chung Yew}, title = {Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {351--362}, doi = {}, year = {2021}, } Info Artifacts Available Artifacts Reusable Results Reproduced |
|
Zheng, Hengjie |
CGO '21: "Scaling Up the IFDS Algorithm ..."
Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing
Haofeng Li, Haining Meng, Hengjie Zheng, Liqing Cao, Jie Lu, Lian Li, and Lin Gao (Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TianqiSoft, China) The IFDS algorithm can be memory-intensive, requiring a memory budget of more than 100 GB of RAM for some applications. The large memory requirements significantly restrict the deployment of IFDS-based tools in practise. To improve this, we propose a disk-assisted solution that drastically reduces the memory requirements of traditional IFDS solvers. Our solution saves memory by 1) recomputing instead of memorizing intermediate analysis data, and 2) swapping in-memory data to disk when memory usages reach a threshold. We implement sophisticated scheduling schemes to swap data between memory and disks efficiently. We have developed a new taint analysis tool, DiskDroid, based on our disk-assisted IFDS solver. Compared to FlowDroid, a state-of-the-art IFDS-based taint analysis tool, for a set of 19 apps which take from 10 to 128 GB of RAM by FlowDroid, DiskDroid can analyze them with less than 10GB of RAM at a slight performance improvement of 8.6%. In addition, for 21 apps requiring more than 128GB of RAM by FlowDroid, DiskDroid can analyze each app in 3 hours, under the same memory budget of 10GB. This makes the tool deployable to normal desktop environments. We make the tool publicly available at https://github.com/HaofLi/DiskDroid. @InProceedings{CGO21p236, author = {Haofeng Li and Haining Meng and Hengjie Zheng and Liqing Cao and Jie Lu and Lian Li and Lin Gao}, title = {Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {236--247}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional |
|
Zheng, Ruohuang |
CGO '21: "Efficient Execution of Graph ..."
Efficient Execution of Graph Algorithms on CPU with SIMD Extensions
Ruohuang Zheng and Sreepathi Pai (University of Rochester, USA) Existing state-of-the-art CPU graph frameworks take advantage of multiple cores, but not the SIMD capability within each core. In this work, we retarget an existing GPU graph algorithm compiler to obtain the first graph framework that uses SIMD extensions on CPUs to efficiently execute graph algorithms. We evaluate this compiler on 10 benchmarks and 3 graphs on 3 different CPUs and also compare to the GPU. Evaluation results show that on a 8-core machine, enabling SIMD on a naive multi-core implementation achieves an additional 7.48x speedup, averaged across 10 benchmarks and 3 inputs. Applying our SIMD-targeted optimizations improves the plain SIMD implementation by 1.67x, outperforming a serial implementation by 12.46x. On average, the optimized multi-core SIMD version also outperforms the state-of-the-art graph framework, GraphIt, by 1.53x, averaged across 5 (common) benchmarks. SIMD execution on CPUs closes the gap between the CPU and GPU to 1.76x, but the CPU virtual memory performs better when graphs are much bigger than available physical memory. @InProceedings{CGO21p262, author = {Ruohuang Zheng and Sreepathi Pai}, title = {Efficient Execution of Graph Algorithms on CPU with SIMD Extensions}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {262--276}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Zhou, Huiyang |
CGO '21: "Relaxed Peephole Optimization: ..."
Relaxed Peephole Optimization: A Novel Compiler Optimization for Quantum Circuits
Ji Liu, Luciano Bello, and Huiyang Zhou (North Carolina State University, USA; IBM Research, USA) As in classical computing, compilers play an important role in quantum computing. Quantum processors typically support a limited set of primitive operations or quantum gates and have certain hardware-related limitations. A quantum compiler is responsible for adapting a quantum program to these constraint environments and decomposing quantum gates into a sequence of the primitive ones. During the compilation process, it is also critical for the compiler to optimize the quantum circuits in order to reduce the noise in the computation results. Since the noise is introduced by operations and decoherence, reducing the gate count is the key for improving performance. In this paper, we propose a novel quantum compiler optimization, named relaxed peephole optimization (RPO) for quantum computers. RPO leverages the single-qubit state information that can be determined statically by the compiler. We define that a qubit is in a basis state when, at a given point in time, its state is either in the X-, Y-, or Z-basis (|+>/|−>, |L>/|R> and |0>/|1>). When basis qubits are used as inputs to quantum gates, there exist opportunities for strength reduction, which replaces quantum operations with equivalent but less expensive ones. Compared to the existing peephole optimization for quantum programs, the difference is that our proposed optimization does not require an identical unitary matrix, thereby named ‘relaxed’ peephole optimization. We also extend our approach to optimize the quantum gates when some input qubits are in known pure states. Both optimizations, namely the Quantum Basis-state Optimization (QBO) and the Quantum Pure-state Optimization (QPO), are implemented in the IBM’s Qiskit transpiler. Our experimental results show that our proposed optimization pass is fast and effective. The circuits optimized with our compiler optimizations obtain up to 18.0% (11.7% on average) fewer CNOT gates and up to 8.2% (7.1% on average) lower transpilation time than that of the most aggressive optimization level in the Qiskit compiler. When running on real quantum computers, the success rates of 3-qubit quantum phase estimation algorithm improve by 2.30X due to the reduced gate counts. @InProceedings{CGO21p301, author = {Ji Liu and Luciano Bello and Huiyang Zhou}, title = {Relaxed Peephole Optimization: A Novel Compiler Optimization for Quantum Circuits}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {301--314}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Zhou, Keren |
CGO '21: "GPA: A GPU Performance Advisor ..."
GPA: A GPU Performance Advisor Based on Instruction Sampling
Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey (Rice University, USA) Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained tuning advice at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program’s structure and the GPU to match inefficiency patterns with optimization strategies. To quantify the potential benefits of each optimization strategy, we developed PC sampling-based performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides insightful reports to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01× to 3.58×, with a geometric mean of 1.22×. @InProceedings{CGO21p115, author = {Keren Zhou and Xiaozhu Meng and Ryuichi Sai and John Mellor-Crummey}, title = {GPA: A GPU Performance Advisor Based on Instruction Sampling}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {115--125}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable Results Reproduced |
|
Zhou, Ruoyu |
CGO '21: "Cinnamon: A Domain-Specific ..."
Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring
Mahwish Arif, Ruoyu Zhou, Hsi-Ming Ho, and Timothy M. Jones (University of Cambridge, UK; University of Sussex, UK) Binary instrumentation and rewriting frameworks provide a powerful way of implementing custom analysis and transformation techniques for applications ranging from performance profiling to security monitoring. However, using these frameworks to write even simple analyses and transformations is non-trivial. Developers often need to write framework-specific boilerplate code and work with low-level and complex programming details. This not only results in hundreds (or thousands) of lines of code, but also leaves significant room for error. To address this, we introduce Cinnamon, a domain-specific language designed to write programs for binary profiling and monitoring. Cinnamon's abstractions allow the programmer to focus on implementing their technique in a platform-independent way, without worrying about complex lower-level details. Programmers can use these abstractions to perform analysis and instrumentation at different locations and granularity levels in the binary. The flexibility of Cinnamon also enables its programs to be mapped to static, dynamic or hybrid analysis and instrumentation approaches. As a proof of concept, we target Cinnamon to three different binary frameworks by implementing a custom Cinnamon to C/C++ compiler and integrating the generated code within these frameworks. We further demonstrate the ability of Cinnamon to express a range of profiling and monitoring tools through different use-cases. @InProceedings{CGO21p103, author = {Mahwish Arif and Ruoyu Zhou and Hsi-Ming Ho and Timothy M. Jones}, title = {Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {103--114}, doi = {}, year = {2021}, } |
|
Zinenko, Oleksandr |
CGO '21: "MLIR: Scaling Compiler Infrastructure ..."
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko (Google, USA; Indian Institute of Science, India; Google, France) This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers---describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics. @InProceedings{CGO21p2, author = {Chris Lattner and Mehdi Amini and Uday Bondhugula and Albert Cohen and Andy Davis and Jacques Pienaar and River Riddle and Tatiana Shpeisman and Nicolas Vasilache and Oleksandr Zinenko}, title = {MLIR: Scaling Compiler Infrastructure for Domain Specific Computation}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {2--14}, doi = {}, year = {2021}, } Artifacts Available Artifacts Reusable CGO '21: "Progressive Raising in Multi-level ..." Progressive Raising in Multi-level IR Lorenzo Chelini, Andi Drebes, Oleksandr Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, and Henk Corporaal (Eindhoven University of Technology, Netherlands; Inria, France; ENS Paris, France; Google, France; Google, Switzerland; University of Edinburgh, UK) Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opinionated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multi-Level Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements. @InProceedings{CGO21p15, author = {Lorenzo Chelini and Andi Drebes and Oleksandr Zinenko and Albert Cohen and Nicolas Vasilache and Tobias Grosser and Henk Corporaal}, title = {Progressive Raising in Multi-level IR}, booktitle = {Proc.\ CGO}, publisher = {IEEE}, pages = {15--26}, doi = {}, year = {2021}, } Artifacts Available Artifacts Functional Results Reproduced |
137 authors
proc time: 3.69