GPGPU 2015 – Proceedings

A Comparative Investigation of Device-Specific Mechanisms for Exploiting HPC Accelerators
Ayman Tarakji, Lukas Börger, and Rainer Leupers
(RWTH Aachen University, Germany)
A variety of computational accelerators have been greatly improved in recent years. Intel's MIC (Many Integrated Core) and both GPU architectures, NVIDIA's Kepler and AMD's Graphics Core Next, all represent real innovations in the field of HPC. Based on the single unified programing interface OpenCL, this paper reports a careful study of a well thought-out selection of such devices. A micro-benchmark suite is designed and implemented to investigate the capability of each accelerator to exploit parallelism in OpenCL. Our results expose the relationship between several programing aspects and their possible impact on performance. Instruction level parallelism, intra-kernel vector parallelism, multiple-issue, work-group size, instruction scheduling and a variety of other aspects are explored, highlighting their interaction that must be carefully considered when developing applications for heterogeneous architectures. Evidence-based findings related to microarchitectural features as well as performance characteristics are cross-checked with reference to the compiled code being executed. In conclusion, a case study involving a real application is presented as a part of the verification process of statements.

Cache and Shared Memory

GPU-SM: Shared Memory Multi-GPU Programming
Javier Cabezas, Marc Jordà, Isaac Gelado

, Nacho Navarro, and Wen-mei Hwu

(Barcelona Supercomputing Center, Spain; NVIDIA, USA; Universitat Politècnica de Catalunya, Spain; University of Illinois at Urbana-Champaign, USA)
Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g. a 40% SLOC reduction in the host code of finite difference).

Adaptive GPU Cache Bypassing
Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jiménez
(Texas A&M University, USA; AMD Research, USA)
Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache hit rates. This problem is exacerbated by the design of current GPUs, which share small caches be- tween many threads. Caching these streaming data struc- tures needlessly burns power while evicting data that may otherwise fit into the cache. We propose a GPU cache management technique to im- prove the efficiency of small GPU caches while further re- ducing their power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted. This technique saves energy by avoid- ing needless insertions and evictions while avoiding cache pollution, resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves sim- ilar performance to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%. The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program scratchpad memories is impractical.

Efficient Utilization of GPGPU Cache Hierarchy
Mahmoud Khairy, Mohamed Zahran, and Amr G. Wassal
(Cairo University, Egypt; New York University, USA)
Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance instead. In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. The first technique aims to dynamically detect and bypass memory accesses that show streaming behavior. In the second technique, we propose dynamic warp throttling via cores sampling (DWT-CS) to alleviate cache thrashing by throttling the number of active warps per core. DWT-CS monitors the MPKI at L1, when it exceeds a specific threshold, all GPU cores are sampled with different number of active warps to find the optimal number of warps that mitigates thrashing and achieves the highest performance. Our proposed third technique addresses the problem of GPU cache associativity since many GPGPU applications suffer from severe associativity stalls and conflict misses. Prior work proposed cache bypassing on associativity stalls. In this work, instead of bypassing, we employ a better cache indexing function, Pseudo Random Interleaving Cache (PRIC), that is based on polynomial modulus mapping, in order to fairly and evenly distribute memory accesses over cache sets. The proposed techniques improve the average performance of streaming and contention applications by 1.2X and 2.3X respectively. Compared to prior work, it achieves 1.7X and 1.5X performance improvement over Cache-Conscious Wavefront Scheduler and Memory Request Prioritization Buffer respectively.

Optimization

Effects of Source-Code Optimizations on GPU Performance and Energy Consumption
Jared Coplin and Martin Burtscher
(Texas State University, USA)
This paper studies the effects of source-code optimizations on the performance, power draw, and energy consumption of a modern compute GPU. We evaluate 128 versions of two n-body codes: a compute-bound regular implementation and a memory-bound irregular implementation. Both programs include six optimizations that can be individually enabled or disabled. We measured the active runtime and the power consumption of each code version on three inputs, various GPU clock frequencies, two arithmetic precisions, and with and without ECC. This paper investigates which optimizations primarily improve energy efficiency, which ones mainly boost performance, and which ones help both aspects. Some optimizations also have the added benefit of reducing the power draw. Our analysis shows that individual and combinations of optimizations can alter the performance and energy consumption of a GPU kernel by up to a factor of five.

Optimization for Performance and Energy for Batched Matrix Computations on GPUs
Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, and Jack Dongarra
(University of Tennessee, USA; Oak Ridge National Laboratory, USA; University of Manchester, UK)
As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU’s significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2.5 speedup on the K40 GPU.

Helium: A Transparent Inter-kernel Optimizer for OpenCL
Thibaut Lutz, Christian Fensch, and Murray Cole
(University of Edinburgh, UK; Heriot-Watt University, UK)
State of the art automatic optimization of OpenCL applications focuses on improving the performance of individual compute kernels. Programmers address opportunities for inter-kernel optimization in specific applications by ad-hoc hand tuning: manually fusing kernels together. However, the complexity of interactions between host and kernel code makes this approach weak or even unviable for applications involving more than a small number of kernel invocations or a highly dynamic control flow, leaving substantial potential opportunities unexplored. It also leads to an over complex, hard to maintain code base.

We present Helium, a transparent OpenCL overlay which discovers, manipulates and exploits opportunities for inter-and intra-kernel optimization. Helium is implemented as preloaded library and uses a delay-optimize-replay mechanism in which kernel calls are intercepted, collectively optimized, and then executed according to an improved execution plan. This allows us to benefit from composite optimizations, on large, dynamically complex applications, with no impact on the code base. Our results show that Helium obtains at least the same, and frequently even better performance, than carefully handtuned code. Helium outperforms hand-optimized code where the exact dynamic composition of compute kernel cannot be known statically. In these cases, we demonstrate speedups of up to 3x over unoptimized code and an average speedup of 1.4x over hand optimized code.

Applications

Stochastic Gradient Descent on GPUs
Rashid Kaleem, Sreepathi Pai, and Keshav Pingali
(University of Texas at Austin, USA)
Irregular algorithms such as Stochastic Gradient Descent (SGD) can benefit from the massive parallelism available on GPUs. However, unlike in data-parallel algorithms, synchronization patterns in SGD are quite complex. Furthermore, scheduling for scale-free graphs is challenging. This work examines several synchronization strategies for SGD, ranging from simple locking to conflict-free scheduling. We observe that static schedules do not yield better performance despite eliminating the need to perform conflict detection and resolution at runtime. We identify the source of the performance degradation to be the structure of certain parts of the graph (dense vs sparse). This classification can be used to devise hybrid scheduling strategies which exploit different schedules for different regions of the graph to obtain better performance. We found that the best schedule for some problems can be up to two orders of magnitude faster than the worst one. To evaluate the performance of our GPU implementation, we also compare against a CPU implementation of SGD. Dynamic schedules perform comparably to a 14-thread CPU implementation, while a static schedule performs comparably to a 6-thread CPU implementation.

High Performance Computing of Fiber Scattering Simulation
Leiming Yu, Yan Zhang, Xiang Gong, Nilay Roy, Lee Makowski, and David Kaeli

(Northeastern University, USA)
Cellulose is one of the most promising energy resources that is waiting to be tapped. Harvesting energy from cellulose requires decoding its atomic structure. Some structural information can be exposed by modeling data produced by X-ray scattering. Forward simulation can be used to explore structural parameters of cellulose, including the diameter, twist and coiling, but modeling fiber scattering is computationally challenging. In this paper, we explore how to accelerate a molecular scattering algorithm by leveraging a modern high-end Graphic Processing Unit (GPU). A step-wise optimization approach is described in this work that considers memory utilization, math intrinsics, concurrent kernel execution and workload partitioning. Different caching strategies to manage the state of the atom volume in memory are taken into account. We have developed optimized cluster solutions for both CPUs and GPUs. Different workload distribution schemes and con- current execution approaches for both CPUs and GPUs have been investigated. Leveraging accelerators hosted on a cluster, we have reduced days/weeks of intensive simulation to parallel execution of just a few minutes/seconds. Our GPU-integrated cluster solution can potentially support concurrent modeling of hundreds of cellulose fibril structures, opening up new avenues for energy research.

Rethinking the Parallelization of Random-Restart Hill Climbing: A Case Study in Optimizing a 2-Opt TSP Solver for GPU Execution
Molly A. O'Neil and Martin Burtscher
(Texas State University, USA)
Random-restart hill climbing is a common approach to combinatorial optimization problems such as the traveling salesman problem (TSP). We present and evaluate an implementation of random-restart hill climbing with 2-opt local search applied to TSP. Our implementation is capable of addressing large problem sizes at high throughput. It is based on the key insight that the GPU’s hierarchical hardware parallelism is best exploited with a hierarchical implementation strategy, where independent climbs are parallelized between blocks and the 2-opt evaluations are parallelized across the threads within a block. We analyze the performance impact of this and other optimizations on our heuristic TSP solver and compare its performance to existing GPU-based 2-opt TSP solvers as well as a parallel CPU implementation. Our code outperforms the existing implementations by up to 3X, evaluating up to 60 billion 2-opt moves per second on a single K40 GPU. It also outperforms an OpenMP implementation run on 20 CPU cores by up to 8X.

Forma: A DSL for Image Processing Applications to Target GPUs and Multi-core CPUs
Mahesh Ravishankar, Justin Holewinski, and Vinod Grover

(NVIDIA, USA)
As architectures evolve, optimization techniques to obtain good performance evolve as well. Using low-level programming languages like C/C++ typically results in architecture-specific optimization techniques getting entangled with the application specification. In such situations, moving from one target architecture to another usually requires a reimplementation of the entire application. Further, several compiler transformations are rendered ineffective due to implementation choices. Domain-Specific Languages (DSL) tackle both these issues by allowing developers to specify the computation at a high level, allowing the compiler to handle many tedious and error-prone tasks, while generating efficient code for multiple target architectures at the same time. Here we present Forma, a DSL for image processing applications that targets both CPUs and GPUs. The language provides syntax to express several operations like stencils, sampling, etc. which are commonly used in this domain. These can be chained together to specify complex pipelines in a concise manner. The Forma compiler is in charge of tedious tasks like memory management, data transfers from host to device, handling boundary conditions, etc. The high-level description allows the compiler to generate efficient code through use of compile-time analysis and by taking advantage of hardware resources, like texture memory on GPUs. The ease with which complex pipelines can be specified in Forma is demonstrated through several examples. The efficiency of the generated code is evaluated through comparison with a state-of-the-art DSL that targets the same domain, Halide. Our experimental result show that using Forma allows developers to obtain comparable performance on both CPU and GPU with lesser programmer effort. We also show how Forma could be easily integrated with widely used productivity tools like Python and OpenCV. Such an integration would allow users of such tools to develop efficient implementations easily.

GPGPU 2015 – Proceedings

8th Workshop on General Purpose Processing using GPUs (GPGPU 8)

Preface

HPC

Cache and Shared Memory

Optimization

Applications