Workshop FHPC 2015 – Author Index |
Contents -
Abstracts -
Authors
|
Chakravarty, Manuel M. T. |
![]() Frederik M. Madsen, Robert Clifton-Everest, Manuel M. T. Chakravarty, and Gabriele Keller (University of Copenhagen, Denmark; UNSW, Australia) Regular array languages for high performance computing based on aggregate operations provide a convenient parallel programming model, which enables the generation of efficient code for SIMD architectures, such as GPUs. However, the data sets that can be processed with current implementations are severely constrained by the limited amount of main memory available in these architectures. In this paper, we propose an extension of the embedded array language Accelerate with a notion of sequences, resulting in a two level hierarchy which allows the programmer to specify a partitioning strategy which facilitates automatic resource allocation. Depending on the available memory, the runtime system processes the overall data set in streams of chunks appropriate to the hardware parameters. In this paper, we present the language design for the sequence operations, as well as the compilation and runtime support, and demonstrate with a set of benchmarks the feasibility of this approach. ![]() |
|
Clifton-Everest, Robert |
![]() Frederik M. Madsen, Robert Clifton-Everest, Manuel M. T. Chakravarty, and Gabriele Keller (University of Copenhagen, Denmark; UNSW, Australia) Regular array languages for high performance computing based on aggregate operations provide a convenient parallel programming model, which enables the generation of efficient code for SIMD architectures, such as GPUs. However, the data sets that can be processed with current implementations are severely constrained by the limited amount of main memory available in these architectures. In this paper, we propose an extension of the embedded array language Accelerate with a notion of sequences, resulting in a two level hierarchy which allows the programmer to specify a partitioning strategy which facilitates automatic resource allocation. Depending on the available memory, the runtime system processes the overall data set in streams of chunks appropriate to the hardware parameters. In this paper, we present the language design for the sequence operations, as well as the compilation and runtime support, and demonstrate with a set of benchmarks the feasibility of this approach. ![]() |
|
Duke, David J. |
![]() David J. Duke and Fouzhan Hosseini (University of Leeds, UK) Parallel implementation of topological algorithms is highly desirable, but the challenges, from reconstructing algorithms around independent threads through to runtime load balancing, have proven to be formidable. This problem, made all the more acute by the diversity of hardware platforms, has led to new kinds of implementation platform for computational science, with sophisticated runtime systems managing and coordinating large threadcounts to keep processing elements heavily utilized. While simpler and more portable than direct management of threads, these approaches still entangle program logic with resource management. Similar kinds of highly parallel runtime system have also been developed for functional languages. Here, however, language support for higher-order functions allows a cleaner separation between the algorithm and `skeletons' that express generic patterns of parallel computation. We report results on using this technique to develop a distributed version of the Joint Contour Net, a generalization of the Contour Tree to multifields. We present performance comparisons against a recent Haskell implementation using shared-memory parallelism, and initial work on a skeleton for distributed memory implementation that utilizes an innovative strategy to reduce inter-process communication overheads. ![]() |
|
Holk, Eric |
![]() Michael Vollmer, Bo Joel Svensson, Eric Holk, and Ryan R. Newton (Indiana University, USA) Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large. Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are expressed in an embedded DSL, and which expose compile-time parameters for tuning. Our framework allows for kernels to be polymorphic over what search strategy will tune them, and allows search strategies to be implemented in the same meta-language as the kernel-generation code (Haskell). Further, we show how to use functional programming abstractions to enforce regular (hyper-rectangular) search spaces. We also evaluate several common search strategies on a variety of kernels, and demonstrate that the framework can tune both EDSL and ordinary CUDA code. ![]() ![]() Bo Joel Svensson, Michael Vollmer, Eric Holk, Trevor L. McDonell, and Ryan R. Newton (Indiana University, USA) High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel. ![]() |
|
Hosseini, Fouzhan |
![]() David J. Duke and Fouzhan Hosseini (University of Leeds, UK) Parallel implementation of topological algorithms is highly desirable, but the challenges, from reconstructing algorithms around independent threads through to runtime load balancing, have proven to be formidable. This problem, made all the more acute by the diversity of hardware platforms, has led to new kinds of implementation platform for computational science, with sophisticated runtime systems managing and coordinating large threadcounts to keep processing elements heavily utilized. While simpler and more portable than direct management of threads, these approaches still entangle program logic with resource management. Similar kinds of highly parallel runtime system have also been developed for functional languages. Here, however, language support for higher-order functions allows a cleaner separation between the algorithm and `skeletons' that express generic patterns of parallel computation. We report results on using this technique to develop a distributed version of the Joint Contour Net, a generalization of the Contour Tree to multifields. We present performance comparisons against a recent Haskell implementation using shared-memory parallelism, and initial work on a skeleton for distributed memory implementation that utilizes an innovative strategy to reduce inter-process communication overheads. ![]() |
|
Kameyama, Yukiyoshi |
![]() Naoki Takashima, Hiroki Sakamoto, and Yukiyoshi Kameyama (University of Tsukuba, Japan) We present the Asuna system which supports implicitly heterogeneous multi-stage programming based on MetaOCaml, a multi-stage extension of OCaml. Our system allows programmers to write code generators in a high-level language, and generated code can be translated to a program in low-level languages such as C and LLVM. The high-level code generators can make use of all the features of MetaOCaml such as algebraic data types and higher-order functions while the generated code may include low-level CPU instructions such as vector (SIMD) operations. One can write programs in a modular and type-safe programming style and can directly represent low-level optimizations. Asuna is a multi-target system, that means a single code generator can generate code in C and LLVM, without changing the generator. The translation by Asuna preserves typing and all generated code is guaranteed to be well typed and well scoped. In this paper, we explain the practical aspect of Asuna, using examples taken from high-performance computing. ![]() |
|
Keller, Gabriele |
![]() Frederik M. Madsen, Robert Clifton-Everest, Manuel M. T. Chakravarty, and Gabriele Keller (University of Copenhagen, Denmark; UNSW, Australia) Regular array languages for high performance computing based on aggregate operations provide a convenient parallel programming model, which enables the generation of efficient code for SIMD architectures, such as GPUs. However, the data sets that can be processed with current implementations are severely constrained by the limited amount of main memory available in these architectures. In this paper, we propose an extension of the embedded array language Accelerate with a notion of sequences, resulting in a two level hierarchy which allows the programmer to specify a partitioning strategy which facilitates automatic resource allocation. Depending on the available memory, the runtime system processes the overall data set in streams of chunks appropriate to the hardware parameters. In this paper, we present the language design for the sequence operations, as well as the compilation and runtime support, and demonstrate with a set of benchmarks the feasibility of this approach. ![]() |
|
Madsen, Frederik M. |
![]() Frederik M. Madsen, Robert Clifton-Everest, Manuel M. T. Chakravarty, and Gabriele Keller (University of Copenhagen, Denmark; UNSW, Australia) Regular array languages for high performance computing based on aggregate operations provide a convenient parallel programming model, which enables the generation of efficient code for SIMD architectures, such as GPUs. However, the data sets that can be processed with current implementations are severely constrained by the limited amount of main memory available in these architectures. In this paper, we propose an extension of the embedded array language Accelerate with a notion of sequences, resulting in a two level hierarchy which allows the programmer to specify a partitioning strategy which facilitates automatic resource allocation. Depending on the available memory, the runtime system processes the overall data set in streams of chunks appropriate to the hardware parameters. In this paper, we present the language design for the sequence operations, as well as the compilation and runtime support, and demonstrate with a set of benchmarks the feasibility of this approach. ![]() |
|
McDonell, Trevor L. |
![]() Bo Joel Svensson, Michael Vollmer, Eric Holk, Trevor L. McDonell, and Ryan R. Newton (Indiana University, USA) High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel. ![]() |
|
Newton, Ryan R. |
![]() Michael Vollmer, Bo Joel Svensson, Eric Holk, and Ryan R. Newton (Indiana University, USA) Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large. Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are expressed in an embedded DSL, and which expose compile-time parameters for tuning. Our framework allows for kernels to be polymorphic over what search strategy will tune them, and allows search strategies to be implemented in the same meta-language as the kernel-generation code (Haskell). Further, we show how to use functional programming abstractions to enforce regular (hyper-rectangular) search spaces. We also evaluate several common search strategies on a variety of kernels, and demonstrate that the framework can tune both EDSL and ordinary CUDA code. ![]() ![]() Bo Joel Svensson, Michael Vollmer, Eric Holk, Trevor L. McDonell, and Ryan R. Newton (Indiana University, USA) High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel. ![]() |
|
Romanov, Alexey |
![]() Alexander Slesarenko and Alexey Romanov (Huawei Technologies, Russia) While high-level abstractions greatly simplify program development, they ultimately need to be eliminated to produce high-performance code. This can be done using generative programming; one particularly usable approach is Lightweight Modular Staging. We present Scalan, a framework which enables compilation of high-level object-oriented-functional code into high-performance low-level code. It extends the basic LMS approach by making rewrite rules and compilation stages first-class and extending the graph IR with object-oriented features. Rewrite rules are represented as graph IR nodes with edges pointing to a pattern graph and a replacement graph; whenever new nodes are constructed, they are compared with the pattern graphs of all active rules and in case a match is found, the corresponding replacement graph is generated instead. Compilation stages are represented as graph transformers and together with the final output generation stage assembled into a compilation pipeline. This allows using multiple backends together, for example generating C/C++ code with JNI wrappers for the most performance-critical parts and Spark code which calls into it for the rest. We will show how object-oriented programming is supported by staging class constructors and method calls (including "factory" methods on companion objects) as part of the IR, thus exposing them to rewrite rules like all other operations. JVM mechanisms allow treating symbols as typed proxies for their corresponding nodes. Now it becomes necessary to eliminate such nodes at some compilation stage to avoid virtual dispatch in the output code (or at least minimize it for object-oriented target languages). In the simple case when the receiver node of a method is a class constructor, we can simply delegate the call to the subject at that stage. The more interesting case when the receiver node is the result of a calculation is handled by isomorphic specialization. This effectively enables virtual dispatch to be carried out at staging time, as described in our previous work. We will demonstrate how we use a Scala compiler plugin to further simplify development by avoiding the explicit use of the Rep type constructor and how our framework can handle effects using free monads. We will finish by discussing future plans for Scalan development. ![]() |
|
Sakamoto, Hiroki |
![]() Naoki Takashima, Hiroki Sakamoto, and Yukiyoshi Kameyama (University of Tsukuba, Japan) We present the Asuna system which supports implicitly heterogeneous multi-stage programming based on MetaOCaml, a multi-stage extension of OCaml. Our system allows programmers to write code generators in a high-level language, and generated code can be translated to a program in low-level languages such as C and LLVM. The high-level code generators can make use of all the features of MetaOCaml such as algebraic data types and higher-order functions while the generated code may include low-level CPU instructions such as vector (SIMD) operations. One can write programs in a modular and type-safe programming style and can directly represent low-level optimizations. Asuna is a multi-target system, that means a single code generator can generate code in C and LLVM, without changing the generator. The translation by Asuna preserves typing and all generated code is guaranteed to be well typed and well scoped. In this paper, we explain the practical aspect of Asuna, using examples taken from high-performance computing. ![]() |
|
Slesarenko, Alexander |
![]() Alexander Slesarenko and Alexey Romanov (Huawei Technologies, Russia) While high-level abstractions greatly simplify program development, they ultimately need to be eliminated to produce high-performance code. This can be done using generative programming; one particularly usable approach is Lightweight Modular Staging. We present Scalan, a framework which enables compilation of high-level object-oriented-functional code into high-performance low-level code. It extends the basic LMS approach by making rewrite rules and compilation stages first-class and extending the graph IR with object-oriented features. Rewrite rules are represented as graph IR nodes with edges pointing to a pattern graph and a replacement graph; whenever new nodes are constructed, they are compared with the pattern graphs of all active rules and in case a match is found, the corresponding replacement graph is generated instead. Compilation stages are represented as graph transformers and together with the final output generation stage assembled into a compilation pipeline. This allows using multiple backends together, for example generating C/C++ code with JNI wrappers for the most performance-critical parts and Spark code which calls into it for the rest. We will show how object-oriented programming is supported by staging class constructors and method calls (including "factory" methods on companion objects) as part of the IR, thus exposing them to rewrite rules like all other operations. JVM mechanisms allow treating symbols as typed proxies for their corresponding nodes. Now it becomes necessary to eliminate such nodes at some compilation stage to avoid virtual dispatch in the output code (or at least minimize it for object-oriented target languages). In the simple case when the receiver node of a method is a class constructor, we can simply delegate the call to the subject at that stage. The more interesting case when the receiver node is the result of a calculation is handled by isomorphic specialization. This effectively enables virtual dispatch to be carried out at staging time, as described in our previous work. We will demonstrate how we use a Scala compiler plugin to further simplify development by avoiding the explicit use of the Rep type constructor and how our framework can handle effects using free monads. We will finish by discussing future plans for Scalan development. ![]() |
|
Svensson, Bo Joel |
![]() Michael Vollmer, Bo Joel Svensson, Eric Holk, and Ryan R. Newton (Indiana University, USA) Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large. Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are expressed in an embedded DSL, and which expose compile-time parameters for tuning. Our framework allows for kernels to be polymorphic over what search strategy will tune them, and allows search strategies to be implemented in the same meta-language as the kernel-generation code (Haskell). Further, we show how to use functional programming abstractions to enforce regular (hyper-rectangular) search spaces. We also evaluate several common search strategies on a variety of kernels, and demonstrate that the framework can tune both EDSL and ordinary CUDA code. ![]() ![]() Bo Joel Svensson, Michael Vollmer, Eric Holk, Trevor L. McDonell, and Ryan R. Newton (Indiana University, USA) High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel. ![]() |
|
Takashima, Naoki |
![]() Naoki Takashima, Hiroki Sakamoto, and Yukiyoshi Kameyama (University of Tsukuba, Japan) We present the Asuna system which supports implicitly heterogeneous multi-stage programming based on MetaOCaml, a multi-stage extension of OCaml. Our system allows programmers to write code generators in a high-level language, and generated code can be translated to a program in low-level languages such as C and LLVM. The high-level code generators can make use of all the features of MetaOCaml such as algebraic data types and higher-order functions while the generated code may include low-level CPU instructions such as vector (SIMD) operations. One can write programs in a modular and type-safe programming style and can directly represent low-level optimizations. Asuna is a multi-target system, that means a single code generator can generate code in C and LLVM, without changing the generator. The translation by Asuna preserves typing and all generated code is guaranteed to be well typed and well scoped. In this paper, we explain the practical aspect of Asuna, using examples taken from high-performance computing. ![]() |
|
Vollmer, Michael |
![]() Michael Vollmer, Bo Joel Svensson, Eric Holk, and Ryan R. Newton (Indiana University, USA) Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large. Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are expressed in an embedded DSL, and which expose compile-time parameters for tuning. Our framework allows for kernels to be polymorphic over what search strategy will tune them, and allows search strategies to be implemented in the same meta-language as the kernel-generation code (Haskell). Further, we show how to use functional programming abstractions to enforce regular (hyper-rectangular) search spaces. We also evaluate several common search strategies on a variety of kernels, and demonstrate that the framework can tune both EDSL and ordinary CUDA code. ![]() ![]() Bo Joel Svensson, Michael Vollmer, Eric Holk, Trevor L. McDonell, and Ryan R. Newton (Indiana University, USA) High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel. ![]() |
20 authors
proc time: 0.72