Powered by
Conference Publishing Consulting

2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2015), June 13, 2015, Portland, OR, USA

ARRAY 2015 – Proceedings

Contents - Abstracts - Authors

2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2015)

Title Page


Message from the ARRAY 2015 Organizing Committee
Welcome to the second ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, held in association with PLDI 2015, as part of the ACM FCRC in Portland, Oregon.

Loo.py: From Fortran to Performance via Transformation and Substitution Rules
Andreas Klöckner
(University of Illinois at Urbana-Champaign, USA)
A large amount of numerically-oriented code is written and is being written in legacy languages. Much of this code could, in principle, make good use of data-parallel throughput-oriented computer architectures. Loo.py, a transformation-based programming system targeted at GPUs and general data-parallel architectures, provides a mechanism for user-controlled transformation of array programs. This transformation capability is designed to not just apply to programs written specifically for Loo.py, but also those imported from other languages such as Fortran. It eases the trade-off between achieving high performance, portability, and programmability by allowing the user to apply a large and growing family of transformations to an input program. These transformations are expressed in and used from Python and may be applied from a variety of settings, including a pragma-like manner from other languages.

Techniques for Efficient MATLAB-to-C Compilation
João Bispo, Luís Reis, and João M. P. Cardoso
(University of Porto, Portugal)
MATLAB to C translation is foreseen to raise the overall abstrac-tion level when mapping computations to embedded systems (pos-sibly consisting of software and hardware components), and thus for increasing productivity and for providing an automated model-driven design-flow. This paper describes recent work developed in the context of MATISSE, a MATLAB to C compiler targeting embedded systems. We introduce several techniques to allow the efficient generation of C code, such as weak types, primitives and matrix views. We evaluate the compiler with a set of 9 publicly available benchmarks, targeting both embedded systems and a desktop system. We compare the execution time of the generated C code with the original code running on MATLAB, achieving a geometric mean speedup of 8.1x, and qualitatively compare our results with the performance of related approaches. The use of the new techniques allowed the compiler to achieve performance im-provements of 46% on average.

Compiling APL to Accelerate through a Typed Array Intermediate Language
Michael Budde, Martin Dybdal, and Martin Elsman ORCID logo
(University of Copenhagen, Denmark)
We present an approach for compiling a rich subset of APL into data-parallel programs that can be executed on GPUs. The compiler is based on the AplTail compiler, which compiles APL programs into a typed array intermediate language, called TAIL. We translate TAIL programs into Haskell source code, employing Accelerate, a Haskell-library for general purpose GPU-programming. We demonstrate the feasibility of the approach by presenting some encouraging results for a number of smaller benchmarks. We also outline some problems that we need to overcome in order for the approach to result in competitive code for larger benchmarks.

Velociraptor: A Compiler Toolkit for Array-Based Languages Targeting CPUs and GPUs
Rahul Garg, Sameer Jagdale, and Laurie Hendren ORCID logo
(McGill University, Canada)
We present a toolkit called Velociraptor that can be used by compiler writers to quickly build compilers and other tools for array-based languages. Velociraptor operates on its own unique intermediate representation (IR) designed to support a variety of array-based languages. The toolkit also provides some novel analysis and transformations such as region detection and specialization, as well as a dynamic backend with CPU and GPU code generation. We discuss the components of the toolkit and also present case-studies illustrating the use of the toolkit.

Performance Search Engine Driven by Prior Knowledge of Optimization
Youngsung Kim, Pavol Černý, and John Dennis
(University of Colorado at Boulder, USA; National Center for Atmospheric Research, USA)
For scientific array-based programs, optimization for a particular target platform is a hard problem. There are many optimization techniques such as (semantics-preserving) source code transformations, compiler directives, environment variables, and compiler flags that influence performance. Moreover, the performance impact of (combinations of) these factors is unpredictable. This pa- per focuses on providing a platform for automatically searching through search space consisting of such optimization techniques. We provide (i) a search-space description language, which enables the user to describe optimization options to be used; (ii) search engine that enables testing the performance impact of optimization options by executing optimized programs and checking their results; and (iii) an interface for implementing various search algorithms. We evaluate our platform by using two simple search algorithms - a random search and a casetree search that heuristically learns from the already examined parts of the search space. We show that such algorithms are easily implementable in our plat- form, and we empirically find that the framework can be used to find useful optimized algorithms.

High-Level Accelerated Array Programming in the Web Browser
Mathias Bourgoin and Emmanuel Chailloux
(University of Grenoble, France; VERIMAG, France; University of Paris 6, France; LIP6, France)
Client-side web programming currently means using technologies embedded in web browsers to run computations on the client computer. Most solutions imply using JavaScript which allows to describe computations, and modifications of the DOM displayed by the browser. However, JavaScript limits static checking as everything (types, names, etc.) is checked at runtime. Moreover its concurrent model does not take advantage of multi-core or GPU architectures. In this paper we present WebSPOC, an adapted version of the SPOC library for web applications. SPOC is an OCaml GPGPU library focusing on abstracting memory transfers and handling GPGPU computations in a strongly static typed context. SPOC proposes a specific language, called Sarek, to express kernels and different parallel skeletons to compose them. To run SPOC programs on the Web client side, its OCaml part is compiled to JavaScript code and its Sarek part to kernels running on GPUs or multi-core CPUs.

Accelerating Information Experts through Compiler Design
Aaron W. Hsu
(Indiana University, USA)
Dyalog APL is a tool of thought for information experts, enabling rapid development of domain-centric software without the costly software engineering feedback loop often required. The Dyalog APL interpreter introduces performance constraints that hinder the analysis of large data sets, especially on highly-parallel computing architectures. The Co-dfns compiler project aims to reduce the overheads involved in creating high-performance code in APL. It focuses on integrating with the APL environment and compiles a familiar subset of the language, delivering significant performance and platform independence to information experts without requiring code rewrites and conversion into other languages.
The design of the Co-dfns compiler, itself an APL program, possesses a unique architecture that permits implementation without branching, recursion, or other complex forms of control flow. By integrating specific optimizations, the generated code competes with hand-written C code in the domain of financial simulations, exceeding it when integrated into the environment. Preliminary results demonstrate platform independent performance across CPUs and GPUs without modification of the source. Work continues to improve performance both of the architecture and the generated code. Eventually, the project hopes to convincingly demonstrate a wider range of techniques that extend the suitable domain for effective array programming

Fusing Convolution Kernels through Tiling
Mahesh Ravishankar, Paulius Micikevicius, and Vinod Grover ORCID logo
(NVIDIA, USA)
Image processing pipelines are continuously being developed to deduce more information about objects captured in images. To facilitate the development of such pipelines several Domain Specific Languages (DSLs) have been proposed that provide constructs for easy specification of such computations. It is then upto the DSL compiler to generate code to efficiently execute the pipeline on multiple hardware architectures. While such compilers are getting ever more sophisticated, to achieve large scale adoption these DSLs have to beat, or at least match, the performance that can be achieved by a skilled programmer. Many of these pipelines use a sequence of convolution kernels that are memory bandwidth bound. One way to address this bottleneck is through use of tiling. In this paper we describe an approach to tiling within the context of a DSL called Forma. Using the high-level specification of the pipeline in this DSL, we describe a code generation algorithm that fuses multiple stages of the pipeline through the use of tiling to reduce the memory bandwidth requirements on both GPU and CPU. Using this technique improves the performance of pipelines like Canny Edge Detection by 58% on NVIDIA GPUs, and of the Harris Corner Detection pipeline by 71% on CPUs.

Array Programming in Pascal
Paul Cockshott, Ciaran Mcreesh, Susanne Oehler, and Youssef Gdura
(University of Glasgow, UK; University of Tripoli, Libya)
A review of previous array Pascals leads on to a description the Glasgow Pascal compiler. The compiler is an ISO-Pascal superset with semantic extensions to translate data parallel statements to run on multiple SIMD cores. An appendix is given which includes demonstrations of the tool.

Abstract Expressionism for Parallel Performance
Robert Bernecky and Sven-Bodo Scholz
(Snake Island Research, Canada; Heriot-Watt University, UK)
Programming with abstract, mathematical expressions offers benefits including terser programs, easier communication of algorithms, ability to prove theorems about algorithms, increased parallelism, and improved programming productivity. Common belief is that higher levels of abstraction imply a larger semantic gap between the user and computer and, therefore, typically slower execution, whether sequential or parallel. In recent years, domain-specific languages have been shown to close this gap through sophisticated optimizations benefitting from domain-specific knowledge. In this paper, we demonstrate that the semantic gap can also be closed for non-domain-specific functional array languages, without requiring embedding of language-specific semantic knowledge into the compiler tool chain. We present a simple example of APL-style programs, compiled into C-code that outperform equivalent C programs in both sequential and parallel (OpenMP) environments. We offer insights into abstract expressionist programming, by comparing the characteristics and performance of a numerical relaxation benchmark written in C99, C99 with OpenMP directives, scheduling code, and pragmas, and in , a functional array language. We compare three algorithmic styles: if/then/else, hand-optimized loop splitting, and an abstract, functional style whose roots lie in APL. We show that the algorithms match or outperform serial C, and that the hand-optimized and abstract styles generate identical code, and so have identical performance. Furthermore, parallel variants also outperform the best OpenMP C variant by up to a third, with no source code modifications. Preserving an algorithm's abstract expression during optimization opens the door to generation of radically different code for different architectures. [The author list is wrong, but I see no way to correct, despite the fact that EasyChair has the correct author list.]

proc time: 1.17