ARRAY 2017 – Proceedings

Message from the Organizing Committee
Welcome to the proceedings of the fourth ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, held in conjunction with PLDI 2017 in Barcelona, Spain. These proceedings contain the eight papers selected by the Program Committee from the fourteen submissions received.

Papers

Quad Ropes: Immutable, Declarative Arrays with Parallelizable Operations
Florian Biermann and Peter Sestoft

(IT University of Copenhagen, Denmark)
We describe the quad rope data structure, a representation of immutable two-dimensional arrays that avoids many of the performance pitfalls of plain C-style two-dimensional arrays. Our motivation is that, for end-user development in high-level declarative programming languages, it is impractical to let users choose between different array-like data structures. Instead, one should use the same, somewhat performance-robust, representation for every programming task.
Quad ropes roughly retain array efficiency, as long as programmers express their programs using high-level constructs. Moreover, they allow for fast concatenation and dynamic task-based parallelism and are well suited to represent sparse arrays. We describe their operational semantics and evaluate the performance of individual functions on quad ropes as well as declarative algorithms that use our quad rope implementation.

An ELI-to-C Compiler: Design, Implementation, and Performance
Hanfeng Chen, Wai-Mee Ching, and Laurie Hendren

(McGill University, Canada)
ELI is a succinct array-based interactive programming language derived from APL. In this paper we present the overall design and implementation of a bootstrapped ELI-to-C compiler which is implemented in ELI. We provide a brief introduction to the ELI language, a high-level view of the code generation strategy, and a description of our bootstrapping process. We also provide a preliminary performance evaluation. Firstly, we use three existing C benchmarks to demonstrate the performance of the ELI-generated C code as compared with interpreted ELI and native C. Secondly, we use two benchmarks originally from APL to compare the ELI-generated C to interpreted ELI and a naive hand-generated C version. These preliminary results are encouraging, showing speedups over the interpreter and in many cases performance close to C. The results also show that some future optimizations, such as copy elimination/avoidance, would be beneficial.

Array Programming in Whiley
David J. Pearce
(Victoria University of Wellington, New Zealand)
Arrays are a fundamental mechanism for developing and reasoning about programs. Using them, one can easily encode a range of important algorithms from various domains, such as for sorting, graph traversal, heap manipulation and more. However, the encoding of such problems in traditional languages is relatively opaque. That is, such programming languages do not allow those properties important for the given problem to be encoded within the language itself and, instead, rely up on programmer-supplied comments.
This paper explores how array-based programming is enhanced by programming languages which support specifications and invariants over arrays. Examples of such systems include Dafny, Why3, Whiley, Spec# and more. For this paper, we choose Whiley as this language provides good support for array-based programming. Whiley is a programming language designed for verification and employs a verifying compiler to ensure that programs meet their specifications. A number of features make Whiley particularly suitable for array-based programming, such as type invariants and abstract properties. We explore this through a series of worked examples.

Info

Flexible Data Views: Design and Implementation
Leo Osvald and Tiark Rompf

(Purdue University, USA)
In this paper, we present a library-based framework of data views over chunks of memory segments. Such views not only enable a uniform treatment of references and arrays, but they provide a more general abstraction in the sense that parts of arrays, references, or even views, can be combined into hierarchies to form new logical data structures. To provide efficient implementations in widely used industrial languages such as C++ and Scala, we employ static and dynamic multi-staging techniques, respectively. Through staging and code specialization, the overhead of traversal and tracking of such view hierarchies is mostly eliminated. Thus, our data views can be used as building blocks for creating data structures for which programmers need not pick a specific representation but can rely on code generation and specialization to provide the right implementation that meets asymptotic running time and space guarantees. We apply our approach in case studies in which two-dimensional array views are used to efficiently encode real-world matrices, showing performance on par with specialized data structures such as sparse matrices from popular linear algebra libraries (Armadillo and Eigen), or hand-tuned dense representations. We also show the practicality of specializing data views at run-time on the JVM via Lightweight Modular Staging, a Scala framework for dynamic multi-stage programming, by designing a user-friendly API that hides the underlying compilation through lazy evaluation and a uniform access principle.

Portable Vectorization and Parallelization of C++ Multi-dimensional Array Computations
Laurent Plagne and Kavoos Bojnourdi
(EDF Lab, France)
This paper presents Legolas++ arrays, a C++ multi-dimensional array library. Parameterized type of Legolas++ arrays enable data layout adaptation for specific Single Instruction Multiple Data (SIMD) core architectures. The mapping of complex array-based kernels to regular collections of data is efficiently vectorized. In addition, Legolas++ arrays can combine multi-threaded parallelism with SIMD acceleration. For example, a direct tridiagonal solver applied to a collection of equally sized problems exhibits a speedup of more than × 22 on an 8-core SIMD processor.

Efficient Array Slicing on the Intel Xeon Phi Coprocessor
Benjamin Andreassen Bjørnseth, Jan Christian Meyer, and Lasse Natvig
(NTNU, Norway)
Array slicing is an operation which selects a subset of elements from a source array and copies them into a destination array. In this article we present an algorithm for generating code for a subset of Fortran slicing expressions, targeting the first generation Intel Xeon Phi coprocessor. The resulting code outperforms the code produced by Intel’s Fortran compiler by 2.40 × on average for a set of slicing expressions, and by 2.23 × and 1.13 × on average for two slicing expressions relevant for border exchange code.

Modular Array-Based GPU Computing in a Dynamically-Typed Language
Matthias Springer, Peter Wauligmann, and Hidehiko Masuhara

(Tokyo Institute of Technology, Japan)
Nowadays, GPU accelerators are widely used in areas with large data-parallel computations such as scientific computations or neural networks. Programmers can either write code in low-level CUDA/OpenCL code or use a GPU extension for a high-level programming language for better productivity. Most extensions focus on statically-typed languages, but many programmers prefer dynamically-typed languages due to their simplicity and flexibility.
This paper shows how programmers can write high-level modular code in Ikra, a Ruby extension for array-based GPU computing. Programmers can compose GPU programs of multiple reusable parallel sections, which are subsequently fused into a small number of GPU kernels. We propose a seamless syntax for separating code regions that extensively use dynamic language features from those that are compiled for efficient execution. Moreover, we propose symbolic execution and a program analysis for kernel fusion to achieve performance that is close to hand-written CUDA code.

Info

HPTT: A High-Performance Tensor Transposition C++ Library
Paul Springer, Tong Su, and Paolo Bientinesi
(RWTH Aachen University, Germany)
Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., ‍a 4× 4 transpose). HPTT also offers an optional autotuning framework—guided by performance heuristics—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., ‍Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1×.

ARRAY 2017 – Proceedings

4th International Workshop on Libraries, Languages and Compilers for Programming (ARRAY 2017)

Frontmatter

Papers