ARRAY 2016 – Proceedings

Frontmatter
Welcome to the third ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, held in association with PLDI 2016 in Santa Barbara, California. This volume contains the accepted papers. All papers were reviewed by at least three members of the Program Committee. Array-oriented programming is a powerful abstraction for compactly implementing numerically intensive algorithms. Many modern languages now provide some support for collective array operations, which are used by an increasing number of programmers (and non-programmers) for data analysis and scientific computing. The ARRAY 2016 workshop brings together researchers from many different communities, including language designers, library developers and compiler researchers. There are 9 papers accepted, all of which are presented at the workshop. In addition, the workshop hosts two invited talks by Bradford Chamberlain, Principal Engineer at Cray Inc, Seattle, USA, chief designer of the Chapel high productivity language, and by Morten Kromberg, User Experience Director (CXO) at Dyalog Ltd, Bramley, UK.

Data-Race Detection: The Missing Piece for an End-to-End Semantic Equivalence Checker for Parallelizing Transformations of Array-Intensive Programs
Kunal Banerjee, Soumyadip Banerjee, and Santonu Sarkar
(IIT Kharagpur, India; BITS Pilani Goa, India)
The parallelizing transformation (hand-crafted or compiler-assisted) is error prone as it is often performed without verifying any semantic equivalence with the sequential counterpart. Even when the parallel program can be proved to be semantically equivalent with its corresponding sequential program, detecting data-race conditions in the parallel program remains a challenge. In this paper, we propose a formal verification approach that can detect data-race conditions while verifying the computational semantic equivalence of parallelizing loop transformations. We propose a coloured array data dependence graph (C-ADDG) based modeling of a program for verification of program equivalence as well as data-race condition detection across parallel loops. We have tested our tool on a set of Rodinia and PLuTo+ benchmarks and shown that our method is sound, whereby the method does not produce any false-positive program equivalence or data-race condition.

Array Program Transformation with Loo.py by Example: High-Order Finite Elements
Andreas Klöckner, Lucas C. Wilcox, and T. Warburton
(University of Illinois at Urbana-Champaign, USA; Naval Postgraduate School, USA; Virginia Tech, USA)
To concisely and effectively demonstrate the capabilities of our program transformation system Loo.py, we examine a transformation path from two real-world Fortran subroutines as found in a weather model to a single high-performance computational kernel suitable for execution on modern GPU hardware. Along the transformation path, we encounter kernel fusion, vectorization, prefetching, parallelization, and algorithmic changes achieved by mechanized conversion between imperative and functional/substitution-based code, among a number more. We conclude with performance results that demonstrate the effects and support the effectiveness of the applied transformations.

Design and GPGPU Performance of Futhark's Redomap Construct
Troels Henriksen

, Ken Friis Larsen, and Cosmin E. Oancea

(University of Copenhagen, Denmark)
This paper presents and evaluates a novel second-order operator, named 'redomap', that stems from 'map'-'reduce' compositions in the context of the purely-functional array language Futhark, which is aimed at efficient GPGPU execution. Main contributions are: First, we demonstrate an aggressive fusion technique that is centered on the 'redomap' operator. Second, we present a compilation technique for 'redomap' that efficiently sequentializes the excess parallelism and ensures coalesced access to global memory, even for non-commutative 'reduce' operators. Third, a detailed performance evaluation shows that Futhark's automatically generated code matches or exceeds performance of hand-tuned Thrust code.
Our evaluation infrastructure is publicly available and we encourage replication and verification of our results.

Object Support in an Array-Based GPGPU Extension for Ruby
Matthias Springer and Hidehiko Masuhara

(Tokyo Institute of Technology, Japan)
This paper presents implementation and optimization techniques to support objects in Ikra, an array-based parallel extension to Ruby with dynamic compilation. The high-level goal of Ikra is to allow developers to exploit GPU-based high-performance computing without paying much attention to intricate details of the underlying GPU infrastructure and CUDA. Ikra supports dynamically-typed object-oriented programming in Ruby and performs a number of optimizations. It supports parallel operations (e.g., map, each) on arrays of polymorphic objects, allowing polymorphic method calls inside a kernel by compiling them to conditional branches. To reduce branch divergence, Ikra shuffles thread assignments to base array elements based on runtime types of elements. To facilitate memory coalescing, Ikra stores objects in a structure-of-arrays (SoA) representation (columnar object layout). To eliminate intermediate data in global memory, Ikra merges cascaded parallel sections into one kernel using symbolic execution.

The Key to a Data Parallel Compiler
Aaron W. Hsu
(Indiana University, USA)
We present a language-driven strategy for the construction of compilers that are inherently data-parallel in their design and implementation. Using an encoding of the inter-node relationships between nodes in an AST called a Node Coordinate Matrix, we demonstrate how an operator called the Key operator, that applies a function over groupings of array cells grouped and ordered by their keys, when used in conjunction with a Node Coordinate Matrix, permits arbitrary computation over sub-trees of an AST using purely data-parallel array programming techniques. We discuss the use of these techniques when applied to an experimental commercial compiler called Co-dfns, which is a fully data-parallel compiler developed using the techniques discussed here.

TTC: A Tensor Transposition Compiler for Multiple Architectures
Paul Springer, Aravind Sankaran, and Paolo Bientinesi
(RWTH Aachen University, Germany)
We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g., Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to 8x and 32x, respectively. We also showcase TTC's support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.

Automatic Generation of Parallel C Code for Stencil Applications Written in MATLAB
Johannes Spazier, Steffen Christgau, and Bettina Schnor
(University of Potsdam, Germany)
High-level programming languages such as MATLAB are widely used in scientific domains to implement prototypes based on mathematical models. These prototypes are finally often re-implemented in low-level languages to reach execution times required for the operational use. In order to exploit latest hardware architectures additional effort is necessary to add parallelism to the applications. This paper presents performance results of an automatic translation from a MATLAB subset into efficient parallelized C code for different architectures: multicores, compute clusters, and GPGPUs. We present the first compiler that generates native MPI code from MATLAB source and thereby showing significant performance improvements. The evaluation is done for two stencil applications which use different communication patterns, a Game-of-Life application and a Tsunami simulation. For the Game-of-Life application, the generated parallel code shows nearly optimal speedup. The generated parallel code of the Tsunami simulation reaches the performance of the available parallel reference implementations.

SSA-Based MATLAB-to-C Compilation and Optimization
Luís Reis, João Bispo

, and João M. P. Cardoso
(University of Porto, Portugal; INESC TEC, Portugal)
Many fields of engineering, science and finance use models that are developed and validated in high-level languages such as MATLAB. However, when moving to environments with resource constraints or portability challenges, these models often have to be rewritten in lower-level languages such as C. Doing so manually is costly and error-prone, but automated approaches tend to generate code that can be substantially less efficient than the handwritten equivalents. Additionally, it is usually difficult to read and improve code generated by these tools. In this paper, we describe how we improved our MATLAB-to-C compiler, based on the MATISSE framework, to be able to compete with handwritten C code. We describe our new IR and the most important optimizations that we use in order to obtain acceptable performance. We also analyze multiple C code versions to identify where the generated code is slower than the handwritten code and identify a few key improvements to generate code capable of outperforming handwritten C. We evaluate the new version of our compiler using a set of benchmarks, including the Disparity benchmark, from the San Diego Vision Benchmark Suite, on a desktop computer and on an embedded device. The achieved results clearly show the efficiency of the current version of the compiler.

Info

Extending C++ with Co-Array Semantics
Antoine Tran Tan and Hartmut Kaiser
(Louisiana State University, USA)
The current trend of large scientific computing problems is to align as much as possible to a Single Programming Multiple Data (or SPMD) scheme when the application algorithms are conducive to parallelization and vectorization. This reduces the complexity of code because the processors or (computational nodes) perform the same instructions which allows for better performance as algorithms work on local data sets instead of continuously transferring data from one locality to another. However, certain applications, such as stencil problems, demonstrate the need to move data to or from remote localities. This involves an additional degree of complexity, as one must know with which localities to exchange data. In order to solve this issue, Fortran has extended its scalar element indexing approach to distributed structures of elements. In this extension, a structure of scalar elements is attributed a ”co-index” and lives in a specific locality. A co-index provides the application with enough information to retrieve the corresponding data reference. In C++, containers present themselves as a ”smarter” alternative of Fortran arrays but there are still no corresponding standardized features similar to the Fortran co-indexing approach. In this paper, we present an implementation of such features in HPX, a general purpose C++ runtime system for applications of any scale. We describe how the combination of the HPX features and the actual C++ Standard makes it easy to define a high performance API similar to Co-Array Fortran.

ARRAY 2016 – Proceedings

3rd International Workshop on Libraries, Languages and Compilers for Programming (ARRAY 2016)