*Powered by*

9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming (ARRAY 2023),
June 18, 2023,
Orlando, FL, USA

## 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming (ARRAY 2023)

#### Frontmatter

Message from the Chairs
Welcome to the 2023 edition on the ACM SIGPLAN Workshop
on Libraries, Languages and Compilers for Array Programming,
co-located with PLDI 2023.

Array programming unites two uncommon properties. As an abstraction,
it directly mirrors mathematical concepts commonly used in many fields
from natural sciences over engineering to financial modeling. As a
language feature, it exposes regular control flow, exhibits structured
data dependencies, and lends itself to many types of program
analysis. Furthermore, many modern parallel computer architectures are
well-suited to efficiently execute array operations.

The ARRAY series of workshops explores all aspects of array
programming, such as languages, formal semantics, array theories,
productivity/performance tradeoffs, libraries, notation such as
including axis- and index-based approaches, intermediate languages,
and efficient compilation.

#### Papers

A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos
Pedro Valero-Lara

and Jeffrey S. Vetter

*(Oak Ridge National Laboratory, USA)*
Today, multiGPU nodes are widely used in high-performance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application areas of array programming. In this paper, we describe a new application programming interface based on the Kokkos programming model to enable array computation on multiple GPUs in a transparent and portable way across both NVIDIA and AMD GPUs. We implement different variations of this technique to accommodate the exchange of stencils (array boundaries) among different GPU memory spaces, and we provide autotuning to select the proper number of GPUs, depending on the computational cost of the operations to be computed on arrays, that is completely transparent to the programmer. We evaluate our multiGPU extension on Summit (#5 TOP500), with six NVIDIA V100 Volta GPUs per node, and Crusher that contains identical hardware/software as Frontier (#1 TOP500), with four AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs)for a total of 8 GCDs per node. We also compare the performance of this solution against the use of MPI + Kokkos, which is the cur-rent de facto solution for multiple GPUs in Kokkos. Our evaluation shows that the new Kokkos solution provides good scalability for many GPUs and a faster and simpler solution (from a programming productivity perspective) than MPI + Kokkos.

Article Search
HERO-ML: A Very High-Level Array Language for Executable Modelling of Data Parallel Algorithms
Björn Lisper

and Linus Källberg

*(Mälardalen University, Sweden)*
HERO-ML is an array language, on very high level, which is intended for
specifying data parallel algorithms in a concise and platform-independent
way where all the inherent data parallelism is easy to identify. The goal is to support the software development for heterogeneous
systems with different kinds of parallel numerical accelerators, where
programs tend to be very platform-specific and difficult to develop.
In this paper we describe HERO-ML, and a proof-of-concept implementation.

Article Search
U-Net CNN in APL: Exploring Zero-Framework, Zero-Library Machine Learning
Aaron W. Hsu and Rodrigo Girão Serrão

*(Dyalog, USA; Dyalog, UK)*
The APL notation would appear to be a clear match for convolutional neural networks, but traditional implementations of APL have lagged behind the performance of highly tuned, specialized frameworks designed to execute CNNs on the GPU. Moreover, most demonstrations of APL for neural networking have involved relatively small examples. We explore a more complex example in the U-net architecture and utilize a modern APL compiler with GPU support, Co-dfns, to compare the state of the art of APL against the current crop of specialized neural network frameworks in the form of PyTorch. We compare performance as well as the language design of APL for neural network programming and the clarity and transparency of the resulting code.

We found that the complete “from scratch” APL source was on par with the complexity of the PyTorch reference implementation, albeit more foreign, while being more concise and complete. We also found that when compiled with Co-dfns, despite the naïve implementation both of Co-dfns and our own code, performance on the GPU and the CPU were within a factor of 2.2 - 2.4 times that of the PyTorch implementation. We believe this suggests significant avenues of future exploration for machine learning language design, pedagogy, and implementation, both inside and outside of the APL community.

Article Search
Polymorphic Types with Polynomial Sizes
Jean-Louis Colaço

, Baptiste Pauget

, and Marc Pouzet

*(ANSYS, France; Inria, France; ENS-PSL University, France)*
This article presents a compile-time analysis for tracking the size of
data-structures in a statically typed and strict functional language. This
information is valuable for static checking and code generation. Rather than
relying on dependent types, we propose a type-system close to that of ML:
polymorphism is used to define functions that are generic in types and
sizes; both can be inferred. This approach is convenient, in particular for
a language used to program critical embedded systems, where sizes are indeed
known at compile-time. By using sizes that are multivariate polynomials, we
obtain a good compromise between the expressiveness of the size language and
its properties (verification, inference).

The article defines a minimal functional language that is sufficient to
capture size constraints in types. It presents its dynamic semantics, the
type system and inference algorithm. Last, we sketch some practical
extensions that matter for a more realistic language.

Article Search
Archive submitted (560 kB)
Towards Structured Algebraic Programming
Daniele G. Spampinato

, Denis Jelovina

, Jiawei Zhuang

, and Albert-Jan N. Yzelman

*(Huawei Zurich Research Center, Switzerland; Huawei Technologies, China)*
Structured matrices and tensors exhibiting properties such as symmetry and fixed non-zero patterns are known for making algorithms and data storage more efficient. Due to emerging power and efficiency constraints required by the scale of modern scientific, machine learning, and edge computing applications, algorithm experts are revisiting traditional linear algebra algorithms with the goal of making new structures appear. Such structures often result from new numerical approximations that would greatly benefit from a more flexible linear algebra interface than standard BLAS and LAPACK, allowing for mixed precision and data types to appear in place of traditional floating-point operations. Algebraic programming interfaces, like GraphBLAS, while naturally abstracting the algebras of their operations, do not explicitly capture structured, densely stored arrays. In this paper, we present a preliminary design of a new algebraic programming interface for structured containers with template-generic, non-zero patterns. This interface offers to backend implementations the possibility of integrating more compile-time pattern information in the loop-based implementation of primitive operations as well as in the formulation of array accesses. We demonstrate its ability to specify important dense matrix decomposition algorithms and argue its ability to expose high-performance backends.

Article Search
Faster APL with Lazy Extensions
Andrew Sengul

*(Independent Researcher, USA)*
April is a compiler from a subset of the APL language to Common Lisp. To realize a more performant and elegant APL implementation, April now defers the evaluation of certain types of input. This means that the compiler produces code building a tree of "virtual arrays" – objects that represent arrays not yet computed. The object tree's methods are called to write an output array to memory, which may involve performing a second stage of compilation to build an efficient array-generating kernel with the option to emit assembly code for high speed. This evaluation model enables high performance and its component functions can be elegantly specified thanks to the object-oriented programming faculties of Common Lisp.

Article Search
proc time: 2.46