Workshop ARRAY 2023 – Author Index 
Contents 
Abstracts 
Authors

Colaço, JeanLouis 
ARRAY '23: "Polymorphic Types with Polynomial ..."
Polymorphic Types with Polynomial Sizes
JeanLouis Colaço , Baptiste Pauget , and Marc Pouzet (ANSYS, France; Inria, France; ENSPSL University, France) This article presents a compiletime analysis for tracking the size of datastructures in a statically typed and strict functional language. This information is valuable for static checking and code generation. Rather than relying on dependent types, we propose a typesystem close to that of ML: polymorphism is used to define functions that are generic in types and sizes; both can be inferred. This approach is convenient, in particular for a language used to program critical embedded systems, where sizes are indeed known at compiletime. By using sizes that are multivariate polynomials, we obtain a good compromise between the expressiveness of the size language and its properties (verification, inference). The article defines a minimal functional language that is sufficient to capture size constraints in types. It presents its dynamic semantics, the type system and inference algorithm. Last, we sketch some practical extensions that matter for a more realistic language. @InProceedings{ARRAY23p36, author = {JeanLouis Colaço and Baptiste Pauget and Marc Pouzet}, title = {Polymorphic Types with Polynomial Sizes}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {3649}, doi = {10.1145/3589246.3595372}, year = {2023}, } Publisher's Version Archive submitted (560 kB) 

Hsu, Aaron W. 
ARRAY '23: "UNet CNN in APL: Exploring ..."
UNet CNN in APL: Exploring ZeroFramework, ZeroLibrary Machine Learning
Aaron W. Hsu and Rodrigo Girão Serrão (Dyalog, USA; Dyalog, UK) The APL notation would appear to be a clear match for convolutional neural networks, but traditional implementations of APL have lagged behind the performance of highly tuned, specialized frameworks designed to execute CNNs on the GPU. Moreover, most demonstrations of APL for neural networking have involved relatively small examples. We explore a more complex example in the Unet architecture and utilize a modern APL compiler with GPU support, Codfns, to compare the state of the art of APL against the current crop of specialized neural network frameworks in the form of PyTorch. We compare performance as well as the language design of APL for neural network programming and the clarity and transparency of the resulting code. We found that the complete “from scratch” APL source was on par with the complexity of the PyTorch reference implementation, albeit more foreign, while being more concise and complete. We also found that when compiled with Codfns, despite the naïve implementation both of Codfns and our own code, performance on the GPU and the CPU were within a factor of 2.2  2.4 times that of the PyTorch implementation. We believe this suggests significant avenues of future exploration for machine learning language design, pedagogy, and implementation, both inside and outside of the APL community. @InProceedings{ARRAY23p22, author = {Aaron W. Hsu and Rodrigo Girão Serrão}, title = {UNet CNN in APL: Exploring ZeroFramework, ZeroLibrary Machine Learning}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {2235}, doi = {10.1145/3589246.3595371}, year = {2023}, } Publisher's Version 

Jelovina, Denis 
ARRAY '23: "Towards Structured Algebraic ..."
Towards Structured Algebraic Programming
Daniele G. Spampinato , Denis Jelovina , Jiawei Zhuang , and AlbertJan N. Yzelman (Huawei Zurich Research Center, Switzerland; Huawei Technologies, China) Structured matrices and tensors exhibiting properties such as symmetry and fixed nonzero patterns are known for making algorithms and data storage more efficient. Due to emerging power and efficiency constraints required by the scale of modern scientific, machine learning, and edge computing applications, algorithm experts are revisiting traditional linear algebra algorithms with the goal of making new structures appear. Such structures often result from new numerical approximations that would greatly benefit from a more flexible linear algebra interface than standard BLAS and LAPACK, allowing for mixed precision and data types to appear in place of traditional floatingpoint operations. Algebraic programming interfaces, like GraphBLAS, while naturally abstracting the algebras of their operations, do not explicitly capture structured, densely stored arrays. In this paper, we present a preliminary design of a new algebraic programming interface for structured containers with templategeneric, nonzero patterns. This interface offers to backend implementations the possibility of integrating more compiletime pattern information in the loopbased implementation of primitive operations as well as in the formulation of array accesses. We demonstrate its ability to specify important dense matrix decomposition algorithms and argue its ability to expose highperformance backends. @InProceedings{ARRAY23p50, author = {Daniele G. Spampinato and Denis Jelovina and Jiawei Zhuang and AlbertJan N. Yzelman}, title = {Towards Structured Algebraic Programming}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {5061}, doi = {10.1145/3589246.3595373}, year = {2023}, } Publisher's Version 

Källberg, Linus 
ARRAY '23: "HEROML: A Very HighLevel ..."
HEROML: A Very HighLevel Array Language for Executable Modelling of Data Parallel Algorithms
Björn Lisper and Linus Källberg (Mälardalen University, Sweden) HEROML is an array language, on very high level, which is intended for specifying data parallel algorithms in a concise and platformindependent way where all the inherent data parallelism is easy to identify. The goal is to support the software development for heterogeneous systems with different kinds of parallel numerical accelerators, where programs tend to be very platformspecific and difficult to develop. In this paper we describe HEROML, and a proofofconcept implementation. @InProceedings{ARRAY23p13, author = {Björn Lisper and Linus Källberg}, title = {HEROML: A Very HighLevel Array Language for Executable Modelling of Data Parallel Algorithms}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {1321}, doi = {10.1145/3589246.3595370}, year = {2023}, } Publisher's Version 

Lisper, Björn 
ARRAY '23: "HEROML: A Very HighLevel ..."
HEROML: A Very HighLevel Array Language for Executable Modelling of Data Parallel Algorithms
Björn Lisper and Linus Källberg (Mälardalen University, Sweden) HEROML is an array language, on very high level, which is intended for specifying data parallel algorithms in a concise and platformindependent way where all the inherent data parallelism is easy to identify. The goal is to support the software development for heterogeneous systems with different kinds of parallel numerical accelerators, where programs tend to be very platformspecific and difficult to develop. In this paper we describe HEROML, and a proofofconcept implementation. @InProceedings{ARRAY23p13, author = {Björn Lisper and Linus Källberg}, title = {HEROML: A Very HighLevel Array Language for Executable Modelling of Data Parallel Algorithms}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {1321}, doi = {10.1145/3589246.3595370}, year = {2023}, } Publisher's Version 

Pauget, Baptiste 
ARRAY '23: "Polymorphic Types with Polynomial ..."
Polymorphic Types with Polynomial Sizes
JeanLouis Colaço , Baptiste Pauget , and Marc Pouzet (ANSYS, France; Inria, France; ENSPSL University, France) This article presents a compiletime analysis for tracking the size of datastructures in a statically typed and strict functional language. This information is valuable for static checking and code generation. Rather than relying on dependent types, we propose a typesystem close to that of ML: polymorphism is used to define functions that are generic in types and sizes; both can be inferred. This approach is convenient, in particular for a language used to program critical embedded systems, where sizes are indeed known at compiletime. By using sizes that are multivariate polynomials, we obtain a good compromise between the expressiveness of the size language and its properties (verification, inference). The article defines a minimal functional language that is sufficient to capture size constraints in types. It presents its dynamic semantics, the type system and inference algorithm. Last, we sketch some practical extensions that matter for a more realistic language. @InProceedings{ARRAY23p36, author = {JeanLouis Colaço and Baptiste Pauget and Marc Pouzet}, title = {Polymorphic Types with Polynomial Sizes}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {3649}, doi = {10.1145/3589246.3595372}, year = {2023}, } Publisher's Version Archive submitted (560 kB) 

Pouzet, Marc 
ARRAY '23: "Polymorphic Types with Polynomial ..."
Polymorphic Types with Polynomial Sizes
JeanLouis Colaço , Baptiste Pauget , and Marc Pouzet (ANSYS, France; Inria, France; ENSPSL University, France) This article presents a compiletime analysis for tracking the size of datastructures in a statically typed and strict functional language. This information is valuable for static checking and code generation. Rather than relying on dependent types, we propose a typesystem close to that of ML: polymorphism is used to define functions that are generic in types and sizes; both can be inferred. This approach is convenient, in particular for a language used to program critical embedded systems, where sizes are indeed known at compiletime. By using sizes that are multivariate polynomials, we obtain a good compromise between the expressiveness of the size language and its properties (verification, inference). The article defines a minimal functional language that is sufficient to capture size constraints in types. It presents its dynamic semantics, the type system and inference algorithm. Last, we sketch some practical extensions that matter for a more realistic language. @InProceedings{ARRAY23p36, author = {JeanLouis Colaço and Baptiste Pauget and Marc Pouzet}, title = {Polymorphic Types with Polynomial Sizes}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {3649}, doi = {10.1145/3589246.3595372}, year = {2023}, } Publisher's Version Archive submitted (560 kB) 

Sengul, Andrew 
ARRAY '23: "Faster APL with Lazy Extensions ..."
Faster APL with Lazy Extensions
Andrew Sengul (Independent Researcher, USA) April is a compiler from a subset of the APL language to Common Lisp. To realize a more performant and elegant APL implementation, April now defers the evaluation of certain types of input. This means that the compiler produces code building a tree of "virtual arrays" – objects that represent arrays not yet computed. The object tree's methods are called to write an output array to memory, which may involve performing a second stage of compilation to build an efficient arraygenerating kernel with the option to emit assembly code for high speed. This evaluation model enables high performance and its component functions can be elegantly specified thanks to the objectoriented programming faculties of Common Lisp. @InProceedings{ARRAY23p62, author = {Andrew Sengul}, title = {Faster APL with Lazy Extensions}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {6274}, doi = {10.1145/3589246.3595374}, year = {2023}, } Publisher's Version 

Serrão, Rodrigo Girão 
ARRAY '23: "UNet CNN in APL: Exploring ..."
UNet CNN in APL: Exploring ZeroFramework, ZeroLibrary Machine Learning
Aaron W. Hsu and Rodrigo Girão Serrão (Dyalog, USA; Dyalog, UK) The APL notation would appear to be a clear match for convolutional neural networks, but traditional implementations of APL have lagged behind the performance of highly tuned, specialized frameworks designed to execute CNNs on the GPU. Moreover, most demonstrations of APL for neural networking have involved relatively small examples. We explore a more complex example in the Unet architecture and utilize a modern APL compiler with GPU support, Codfns, to compare the state of the art of APL against the current crop of specialized neural network frameworks in the form of PyTorch. We compare performance as well as the language design of APL for neural network programming and the clarity and transparency of the resulting code. We found that the complete “from scratch” APL source was on par with the complexity of the PyTorch reference implementation, albeit more foreign, while being more concise and complete. We also found that when compiled with Codfns, despite the naïve implementation both of Codfns and our own code, performance on the GPU and the CPU were within a factor of 2.2  2.4 times that of the PyTorch implementation. We believe this suggests significant avenues of future exploration for machine learning language design, pedagogy, and implementation, both inside and outside of the APL community. @InProceedings{ARRAY23p22, author = {Aaron W. Hsu and Rodrigo Girão Serrão}, title = {UNet CNN in APL: Exploring ZeroFramework, ZeroLibrary Machine Learning}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {2235}, doi = {10.1145/3589246.3595371}, year = {2023}, } Publisher's Version 

Spampinato, Daniele G. 
ARRAY '23: "Towards Structured Algebraic ..."
Towards Structured Algebraic Programming
Daniele G. Spampinato , Denis Jelovina , Jiawei Zhuang , and AlbertJan N. Yzelman (Huawei Zurich Research Center, Switzerland; Huawei Technologies, China) Structured matrices and tensors exhibiting properties such as symmetry and fixed nonzero patterns are known for making algorithms and data storage more efficient. Due to emerging power and efficiency constraints required by the scale of modern scientific, machine learning, and edge computing applications, algorithm experts are revisiting traditional linear algebra algorithms with the goal of making new structures appear. Such structures often result from new numerical approximations that would greatly benefit from a more flexible linear algebra interface than standard BLAS and LAPACK, allowing for mixed precision and data types to appear in place of traditional floatingpoint operations. Algebraic programming interfaces, like GraphBLAS, while naturally abstracting the algebras of their operations, do not explicitly capture structured, densely stored arrays. In this paper, we present a preliminary design of a new algebraic programming interface for structured containers with templategeneric, nonzero patterns. This interface offers to backend implementations the possibility of integrating more compiletime pattern information in the loopbased implementation of primitive operations as well as in the formulation of array accesses. We demonstrate its ability to specify important dense matrix decomposition algorithms and argue its ability to expose highperformance backends. @InProceedings{ARRAY23p50, author = {Daniele G. Spampinato and Denis Jelovina and Jiawei Zhuang and AlbertJan N. Yzelman}, title = {Towards Structured Algebraic Programming}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {5061}, doi = {10.1145/3589246.3595373}, year = {2023}, } Publisher's Version 

ValeroLara, Pedro 
ARRAY '23: "A MultiGPU PerformancePortable ..."
A MultiGPU PerformancePortable Solution for Array Programming Based on Kokkos
Pedro ValeroLara and Jeffrey S. Vetter (Oak Ridge National Laboratory, USA) Today, multiGPU nodes are widely used in highperformance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application areas of array programming. In this paper, we describe a new application programming interface based on the Kokkos programming model to enable array computation on multiple GPUs in a transparent and portable way across both NVIDIA and AMD GPUs. We implement different variations of this technique to accommodate the exchange of stencils (array boundaries) among different GPU memory spaces, and we provide autotuning to select the proper number of GPUs, depending on the computational cost of the operations to be computed on arrays, that is completely transparent to the programmer. We evaluate our multiGPU extension on Summit (#5 TOP500), with six NVIDIA V100 Volta GPUs per node, and Crusher that contains identical hardware/software as Frontier (#1 TOP500), with four AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs)for a total of 8 GCDs per node. We also compare the performance of this solution against the use of MPI + Kokkos, which is the current de facto solution for multiple GPUs in Kokkos. Our evaluation shows that the new Kokkos solution provides good scalability for many GPUs and a faster and simpler solution (from a programming productivity perspective) than MPI + Kokkos. @InProceedings{ARRAY23p1, author = {Pedro ValeroLara and Jeffrey S. Vetter}, title = {A MultiGPU PerformancePortable Solution for Array Programming Based on Kokkos}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {112}, doi = {10.1145/3589246.3595369}, year = {2023}, } Publisher's Version 

Vetter, Jeffrey S. 
ARRAY '23: "A MultiGPU PerformancePortable ..."
A MultiGPU PerformancePortable Solution for Array Programming Based on Kokkos
Pedro ValeroLara and Jeffrey S. Vetter (Oak Ridge National Laboratory, USA) Today, multiGPU nodes are widely used in highperformance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application areas of array programming. In this paper, we describe a new application programming interface based on the Kokkos programming model to enable array computation on multiple GPUs in a transparent and portable way across both NVIDIA and AMD GPUs. We implement different variations of this technique to accommodate the exchange of stencils (array boundaries) among different GPU memory spaces, and we provide autotuning to select the proper number of GPUs, depending on the computational cost of the operations to be computed on arrays, that is completely transparent to the programmer. We evaluate our multiGPU extension on Summit (#5 TOP500), with six NVIDIA V100 Volta GPUs per node, and Crusher that contains identical hardware/software as Frontier (#1 TOP500), with four AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs)for a total of 8 GCDs per node. We also compare the performance of this solution against the use of MPI + Kokkos, which is the current de facto solution for multiple GPUs in Kokkos. Our evaluation shows that the new Kokkos solution provides good scalability for many GPUs and a faster and simpler solution (from a programming productivity perspective) than MPI + Kokkos. @InProceedings{ARRAY23p1, author = {Pedro ValeroLara and Jeffrey S. Vetter}, title = {A MultiGPU PerformancePortable Solution for Array Programming Based on Kokkos}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {112}, doi = {10.1145/3589246.3595369}, year = {2023}, } Publisher's Version 

Yzelman, AlbertJan N. 
ARRAY '23: "Towards Structured Algebraic ..."
Towards Structured Algebraic Programming
Daniele G. Spampinato , Denis Jelovina , Jiawei Zhuang , and AlbertJan N. Yzelman (Huawei Zurich Research Center, Switzerland; Huawei Technologies, China) Structured matrices and tensors exhibiting properties such as symmetry and fixed nonzero patterns are known for making algorithms and data storage more efficient. Due to emerging power and efficiency constraints required by the scale of modern scientific, machine learning, and edge computing applications, algorithm experts are revisiting traditional linear algebra algorithms with the goal of making new structures appear. Such structures often result from new numerical approximations that would greatly benefit from a more flexible linear algebra interface than standard BLAS and LAPACK, allowing for mixed precision and data types to appear in place of traditional floatingpoint operations. Algebraic programming interfaces, like GraphBLAS, while naturally abstracting the algebras of their operations, do not explicitly capture structured, densely stored arrays. In this paper, we present a preliminary design of a new algebraic programming interface for structured containers with templategeneric, nonzero patterns. This interface offers to backend implementations the possibility of integrating more compiletime pattern information in the loopbased implementation of primitive operations as well as in the formulation of array accesses. We demonstrate its ability to specify important dense matrix decomposition algorithms and argue its ability to expose highperformance backends. @InProceedings{ARRAY23p50, author = {Daniele G. Spampinato and Denis Jelovina and Jiawei Zhuang and AlbertJan N. Yzelman}, title = {Towards Structured Algebraic Programming}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {5061}, doi = {10.1145/3589246.3595373}, year = {2023}, } Publisher's Version 

Zhuang, Jiawei 
ARRAY '23: "Towards Structured Algebraic ..."
Towards Structured Algebraic Programming
Daniele G. Spampinato , Denis Jelovina , Jiawei Zhuang , and AlbertJan N. Yzelman (Huawei Zurich Research Center, Switzerland; Huawei Technologies, China) Structured matrices and tensors exhibiting properties such as symmetry and fixed nonzero patterns are known for making algorithms and data storage more efficient. Due to emerging power and efficiency constraints required by the scale of modern scientific, machine learning, and edge computing applications, algorithm experts are revisiting traditional linear algebra algorithms with the goal of making new structures appear. Such structures often result from new numerical approximations that would greatly benefit from a more flexible linear algebra interface than standard BLAS and LAPACK, allowing for mixed precision and data types to appear in place of traditional floatingpoint operations. Algebraic programming interfaces, like GraphBLAS, while naturally abstracting the algebras of their operations, do not explicitly capture structured, densely stored arrays. In this paper, we present a preliminary design of a new algebraic programming interface for structured containers with templategeneric, nonzero patterns. This interface offers to backend implementations the possibility of integrating more compiletime pattern information in the loopbased implementation of primitive operations as well as in the formulation of array accesses. We demonstrate its ability to specify important dense matrix decomposition algorithms and argue its ability to expose highperformance backends. @InProceedings{ARRAY23p50, author = {Daniele G. Spampinato and Denis Jelovina and Jiawei Zhuang and AlbertJan N. Yzelman}, title = {Towards Structured Algebraic Programming}, booktitle = {Proc.\ ARRAY}, publisher = {ACM}, pages = {5061}, doi = {10.1145/3589246.3595373}, year = {2023}, } Publisher's Version 
14 authors
proc time: 2.91