Array-oriented programming offers a unique blend of programmer productivity and high-performance parallel execution. As an abstraction, it directly mirrors high-level mathematical constructions commonly used in many fields from natural sciences through engineering to financial modelling. As a language feature, it exposes regular control flow, exhibits structured data dependencies, and lends itself to many types of program analysis. Furthermore, many modern computer architectures, particularly highly parallel architectures such as GPUs and FPGAs, lend themselves to efficiently executing array operations.
The ARRAY workshop series is intended to bring together researchers from many different communities, including language designers, library developers, compiler researchers, and practitioners, who are using or working on numeric, array-centric aspects of programming languages, libraries and methodologies from all domains: imperative or declarative; object-oriented or functional; interpreted or compiled; strongly typed, weakly typed, or untyped.
Numerous scientific-computational domains make use of array data. The core
computing of the numerical methods and the algorithms involved is related to
multi-dimensional array manipulation. Memory layout and the access patterns
of that data are crucial to the optimal performance of the array-based
computations. As we move towards exascale computing, writing portable code
for efficient data parallel computations is increasingly requiring an abstract
productive working environment. To that end, we present the design of a
framework for optimizing scientific array-based computations, building a case
study for a Partial Differential Equations solver.
By embedding the Mathematics of Arrays formalism in the Magnolia programming
language, we assemble a software stack capable of abstracting the continuous
high-level application layer from the discrete formulation of the collective
array-based numerical methods and algorithms and the final detailed low-level
code. The case study lays the groundwork for achieving optimized memory
layout and efficient computations while preserving a stable abstraction layer
independent of underlying algorithms and changes in the architecture.
We present a higher-order programmer-level technique for compiling particular kinds of irregular data-parallel problems to parallel hardware. The technique, which we have named ``flattening-by-expansion'' builds on a number of segmented data-parallel operations but is itself implemented as a higher-order generic function, which makes it useful for many irregular problems. Concretely, the implementation is given in Futhark and we demonstrate the usefulness of the functionality for a number of irregular problems and show that, in practice, the irregular problems are compiled to efficient parallel code that can be executed on GPUs. The technique is useful in any data-parallel language that provides a key set of primitives.
We present ALPyNA, an automatic loop parallelization framework
for Python, which analyzes data dependences within nested loops
and dynamically generates CUDA kernels for GPU execution. The
ALPyNA system applies classical dependence analysis techniques
to discover and exploit potential parallelism. The skeletal
structure of the dependence graph is determined statically
(if possible) or at runtime; this is combined with type and bounds
information discovered at runtime, to auto-generate high-performance
kernels for offload to GPU.
We demonstrate speedups of up to 1000x relative to the native CPython
interpreter across four array-intensive numerical Python benchmarks.
Performance improvement is related to both iteration domain size and
dependence graph complexity. Nevertheless, this approach promises to
bring the benefits of manycore parallelism to application developers.
High-level languages are commonly seen as a good fit to tackle the problem of performance portability across parallel architectures. The Lift framework is a recent approach which combines high-level, array-based programming abstractions, with a system of rewrite-rules that express algorithmic as well as low-level hardware optimizations. Lift has successfully demonstrated its ability to address the challenge of performance portability across multiple types of CPU and GPU devices by automatically generating code that is on-par with highly optimized hand-written code.
This paper demonstrates the potential of Lift for targeting FPGA-based platforms. It presents the design of new Lift parallel patterns operating on data streams, and describes the implementation of a Lift VHDL backend. This approach is evaluated on a Xilinx XC7Z010 FPGA using matrix multiplication, leading to a 10x speed-up over highly optimized CPU code and a commercial HLS tool. Furthermore, by considering the potential of design space exploration enabled by Lift, this work is a stepping stone towards automatically generated competitive code for FPGAs.
The widespread use of tensor operations in describing electronic structure calculations has motivated the design of software frameworks for productive development of scalable optimized tensor-based electronic structure methods. Whereas prior work focused on Cartesian abstractions for dense tensors, we present an algebra to specify and perform tensor operations on a larger class of block-sparse tensors. We illustrate the use of this framework in expressing real-world computational chemistry calculations beyond the reach of existing frameworks.
Each of the popular tensor frameworks from the machine learning domain comes with its own language for expressing tensor kernels.
Since these tensor languages lack precise specifications, it is impossible to understand and reason about tensor kernels that exhibit unexpected behaviour.
In this paper, we give examples of such kernels.
The tensor languages are superficially similar to the well-known functional array languages, for which formal definitions often exist.
However, the tensor languages are inherently imperative.
In this paper we present TeIL, an imperative tensor intermediate language with precise formal semantics.
For the popular tensor languages, TeIL can serve as a common ground on the basis of which precise reasoning about kernels becomes possible.
Based on TeIL's formal semantics we develop a type-safety result in the Coq proof assistant.
Convolutional Neural Networks in APL
Artjoms Šinkarovs, Robert Bernecky, and Sven-Bodo Scholz (Heriot-Watt University, UK; Snake Island Research, Canada)
This paper shows how a Convolutional Neural Network (CNN) can be implemented in APL. Its first-class array support ideally fits that domain, and the operations of APL facilitate rapid and concise creation of generically reusable building blocks. For our example, only ten blocks are needed, and they can be expressed as ten lines of native APL. All these blocks are purely functional, and they are built out of a small number of builtin operators, resulting in a highly portable specification that is immediately runnable and should be suitable for high-performance optimizations and parallel execution. This implies that APL can be seen as a framework to define shallowly-embedded machine learning DSLs without any external dependencies, making them useful at least for experiments and prototyping. We explain the construction of each CNN building block, and briefly discuss the performance of the resulting specification.
In a rank-polymorphic programming language, all functions automatically lift to operate on arbitrarily high-dimensional aggregate data. By adding records to such a language, we can support computation on data frames, a tabular data structure containing heterogeneous data but in which individual columns are homogeneous. In such a setting, a data frame is a vector of records, subject to both ordinary array operations (, filtering, reducing, sorting) and lifted record operations—projecting a field lifts to projecting a column. Data frames have become a popular tool for exploratory data analysis, but fluidity of interacting with data frames via lifted record operations depends on how the language’s records are designed.
We investigate three languages with different notions of record data: Racket, Standard ML, and Python. For each, we examine several common tasks for working with data frames and how the language’s records make these tasks easy or hard. Based on their advantages and disadvantages, we synthesize their ideas to produce a design for record types which is flexible for both scalar and lifted computation.
There is a recent push by a segment of the graph community to
implement graph algorithms in the language of linear algebra.
However, graph algorithms that depend on depth-first search (DFS) techniques
are often highlighted as limitations of the linear algebraic approach
as linear algebraic formulation of DFS algorithms are
few, if any. This paper provides a linear algebraic approach for
developing DFS graph algorithms and demonstrates its
use for defining three classical DFS-based computations: Binary tree
traversal, topological sort, and biconnected components.