FHPC 2016 – Proceedings

Message from the Chairs
Welcome to the 5th ACM Workshop on Functional High Performance Computing (FHPC’16) in Nara, Japan, co-located with ICFP 2016. The FHPC workshop series brings together researchers who seek to exploit the high-level abstractions, rigour, elegance, and expressive power of functional programming to leverage the performance of diverse hardware architectures, from multi-core CPUs through GPUs to FPGAs and ASICs, to find new solutions to computationally challenging applications. The scope of the workshop is deliberately broad: papers this year address code generation, domain-specific languages, design of parallel programs in a functional setting, GPU programming, and applications. We welcome papers on new theory or on experience gained from applications; on pure functional programming, or functional ideas embedded in broad-spectrum languages; on challenges in the small such as micro-kernels, or in the large such as computational science over massive data volumes.

Keynote

From Identification of Parallelizability to Derivation of Parallelizable Codes
Akimasa Morihata

(University of Tokyo, Japan)
Although now parallel computing is very common, current parallel programming methods tend to be domain-specific (specializing in certain program patterns such as nested loops) and/or manual (programmers need to specify independent tasks). This situation poses a serious difficulty in developing efficient parallel programs. We often need to manually transform codes written in usual programming patterns to ones in a parallelizable form. We hope to have a solid foundation to streamline this transformation. This talk first reviews necessity of a method of systematically deriving parallelizable codes and then introduces an ongoing work on extending lambda calculus for the purpose. The distinguished feature of the new calculus is a special construct that enable evaluation with incomplete information, which is useful to express important parallel computation patterns such as reductions (aggregations). We then investigate derivations of parallelizable codes as transformations on the calculus.

Domain-Specific Languages

Icicle: Write Once, Run Once
Amos Robinson and Ben Lippmeier
(UNSW, Australia; Ambiata, Australia; Vertigo Technology, Australia)
We present Icicle, a pure streaming query language which statically guarantees that multiple queries over the same input stream will be fused. We use a modal type system to ensure that fused queries can be computed in an incremental fashion, and a fold-based intermediate language to compile down to efficient C code. We present production benchmarks demonstrating significant speedup over existing queries written in R, and on par with the widely used Unix tools grep and wc.

Using Fusion to Enable Late Design Decisions for Pipelined Computations
Máté Karácsony and Koen Claessen

(Eötvös Loránd University, Hungary; Chalmers University of Technology, Sweden)
We present an embedded language in Haskell for programming pipelined computations. The language is a combination of Feldspar (a functional language for array computations) and a new implementation of Ziria (a language for describing streaming computations originally designed for programming software defined radio). The resulting language makes heavy use of fusion: as in Feldspar, computations over arrays are fused to eliminate intermediate arrays, but Ziria processes can also be fused, eliminating the message passing between them, which in turn can give rise to more fusion at the Feldspar level. The result is a language in which we can first describe pipelined computations at a very fine-grained level, and only afterwards map computations onto the details of a specific parallel architecture, where the fusion helps us to generate efficient code. This flexible design method enables late design decisions cheaply, which in turn can lead to more efficient produced code. In the paper, we present two examples of pipelined computations in our language that can be run on Adapteva’s Epiphany many-core coprocessor and on other back-ends.

Code Generation

Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance.
We have designed Formura which adopts application-friendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that state-of-the-art stencil optimization techniques such as temporal blocking is always applied to Formura-generated program.
We demonstrate the usefulness of Formura by implementing a preliminary below-ground biology simulation. Optimized C-code are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floating-point-instruction efficiency, and 31.26% memory throughput efficiency.

JIT Costing Adaptive Skeletons for Performance Portability
Patrick Maier, John Magnus Morton, and Phil Trinder

(University of Glasgow, UK)
The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregular parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination.
The paper outlines a novel approach to delivering portable parallel performance for irregular parallel programs. The approach combines JIT compiler technology with dynamic scheduling and dynamic transformation of declarative parallelism.
We specify families of algorithmic skeletons plus equations for rewriting skeleton expressions. We present the design of a framework that unfolds skeletons into task graphs, dynamically schedules tasks, and dynamically rewrites skeletons, guided by a lightweight JIT trace-based cost model, to adapt the number and granularity of tasks for the architecture.
We outline the system architecture and prototype implementation in Racket/Pycket. As the current prototype does not yet automatically perform dynamic rewriting we present results based on manual offline rewriting, demonstrating that (i) the system scales to hundreds of cores given enough parallelism of suitable granularity, and (ii) the JIT trace cost model predicts granularity accurately enough to guide rewriting towards a good adaptive transformation.

GPUs

Low-Level Functional GPU Programming for Parallel Algorithms
Martin Dybdal, Martin Elsman

, Bo Joel Svensson, and Mary Sheeran
(University of Copenhagen, Denmark; Chalmers University of Technology, Sweden)
We present a Functional Compute Language (FCL) for low-level GPU programming. FCL is functional in style, which allows for easy composition of program fragments and thus easy prototyping and a high degree of code reuse. In contrast with projects such as Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not to develop a language providing fully automatic optimizations, but instead to provide a platform that supports absolute control of the GPU computation and memory hierarchies. The developer is thus required to have an intimate knowledge of the target platform, as is also required when using CUDA/OpenCL directly.
FCL is heavily inspired by Obsidian. However, instead of relying on a multi-staged meta-programming approach for kernel generation using Haskell as meta-language, FCL is completely self-contained, and we intend it to be suitable as an intermediate language for data-parallel languages, including data-parallel parts of high-level array languages, such as R, Matlab, and APL.
We present a type-system and a dynamic semantics suitable for understanding the performance characteristics of both FCL and Obsidian-style programs. Our aim is that FCL will be useful as a platform for developing new parallel algorithms, as well as a target-language for various code-generators targeting GPU hardware.

APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen

, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman

, and Cosmin Oancea
(University of Copenhagen, Denmark)
This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. Finally, empirical evaluation shows that the GPGPU translation achieves speedups up to hundreds of times faster than sequential C compiled code.

Streaming and Dataflow

Streaming Nested Data Parallelism on Multicores
Frederik M. Madsen and Andrzej Filinski
(University of Copenhagen, Denmark)
The paradigm of nested data parallelism (NDP) allows a variety of semi-regular computation tasks to be mapped onto SIMD-style hardware, including GPUs and vector units. However, some care is needed to keep down space consumption in situations where the available parallelism may vastly exceed the available computation resources. To allow for an accurate space-cost model in such cases, we have previously proposed the Streaming NESL language, a refinement of NESL with a high-level notion of streamable sequences.
In this paper, we report on experience with a prototype implementation of Streaming NESL on a 2-level parallel platform, namely a multicore system in which we also aggressively utilize vector instructions on each core. We show that for several examples of simple, but not trivially parallelizable, text-processing tasks, we obtain single-core performance on par with off-the-shelf GNU Coreutils code, and near-linear speedups for multiple cores.

Polarized Data Parallel Data Flow
Ben Lippmeier, Fil Mackay, and Amos Robinson
(Vertigo Technology, Australia; UNSW, Australia; Ambiata, Australia)
We present an approach to writing fused data parallel data flow programs where the library API guarantees that the client programs run in constant space. Our constant space guarantee is achieved by observing that binary stream operators can be provided in several polarity versions. Each polarity version uses a different combination of stream sources and sinks, and some versions allow constant space execution while others do not. Our approach is embodied in the Repa Flow Haskell library, which we are currently using for production workloads at Vertigo.

Graph Processing

s6raph: Vertex-Centric Graph Processing Framework with Functional Interface
Onofre Coll Ruiz, Kiminori Matsuzaki, and Shigeyuki Sato
(Kochi University of Technology, Japan)
Parallel processing of big graph-shaped data still presents many challenges. Several approaches have appeared recently, and a strong trend focusing on understanding graph computation as iterative vertex-centric computations has emerged. There have been several systems in the vertex-centric approach, for example Pregel, Giraph, GraphLab and PowerGraph. Though programs developed in these systems run efficiently in parallel, writing vertex-programs usually results in code with poor readability, that is full of side effects and control statements unrelated to the algorithm.
In this paper we introduce ``s6raph'', a new vertex-centric graph processing framework with a functional interface that allows the user to write clear and concise functions. The user can choose one of several default behaviours provided for most common graph algorithms. We discuss the design of the functional interface and introduce our prototype implementation in Erlang.