ICFP Workshops 2017
22nd ACM SIGPLAN International Conference on Functional Programming (ICFP 2017)
Powered by
Conference Publishing Consulting

6th ACM SIGPLAN International Workshop on Functional High-Performance Computing (FHPC 2017), September 7, 2017, Oxford, UK

FHPC 2017 – Proceedings

Contents - Abstracts - Authors

6th ACM SIGPLAN International Workshop on Functional High-Performance Computing (FHPC 2017)

Frontmatter

Title Page


Message from the Chairs
Welcome to the 6th ACM Workshop on Functional High Performance Computing (FHPC'17), to be held on 7th of September 2017 in Oxford, UK, co-located with ICFP 2017. The FHPC workshop series brings together researchers who seek to explore uses of functional (or more generally, declarative or high-level) programming technology in application domains where high performance is essential. The aim of the meeting is to enable sharing of results, experiences and ideas about how the high-level abstractions, rigour and expressive power of functional/declarative languages can harness the power of various modern hardware architectures, ranging from multi-core CPUs to vector processors to GPUs to FPGAs and ASICs.

Compilation

From High-Level Radio Protocol Specifications to Efficient Low-Level Implementations via Partial Evaluation
Geoffrey Mainland and Siddhanathan Shanmugam
(Drexel University, USA)
Software-defined radio (SDR) is a challenging domain for language designers. To be useful in the real world, radio protocol implementations must operate at high data rates with low latency, yet to be useful to implementers, a language should allow programmers to express algorithms at a high level of abstraction without having to worry about the very low-level details that are necessary for meeting performance requirements. Ziria demonstrated that a high-level language for writing wireless physical layer (PHY) protocols could be competitive with hand-written C++, but only in a context where performance-critical computations, such as FFT and Viterbi, were still written in C++ and accessed via a foreign function interface.
We demonstrate that a new implementation of Ziria, embodied in the kzc compiler, allows even performance-critical blocks such as FFT and Viterbi to be written in a high-level language without sacrificing performance. Because kzc performs whole-program optimization, a radio protocol pipeline using an implementation of Viterbi written in Ziria can outperform an implementation that calls out to . The contributions of this paper fall into two categories. First, we describe two new optimizations in kzc, both of which are critical for wringing performance out of high-level code: an aggressive partial evaluator for Ziria programs, and an automatic lookup-table (LUT) generator. Second, we show how these optimizations allow the efficient implementation of three performance-critical blocks in Ziria: Viterbi decoding, the Fast Fourier Transform (FFT), and the inverse Fast Fourier Transform (IFFT).

Destination-Passing Style for Efficient Memory Management
Amir Shaikhha, Andrew Fitzgibbon, Simon Peyton Jones, and Dimitrios Vytiniotis
(EPFL, Switzerland; Microsoft, UK; Microsoft Research, UK)
We show how to compile high-level functional array-processing programs, drawn from image processing and machine learning, into C code that runs as fast as hand-written C. The key idea is to transform the program to destination-passing style, which in turn enables a highly-efficient stack-like memory allocation discipline.

Tool

VisPar: Visualising Dataflow Graphs from the Par Monad
Maximilian Algehed and Patrik Jansson
(Chalmers University of Technology, Sweden)
We present a work in progress tool (VisPar) for visualising computations in the Par monad in Haskell. Our contribution is not a revolutionary new idea but rather a modest addition to the set of tools available for making sense of parallel programs. We hope to show that VisPar can be useful as a teaching tool by providing visualisations of a few examples from a course on parallel functional programming.

Parallel Programming

In Search of a Map: Using Program Slicing to Discover Potential Parallelism in Recursive Functions
Adam D. Barwell and Kevin Hammond
(University of St. Andrews, UK)
Recursion schemes, such as the well-known map, can be used as loci of potential parallelism, where schemes are replaced with an equivalent parallel implementation. This paper formalises a novel technique, using program slicing, that automatically and statically identifies computations in recursive functions that can be lifted out of the function and then potentially performed in parallel. We define a new program slicing algorithm, build a prototype implementation, and demonstrate its use on 12 Haskell examples, including benchmarks from the NoFib suite and functions from the standard Haskell Prelude. In all cases, we obtain the expected results in terms of finding potential parallelism. Moreover, we have tested our prototype against synthetic benchmarks, and found that our prototype has quadratic time complexity. For the NoFib benchmark examples we demonstrate that relative parallel speedups can be obtained (up to 32.93× the sequential performance on 56 hyperthreaded cores).

Strategies for Regular Segmented Reductions on GPU
Rasmus Wriedt Larsen and Troels Henriksen
(University of Copenhagen, Denmark)
We present and evaluate an implementation technique for regular segmented reductions on GPUs. Existing techniques tend to be either consistent in performance but relatively inefficient in absolute terms, or optimised for specific workloads and thereby exhibiting bad performance for certain input. We propose three different strategies for segmented reduction of regular arrays, each optimised for a particular workload. We demonstrate an implementation in the Futhark compiler that is able to employ all three strategies and automatically select the appropriate one at runtime. While our evaluation is in the context of the Futhark compiler, the implementation technique is applicable to any library or language that has a need for segmented reductions.
We evaluate the technique on four microbenchmarks, two of which we also compare to implementations in the CUB library for GPU programming, as well as on two application benchmarks from the Rodinia suite. On the latter, we obtain speedups ranging from 1.3× to 1.7× over a previous implementation based on scans.

proc time: 1.13