Powered by
5th International Workshop on Functional High-Performance Computing (FHPC 2016),
September 22, 2016,
Nara, Japan
5th International Workshop on Functional High-Performance Computing (FHPC 2016)
Frontmatter
Message from the Chairs
Welcome to the 5th ACM Workshop on Functional High Performance Computing (FHPC’16) in Nara, Japan, co-located with ICFP 2016. The FHPC workshop series brings together researchers who seek to exploit the high-level abstractions, rigour, elegance, and expressive power of functional programming to leverage the performance of diverse hardware architectures, from multi-core CPUs through GPUs to FPGAs and ASICs, to find new solutions to computationally challenging applications. The scope of the workshop is deliberately broad: papers this year address code generation, domain-specific languages, design of parallel programs in a functional setting, GPU programming, and applications. We welcome papers on new theory or on experience gained from applications; on pure functional programming, or functional ideas embedded in broad-spectrum languages; on challenges in the small such as micro-kernels, or in the large such as computational science over massive data volumes.
Keynote
From Identification of Parallelizability to Derivation of Parallelizable Codes
Akimasa Morihata
(University of Tokyo, Japan)
Although now parallel computing is very common, current parallel programming methods tend to be domain-specific (specializing in certain program patterns such as nested loops) and/or manual (programmers need to specify independent tasks). This situation poses a serious difficulty in developing efficient parallel programs. We often need to manually transform codes written in usual programming patterns to ones in a parallelizable form. We hope to have a solid foundation to streamline this transformation. This talk first reviews necessity of a method of systematically deriving parallelizable codes and then introduces an ongoing work on extending lambda calculus for the purpose. The distinguished feature of the new calculus is a special construct that enable evaluation with incomplete information, which is useful to express important parallel computation patterns such as reductions (aggregations). We then investigate derivations of parallelizable codes as transformations on the calculus.
@InProceedings{FHPC16p1,
author = {Akimasa Morihata},
title = {From Identification of Parallelizability to Derivation of Parallelizable Codes},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1--1},
doi = {},
year = {2016},
}
Domain-Specific Languages
Icicle: Write Once, Run Once
Amos Robinson and
Ben Lippmeier
(UNSW, Australia; Ambiata, Australia; Vertigo Technology, Australia)
We present Icicle, a pure streaming query language which statically guarantees that multiple queries over the same input stream will be fused. We use a modal type system to ensure that fused queries can be computed in an incremental fashion, and a fold-based intermediate language to compile down to efficient C code. We present production benchmarks demonstrating significant speedup over existing queries written in R, and on par with the widely used Unix tools grep and wc.
@InProceedings{FHPC16p2,
author = {Amos Robinson and Ben Lippmeier},
title = {Icicle: Write Once, Run Once},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {2--8},
doi = {},
year = {2016},
}
Using Fusion to Enable Late Design Decisions for Pipelined Computations
Máté Karácsony and
Koen Claessen
(Eötvös Loránd University, Hungary; Chalmers University of Technology, Sweden)
We present an embedded language in Haskell for programming pipelined computations. The language is a combination of Feldspar (a functional language for array computations) and a new implementation of Ziria (a language for describing streaming computations originally designed for programming software defined radio). The resulting language makes heavy use of fusion: as in Feldspar, computations over arrays are fused to eliminate intermediate arrays, but Ziria processes can also be fused, eliminating the message passing between them, which in turn can give rise to more fusion at the Feldspar level. The result is a language in which we can first describe pipelined computations at a very fine-grained level, and only afterwards map computations onto the details of a specific parallel architecture, where the fusion helps us to generate efficient code. This flexible design method enables late design decisions cheaply, which in turn can lead to more efficient produced code. In the paper, we present two examples of pipelined computations in our language that can be run on Adapteva’s Epiphany many-core coprocessor and on other back-ends.
@InProceedings{FHPC16p9,
author = {Máté Karácsony and Koen Claessen},
title = {Using Fusion to Enable Late Design Decisions for Pipelined Computations},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {9--16},
doi = {},
year = {2016},
}
Code Generation
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi,
Seiya Nishizawa,
Hirofumi Tomita,
Keigo Nitadori,
Masaki Iwasawa,
Yutaka Maruyama,
Hisashi Yashiro,
Yoshifumi Nakamura,
Hideyuki Hotta,
Junichiro Makino,
Natsuki Hosono, and
Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance.
We have designed Formura which adopts application-friendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that state-of-the-art stencil optimization techniques such as temporal blocking is always applied to Formura-generated program.
We demonstrate the usefulness of Formura by implementing a preliminary below-ground biology simulation. Optimized C-code are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floating-point-instruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {17--22},
doi = {},
year = {2016},
}
JIT Costing Adaptive Skeletons for Performance Portability
Patrick Maier,
John Magnus Morton, and
Phil Trinder
(University of Glasgow, UK)
The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregular parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination.
The paper outlines a novel approach to delivering portable parallel performance for irregular parallel programs. The approach combines JIT compiler technology with dynamic scheduling and dynamic transformation of declarative parallelism.
We specify families of algorithmic skeletons plus equations for rewriting skeleton expressions. We present the design of a framework that unfolds skeletons into task graphs, dynamically schedules tasks, and dynamically rewrites skeletons, guided by a lightweight JIT trace-based cost model, to adapt the number and granularity of tasks for the architecture.
We outline the system architecture and prototype implementation in Racket/Pycket. As the current prototype does not yet automatically perform dynamic rewriting we present results based on manual offline rewriting, demonstrating that
(i) the system scales to hundreds of cores given enough parallelism of suitable granularity, and
(ii) the JIT trace cost model predicts granularity accurately enough to guide rewriting towards a good adaptive transformation.
@InProceedings{FHPC16p23,
author = {Patrick Maier and John Magnus Morton and Phil Trinder},
title = {JIT Costing Adaptive Skeletons for Performance Portability},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {23--30},
doi = {},
year = {2016},
}
GPUs
Low-Level Functional GPU Programming for Parallel Algorithms
Martin Dybdal,
Martin Elsman,
Bo Joel Svensson, and
Mary Sheeran
(University of Copenhagen, Denmark; Chalmers University of Technology, Sweden)
We present a Functional Compute Language (FCL) for low-level GPU
programming. FCL is functional in style, which allows for easy
composition of program fragments and thus easy prototyping and a
high degree of code reuse. In contrast with projects such as
Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not
to develop a language providing fully automatic optimizations, but
instead to provide a platform that supports absolute control of the
GPU computation and memory hierarchies. The developer is thus
required to have an intimate knowledge of the target platform, as is
also required when using CUDA/OpenCL directly.
FCL is heavily inspired by Obsidian. However, instead of relying on
a multi-staged meta-programming approach for kernel generation using
Haskell as meta-language, FCL is completely self-contained, and we
intend it to be suitable as an intermediate language for
data-parallel languages, including data-parallel parts of high-level
array languages, such as R, Matlab, and APL.
We present a type-system and a dynamic semantics suitable for
understanding the performance characteristics of both FCL and
Obsidian-style programs. Our aim is that FCL will be useful as a
platform for developing new parallel algorithms, as well as a
target-language for various code-generators targeting GPU hardware.
@InProceedings{FHPC16p31,
author = {Martin Dybdal and Martin Elsman and Bo Joel Svensson and Mary Sheeran},
title = {Low-Level Functional GPU Programming for Parallel Algorithms},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {31--37},
doi = {},
year = {2016},
}
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen,
Martin Dybdal,
Henrik Urms,
Anna Sofie Kiehn,
Daniel Gavin,
Hjalte Abelskov,
Martin Elsman, and
Cosmin Oancea
(University of Copenhagen, Denmark)
This paper demonstrates translation schemes by which programs
written in a functional subset of APL can be compiled to code that
is run efficiently on general purpose graphical processing units (GPGPUs).
Furthermore, the generated programs can be straightforwardly interoperated
with mainstream programming environments, such as Python, for example for
purposes of visualization and user interaction.
Finally, empirical evaluation shows that the GPGPU translation
achieves speedups up to hundreds of times faster than sequential
C compiled code.
@InProceedings{FHPC16p38,
author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea},
title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {38--43},
doi = {},
year = {2016},
}
Streaming and Dataflow
Streaming Nested Data Parallelism on Multicores
Frederik M. Madsen and
Andrzej Filinski
(University of Copenhagen, Denmark)
The paradigm of nested data parallelism (NDP) allows a variety of
semi-regular computation tasks to be mapped onto SIMD-style hardware,
including GPUs and vector units. However, some care is needed to keep
down space consumption in situations where the available parallelism may
vastly exceed the available computation resources. To allow for an
accurate space-cost model in such cases, we have previously proposed
the Streaming NESL language, a refinement of NESL with a high-level
notion of streamable sequences.
In this paper, we report on experience with a prototype implementation
of Streaming NESL on a 2-level parallel platform, namely a multicore
system in which we also aggressively utilize vector instructions on
each core. We show that for several examples of simple, but not
trivially parallelizable, text-processing tasks, we obtain single-core
performance on par with off-the-shelf GNU Coreutils code, and
near-linear speedups for multiple cores.
@InProceedings{FHPC16p44,
author = {Frederik M. Madsen and Andrzej Filinski},
title = {Streaming Nested Data Parallelism on Multicores},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {44--51},
doi = {},
year = {2016},
}
Polarized Data Parallel Data Flow
Ben Lippmeier,
Fil Mackay, and
Amos Robinson
(Vertigo Technology, Australia; UNSW, Australia; Ambiata, Australia)
We present an approach to writing fused data parallel data flow programs where the library API guarantees that the client programs run in constant space. Our constant space guarantee is achieved by observing that binary stream operators can be provided in several polarity versions. Each polarity version uses a different combination of stream sources and sinks, and some versions allow constant space execution while others do not. Our approach is embodied in the Repa Flow Haskell library, which we are currently using for production workloads at Vertigo.
@InProceedings{FHPC16p52,
author = {Ben Lippmeier and Fil Mackay and Amos Robinson},
title = {Polarized Data Parallel Data Flow},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {52--57},
doi = {},
year = {2016},
}
Graph Processing
s6raph: Vertex-Centric Graph Processing Framework with Functional Interface
Onofre Coll Ruiz,
Kiminori Matsuzaki, and
Shigeyuki Sato
(Kochi University of Technology, Japan)
Parallel processing of big graph-shaped data still presents many challenges. Several approaches have appeared recently, and a strong trend focusing on understanding graph computation as iterative vertex-centric computations has emerged. There have been several systems in the vertex-centric approach, for example Pregel, Giraph, GraphLab and PowerGraph. Though programs developed in these systems run efficiently in parallel, writing vertex-programs usually results in code with poor readability, that is full of side effects and control statements unrelated to the algorithm.
In this paper we introduce ``s6raph'', a new vertex-centric graph processing framework with a functional interface that allows the user to write clear and concise functions. The user can choose one of several default behaviours provided for most common graph algorithms. We discuss the design of the functional interface and introduce our prototype implementation in Erlang.
@InProceedings{FHPC16p58,
author = {Onofre Coll Ruiz and Kiminori Matsuzaki and Shigeyuki Sato},
title = {s6raph: Vertex-Centric Graph Processing Framework with Functional Interface},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {58--64},
doi = {},
year = {2016},
}
proc time: 0.88