Workshop FHPC 2016 – Author Index 
Contents 
Abstracts 
Authors

A C D E F G H I K L M N O R S T U Y
Abelskov, Hjalte 
FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea (University of Copenhagen, Denmark) This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. Finally, empirical evaluation shows that the GPGPU translation achieves speedups up to hundreds of times faster than sequential C compiled code. @InProceedings{FHPC16p38, author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea}, title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3843}, doi = {10.1145/2975991.2975997}, year = {2016}, } Publisher's Version Article Search 

Claessen, Koen 
FHPC '16: "Using Fusion to Enable Late ..."
Using Fusion to Enable Late Design Decisions for Pipelined Computations
Máté Karácsony and Koen Claessen (Eötvös Loránd University, Hungary; Chalmers University of Technology, Sweden) We present an embedded language in Haskell for programming pipelined computations. The language is a combination of Feldspar (a functional language for array computations) and a new implementation of Ziria (a language for describing streaming computations originally designed for programming software defined radio). The resulting language makes heavy use of fusion: as in Feldspar, computations over arrays are fused to eliminate intermediate arrays, but Ziria processes can also be fused, eliminating the message passing between them, which in turn can give rise to more fusion at the Feldspar level. The result is a language in which we can first describe pipelined computations at a very finegrained level, and only afterwards map computations onto the details of a specific parallel architecture, where the fusion helps us to generate efficient code. This flexible design method enables late design decisions cheaply, which in turn can lead to more efficient produced code. In the paper, we present two examples of pipelined computations in our language that can be run on Adapteva’s Epiphany manycore coprocessor and on other backends. @InProceedings{FHPC16p9, author = {Máté Karácsony and Koen Claessen}, title = {Using Fusion to Enable Late Design Decisions for Pipelined Computations}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {916}, doi = {10.1145/2975991.2975993}, year = {2016}, } Publisher's Version Article Search 

Coll Ruiz, Onofre 
FHPC '16: "s6raph: VertexCentric Graph ..."
s6raph: VertexCentric Graph Processing Framework with Functional Interface
Onofre Coll Ruiz, Kiminori Matsuzaki, and Shigeyuki Sato (Kochi University of Technology, Japan) Parallel processing of big graphshaped data still presents many challenges. Several approaches have appeared recently, and a strong trend focusing on understanding graph computation as iterative vertexcentric computations has emerged. There have been several systems in the vertexcentric approach, for example Pregel, Giraph, GraphLab and PowerGraph. Though programs developed in these systems run efficiently in parallel, writing vertexprograms usually results in code with poor readability, that is full of side effects and control statements unrelated to the algorithm. In this paper we introduce ``s6raph'', a new vertexcentric graph processing framework with a functional interface that allows the user to write clear and concise functions. The user can choose one of several default behaviours provided for most common graph algorithms. We discuss the design of the functional interface and introduce our prototype implementation in Erlang. @InProceedings{FHPC16p58, author = {Onofre Coll Ruiz and Kiminori Matsuzaki and Shigeyuki Sato}, title = {s6raph: VertexCentric Graph Processing Framework with Functional Interface}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {5864}, doi = {10.1145/2975991.2976000}, year = {2016}, } Publisher's Version Article Search 

Dybdal, Martin 
FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea (University of Copenhagen, Denmark) This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. Finally, empirical evaluation shows that the GPGPU translation achieves speedups up to hundreds of times faster than sequential C compiled code. @InProceedings{FHPC16p38, author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea}, title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3843}, doi = {10.1145/2975991.2975997}, year = {2016}, } Publisher's Version Article Search FHPC '16: "LowLevel Functional GPU Programming ..." LowLevel Functional GPU Programming for Parallel Algorithms Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran (University of Copenhagen, Denmark; Chalmers University of Technology, Sweden) We present a Functional Compute Language (FCL) for lowlevel GPU programming. FCL is functional in style, which allows for easy composition of program fragments and thus easy prototyping and a high degree of code reuse. In contrast with projects such as Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not to develop a language providing fully automatic optimizations, but instead to provide a platform that supports absolute control of the GPU computation and memory hierarchies. The developer is thus required to have an intimate knowledge of the target platform, as is also required when using CUDA/OpenCL directly. FCL is heavily inspired by Obsidian. However, instead of relying on a multistaged metaprogramming approach for kernel generation using Haskell as metalanguage, FCL is completely selfcontained, and we intend it to be suitable as an intermediate language for dataparallel languages, including dataparallel parts of highlevel array languages, such as R, Matlab, and APL. We present a typesystem and a dynamic semantics suitable for understanding the performance characteristics of both FCL and Obsidianstyle programs. Our aim is that FCL will be useful as a platform for developing new parallel algorithms, as well as a targetlanguage for various codegenerators targeting GPU hardware. @InProceedings{FHPC16p31, author = {Martin Dybdal and Martin Elsman and Bo Joel Svensson and Mary Sheeran}, title = {LowLevel Functional GPU Programming for Parallel Algorithms}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3137}, doi = {10.1145/2975991.2975996}, year = {2016}, } Publisher's Version Article Search 

Elsman, Martin 
FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea (University of Copenhagen, Denmark) This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. Finally, empirical evaluation shows that the GPGPU translation achieves speedups up to hundreds of times faster than sequential C compiled code. @InProceedings{FHPC16p38, author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea}, title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3843}, doi = {10.1145/2975991.2975997}, year = {2016}, } Publisher's Version Article Search FHPC '16: "LowLevel Functional GPU Programming ..." LowLevel Functional GPU Programming for Parallel Algorithms Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran (University of Copenhagen, Denmark; Chalmers University of Technology, Sweden) We present a Functional Compute Language (FCL) for lowlevel GPU programming. FCL is functional in style, which allows for easy composition of program fragments and thus easy prototyping and a high degree of code reuse. In contrast with projects such as Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not to develop a language providing fully automatic optimizations, but instead to provide a platform that supports absolute control of the GPU computation and memory hierarchies. The developer is thus required to have an intimate knowledge of the target platform, as is also required when using CUDA/OpenCL directly. FCL is heavily inspired by Obsidian. However, instead of relying on a multistaged metaprogramming approach for kernel generation using Haskell as metalanguage, FCL is completely selfcontained, and we intend it to be suitable as an intermediate language for dataparallel languages, including dataparallel parts of highlevel array languages, such as R, Matlab, and APL. We present a typesystem and a dynamic semantics suitable for understanding the performance characteristics of both FCL and Obsidianstyle programs. Our aim is that FCL will be useful as a platform for developing new parallel algorithms, as well as a targetlanguage for various codegenerators targeting GPU hardware. @InProceedings{FHPC16p31, author = {Martin Dybdal and Martin Elsman and Bo Joel Svensson and Mary Sheeran}, title = {LowLevel Functional GPU Programming for Parallel Algorithms}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3137}, doi = {10.1145/2975991.2975996}, year = {2016}, } Publisher's Version Article Search 

Filinski, Andrzej 
FHPC '16: "Streaming Nested Data Parallelism ..."
Streaming Nested Data Parallelism on Multicores
Frederik M. Madsen and Andrzej Filinski (University of Copenhagen, Denmark) The paradigm of nested data parallelism (NDP) allows a variety of semiregular computation tasks to be mapped onto SIMDstyle hardware, including GPUs and vector units. However, some care is needed to keep down space consumption in situations where the available parallelism may vastly exceed the available computation resources. To allow for an accurate spacecost model in such cases, we have previously proposed the Streaming NESL language, a refinement of NESL with a highlevel notion of streamable sequences. In this paper, we report on experience with a prototype implementation of Streaming NESL on a 2level parallel platform, namely a multicore system in which we also aggressively utilize vector instructions on each core. We show that for several examples of simple, but not trivially parallelizable, textprocessing tasks, we obtain singlecore performance on par with offtheshelf GNU Coreutils code, and nearlinear speedups for multiple cores. @InProceedings{FHPC16p44, author = {Frederik M. Madsen and Andrzej Filinski}, title = {Streaming Nested Data Parallelism on Multicores}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {4451}, doi = {10.1145/2975991.2975998}, year = {2016}, } Publisher's Version Article Search 

Gavin, Daniel 
FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea (University of Copenhagen, Denmark) This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. Finally, empirical evaluation shows that the GPGPU translation achieves speedups up to hundreds of times faster than sequential C compiled code. @InProceedings{FHPC16p38, author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea}, title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3843}, doi = {10.1145/2975991.2975997}, year = {2016}, } Publisher's Version Article Search 

Henriksen, Troels 
FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea (University of Copenhagen, Denmark) This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. Finally, empirical evaluation shows that the GPGPU translation achieves speedups up to hundreds of times faster than sequential C compiled code. @InProceedings{FHPC16p38, author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea}, title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3843}, doi = {10.1145/2975991.2975997}, year = {2016}, } Publisher's Version Article Search 

Hosono, Natsuki 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Hotta, Hideyuki 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Inoue, Hikaru 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Iwasawa, Masaki 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Karácsony, Máté 
FHPC '16: "Using Fusion to Enable Late ..."
Using Fusion to Enable Late Design Decisions for Pipelined Computations
Máté Karácsony and Koen Claessen (Eötvös Loránd University, Hungary; Chalmers University of Technology, Sweden) We present an embedded language in Haskell for programming pipelined computations. The language is a combination of Feldspar (a functional language for array computations) and a new implementation of Ziria (a language for describing streaming computations originally designed for programming software defined radio). The resulting language makes heavy use of fusion: as in Feldspar, computations over arrays are fused to eliminate intermediate arrays, but Ziria processes can also be fused, eliminating the message passing between them, which in turn can give rise to more fusion at the Feldspar level. The result is a language in which we can first describe pipelined computations at a very finegrained level, and only afterwards map computations onto the details of a specific parallel architecture, where the fusion helps us to generate efficient code. This flexible design method enables late design decisions cheaply, which in turn can lead to more efficient produced code. In the paper, we present two examples of pipelined computations in our language that can be run on Adapteva’s Epiphany manycore coprocessor and on other backends. @InProceedings{FHPC16p9, author = {Máté Karácsony and Koen Claessen}, title = {Using Fusion to Enable Late Design Decisions for Pipelined Computations}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {916}, doi = {10.1145/2975991.2975993}, year = {2016}, } Publisher's Version Article Search 

Kiehn, Anna Sofie 
FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea (University of Copenhagen, Denmark) This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. Finally, empirical evaluation shows that the GPGPU translation achieves speedups up to hundreds of times faster than sequential C compiled code. @InProceedings{FHPC16p38, author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea}, title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3843}, doi = {10.1145/2975991.2975997}, year = {2016}, } Publisher's Version Article Search 

Lippmeier, Ben 
FHPC '16: "Icicle: Write Once, Run Once ..."
Icicle: Write Once, Run Once
Amos Robinson and Ben Lippmeier (UNSW, Australia; Ambiata, Australia; Vertigo Technology, Australia) We present Icicle, a pure streaming query language which statically guarantees that multiple queries over the same input stream will be fused. We use a modal type system to ensure that fused queries can be computed in an incremental fashion, and a foldbased intermediate language to compile down to efficient C code. We present production benchmarks demonstrating significant speedup over existing queries written in R, and on par with the widely used Unix tools grep and wc. @InProceedings{FHPC16p2, author = {Amos Robinson and Ben Lippmeier}, title = {Icicle: Write Once, Run Once}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {28}, doi = {10.1145/2975991.2975992}, year = {2016}, } Publisher's Version Article Search FHPC '16: "Polarized Data Parallel Data ..." Polarized Data Parallel Data Flow Ben Lippmeier, Fil Mackay, and Amos Robinson (Vertigo Technology, Australia; UNSW, Australia; Ambiata, Australia) We present an approach to writing fused data parallel data flow programs where the library API guarantees that the client programs run in constant space. Our constant space guarantee is achieved by observing that binary stream operators can be provided in several polarity versions. Each polarity version uses a different combination of stream sources and sinks, and some versions allow constant space execution while others do not. Our approach is embodied in the Repa Flow Haskell library, which we are currently using for production workloads at Vertigo. @InProceedings{FHPC16p52, author = {Ben Lippmeier and Fil Mackay and Amos Robinson}, title = {Polarized Data Parallel Data Flow}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {5257}, doi = {10.1145/2975991.2975999}, year = {2016}, } Publisher's Version Article Search 

Mackay, Fil 
FHPC '16: "Polarized Data Parallel Data ..."
Polarized Data Parallel Data Flow
Ben Lippmeier, Fil Mackay, and Amos Robinson (Vertigo Technology, Australia; UNSW, Australia; Ambiata, Australia) We present an approach to writing fused data parallel data flow programs where the library API guarantees that the client programs run in constant space. Our constant space guarantee is achieved by observing that binary stream operators can be provided in several polarity versions. Each polarity version uses a different combination of stream sources and sinks, and some versions allow constant space execution while others do not. Our approach is embodied in the Repa Flow Haskell library, which we are currently using for production workloads at Vertigo. @InProceedings{FHPC16p52, author = {Ben Lippmeier and Fil Mackay and Amos Robinson}, title = {Polarized Data Parallel Data Flow}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {5257}, doi = {10.1145/2975991.2975999}, year = {2016}, } Publisher's Version Article Search 

Madsen, Frederik M. 
FHPC '16: "Streaming Nested Data Parallelism ..."
Streaming Nested Data Parallelism on Multicores
Frederik M. Madsen and Andrzej Filinski (University of Copenhagen, Denmark) The paradigm of nested data parallelism (NDP) allows a variety of semiregular computation tasks to be mapped onto SIMDstyle hardware, including GPUs and vector units. However, some care is needed to keep down space consumption in situations where the available parallelism may vastly exceed the available computation resources. To allow for an accurate spacecost model in such cases, we have previously proposed the Streaming NESL language, a refinement of NESL with a highlevel notion of streamable sequences. In this paper, we report on experience with a prototype implementation of Streaming NESL on a 2level parallel platform, namely a multicore system in which we also aggressively utilize vector instructions on each core. We show that for several examples of simple, but not trivially parallelizable, textprocessing tasks, we obtain singlecore performance on par with offtheshelf GNU Coreutils code, and nearlinear speedups for multiple cores. @InProceedings{FHPC16p44, author = {Frederik M. Madsen and Andrzej Filinski}, title = {Streaming Nested Data Parallelism on Multicores}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {4451}, doi = {10.1145/2975991.2975998}, year = {2016}, } Publisher's Version Article Search 

Maier, Patrick 
FHPC '16: "JIT Costing Adaptive Skeletons ..."
JIT Costing Adaptive Skeletons for Performance Portability
Patrick Maier, John Magnus Morton, and Phil Trinder (University of Glasgow, UK) The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregular parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination. The paper outlines a novel approach to delivering portable parallel performance for irregular parallel programs. The approach combines JIT compiler technology with dynamic scheduling and dynamic transformation of declarative parallelism. We specify families of algorithmic skeletons plus equations for rewriting skeleton expressions. We present the design of a framework that unfolds skeletons into task graphs, dynamically schedules tasks, and dynamically rewrites skeletons, guided by a lightweight JIT tracebased cost model, to adapt the number and granularity of tasks for the architecture. We outline the system architecture and prototype implementation in Racket/Pycket. As the current prototype does not yet automatically perform dynamic rewriting we present results based on manual offline rewriting, demonstrating that (i) the system scales to hundreds of cores given enough parallelism of suitable granularity, and (ii) the JIT trace cost model predicts granularity accurately enough to guide rewriting towards a good adaptive transformation. @InProceedings{FHPC16p23, author = {Patrick Maier and John Magnus Morton and Phil Trinder}, title = {JIT Costing Adaptive Skeletons for Performance Portability}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {2330}, doi = {10.1145/2975991.2975995}, year = {2016}, } Publisher's Version Article Search 

Makino, Junichiro 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Maruyama, Yutaka 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Matsuzaki, Kiminori 
FHPC '16: "s6raph: VertexCentric Graph ..."
s6raph: VertexCentric Graph Processing Framework with Functional Interface
Onofre Coll Ruiz, Kiminori Matsuzaki, and Shigeyuki Sato (Kochi University of Technology, Japan) Parallel processing of big graphshaped data still presents many challenges. Several approaches have appeared recently, and a strong trend focusing on understanding graph computation as iterative vertexcentric computations has emerged. There have been several systems in the vertexcentric approach, for example Pregel, Giraph, GraphLab and PowerGraph. Though programs developed in these systems run efficiently in parallel, writing vertexprograms usually results in code with poor readability, that is full of side effects and control statements unrelated to the algorithm. In this paper we introduce ``s6raph'', a new vertexcentric graph processing framework with a functional interface that allows the user to write clear and concise functions. The user can choose one of several default behaviours provided for most common graph algorithms. We discuss the design of the functional interface and introduce our prototype implementation in Erlang. @InProceedings{FHPC16p58, author = {Onofre Coll Ruiz and Kiminori Matsuzaki and Shigeyuki Sato}, title = {s6raph: VertexCentric Graph Processing Framework with Functional Interface}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {5864}, doi = {10.1145/2975991.2976000}, year = {2016}, } Publisher's Version Article Search 

Morihata, Akimasa 
FHPC '16: "From Identification of Parallelizability ..."
From Identification of Parallelizability to Derivation of Parallelizable Codes
Akimasa Morihata (University of Tokyo, Japan) Although now parallel computing is very common, current parallel programming methods tend to be domainspecific (specializing in certain program patterns such as nested loops) and/or manual (programmers need to specify independent tasks). This situation poses a serious difficulty in developing efficient parallel programs. We often need to manually transform codes written in usual programming patterns to ones in a parallelizable form. We hope to have a solid foundation to streamline this transformation. This talk first reviews necessity of a method of systematically deriving parallelizable codes and then introduces an ongoing work on extending lambda calculus for the purpose. The distinguished feature of the new calculus is a special construct that enable evaluation with incomplete information, which is useful to express important parallel computation patterns such as reductions (aggregations). We then investigate derivations of parallelizable codes as transformations on the calculus. @InProceedings{FHPC16p1, author = {Akimasa Morihata}, title = {From Identification of Parallelizability to Derivation of Parallelizable Codes}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {11}, doi = {10.1145/2975991.2984053}, year = {2016}, } Publisher's Version Article Search 

Morton, John Magnus 
FHPC '16: "JIT Costing Adaptive Skeletons ..."
JIT Costing Adaptive Skeletons for Performance Portability
Patrick Maier, John Magnus Morton, and Phil Trinder (University of Glasgow, UK) The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregular parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination. The paper outlines a novel approach to delivering portable parallel performance for irregular parallel programs. The approach combines JIT compiler technology with dynamic scheduling and dynamic transformation of declarative parallelism. We specify families of algorithmic skeletons plus equations for rewriting skeleton expressions. We present the design of a framework that unfolds skeletons into task graphs, dynamically schedules tasks, and dynamically rewrites skeletons, guided by a lightweight JIT tracebased cost model, to adapt the number and granularity of tasks for the architecture. We outline the system architecture and prototype implementation in Racket/Pycket. As the current prototype does not yet automatically perform dynamic rewriting we present results based on manual offline rewriting, demonstrating that (i) the system scales to hundreds of cores given enough parallelism of suitable granularity, and (ii) the JIT trace cost model predicts granularity accurately enough to guide rewriting towards a good adaptive transformation. @InProceedings{FHPC16p23, author = {Patrick Maier and John Magnus Morton and Phil Trinder}, title = {JIT Costing Adaptive Skeletons for Performance Portability}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {2330}, doi = {10.1145/2975991.2975995}, year = {2016}, } Publisher's Version Article Search 

Muranushi, Takayuki 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Nakamura, Yoshifumi 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Nishizawa, Seiya 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Nitadori, Keigo 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Oancea, Cosmin 
FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea (University of Copenhagen, Denmark) This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. Finally, empirical evaluation shows that the GPGPU translation achieves speedups up to hundreds of times faster than sequential C compiled code. @InProceedings{FHPC16p38, author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea}, title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3843}, doi = {10.1145/2975991.2975997}, year = {2016}, } Publisher's Version Article Search 

Robinson, Amos 
FHPC '16: "Icicle: Write Once, Run Once ..."
Icicle: Write Once, Run Once
Amos Robinson and Ben Lippmeier (UNSW, Australia; Ambiata, Australia; Vertigo Technology, Australia) We present Icicle, a pure streaming query language which statically guarantees that multiple queries over the same input stream will be fused. We use a modal type system to ensure that fused queries can be computed in an incremental fashion, and a foldbased intermediate language to compile down to efficient C code. We present production benchmarks demonstrating significant speedup over existing queries written in R, and on par with the widely used Unix tools grep and wc. @InProceedings{FHPC16p2, author = {Amos Robinson and Ben Lippmeier}, title = {Icicle: Write Once, Run Once}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {28}, doi = {10.1145/2975991.2975992}, year = {2016}, } Publisher's Version Article Search FHPC '16: "Polarized Data Parallel Data ..." Polarized Data Parallel Data Flow Ben Lippmeier, Fil Mackay, and Amos Robinson (Vertigo Technology, Australia; UNSW, Australia; Ambiata, Australia) We present an approach to writing fused data parallel data flow programs where the library API guarantees that the client programs run in constant space. Our constant space guarantee is achieved by observing that binary stream operators can be provided in several polarity versions. Each polarity version uses a different combination of stream sources and sinks, and some versions allow constant space execution while others do not. Our approach is embodied in the Repa Flow Haskell library, which we are currently using for production workloads at Vertigo. @InProceedings{FHPC16p52, author = {Ben Lippmeier and Fil Mackay and Amos Robinson}, title = {Polarized Data Parallel Data Flow}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {5257}, doi = {10.1145/2975991.2975999}, year = {2016}, } Publisher's Version Article Search 

Sato, Shigeyuki 
FHPC '16: "s6raph: VertexCentric Graph ..."
s6raph: VertexCentric Graph Processing Framework with Functional Interface
Onofre Coll Ruiz, Kiminori Matsuzaki, and Shigeyuki Sato (Kochi University of Technology, Japan) Parallel processing of big graphshaped data still presents many challenges. Several approaches have appeared recently, and a strong trend focusing on understanding graph computation as iterative vertexcentric computations has emerged. There have been several systems in the vertexcentric approach, for example Pregel, Giraph, GraphLab and PowerGraph. Though programs developed in these systems run efficiently in parallel, writing vertexprograms usually results in code with poor readability, that is full of side effects and control statements unrelated to the algorithm. In this paper we introduce ``s6raph'', a new vertexcentric graph processing framework with a functional interface that allows the user to write clear and concise functions. The user can choose one of several default behaviours provided for most common graph algorithms. We discuss the design of the functional interface and introduce our prototype implementation in Erlang. @InProceedings{FHPC16p58, author = {Onofre Coll Ruiz and Kiminori Matsuzaki and Shigeyuki Sato}, title = {s6raph: VertexCentric Graph Processing Framework with Functional Interface}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {5864}, doi = {10.1145/2975991.2976000}, year = {2016}, } Publisher's Version Article Search 

Sheeran, Mary 
FHPC '16: "LowLevel Functional GPU Programming ..."
LowLevel Functional GPU Programming for Parallel Algorithms
Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran (University of Copenhagen, Denmark; Chalmers University of Technology, Sweden) We present a Functional Compute Language (FCL) for lowlevel GPU programming. FCL is functional in style, which allows for easy composition of program fragments and thus easy prototyping and a high degree of code reuse. In contrast with projects such as Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not to develop a language providing fully automatic optimizations, but instead to provide a platform that supports absolute control of the GPU computation and memory hierarchies. The developer is thus required to have an intimate knowledge of the target platform, as is also required when using CUDA/OpenCL directly. FCL is heavily inspired by Obsidian. However, instead of relying on a multistaged metaprogramming approach for kernel generation using Haskell as metalanguage, FCL is completely selfcontained, and we intend it to be suitable as an intermediate language for dataparallel languages, including dataparallel parts of highlevel array languages, such as R, Matlab, and APL. We present a typesystem and a dynamic semantics suitable for understanding the performance characteristics of both FCL and Obsidianstyle programs. Our aim is that FCL will be useful as a platform for developing new parallel algorithms, as well as a targetlanguage for various codegenerators targeting GPU hardware. @InProceedings{FHPC16p31, author = {Martin Dybdal and Martin Elsman and Bo Joel Svensson and Mary Sheeran}, title = {LowLevel Functional GPU Programming for Parallel Algorithms}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3137}, doi = {10.1145/2975991.2975996}, year = {2016}, } Publisher's Version Article Search 

Svensson, Bo Joel 
FHPC '16: "LowLevel Functional GPU Programming ..."
LowLevel Functional GPU Programming for Parallel Algorithms
Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran (University of Copenhagen, Denmark; Chalmers University of Technology, Sweden) We present a Functional Compute Language (FCL) for lowlevel GPU programming. FCL is functional in style, which allows for easy composition of program fragments and thus easy prototyping and a high degree of code reuse. In contrast with projects such as Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not to develop a language providing fully automatic optimizations, but instead to provide a platform that supports absolute control of the GPU computation and memory hierarchies. The developer is thus required to have an intimate knowledge of the target platform, as is also required when using CUDA/OpenCL directly. FCL is heavily inspired by Obsidian. However, instead of relying on a multistaged metaprogramming approach for kernel generation using Haskell as metalanguage, FCL is completely selfcontained, and we intend it to be suitable as an intermediate language for dataparallel languages, including dataparallel parts of highlevel array languages, such as R, Matlab, and APL. We present a typesystem and a dynamic semantics suitable for understanding the performance characteristics of both FCL and Obsidianstyle programs. Our aim is that FCL will be useful as a platform for developing new parallel algorithms, as well as a targetlanguage for various codegenerators targeting GPU hardware. @InProceedings{FHPC16p31, author = {Martin Dybdal and Martin Elsman and Bo Joel Svensson and Mary Sheeran}, title = {LowLevel Functional GPU Programming for Parallel Algorithms}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3137}, doi = {10.1145/2975991.2975996}, year = {2016}, } Publisher's Version Article Search 

Tomita, Hirofumi 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 

Trinder, Phil 
FHPC '16: "JIT Costing Adaptive Skeletons ..."
JIT Costing Adaptive Skeletons for Performance Portability
Patrick Maier, John Magnus Morton, and Phil Trinder (University of Glasgow, UK) The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregular parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination. The paper outlines a novel approach to delivering portable parallel performance for irregular parallel programs. The approach combines JIT compiler technology with dynamic scheduling and dynamic transformation of declarative parallelism. We specify families of algorithmic skeletons plus equations for rewriting skeleton expressions. We present the design of a framework that unfolds skeletons into task graphs, dynamically schedules tasks, and dynamically rewrites skeletons, guided by a lightweight JIT tracebased cost model, to adapt the number and granularity of tasks for the architecture. We outline the system architecture and prototype implementation in Racket/Pycket. As the current prototype does not yet automatically perform dynamic rewriting we present results based on manual offline rewriting, demonstrating that (i) the system scales to hundreds of cores given enough parallelism of suitable granularity, and (ii) the JIT trace cost model predicts granularity accurately enough to guide rewriting towards a good adaptive transformation. @InProceedings{FHPC16p23, author = {Patrick Maier and John Magnus Morton and Phil Trinder}, title = {JIT Costing Adaptive Skeletons for Performance Portability}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {2330}, doi = {10.1145/2975991.2975995}, year = {2016}, } Publisher's Version Article Search 

Urms, Henrik 
FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea (University of Copenhagen, Denmark) This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. Finally, empirical evaluation shows that the GPGPU translation achieves speedups up to hundreds of times faster than sequential C compiled code. @InProceedings{FHPC16p38, author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea}, title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {3843}, doi = {10.1145/2975991.2975997}, year = {2016}, } Publisher's Version Article Search 

Yashiro, Hisashi 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue (RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan) Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency. @InProceedings{FHPC16p17, author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue}, title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation}, booktitle = {Proc.\ FHPC}, publisher = {ACM}, pages = {1722}, doi = {10.1145/2975991.2975994}, year = {2016}, } Publisher's Version Article Search 
36 authors
proc time: 0.38