
Abelskov, Hjalte

FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea
(University of Copenhagen, Denmark)
This paper demonstrates translation schemes by which programs
written in a functional subset of APL can be compiled to code that
is run efficiently on general purpose graphical processing units (GPGPUs).
Furthermore, the generated programs can be straightforwardly interoperated
with mainstream programming environments, such as Python, for example for
purposes of visualization and user interaction.
Finally, empirical evaluation shows that the GPGPU translation
achieves speedups up to hundreds of times faster than sequential
C compiled code.
@InProceedings{FHPC16p38,
author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea},
title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3843},
doi = {10.1145/2975991.2975997},
year = {2016},
}
Publisher's Version
Article Search


Claessen, Koen

FHPC '16: "Using Fusion to Enable Late ..."
Using Fusion to Enable Late Design Decisions for Pipelined Computations
Máté Karácsony and Koen Claessen
(Eötvös Loránd University, Hungary; Chalmers University of Technology, Sweden)
We present an embedded language in Haskell for programming pipelined computations. The language is a combination of Feldspar (a functional language for array computations) and a new implementation of Ziria (a language for describing streaming computations originally designed for programming software defined radio). The resulting language makes heavy use of fusion: as in Feldspar, computations over arrays are fused to eliminate intermediate arrays, but Ziria processes can also be fused, eliminating the message passing between them, which in turn can give rise to more fusion at the Feldspar level. The result is a language in which we can first describe pipelined computations at a very finegrained level, and only afterwards map computations onto the details of a specific parallel architecture, where the fusion helps us to generate efficient code. This flexible design method enables late design decisions cheaply, which in turn can lead to more efficient produced code. In the paper, we present two examples of pipelined computations in our language that can be run on Adapteva’s Epiphany manycore coprocessor and on other backends.
@InProceedings{FHPC16p9,
author = {Máté Karácsony and Koen Claessen},
title = {Using Fusion to Enable Late Design Decisions for Pipelined Computations},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {916},
doi = {10.1145/2975991.2975993},
year = {2016},
}
Publisher's Version
Article Search


Coll Ruiz, Onofre 
FHPC '16: "s6raph: VertexCentric Graph ..."
s6raph: VertexCentric Graph Processing Framework with Functional Interface
Onofre Coll Ruiz, Kiminori Matsuzaki, and Shigeyuki Sato
(Kochi University of Technology, Japan)
Parallel processing of big graphshaped data still presents many challenges. Several approaches have appeared recently, and a strong trend focusing on understanding graph computation as iterative vertexcentric computations has emerged. There have been several systems in the vertexcentric approach, for example Pregel, Giraph, GraphLab and PowerGraph. Though programs developed in these systems run efficiently in parallel, writing vertexprograms usually results in code with poor readability, that is full of side effects and control statements unrelated to the algorithm.
In this paper we introduce ``s6raph'', a new vertexcentric graph processing framework with a functional interface that allows the user to write clear and concise functions. The user can choose one of several default behaviours provided for most common graph algorithms. We discuss the design of the functional interface and introduce our prototype implementation in Erlang.
@InProceedings{FHPC16p58,
author = {Onofre Coll Ruiz and Kiminori Matsuzaki and Shigeyuki Sato},
title = {s6raph: VertexCentric Graph Processing Framework with Functional Interface},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {5864},
doi = {10.1145/2975991.2976000},
year = {2016},
}
Publisher's Version
Article Search


Dybdal, Martin

FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea
(University of Copenhagen, Denmark)
This paper demonstrates translation schemes by which programs
written in a functional subset of APL can be compiled to code that
is run efficiently on general purpose graphical processing units (GPGPUs).
Furthermore, the generated programs can be straightforwardly interoperated
with mainstream programming environments, such as Python, for example for
purposes of visualization and user interaction.
Finally, empirical evaluation shows that the GPGPU translation
achieves speedups up to hundreds of times faster than sequential
C compiled code.
@InProceedings{FHPC16p38,
author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea},
title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3843},
doi = {10.1145/2975991.2975997},
year = {2016},
}
Publisher's Version
Article Search
FHPC '16: "LowLevel Functional GPU Programming ..."
LowLevel Functional GPU Programming for Parallel Algorithms
Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran
(University of Copenhagen, Denmark; Chalmers University of Technology, Sweden)
We present a Functional Compute Language (FCL) for lowlevel GPU
programming. FCL is functional in style, which allows for easy
composition of program fragments and thus easy prototyping and a
high degree of code reuse. In contrast with projects such as
Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not
to develop a language providing fully automatic optimizations, but
instead to provide a platform that supports absolute control of the
GPU computation and memory hierarchies. The developer is thus
required to have an intimate knowledge of the target platform, as is
also required when using CUDA/OpenCL directly.
FCL is heavily inspired by Obsidian. However, instead of relying on
a multistaged metaprogramming approach for kernel generation using
Haskell as metalanguage, FCL is completely selfcontained, and we
intend it to be suitable as an intermediate language for
dataparallel languages, including dataparallel parts of highlevel
array languages, such as R, Matlab, and APL.
We present a typesystem and a dynamic semantics suitable for
understanding the performance characteristics of both FCL and
Obsidianstyle programs. Our aim is that FCL will be useful as a
platform for developing new parallel algorithms, as well as a
targetlanguage for various codegenerators targeting GPU hardware.
@InProceedings{FHPC16p31,
author = {Martin Dybdal and Martin Elsman and Bo Joel Svensson and Mary Sheeran},
title = {LowLevel Functional GPU Programming for Parallel Algorithms},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3137},
doi = {10.1145/2975991.2975996},
year = {2016},
}
Publisher's Version
Article Search


Elsman, Martin

FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea
(University of Copenhagen, Denmark)
This paper demonstrates translation schemes by which programs
written in a functional subset of APL can be compiled to code that
is run efficiently on general purpose graphical processing units (GPGPUs).
Furthermore, the generated programs can be straightforwardly interoperated
with mainstream programming environments, such as Python, for example for
purposes of visualization and user interaction.
Finally, empirical evaluation shows that the GPGPU translation
achieves speedups up to hundreds of times faster than sequential
C compiled code.
@InProceedings{FHPC16p38,
author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea},
title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3843},
doi = {10.1145/2975991.2975997},
year = {2016},
}
Publisher's Version
Article Search
FHPC '16: "LowLevel Functional GPU Programming ..."
LowLevel Functional GPU Programming for Parallel Algorithms
Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran
(University of Copenhagen, Denmark; Chalmers University of Technology, Sweden)
We present a Functional Compute Language (FCL) for lowlevel GPU
programming. FCL is functional in style, which allows for easy
composition of program fragments and thus easy prototyping and a
high degree of code reuse. In contrast with projects such as
Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not
to develop a language providing fully automatic optimizations, but
instead to provide a platform that supports absolute control of the
GPU computation and memory hierarchies. The developer is thus
required to have an intimate knowledge of the target platform, as is
also required when using CUDA/OpenCL directly.
FCL is heavily inspired by Obsidian. However, instead of relying on
a multistaged metaprogramming approach for kernel generation using
Haskell as metalanguage, FCL is completely selfcontained, and we
intend it to be suitable as an intermediate language for
dataparallel languages, including dataparallel parts of highlevel
array languages, such as R, Matlab, and APL.
We present a typesystem and a dynamic semantics suitable for
understanding the performance characteristics of both FCL and
Obsidianstyle programs. Our aim is that FCL will be useful as a
platform for developing new parallel algorithms, as well as a
targetlanguage for various codegenerators targeting GPU hardware.
@InProceedings{FHPC16p31,
author = {Martin Dybdal and Martin Elsman and Bo Joel Svensson and Mary Sheeran},
title = {LowLevel Functional GPU Programming for Parallel Algorithms},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3137},
doi = {10.1145/2975991.2975996},
year = {2016},
}
Publisher's Version
Article Search


Filinski, Andrzej

FHPC '16: "Streaming Nested Data Parallelism ..."
Streaming Nested Data Parallelism on Multicores
Frederik M. Madsen and Andrzej Filinski
(University of Copenhagen, Denmark)
The paradigm of nested data parallelism (NDP) allows a variety of
semiregular computation tasks to be mapped onto SIMDstyle hardware,
including GPUs and vector units. However, some care is needed to keep
down space consumption in situations where the available parallelism may
vastly exceed the available computation resources. To allow for an
accurate spacecost model in such cases, we have previously proposed
the Streaming NESL language, a refinement of NESL with a highlevel
notion of streamable sequences.
In this paper, we report on experience with a prototype implementation
of Streaming NESL on a 2level parallel platform, namely a multicore
system in which we also aggressively utilize vector instructions on
each core. We show that for several examples of simple, but not
trivially parallelizable, textprocessing tasks, we obtain singlecore
performance on par with offtheshelf GNU Coreutils code, and
nearlinear speedups for multiple cores.
@InProceedings{FHPC16p44,
author = {Frederik M. Madsen and Andrzej Filinski},
title = {Streaming Nested Data Parallelism on Multicores},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {4451},
doi = {10.1145/2975991.2975998},
year = {2016},
}
Publisher's Version
Article Search


Gavin, Daniel

FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea
(University of Copenhagen, Denmark)
This paper demonstrates translation schemes by which programs
written in a functional subset of APL can be compiled to code that
is run efficiently on general purpose graphical processing units (GPGPUs).
Furthermore, the generated programs can be straightforwardly interoperated
with mainstream programming environments, such as Python, for example for
purposes of visualization and user interaction.
Finally, empirical evaluation shows that the GPGPU translation
achieves speedups up to hundreds of times faster than sequential
C compiled code.
@InProceedings{FHPC16p38,
author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea},
title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3843},
doi = {10.1145/2975991.2975997},
year = {2016},
}
Publisher's Version
Article Search


Henriksen, Troels

FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea
(University of Copenhagen, Denmark)
This paper demonstrates translation schemes by which programs
written in a functional subset of APL can be compiled to code that
is run efficiently on general purpose graphical processing units (GPGPUs).
Furthermore, the generated programs can be straightforwardly interoperated
with mainstream programming environments, such as Python, for example for
purposes of visualization and user interaction.
Finally, empirical evaluation shows that the GPGPU translation
achieves speedups up to hundreds of times faster than sequential
C compiled code.
@InProceedings{FHPC16p38,
author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea},
title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3843},
doi = {10.1145/2975991.2975997},
year = {2016},
}
Publisher's Version
Article Search


Hosono, Natsuki 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Hotta, Hideyuki 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Inoue, Hikaru

FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Iwasawa, Masaki 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Karácsony, Máté

FHPC '16: "Using Fusion to Enable Late ..."
Using Fusion to Enable Late Design Decisions for Pipelined Computations
Máté Karácsony and Koen Claessen
(Eötvös Loránd University, Hungary; Chalmers University of Technology, Sweden)
We present an embedded language in Haskell for programming pipelined computations. The language is a combination of Feldspar (a functional language for array computations) and a new implementation of Ziria (a language for describing streaming computations originally designed for programming software defined radio). The resulting language makes heavy use of fusion: as in Feldspar, computations over arrays are fused to eliminate intermediate arrays, but Ziria processes can also be fused, eliminating the message passing between them, which in turn can give rise to more fusion at the Feldspar level. The result is a language in which we can first describe pipelined computations at a very finegrained level, and only afterwards map computations onto the details of a specific parallel architecture, where the fusion helps us to generate efficient code. This flexible design method enables late design decisions cheaply, which in turn can lead to more efficient produced code. In the paper, we present two examples of pipelined computations in our language that can be run on Adapteva’s Epiphany manycore coprocessor and on other backends.
@InProceedings{FHPC16p9,
author = {Máté Karácsony and Koen Claessen},
title = {Using Fusion to Enable Late Design Decisions for Pipelined Computations},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {916},
doi = {10.1145/2975991.2975993},
year = {2016},
}
Publisher's Version
Article Search


Kiehn, Anna Sofie 
FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea
(University of Copenhagen, Denmark)
This paper demonstrates translation schemes by which programs
written in a functional subset of APL can be compiled to code that
is run efficiently on general purpose graphical processing units (GPGPUs).
Furthermore, the generated programs can be straightforwardly interoperated
with mainstream programming environments, such as Python, for example for
purposes of visualization and user interaction.
Finally, empirical evaluation shows that the GPGPU translation
achieves speedups up to hundreds of times faster than sequential
C compiled code.
@InProceedings{FHPC16p38,
author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea},
title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3843},
doi = {10.1145/2975991.2975997},
year = {2016},
}
Publisher's Version
Article Search


Lippmeier, Ben

FHPC '16: "Icicle: Write Once, Run Once ..."
Icicle: Write Once, Run Once
Amos Robinson and Ben Lippmeier
(UNSW, Australia; Ambiata, Australia; Vertigo Technology, Australia)
We present Icicle, a pure streaming query language which statically guarantees that multiple queries over the same input stream will be fused. We use a modal type system to ensure that fused queries can be computed in an incremental fashion, and a foldbased intermediate language to compile down to efficient C code. We present production benchmarks demonstrating significant speedup over existing queries written in R, and on par with the widely used Unix tools grep and wc .
@InProceedings{FHPC16p2,
author = {Amos Robinson and Ben Lippmeier},
title = {Icicle: Write Once, Run Once},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {28},
doi = {10.1145/2975991.2975992},
year = {2016},
}
Publisher's Version
Article Search
FHPC '16: "Polarized Data Parallel Data ..."
Polarized Data Parallel Data Flow
Ben Lippmeier, Fil Mackay, and Amos Robinson
(Vertigo Technology, Australia; UNSW, Australia; Ambiata, Australia)
We present an approach to writing fused data parallel data flow programs where the library API guarantees that the client programs run in constant space. Our constant space guarantee is achieved by observing that binary stream operators can be provided in several polarity versions. Each polarity version uses a different combination of stream sources and sinks, and some versions allow constant space execution while others do not. Our approach is embodied in the Repa Flow Haskell library, which we are currently using for production workloads at Vertigo.
@InProceedings{FHPC16p52,
author = {Ben Lippmeier and Fil Mackay and Amos Robinson},
title = {Polarized Data Parallel Data Flow},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {5257},
doi = {10.1145/2975991.2975999},
year = {2016},
}
Publisher's Version
Article Search


Mackay, Fil

FHPC '16: "Polarized Data Parallel Data ..."
Polarized Data Parallel Data Flow
Ben Lippmeier, Fil Mackay, and Amos Robinson
(Vertigo Technology, Australia; UNSW, Australia; Ambiata, Australia)
We present an approach to writing fused data parallel data flow programs where the library API guarantees that the client programs run in constant space. Our constant space guarantee is achieved by observing that binary stream operators can be provided in several polarity versions. Each polarity version uses a different combination of stream sources and sinks, and some versions allow constant space execution while others do not. Our approach is embodied in the Repa Flow Haskell library, which we are currently using for production workloads at Vertigo.
@InProceedings{FHPC16p52,
author = {Ben Lippmeier and Fil Mackay and Amos Robinson},
title = {Polarized Data Parallel Data Flow},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {5257},
doi = {10.1145/2975991.2975999},
year = {2016},
}
Publisher's Version
Article Search


Madsen, Frederik M. 
FHPC '16: "Streaming Nested Data Parallelism ..."
Streaming Nested Data Parallelism on Multicores
Frederik M. Madsen and Andrzej Filinski
(University of Copenhagen, Denmark)
The paradigm of nested data parallelism (NDP) allows a variety of
semiregular computation tasks to be mapped onto SIMDstyle hardware,
including GPUs and vector units. However, some care is needed to keep
down space consumption in situations where the available parallelism may
vastly exceed the available computation resources. To allow for an
accurate spacecost model in such cases, we have previously proposed
the Streaming NESL language, a refinement of NESL with a highlevel
notion of streamable sequences.
In this paper, we report on experience with a prototype implementation
of Streaming NESL on a 2level parallel platform, namely a multicore
system in which we also aggressively utilize vector instructions on
each core. We show that for several examples of simple, but not
trivially parallelizable, textprocessing tasks, we obtain singlecore
performance on par with offtheshelf GNU Coreutils code, and
nearlinear speedups for multiple cores.
@InProceedings{FHPC16p44,
author = {Frederik M. Madsen and Andrzej Filinski},
title = {Streaming Nested Data Parallelism on Multicores},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {4451},
doi = {10.1145/2975991.2975998},
year = {2016},
}
Publisher's Version
Article Search


Maier, Patrick 
FHPC '16: "JIT Costing Adaptive Skeletons ..."
JIT Costing Adaptive Skeletons for Performance Portability
Patrick Maier, John Magnus Morton, and Phil Trinder
(University of Glasgow, UK)
The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregular parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination.
The paper outlines a novel approach to delivering portable parallel performance for irregular parallel programs. The approach combines JIT compiler technology with dynamic scheduling and dynamic transformation of declarative parallelism.
We specify families of algorithmic skeletons plus equations for rewriting skeleton expressions. We present the design of a framework that unfolds skeletons into task graphs, dynamically schedules tasks, and dynamically rewrites skeletons, guided by a lightweight JIT tracebased cost model, to adapt the number and granularity of tasks for the architecture.
We outline the system architecture and prototype implementation in Racket/Pycket. As the current prototype does not yet automatically perform dynamic rewriting we present results based on manual offline rewriting, demonstrating that
(i) the system scales to hundreds of cores given enough parallelism of suitable granularity, and
(ii) the JIT trace cost model predicts granularity accurately enough to guide rewriting towards a good adaptive transformation.
@InProceedings{FHPC16p23,
author = {Patrick Maier and John Magnus Morton and Phil Trinder},
title = {JIT Costing Adaptive Skeletons for Performance Portability},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {2330},
doi = {10.1145/2975991.2975995},
year = {2016},
}
Publisher's Version
Article Search


Makino, Junichiro 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Maruyama, Yutaka 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Matsuzaki, Kiminori 
FHPC '16: "s6raph: VertexCentric Graph ..."
s6raph: VertexCentric Graph Processing Framework with Functional Interface
Onofre Coll Ruiz, Kiminori Matsuzaki, and Shigeyuki Sato
(Kochi University of Technology, Japan)
Parallel processing of big graphshaped data still presents many challenges. Several approaches have appeared recently, and a strong trend focusing on understanding graph computation as iterative vertexcentric computations has emerged. There have been several systems in the vertexcentric approach, for example Pregel, Giraph, GraphLab and PowerGraph. Though programs developed in these systems run efficiently in parallel, writing vertexprograms usually results in code with poor readability, that is full of side effects and control statements unrelated to the algorithm.
In this paper we introduce ``s6raph'', a new vertexcentric graph processing framework with a functional interface that allows the user to write clear and concise functions. The user can choose one of several default behaviours provided for most common graph algorithms. We discuss the design of the functional interface and introduce our prototype implementation in Erlang.
@InProceedings{FHPC16p58,
author = {Onofre Coll Ruiz and Kiminori Matsuzaki and Shigeyuki Sato},
title = {s6raph: VertexCentric Graph Processing Framework with Functional Interface},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {5864},
doi = {10.1145/2975991.2976000},
year = {2016},
}
Publisher's Version
Article Search


Morihata, Akimasa 
FHPC '16: "From Identification of Parallelizability ..."
From Identification of Parallelizability to Derivation of Parallelizable Codes
Akimasa Morihata
(University of Tokyo, Japan)
Although now parallel computing is very common, current parallel programming methods tend to be domainspecific (specializing in certain program patterns such as nested loops) and/or manual (programmers need to specify independent tasks). This situation poses a serious difficulty in developing efficient parallel programs. We often need to manually transform codes written in usual programming patterns to ones in a parallelizable form. We hope to have a solid foundation to streamline this transformation. This talk first reviews necessity of a method of systematically deriving parallelizable codes and then introduces an ongoing work on extending lambda calculus for the purpose. The distinguished feature of the new calculus is a special construct that enable evaluation with incomplete information, which is useful to express important parallel computation patterns such as reductions (aggregations). We then investigate derivations of parallelizable codes as transformations on the calculus.
@InProceedings{FHPC16p1,
author = {Akimasa Morihata},
title = {From Identification of Parallelizability to Derivation of Parallelizable Codes},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {11},
doi = {10.1145/2975991.2984053},
year = {2016},
}
Publisher's Version
Article Search


Morton, John Magnus 
FHPC '16: "JIT Costing Adaptive Skeletons ..."
JIT Costing Adaptive Skeletons for Performance Portability
Patrick Maier, John Magnus Morton, and Phil Trinder
(University of Glasgow, UK)
The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregular parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination.
The paper outlines a novel approach to delivering portable parallel performance for irregular parallel programs. The approach combines JIT compiler technology with dynamic scheduling and dynamic transformation of declarative parallelism.
We specify families of algorithmic skeletons plus equations for rewriting skeleton expressions. We present the design of a framework that unfolds skeletons into task graphs, dynamically schedules tasks, and dynamically rewrites skeletons, guided by a lightweight JIT tracebased cost model, to adapt the number and granularity of tasks for the architecture.
We outline the system architecture and prototype implementation in Racket/Pycket. As the current prototype does not yet automatically perform dynamic rewriting we present results based on manual offline rewriting, demonstrating that
(i) the system scales to hundreds of cores given enough parallelism of suitable granularity, and
(ii) the JIT trace cost model predicts granularity accurately enough to guide rewriting towards a good adaptive transformation.
@InProceedings{FHPC16p23,
author = {Patrick Maier and John Magnus Morton and Phil Trinder},
title = {JIT Costing Adaptive Skeletons for Performance Portability},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {2330},
doi = {10.1145/2975991.2975995},
year = {2016},
}
Publisher's Version
Article Search


Muranushi, Takayuki 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Nakamura, Yoshifumi

FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Nishizawa, Seiya 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Nitadori, Keigo 
FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Oancea, Cosmin

FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea
(University of Copenhagen, Denmark)
This paper demonstrates translation schemes by which programs
written in a functional subset of APL can be compiled to code that
is run efficiently on general purpose graphical processing units (GPGPUs).
Furthermore, the generated programs can be straightforwardly interoperated
with mainstream programming environments, such as Python, for example for
purposes of visualization and user interaction.
Finally, empirical evaluation shows that the GPGPU translation
achieves speedups up to hundreds of times faster than sequential
C compiled code.
@InProceedings{FHPC16p38,
author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea},
title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3843},
doi = {10.1145/2975991.2975997},
year = {2016},
}
Publisher's Version
Article Search


Robinson, Amos

FHPC '16: "Icicle: Write Once, Run Once ..."
Icicle: Write Once, Run Once
Amos Robinson and Ben Lippmeier
(UNSW, Australia; Ambiata, Australia; Vertigo Technology, Australia)
We present Icicle, a pure streaming query language which statically guarantees that multiple queries over the same input stream will be fused. We use a modal type system to ensure that fused queries can be computed in an incremental fashion, and a foldbased intermediate language to compile down to efficient C code. We present production benchmarks demonstrating significant speedup over existing queries written in R, and on par with the widely used Unix tools grep and wc .
@InProceedings{FHPC16p2,
author = {Amos Robinson and Ben Lippmeier},
title = {Icicle: Write Once, Run Once},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {28},
doi = {10.1145/2975991.2975992},
year = {2016},
}
Publisher's Version
Article Search
FHPC '16: "Polarized Data Parallel Data ..."
Polarized Data Parallel Data Flow
Ben Lippmeier, Fil Mackay, and Amos Robinson
(Vertigo Technology, Australia; UNSW, Australia; Ambiata, Australia)
We present an approach to writing fused data parallel data flow programs where the library API guarantees that the client programs run in constant space. Our constant space guarantee is achieved by observing that binary stream operators can be provided in several polarity versions. Each polarity version uses a different combination of stream sources and sinks, and some versions allow constant space execution while others do not. Our approach is embodied in the Repa Flow Haskell library, which we are currently using for production workloads at Vertigo.
@InProceedings{FHPC16p52,
author = {Ben Lippmeier and Fil Mackay and Amos Robinson},
title = {Polarized Data Parallel Data Flow},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {5257},
doi = {10.1145/2975991.2975999},
year = {2016},
}
Publisher's Version
Article Search


Sato, Shigeyuki

FHPC '16: "s6raph: VertexCentric Graph ..."
s6raph: VertexCentric Graph Processing Framework with Functional Interface
Onofre Coll Ruiz, Kiminori Matsuzaki, and Shigeyuki Sato
(Kochi University of Technology, Japan)
Parallel processing of big graphshaped data still presents many challenges. Several approaches have appeared recently, and a strong trend focusing on understanding graph computation as iterative vertexcentric computations has emerged. There have been several systems in the vertexcentric approach, for example Pregel, Giraph, GraphLab and PowerGraph. Though programs developed in these systems run efficiently in parallel, writing vertexprograms usually results in code with poor readability, that is full of side effects and control statements unrelated to the algorithm.
In this paper we introduce ``s6raph'', a new vertexcentric graph processing framework with a functional interface that allows the user to write clear and concise functions. The user can choose one of several default behaviours provided for most common graph algorithms. We discuss the design of the functional interface and introduce our prototype implementation in Erlang.
@InProceedings{FHPC16p58,
author = {Onofre Coll Ruiz and Kiminori Matsuzaki and Shigeyuki Sato},
title = {s6raph: VertexCentric Graph Processing Framework with Functional Interface},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {5864},
doi = {10.1145/2975991.2976000},
year = {2016},
}
Publisher's Version
Article Search


Sheeran, Mary 
FHPC '16: "LowLevel Functional GPU Programming ..."
LowLevel Functional GPU Programming for Parallel Algorithms
Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran
(University of Copenhagen, Denmark; Chalmers University of Technology, Sweden)
We present a Functional Compute Language (FCL) for lowlevel GPU
programming. FCL is functional in style, which allows for easy
composition of program fragments and thus easy prototyping and a
high degree of code reuse. In contrast with projects such as
Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not
to develop a language providing fully automatic optimizations, but
instead to provide a platform that supports absolute control of the
GPU computation and memory hierarchies. The developer is thus
required to have an intimate knowledge of the target platform, as is
also required when using CUDA/OpenCL directly.
FCL is heavily inspired by Obsidian. However, instead of relying on
a multistaged metaprogramming approach for kernel generation using
Haskell as metalanguage, FCL is completely selfcontained, and we
intend it to be suitable as an intermediate language for
dataparallel languages, including dataparallel parts of highlevel
array languages, such as R, Matlab, and APL.
We present a typesystem and a dynamic semantics suitable for
understanding the performance characteristics of both FCL and
Obsidianstyle programs. Our aim is that FCL will be useful as a
platform for developing new parallel algorithms, as well as a
targetlanguage for various codegenerators targeting GPU hardware.
@InProceedings{FHPC16p31,
author = {Martin Dybdal and Martin Elsman and Bo Joel Svensson and Mary Sheeran},
title = {LowLevel Functional GPU Programming for Parallel Algorithms},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3137},
doi = {10.1145/2975991.2975996},
year = {2016},
}
Publisher's Version
Article Search


Svensson, Bo Joel 
FHPC '16: "LowLevel Functional GPU Programming ..."
LowLevel Functional GPU Programming for Parallel Algorithms
Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran
(University of Copenhagen, Denmark; Chalmers University of Technology, Sweden)
We present a Functional Compute Language (FCL) for lowlevel GPU
programming. FCL is functional in style, which allows for easy
composition of program fragments and thus easy prototyping and a
high degree of code reuse. In contrast with projects such as
Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not
to develop a language providing fully automatic optimizations, but
instead to provide a platform that supports absolute control of the
GPU computation and memory hierarchies. The developer is thus
required to have an intimate knowledge of the target platform, as is
also required when using CUDA/OpenCL directly.
FCL is heavily inspired by Obsidian. However, instead of relying on
a multistaged metaprogramming approach for kernel generation using
Haskell as metalanguage, FCL is completely selfcontained, and we
intend it to be suitable as an intermediate language for
dataparallel languages, including dataparallel parts of highlevel
array languages, such as R, Matlab, and APL.
We present a typesystem and a dynamic semantics suitable for
understanding the performance characteristics of both FCL and
Obsidianstyle programs. Our aim is that FCL will be useful as a
platform for developing new parallel algorithms, as well as a
targetlanguage for various codegenerators targeting GPU hardware.
@InProceedings{FHPC16p31,
author = {Martin Dybdal and Martin Elsman and Bo Joel Svensson and Mary Sheeran},
title = {LowLevel Functional GPU Programming for Parallel Algorithms},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3137},
doi = {10.1145/2975991.2975996},
year = {2016},
}
Publisher's Version
Article Search


Tomita, Hirofumi

FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search


Trinder, Phil 
FHPC '16: "JIT Costing Adaptive Skeletons ..."
JIT Costing Adaptive Skeletons for Performance Portability
Patrick Maier, John Magnus Morton, and Phil Trinder
(University of Glasgow, UK)
The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregular parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination.
The paper outlines a novel approach to delivering portable parallel performance for irregular parallel programs. The approach combines JIT compiler technology with dynamic scheduling and dynamic transformation of declarative parallelism.
We specify families of algorithmic skeletons plus equations for rewriting skeleton expressions. We present the design of a framework that unfolds skeletons into task graphs, dynamically schedules tasks, and dynamically rewrites skeletons, guided by a lightweight JIT tracebased cost model, to adapt the number and granularity of tasks for the architecture.
We outline the system architecture and prototype implementation in Racket/Pycket. As the current prototype does not yet automatically perform dynamic rewriting we present results based on manual offline rewriting, demonstrating that
(i) the system scales to hundreds of cores given enough parallelism of suitable granularity, and
(ii) the JIT trace cost model predicts granularity accurately enough to guide rewriting towards a good adaptive transformation.
@InProceedings{FHPC16p23,
author = {Patrick Maier and John Magnus Morton and Phil Trinder},
title = {JIT Costing Adaptive Skeletons for Performance Portability},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {2330},
doi = {10.1145/2975991.2975995},
year = {2016},
}
Publisher's Version
Article Search


Urms, Henrik

FHPC '16: "APL on GPUs: A TAIL from the ..."
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea
(University of Copenhagen, Denmark)
This paper demonstrates translation schemes by which programs
written in a functional subset of APL can be compiled to code that
is run efficiently on general purpose graphical processing units (GPGPUs).
Furthermore, the generated programs can be straightforwardly interoperated
with mainstream programming environments, such as Python, for example for
purposes of visualization and user interaction.
Finally, empirical evaluation shows that the GPGPU translation
achieves speedups up to hundreds of times faster than sequential
C compiled code.
@InProceedings{FHPC16p38,
author = {Troels Henriksen and Martin Dybdal and Henrik Urms and Anna Sofie Kiehn and Daniel Gavin and Hjalte Abelskov and Martin Elsman and Cosmin Oancea},
title = {APL on GPUs: A TAIL from the Past, Scribbled in Futhark},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {3843},
doi = {10.1145/2975991.2975997},
year = {2016},
}
Publisher's Version
Article Search


Yashiro, Hisashi

FHPC '16: "Automatic Generation of Efficient ..."
Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation
Takayuki Muranushi, Seiya Nishizawa, Hirofumi Tomita, Keigo Nitadori, Masaki Iwasawa, Yutaka Maruyama, Hisashi Yashiro, Yoshifumi Nakamura, Hideyuki Hotta, Junichiro Makino, Natsuki Hosono, and Hikaru Inoue
(RIKEN AICS, Japan; Chiba University, Japan; Kobe University, Japan; Kyoto University, Japan; Fujitsu, Japan)
Programming in HPC is a tedious work. Therefore functional programming languages that generate HPC programs have been proposed. However, they are not widely used by application scientists, because of learning barrier, and lack of demonstrated application performance. We have designed Formura which adopts applicationfriendly features such as typed rational array indices. Formura users can describe mathematical concepts such as operation over derivative operators using functional programming. Formura allows intuitive expression over array elements while ensuring the program is a stencil computation, so that stateoftheart stencil optimization techniques such as temporal blocking is always applied to Formuragenerated program. We demonstrate the usefulness of Formura by implementing a preliminary belowground biology simulation. Optimized Ccode are generated from 672 bytes of Formura program. The simulation was executed on the full nodes of the K computer, with 1.184 Pflops, 11.62% floatingpointinstruction efficiency, and 31.26% memory throughput efficiency.
@InProceedings{FHPC16p17,
author = {Takayuki Muranushi and Seiya Nishizawa and Hirofumi Tomita and Keigo Nitadori and Masaki Iwasawa and Yutaka Maruyama and Hisashi Yashiro and Yoshifumi Nakamura and Hideyuki Hotta and Junichiro Makino and Natsuki Hosono and Hikaru Inoue},
title = {Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation},
booktitle = {Proc.\ FHPC},
publisher = {ACM},
pages = {1722},
doi = {10.1145/2975991.2975994},
year = {2016},
}
Publisher's Version
Article Search
