PLDI 2019 Workshops
40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019)
Powered by
Conference Publishing Consulting

ACM SIGPLAN 6th Chapel Implementers and Users Workshop (CHIUW 2019), June 22, 2019, Phoenix, AZ, USA

CHIUW 2019 – Preliminary Table of Contents

Contents - Abstracts - Authors

ACM SIGPLAN 6th Chapel Implementers and Users Workshop (CHIUW 2019)


Title Page

Welcome from the Chairs

Welcome to the ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW 2019). This year, CHIUW is being held in Phoenix, Arizona on June 22nd in conjunction with PLDI 2019 and FCRC 2019. We would like to thank these conferences and ACM SIGPLAN for sponsoring this workshop.



Programming Abstractions for Orchestration of HPC Scientific Computing (Keynote)
Anshu Dubey
(Argonne National Laboratory, USA)
Application developers are confronted with three axes of increasing complexity going forward; increasing heterogeneity in computing platforms at all levels, increasing heterogeneity in solvers and data management, and moving existing code bases to future programming models. While the first two will dictate which future programming models may deliver the needed performance, the third will determine their adoption. However, it is clear that the infrastructure backbone of large scale Multiphysics software has to orchestrate data and task movement between devices. The lifecycle of scientific software is several times that of platforms, therefore, any orchestration mechanism must have flexibility and configurability to remain usable on future platforms. In this presentation I will outline a model of an orchestration framework and the demands that it will place on programming models and languages.
Article Search

Chapel Implementation Improvements

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
Akihiro Hayashi, Sri Raj Paul, and Vivek Sarkar
(Rice University, USA; Georgia Institute of Technology, USA)
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
Article Search
Calling Chapel Code: Interoperability Improvements
Lydia Duncan and David Iten
(Cray, USA)
Since CHIUW last year, the Chapel team has undertaken an effort to improve the ability to call Chapel code from other languages. This talk will cover a few areas of improvement: using Chapel code as a library from C, Python, and Fortran; and in addition, improvements to array interoperation.
Article Search

Chapel Performance and Optimization

Towards Radix Sorting in the Chapel Standard Library
Michael Ferguson
(Cray, USA)
This talk will discuss recent work improving the Sort module of the Chapel programming language. It will discuss an interface design to support radix sort, describe the implementation of radix sort, compare the performance of this implementation to sort libraries in other language, and finally discuss distributed sorting.
Article Search
Implementing Stencil Problems in Chapel: An Experience Report
Per Fuchs, Pieter Hijma, and Clemens Grelck
(Vrije Universiteit Amsterdam, Netherlands; University of Amsterdam, Netherlands)
Stencil operations represent a fundamental class of algorithms in high-performance computing. We are interested in what level of performance can be expected from a high-productivity language such as Chapel. To this effect we discuss four different implementations of a generic stencil operation with a convergence check after each iteration. We start with a sequential implementation followed by a global-view implementation that we experiment with both on a 16-core multi-core system as well as on a cluster with up to 16 such nodes using domain maps. We finish with a local-view implementation that explicitly encodes all design decisions with respect to parallel execution. This paper is set up as a two stage experience report: We mainly report our findings from the users' perspective without any feedback from the Chapel implementers. We then report additional analysis performed under guidance of the Chapel team. Our experimental findings show that Chapel performs as expected on a single node. However, it does not achieve the expected levels of performance on our multi-node system, neither with the data-parallel global-view approach, nor with the task-parallel local-view code. We discuss the root causes of our reduced performance in detail and report possible solutions.
Article Search Artifacts Available
Chapel Unblocked: Recent Communication Optimizations in Chapel
Elliot Ronaghan, Ben Harshbarger, Gregory Titus, and Michael Ferguson
(Cray, USA)
This talk will highlight communication optimizations made to the Chapel compiler and runtime over the past year. It will focus on improvements to core benchmarks that have benefited from fine-grained and bulk communication optimizations as well as remote task-spawning improvements. Several benchmarks including HPC Challenge (HPCC) RandomAccess, HPCC Stream Triad, and an integer sort code ISx will be briefly introduced, and a relevant performance optimization will be showcased. These benchmarks represent core idioms that are common in many HPC applications. Performance results on up to 1,024 nodes (25,000 cores) will demonstrate that with each release Chapel is becoming more competitive against hand tuned MPI+OpenMP, SHMEM, and UPC.
Article Search

Applications of Chapel

Arkouda: Interactive Data Exploration Backed by Chapel
Michael Merrill, William Reus, and Timothy Neumann

Exploratory data analysis (EDA) is the prerequisite for all data science. EDA is non-negotiably interactive—by far the most popular environment for EDA is a Jupyter notebook—and, as datasets grow, increasingly computationally intensive. Several existing projects attempt to combine interactivity and distributed computation using programming paradigms and tools from cloud computing, but none of these projects have come close to meeting our needs for high-performance EDA. To fill this gap, we have developed a prototype, called arkouda, that allows a user to interactively issue massively parallel computations on distributed data. We designed the API of arkouda to closely mimic NumPy, the underlying computational library used in approximately 80% of EDA workflows (based on a sample of Jupyter notebooks). Our vision is that users will import arkouda as a Python module in place of NumPy (e.g. “import arkouda as np”) and use familiar NumPy functions and syntax to interact with arrays of data residing on an HPC. The computational heart of arkouda is a Chapel interpreter that accepts a pre-defined set of commands from the Python frontend and uses Chapel’s built-in machinery for multi-locale and multithreaded execution. While arkouda, in our experience, comes closer than anything else to enabling high-performance EDA, the process of developing arkouda has also helped identify ways Chapel must improve in order to become a truly productive language for data science.

Article Search
Chapel Graph Library (CGL)
Louis Jenkins and Marcin Zalewski
(University of Rochester, USA; Pacific Northwest National Laboratory, USA; Northwest Institute for Advanced Computing, USA)
In this talk, I summarize prior work on the Chapel HyperGraph Library (CHGL), the Chapel Aggregation Library (CAL), and introduce the more general Chapel Graph Library (CGL). CGL is being designed to enable global-view programming, such that locality is abstracted from the user. CGL is also being designed in a way that is similar to Chapel's multiresolution design philosophy, where graphs are implemented in terms of hyper graphs, and where both the underlying hypergraph and overlying graphs are available for use. Some of the kinds of graphs being designed are bipartite graphs, directed and undirected graphs, and even trees.
Article Search
Chapel in Cray HPO
Benjamin Albrecht, Alex Heye, and Benjamin Robbins
(Cray, USA)

Cray HPO is a module of the data science workflow framework known as CrayAI. This module was released on Urika XC 1.2 and Urika CS 1.1, making Cray HPO the first Cray product built with Chapel. This talk will cover the importance, the technical aspects, and the broader vision for Cray HPO including Chapel’s role in the project.

Machine learning models can be described as a set of mathematical functions that define a relationship between different aspects of data. For example, a linear regression model uses a linear function to define the relationship between features and target data. The weight vector that contains weight values for each feature is the set of model parameters that are tuned to improve the model’s ability to predict the target data.

There is a separate set of parameters known as hyperparameters which determine how to tune the model parameters. For example, the learning rate schedule in an artificial neural network model is a hyperparameter. Hyperparameters play an important role in training models. They can dramatically impact the accuracy, time-to-accuracy, and over- or under-fitting of the model to the data. Because of this, it is considered good practice to tune the hyperparameters for each model and data set.

Optimizing hyperparameters is generally tackled with a different set of tools than those used when optimizing model parameters. The quality of a set of hyperparameters is determined by training and evaluating the model, a black box approach with no closed-form formula available like the loss functions defined in model parameter optimization. Because evaluation of a set of hyperparameters requires fully training a model, HPO can be computationally demanding, making HPO a good candidate for high performance computing.

Cray HPO provides a python interface for various distributed hyperparameter optimization techniques, which are implemented in Chapel under the hood. These approaches include grid search, random search, and genetic search. Additionally, Cray HPO supports a training schedule hyperparameter optimization technique known as population-based training (PBT). This technique evaluates a set of hyperparameter per epoch instead of full training, which effectively trains the model while optimizing the hyperparameters for each iteration. The Cray HPO PBT implementation includes several extensions to the original PBT published by Deep Mind , including sexual reproduction with crossover as well as decoupling hyperparameter and parameters when choosing parents.

Implementing Cray HPO in Chapel has made it easier to achieve necessary performance from the HPO code such that the model evaluation remains the bottleneck. This is not difficult to achieve for traditional HPOs, but does become challenging for single-epoch evaluations in PBT. The library interface relies on Chapel’s recently developed python-interoperability features and helped test and guide the development of these features. Cray HPO being implemented in Chapel will lower the barrier for future components of CrayAI to be developed in Chapel. One such example is feature selection, an area where distributed capabilities and performance are crucial.

Article Search

proc time: 2.72