Powered by
ACM SIGPLAN X10 Workshop (X10 2015),
June 14, 2015,
Portland, OR, USA
ACM SIGPLAN X10 Workshop (X10 2015)
Frontmatter
Foreword
Welcome to X10 2015, the 5th ACM SIGPLAN X10 Workshop, held in Portland, Oregon, USA, on June 14, 2015. The X10 Workshop provides a forum for X10 researchers, educators, and developers to interact with the larger X10 community by sharing their insights, experiences, and plans. Since its inception in 2011, the X10 Workshop has been co-located with the annual ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI).
Parallel Loops
Revisiting Loop Transformations with X10 Clocks
Tomofumi Yuki
(INRIA, France)
Loop transformations are known to be important for performance of compute-intensive programs, and are often used to expose parallelism. However, many transformations involving loops often obfuscate the code, and are cumbersome to apply by hand. The goal of this paper is to explore alternative methods for expressing parallelism that are more friendly to the programmer. In particular, we seek to expose parallelism without significantly changing the original loop structure. We illustrate how clocks in X10 can be used to express some of the traditional loop transformations, in the presence of parallelism, in a manner that we believe to be less invasive. Specifically, expressing parallelism corresponding to one-dimensional affine schedules can be achieved without modifying the original loop structure and/or statements.
@InProceedings{X1015p1,
author = {Tomofumi Yuki},
title = {Revisiting Loop Transformations with X10 Clocks},
booktitle = {Proc.\ X10},
publisher = {ACM},
pages = {1--6},
doi = {},
year = {2015},
}
Local Parallel Iteration in X10
Josh Milthorpe
(IBM Research, USA)
X10 programs have achieved high efficiency on petascale clusters by making significant use of parallelism between places, however, there has been less focus on exploiting local parallelism within a place. This paper introduces a standard mechanism - foreach - for efficient local parallel iteration in X10, including support for worker-local data. Library code transforms parallel iteration into an efficient pattern of activities for execution by X10’s work-stealing runtime. Parallel reductions and worker-local data help to avoid unnecessary synchronization between worker threads. The foreach mechanism is compared with leading programming technologies for shared-memory parallelism using kernel codes from high performance scientific applications. Experiments on a typical Intel multicore architecture show that X10 with foreach achieves parallel speedup comparable with OpenMP and TBB for several important patterns of iteration. foreach is composable with X10’s asynchronous partitioned global address space model, and therefore represents a step towards a parallel programming model that can express the full range of parallelism in modern high performance computing systems.
@InProceedings{X1015p7,
author = {Josh Milthorpe},
title = {Local Parallel Iteration in X10},
booktitle = {Proc.\ X10},
publisher = {ACM},
pages = {7--12},
doi = {},
year = {2015},
}
Compilers and Runtimes
Cutting Out the Middleman: OS-Level Support for X10 Activities
Manuel Mohr, Sebastian Buchwald, Andreas Zwinkau, Christoph Erhardt, Benjamin Oechslein, Jens Schedel, and Daniel Lohmann
(KIT, Germany; University of Erlangen-Nuremberg, Germany)
In the X10 language, computations are modeled as lightweight threads called activities. Since most operating systems only offer relatively heavyweight kernel-level threads, the X10 runtime system implements a user-space scheduler to map activities to operating-system threads in a many-to-one fashion. This approach can lead to suboptimal scheduling decisions or synchronization overhead. In this paper, we present an alternative X10 runtime system that targets OctoPOS, an operating system designed from the ground up for highly parallel workloads on PGAS architectures. OctoPOS offers an unconventional execution model based on i-lets, lightweight self-contained units of computation with (mostly) run-to-completion semantics that can be dispatched very efficiently. We are able to do a 1-to-1 mapping of X10 activities to i-lets, which results in a slim runtime system, avoiding the need for user-level scheduling and its costs. We perform microbenchmarks on a prototype many-core hardware architecture and show that our system needs fewer than 2000 clock cycles to spawn local and remote activities.
@InProceedings{X1015p13,
author = {Manuel Mohr and Sebastian Buchwald and Andreas Zwinkau and Christoph Erhardt and Benjamin Oechslein and Jens Schedel and Daniel Lohmann},
title = {Cutting Out the Middleman: OS-Level Support for X10 Activities},
booktitle = {Proc.\ X10},
publisher = {ACM},
pages = {13--18},
doi = {},
year = {2015},
}
Optimization of X10 Programs with ROSE Compiler Infrastructure
Michihiro Horie, Mikio Takeuchi, Kiyokuni Kawachiya, and
David Grove
(IBM Research, Japan; IBM Research, USA)
X10 is a Java-like programming language that introduces new constructs to significantly simplify scale-out programming based on the Asynchronous Partitioned Global Address Space (APGAS) programming model. The fundamental goal of X10 is to enable scalable, high-performance, high-productivity programming of large scale computer systems for both conventional numerically intensive HPC workloads and for emerging “Big Data” workloads. X10 is implemented via source-to-source compilation; the X10 compiler takes as input X10 programs, applies high-level transformations primarily targeting X10’s APGAS constructs, and outputs either C++ or Java source code that is further compiled to yield an executable program. ROSE is a multi-lingual compiler infrastructure for optimizing HPC applications using source-to-source transformations. It supports widely used programming models for parallel and distributed computing and provides a rich set of optimizations for serial programming models. In this paper, we report our early experiences connecting the X10 and ROSE and compilers to enable X10 programs to benefit from ROSE’s suite of optimizations. To demonstrate the applicability of our approach, we compiled the LULESH proxy application with the combined toolchain and obtained a 10% performance improvement.
@InProceedings{X1015p19,
author = {Michihiro Horie and Mikio Takeuchi and Kiyokuni Kawachiya and David Grove},
title = {Optimization of X10 Programs with ROSE Compiler Infrastructure},
booktitle = {Proc.\ X10},
publisher = {ACM},
pages = {19--24},
doi = {},
year = {2015},
}
The APGAS Library: Resilient Parallel and Distributed Programming in Java 8
Olivier Tardieu
(IBM Research, USA)
We propose the APGAS library for Java 8. Inspired by the core
constructs and semantics of the Resilient X10 programming language,
APGAS brings many benefits of the X10 programming model to the Java
programmer as a pure, idiomatic Java library.
APGAS supports the development of resilient distributed applications
running on elastic clusters of JVMs. It provides asynchronous
lightweight tasks (local and remote), resilient distributed
termination detection, and global heap references.
We compare and contrast the
X10 and APGAS programming styles, review key design choices, and
demonstrate that APGAS achieves performance comparable with X10.
@InProceedings{X1015p25,
author = {Olivier Tardieu},
title = {The APGAS Library: Resilient Parallel and Distributed Programming in Java 8},
booktitle = {Proc.\ X10},
publisher = {ACM},
pages = {25--26},
doi = {},
year = {2015},
}
Global Load Balancing
Towards an Efficient Fault-Tolerance Scheme for GLB
Claudia Fohry, Marco Bungart, and Jonas Posner
(University of Kassel, Germany)
X10's Global Load Balancing framework GLB implements a user-level task pool for inter-place load balancing. It is based on work stealing and deploys the lifeline algorithm. A single worker per place alternates between processing tasks and answering steal requests. We have devised an efficient fault-tolerance scheme for this algorithm, improving on a simpler resilience scheme from our own previous work. Among the base ideas of the new scheme are incremental backups of ``stable'' tasks and an actor-like communication structure. The paper reports on our ongoing work to extend the GLB framework accordingly. While details of the scheme are left out, we discuss implementation issues and preliminary experimental results.
@InProceedings{X1015p27,
author = {Claudia Fohry and Marco Bungart and Jonas Posner},
title = {Towards an Efficient Fault-Tolerance Scheme for GLB},
booktitle = {Proc.\ X10},
publisher = {ACM},
pages = {27--32},
doi = {},
year = {2015},
}
Scalable Parallel Numerical Constraint Solver using Global Load Balancing
Daisuke Ishii, Kazuki Yoshizoe, and Toyotaro Suzumura
(Tokyo Institute of Technology, Japan; University of Tokyo, Japan; IBM Research, USA; University College Dublin, Ireland; JST, Ireland)
We present a scalable parallel solver for numerical constraint satisfaction problems (NCSPs). Our parallelization scheme consists of homogeneous worker solvers, each of which runs on an available core and communicates with others via the global load balancing (GLB) method. The search tree of the branch and prune algorithm is split and distributed through the two phases of GLB: a random workload stealing phase and a workload distribution and termination phase based on a hyper-cube-shaped graph called lifeline. The parallel solver is simply implemented with X10 that provides an implementation of GLB as a library. In experiments, several NCSPs from the literature were solved and attained up to 516-fold speedup using 600 cores of the TSUBAME2.5 supercomputer. Optimal GLB configurations are analyzed.
@InProceedings{X1015p33,
author = {Daisuke Ishii and Kazuki Yoshizoe and Toyotaro Suzumura},
title = {Scalable Parallel Numerical Constraint Solver using Global Load Balancing},
booktitle = {Proc.\ X10},
publisher = {ACM},
pages = {33--38},
doi = {},
year = {2015},
}
proc time: 1.25