Powered by
Conference Publishing Consulting

ACM SIGPLAN X10 Workshop (X10 2015), June 14, 2015, Portland, OR, USA

X10 2015 – Proceedings

Contents - Abstracts - Authors

ACM SIGPLAN X10 Workshop (X10 2015)

Frontmatter

Title Page


Foreword
Welcome to X10 2015, the 5th ACM SIGPLAN X10 Workshop, held in Portland, Oregon, USA, on June 14, 2015. The X10 Workshop provides a forum for X10 researchers, educators, and developers to interact with the larger X10 community by sharing their insights, experiences, and plans. Since its inception in 2011, the X10 Workshop has been co-located with the annual ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI).

Parallel Loops

Revisiting Loop Transformations with X10 Clocks
Tomofumi Yuki
(INRIA, France)
Loop transformations are known to be important for performance of compute-intensive programs, and are often used to expose parallelism. However, many transformations involving loops often obfuscate the code, and are cumbersome to apply by hand. The goal of this paper is to explore alternative methods for expressing parallelism that are more friendly to the programmer. In particular, we seek to expose parallelism without significantly changing the original loop structure. We illustrate how clocks in X10 can be used to express some of the traditional loop transformations, in the presence of parallelism, in a manner that we believe to be less invasive. Specifically, expressing parallelism corresponding to one-dimensional affine schedules can be achieved without modifying the original loop structure and/or statements.

Local Parallel Iteration in X10
Josh Milthorpe
(IBM Research, USA)
X10 programs have achieved high efficiency on petascale clusters by making significant use of parallelism between places, however, there has been less focus on exploiting local parallelism within a place. This paper introduces a standard mechanism - foreach - for efficient local parallel iteration in X10, including support for worker-local data. Library code transforms parallel iteration into an efficient pattern of activities for execution by X10’s work-stealing runtime. Parallel reductions and worker-local data help to avoid unnecessary synchronization between worker threads. The foreach mechanism is compared with leading programming technologies for shared-memory parallelism using kernel codes from high performance scientific applications. Experiments on a typical Intel multicore architecture show that X10 with foreach achieves parallel speedup comparable with OpenMP and TBB for several important patterns of iteration. foreach is composable with X10’s asynchronous partitioned global address space model, and therefore represents a step towards a parallel programming model that can express the full range of parallelism in modern high performance computing systems.

Compilers and Runtimes

Cutting Out the Middleman: OS-Level Support for X10 Activities
Manuel Mohr, Sebastian Buchwald, Andreas Zwinkau, Christoph Erhardt, Benjamin Oechslein, Jens Schedel, and Daniel Lohmann
(KIT, Germany; University of Erlangen-Nuremberg, Germany)
In the X10 language, computations are modeled as lightweight threads called activities. Since most operating systems only offer relatively heavyweight kernel-level threads, the X10 runtime system implements a user-space scheduler to map activities to operating-system threads in a many-to-one fashion. This approach can lead to suboptimal scheduling decisions or synchronization overhead. In this paper, we present an alternative X10 runtime system that targets OctoPOS, an operating system designed from the ground up for highly parallel workloads on PGAS architectures. OctoPOS offers an unconventional execution model based on i-lets, lightweight self-contained units of computation with (mostly) run-to-completion semantics that can be dispatched very efficiently. We are able to do a 1-to-1 mapping of X10 activities to i-lets, which results in a slim runtime system, avoiding the need for user-level scheduling and its costs. We perform microbenchmarks on a prototype many-core hardware architecture and show that our system needs fewer than 2000 clock cycles to spawn local and remote activities.

Optimization of X10 Programs with ROSE Compiler Infrastructure
Michihiro Horie, Mikio Takeuchi, Kiyokuni Kawachiya, and David Grove
(IBM Research, Japan; IBM Research, USA)
X10 is a Java-like programming language that introduces new constructs to significantly simplify scale-out programming based on the Asynchronous Partitioned Global Address Space (APGAS) programming model. The fundamental goal of X10 is to enable scalable, high-performance, high-productivity programming of large scale computer systems for both conventional numerically intensive HPC workloads and for emerging “Big Data” workloads. X10 is implemented via source-to-source compilation; the X10 compiler takes as input X10 programs, applies high-level transformations primarily targeting X10’s APGAS constructs, and outputs either C++ or Java source code that is further compiled to yield an executable program. ROSE is a multi-lingual compiler infrastructure for optimizing HPC applications using source-to-source transformations. It supports widely used programming models for parallel and distributed computing and provides a rich set of optimizations for serial programming models. In this paper, we report our early experiences connecting the X10 and ROSE and compilers to enable X10 programs to benefit from ROSE’s suite of optimizations. To demonstrate the applicability of our approach, we compiled the LULESH proxy application with the combined toolchain and obtained a 10% performance improvement.

The APGAS Library: Resilient Parallel and Distributed Programming in Java 8
Olivier Tardieu
(IBM Research, USA)
We propose the APGAS library for Java 8. Inspired by the core constructs and semantics of the Resilient X10 programming language, APGAS brings many benefits of the X10 programming model to the Java programmer as a pure, idiomatic Java library.
APGAS supports the development of resilient distributed applications running on elastic clusters of JVMs. It provides asynchronous lightweight tasks (local and remote), resilient distributed termination detection, and global heap references.
We compare and contrast the X10 and APGAS programming styles, review key design choices, and demonstrate that APGAS achieves performance comparable with X10.

Global Load Balancing

Towards an Efficient Fault-Tolerance Scheme for GLB
Claudia Fohry, Marco Bungart, and Jonas Posner
(University of Kassel, Germany)
X10's Global Load Balancing framework GLB implements a user-level task pool for inter-place load balancing. It is based on work stealing and deploys the lifeline algorithm. A single worker per place alternates between processing tasks and answering steal requests. We have devised an efficient fault-tolerance scheme for this algorithm, improving on a simpler resilience scheme from our own previous work. Among the base ideas of the new scheme are incremental backups of ``stable'' tasks and an actor-like communication structure. The paper reports on our ongoing work to extend the GLB framework accordingly. While details of the scheme are left out, we discuss implementation issues and preliminary experimental results.

Scalable Parallel Numerical Constraint Solver using Global Load Balancing
Daisuke Ishii, Kazuki Yoshizoe, and Toyotaro Suzumura
(Tokyo Institute of Technology, Japan; University of Tokyo, Japan; IBM Research, USA; University College Dublin, Ireland; JST, Ireland)
We present a scalable parallel solver for numerical constraint satisfaction problems (NCSPs). Our parallelization scheme consists of homogeneous worker solvers, each of which runs on an available core and communicates with others via the global load balancing (GLB) method. The search tree of the branch and prune algorithm is split and distributed through the two phases of GLB: a random workload stealing phase and a workload distribution and termination phase based on a hyper-cube-shaped graph called lifeline. The parallel solver is simply implemented with X10 that provides an implementation of GLB as a library. In experiments, several NCSPs from the literature were solved and attained up to 516-fold speedup using 600 cores of the TSUBAME2.5 supercomputer. Optimal GLB configurations are analyzed.

proc time: 1.25