Powered by
3rd International Workshop on Software Engineering for Parallel Systems (SEPS 2016),
November 1, 2016,
Amsterdam, Netherlands
Frontmatter
Message from the Chairs
Welcome to the third international workshop on Software Engineering for Parallel Systems (SEPS) held in Amsterdam, The Netherlands on November 1, 2016 and co-located with the ACM SIGPLAN conference on Systems, Programming, Languages and Applications: Software for Humanity (SPLASH 2016). The purpose of this workshop is to provide a stable forum for researchers and practitioners dealing with compelling challenges of the software development life cycle on modern parallel platforms.
Papers
Reducing Parallelizing Compilation Time by Removing Redundant Analysis
Jixin Han, Rina Fujino, Ryota Tamura, Mamoru Shimaoka, Hiroki Mikami, Moriyuki Takamura, Sachio Kamiya, Kazuhiko Suzuki, Takahiro Miyajima, Keiji Kimura, and Hironori Kasahara
(Waseda University, Japan; OSCAR TECHNOLOGY, Japan)
Parallelizing compilers equipped with powerful compiler optimizations are essential tools to fully exploit performance from today’s computer systems. These optimizations are supported by both highly sophisticated program analysis techniques and aggressive program restructuring techniques. However, the compilation time for such powerful compilers becomes larger and larger for real commercial application due to these strong program analysis techniques. In this paper, we propose a compilation time reduction technique for parallelizing compilers. The basic idea of the proposed technique is based on an observation that parallelizing compilers applies multiple program analysis passes and restructuring passes to a source program but all program analysis passes do not have to be applied to the whole source program. Thus, there is an opportunity for compilation time reduction by removing redundant program analysis. We describe the removing redundant program analysis techniques considering the inter-procedural propagation of annalysis update information in this paper. We implement the proposed technique into OSCAR automatically multigrain parallelizing compiler. We then evaluate the proposed technique by using three proprietary large scale programs. The proposed technique can remove 37.7% of program analysis time on average for basic analysis includes def-use analysis and dependence calculation, and 51.7% for pointer analysis, respectively.
@InProceedings{SEPS16p1,
author = {Jixin Han and Rina Fujino and Ryota Tamura and Mamoru Shimaoka and Hiroki Mikami and Moriyuki Takamura and Sachio Kamiya and Kazuhiko Suzuki and Takahiro Miyajima and Keiji Kimura and Hironori Kasahara},
title = {Reducing Parallelizing Compilation Time by Removing Redundant Analysis},
booktitle = {Proc.\ SEPS},
publisher = {ACM},
pages = {1--9},
doi = {},
year = {2016},
}
A Divide-and-Conquer Parallel Pattern Implementation for Multicores
Marco Danelutto, Tiziano De Matteis, Gabriele Mencagli, and Massimo Torquati
(University of Pisa, Italy)
Divide-and-Conquer (DaC) is a sequential programming paradigm which models a large class of algorithms used in real-life applications. Although suitable to extract parallelism in a straightforward way, the parallel implementation of DaC algorithms still requires some expertise in parallel programming tools by the programmer.
In this paper we aim at providing to non-expert programmers a high-level solution for fast prototyping parallel DaC programs on multicores with minimal programming effort.
Following the rationale of parallel design pattern methodology, we design a C++11-compliant template interface for developing parallel DaC programs. The interface is implemented using different back-end frameworks (i.e. OpenMP, Intel TBB and FastFlow) supporting source code reuse and a certain amount of performance portability.
Experiments on a 24-core Intel server show the effectiveness of our approach: with a reduced programming effort the programmer easily prototypes parallel versions with performance comparable with hand-made parallelizations.
@InProceedings{SEPS16p10,
author = {Marco Danelutto and Tiziano De Matteis and Gabriele Mencagli and Massimo Torquati},
title = {A Divide-and-Conquer Parallel Pattern Implementation for Multicores},
booktitle = {Proc.\ SEPS},
publisher = {ACM},
pages = {10--19},
doi = {},
year = {2016},
}
Parallel Evaluation of a DSP Algorithm using Julia
Peter Kourzanov
(NXP, Netherlands; Delft University of Technology, Netherlands)
Rapid pace of innovation in industrial research labs requires fast algorithm evaluation cycles. The use of multi-core hardware and distributed clusters is essential to achieve reasonable turnaround times for high-load simulations. Julia’s support for these as well as its pervasive multiple dispatch make it very attractive for high-performance technical computing.
Our experiments in speeding up a Digital Signal Processing (DSP) Intellectual Property (IP) model simulation for a Wireless LAN (WLAN) product confirm this. We augment standard SystemC High-Level Synthesis (HLS) tool-flow by an interactive worksheet supporting performance visualization and rapid design space exploration cycles.
@InProceedings{SEPS16p20,
author = {Peter Kourzanov},
title = {Parallel Evaluation of a DSP Algorithm using Julia},
booktitle = {Proc.\ SEPS},
publisher = {ACM},
pages = {20--24},
doi = {},
year = {2016},
}
Exhaustive Analysis of Thread-Level Speculation
Clark Verbrugge, Christopher J. F. Pickett, Alexander Krolik, and Allan Kielstra
(McGill University, Canada; IBM, Canada)
Thread-level Speculation (TLS) is a technique for automatic parallelization. The complexity of even prototype implementations, however, limits the ability to explore and compare the wide variety of possible design choices, and also makes understanding performance characteristics difficult. In this work we build a general analytical model of the method-level variant of TLS which we can use for determining program speedup under a wide range of TLS designs. Our approach is exhaustive, and using either simple brute force or more efficient dynamic programming implementations we are able to show how performance is strongly limited by program structure, as well as core choices in speculation design, irrespective of and complementary to the impact of data-dependencies. These results provide new, high-level insight into where and how thread-level speculation can and should be applied in order to produce practical speedup.
@InProceedings{SEPS16p25,
author = {Clark Verbrugge and Christopher J. F. Pickett and Alexander Krolik and Allan Kielstra},
title = {Exhaustive Analysis of Thread-Level Speculation},
booktitle = {Proc.\ SEPS},
publisher = {ACM},
pages = {25--34},
doi = {},
year = {2016},
}
proc time: 0.82