IWMSE 2011 – Proceedings

Mesa: Automatic Generation of Lookup Table Optimizations
Chris Wilcox, Michelle Mills Strout, and James M. Bieman
(Colorado State University, USA)
Scientific programmers strive constantly to meet performance demands. Tuning is often done manually, despite the significant development time and effort required. One example is lookup table (LUT) optimization, a technique that is generally applied by hand due to a lack of methodology and tools. LUT methods reduce execution time by replacing computations with memory accesses to precomputed tables of results. LUT optimizations improve performance when the memory access is faster than the original computation, and the level of reuse is sufficient to amortize LUT initialization. Current practice requires programmers to inspect program source to identify candidate expressions, then develop specific LUT code for each optimization. Measurement of LUT accuracy is usually ad hoc, and the interaction with multicore parallelization has not been explored.
In this paper we present Mesa, a standalone tool that implements error analysis and code generation to improve the process of LUT optimization. We evaluate Mesa on a multicore system using a molecular biology application and other scientific expressions. Our LUT optimizations realize a performance improvement of 5X for the application and up to 45X for the expressions, while tightly controlling error. We also show that the serial optimization is just as effective on a parallel version of the application. Our research provides a methodology and tool for incorporating LUT optimizations into existing scientific code.

Model-based Generation of Static Schedules for Safety Critical Multi-Core Systems in the Avionics Domain
Robert Hilbrich and Hans-Joachim Goltz
(Fraunhofer FIRST, Germany)
Static schedules are used in safety critical systems to achieve predictable, real-time behavior. While it was possible to construct static schedules manually for simple, single-core systems, the increase in complexity introduced by multi-core processors and the demand for flexible and dynamic engineering processes in the avionics domain, require a novel approach for their automatic generation.
This paper describes ongoing trends in the avionics domain to further underline engineering constraints encountered, when introducing multi-core processors in a safety critical area. By focussing on the requirement of a predictable behavior, a model-based approach for the generation of static schedules for complex multi-core systems is presented. It incorporates the usage of external resources, which is essential to achieve deterministic resource access and real-time behavior on hardware architectures with multiple execution units.

Improving Programmability of Heterogeneous Many-Core Systems via Explicit Platform Descriptions
Martin Sandrieser, Siegfried Benkner, and Sabri Pllana
(University of Vienna, Austria)
In this paper we present ongoing work towards a programming framework for heterogeneous hardware- and software environments. Our framework aims at improving programmability and portability for heterogeneous many-core systems via a Platform Description Language (PDL) for expressing architectural patterns and platform information. We developed a prototypical code generator that takes as input an annotated serial task-based program and outputs, parametrized via PDL descriptors, code for a specific target heterogeneous computing system. By varying the target PDL descriptor, code for different target configurations can be generated without the need to modify the input program. We utilize a simple task-based programming model for demonstration of our approach and present preliminary results indicating its applicability on a state-of-the-art heterogeneous system.

Auto-tuning SkePU: A Multi-Backend Skeleton Programming Framework for Multi-GPU Systems
Usman Dastgeer, Johan Enmyren, and Christoph W. Kessler
(Linköping University, Sweden)
SkePU is a C++ template library that provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP backend. It also supports multi-GPU systems. Currently available skeletons in SkePU include map, reduce, mapreduce, map-with-overlap, map-array, and scan. The performance of SkePU generated code is comparable to that of hand-written code, even for more complex applications such as ODE solving.
In this paper, we discuss initial results from auto-tuning SkePU using an off-line, machine learning approach where we adapt skeletons to a given platform using training data. The prediction mechanism at execution time uses off-line pre-calculated estimates to construct an execution plan for any desired configuration with minimal overhead. The prediction mechanism accurately predicts execution time for repetitive executions and includes a mechanism to predict execution time for user functions of different complexity. The tuning framework covers selection between different backends as well as choosing optimal parameter values for the selected backend. We will discuss our approach and initial results obtained for different skeletons (map, mapreduce, reduce).

Lightweight Parallel Accumulators Using C++ Templates
Yossi Lev and Mark Moir
(Oracle Labs, USA)
The Boost.Accumulators framework provides C++ template-based support for incremental computation of many important statistical functions, such as maximum, minimum, mean, count, variance, etc. Basic accumulators can be combined to build more sophisticated ones. We explore how this framework can be extended to implement lightweight parallel accumulators that allow multiple threads to Store sample data, and support concurrent GetResult operations that incrementally compute desired functions over the data. Our evaluation shows that our parallel accumulators are scalable and can effectively exploit programmer-supplied knowledge to achieve significant optimizations for some important cases.

Open Language Implementation
Mandana Vaziri, Robert Fuhrer, and Evelyn Duesterwald
(IBM Research, USA)
Writing multi-threaded shared variable code is notoriously difficult because a given program can result in different possible executions due to non-determinism. Observed executions may also vary depending on the specific compiler and runtime system at hand. Many aspects of behavior are beyond the control of the programmer at the language's level of abstraction (e.g., compiler transformations), and as a result, debugging and performance tuning are very difficult tasks. In this position paper, we propose that the language implementation stack (compiler, runtime system) should be open, i.e., transparent, accountable, and interactive, so that the programmer may have direct access to invaluable information, and regain control over the behavior of the program in a given environment.

How Do Programs Become More Concurrent? A Story of Program Transformations
Danny Dig, John Marrero, and Michael D. Ernst

(University of Illinois, USA; Massachusetts Institute of Technology, USA; University of Washington, USA)
In the multi-core era, programmers need to resort to parallelism if they want to improve program performance. Thus, a major maintenance task will be to make sequential programs more concurrent. Must concurrency be designed into a program, or can it be retrofitted later? What are the most common transformations to retrofit concurrency into sequential programs? Are these transformations random, or do they belong to certain categories? How can we automate these transformations?
To answer these questions we analyzed the source code of 5 open-source Java projects and looked at a total of 14 versions. We analyzed qualitatively and quantitatively the concurrency-related transformations. We found that these transformations belong to four categories: transformations that improve the responsiveness, the throughput, the scalability, or correctness of the applications. In 73.9% of these transformations, concurrency was retrofitted on existing program elements. In 20.5% of the transformations, concurrency was designed into new program elements. Our findings educate software developers on how to parallelize sequential programs, and provide hints for tool vendors about what transformations are worth automating.

IWMSE 2011 – Proceedings

Fourth International Workshop on Multicore Software Engineering (IWMSE 2011)