LCTES 2016 – Proceedings

Efficient Asynchronous Interrupt Handling in a Full-System Instruction Set Simulator
Tom Spink, Harry Wagstaff, and Björn Franke

(University of Edinburgh, UK)
Instruction set simulators (ISS) have many uses in embedded software and hardware development and are typically based on dynamic binary translation (DBT), where frequently executed regions of guest instructions are compiled into host instructions using a just-in-time (JIT) compiler. Full-system simulation, which necessitates handling of asynchronous interrupts from e.g. timers and I/O devices, complicates matters as control flow is interrupted unpredictably and diverted from the current region of code. In this paper we present a novel scheme for handling of asynchronous interrupts, which integrates seamlessly into a region-based dynamic binary translator. We first show that our scheme is correct, i.e. interrupt handling is not deferred indefinitely, even in the presence of code regions comprising control flow loops. We demonstrate that our new interrupt handling scheme is efficient as we minimise the number of inserted checks. Interrupt handlers are also presented to the JIT compiler and compiled to native code, further enhancing the performance of our system. We have evaluated our scheme in an ARM simulator using a region-based JIT compilation strategy. We demonstrate that our solution reduces the number of dynamic interrupt checks by 73%, reduces interrupt service latency by 26% and improves throughput of an I/O bound workload by 7%, over traditional per-block schemes.

Code Cache Management in Managed Language VMs to Reduce Memory Consumption for Embedded Systems
Forrest J. Robinson, Michael R. Jantz

, and Prasad A. Kulkarni
(University of Kansas, USA; University of Tennessee, USA)
The compiled native code generated by a just-in-time (JIT) compiler in managed language virtual machines (VM) is placed in a region of memory called the code cache. Code cache management (CCM) in a VM is responsible to find and evict methods from the code cache to maintain execution correctness and manage program performance for a given code cache size or memory budget. Effective CCM can also boost program speed by enabling more aggressive JIT compilation, powerful optimizations, and improved hardware instruction cache and I-TLB performance. Though important, CCM is an overlooked component in VMs. We find that the default CCM policies in Oracle’s production-grade HotSpot VM perform poorly even at modest memory pressure. We develop a detailed simulation-based framework to model and evaluate the potential efficiency of many different CCM policies in a controlled and realistic, but VM-independent environment. We make the encouraging discovery that effective CCM policies can sustain high program performance even for very small cache sizes. Our simulation study provides the rationale and motivation to improve CCM strategies in existing VMs. We implement and study the properties of several CCM policies in HotSpot. We find that in spite of working within the bounds of the HotSpot VM’s current CCM sub-system, our best CCM policy implementation in HotSpot improves program performance over the default CCM algorithm by 39%, 41%, 55%, and 50% with code cache sizes that are 90%, 75%, 50%, and 25% of the desired cache size, on average.

A Graph-Based Iterative Compiler Pass Selection and Phase Ordering Approach
Ricardo Nobre, Luiz G. A. Martins, and João M. P. Cardoso
(University of Porto, Portugal; INESC TEC, Portugal; Federal University of Uberlândia, Brazil)
Nowadays compilers include tens or hundreds of optimization passes, which makes it difficult to find sequences of optimizations that achieve compiled code more optimized than the one obtained using typical compiler options such as -O2 and -O3. The problem involves both the selection of the compiler passes to use and their ordering in the compilation pipeline. The improvement achieved by the use of custom phase orders for each function can be significant, and thus important to satisfy strict requirements such as the ones present in high-performance embedded computing systems. In this paper we present a new and fast iterative approach to the phase selection and ordering challenges resulting in compiled code with higher performance than the one achieved with the standard optimization levels of the LLVM compiler. The obtained performance improvements are comparable with the ones achieved by other iterative approaches while requiring considerably less time and resources. Our approach is based on sampling over a graph representing transitions between compiler passes. We performed a number of experiments targeting the LEON3 microarchitecture using the Clang/LLVM 3.7 compiler, considering 140 LLVM passes and a set of 42 representative signal and image processing C functions. An exhaustive cross-validation shows our new exploration method is able to achieve a geometric mean performance speedup of 1.28x over the best individually selected -OX flag when considering 100,000 iterations; versus geometric mean speedups from 1.16x to 1.25x obtained with state-of-the-art iterative methods not using the graph. From the set of exploration methods tested, our new method is the only one consistently finding compiler sequences that result in performance improvements when considering 100 or less exploration iterations. Specifically, it achieved geometric mean speedups of 1.08x and 1.16x for 10 and 100 iterations, respectively.

Loop and Dataflow Analysis

Translation Validation of Loop and Arithmetic Transformations in the Presence of Recurrences
Kunal Banerjee, Chittaranjan Mandal, and Dipankar Sarkar
(IIT Kharagpur, India)
Compiler optimization of array-intensive programs involves extensive application of loop transformations and arithmetic transformations. Hence, translation validation of array-intensive programs requires manipulation of intervals of integers (representing domains of array indices) and relations over such intervals to account for loop transformations and simplification of arithmetic expressions to handle arithmetic transformations. A major obstacle for verification of such programs is posed by the presence of recurrences, whereby an element of an array gets defined in a statement S inside a loop in terms of some other element(s) of the same array which have been previously defined through the same statement S. Recurrences lead to cycles in the data-dependence graph of a program which make dependence analyses and simplifications (through closed-form representations) of the data transformations difficult. Another technique which works better for recurrences does not handle arithmetic transformations. In this work, array data-dependence graphs (ADDGs) are used to represent both the original and the optimized versions of the program and a validation scheme is proposed where the cycles due to recurrences in the ADDGs are suitably abstracted as acyclic subgraphs. Thus, this work provides a unified equivalence checking framework to handle loop and arithmetic transformations along with most of the recurrences -- this combination of features had not been achieved by a single verification technique earlier.

Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization
Yulei Sui, XIaokang Fan, Hao Zhou, and Jingling Xue

(UNSW, Australia)
Compiler-based auto-vectorization is a promising solution to automatically generate code that makes efficient use of SIMD processors in high performance platforms and embedded systems. Two main auto-vectorization techniques, superword-level parallelism vectorization (SLP) and loop-level vectorization (LLV), re- quire precise dependence analysis on arrays and structs in order to vectorize isomorphic scalar instructions and/or reduce dynamic dependence checks incurred at runtime. The alias analyses used in modern vectorizing compilers are either intra-procedural (without tracking inter-procedural data-flows) or inter-procedural (by using field-insensitive models, which are too imprecise in handling arrays and structs). This paper pro- poses an inter-procedural Loop-oriented Pointer Analysis, called LPA, for analyzing arrays and structs to support aggressive SLP and LLV optimizations. Unlike field-insensitive solutions that pre- allocate objects for each memory allocation site, our approach uses a fine-grained memory model to generate location sets based on how structs and arrays are accessed. LPA can precisely analyze ar- rays and nested aggregate structures to enable SIMD optimizations for large programs. By separating the location set generation as an independent concern from the rest of the pointer analysis, LPA is designed to reuse easily existing points-to resolution algorithms. We evaluate LPA using SLP and LLV, the two classic vectorization techniques on a set of 20 CPU2000/2006 benchmarks. For SLP, LPA enables it to vectorize a total of 133 more basic blocks, with an average of 12.09 per benchmark, resulting in the best speedup of 2.95% for 173.applu. For LLV, LPA has reduced a total of 319 static bound checks, with an average of 22.79 per benchmark, resulting in the best speedup of 7.18% for 177.mesa.

Generalized Cache Tiling for Dataflow Programs
Łukasz Domagała, Duco van Amstel, and Fabrice Rastello
(Inria, France)
The dataflow programming paradigm has facilitated the expression of a great number of algorithmic applications on embedded platforms in a wide variety of applicative domains. Whether it is a Domain Specific Language (DSL) or a more generalistic one, the dataflow paradigm allows to intuitively state the successive steps of an algorithm and link them through data communications. The optimization of cache-memory in this context has been a subject of interest since the early '90s as the reuse and communication of data between the agents of a dataflow program is a key factor in achieving a high-performance implementation within the reduced limits of embedded architectures. In order to improve data reuse among the dataflow agents we propose a modelisation of the communications and data usage within a dataflow program. Aside from providing an estimate of the amount of cache-misses that a given scheduling generates, this model allows us to specify the associated optimization problem in a manner that is identical to loop-nest tiling. Improving on the existing state-of-the-art methods we extend our tiling technique to include non-uniform dependencies on one of the dimensions of the iteration space. When applying the proposed technique to dataflow programs expressed within the StreamIt framework we are able to showcase significant reductions in the number of cache-misses for a majority of test-cases when compared to existing optimizations.

Worst-Case Analysis and Error Handling

Symbolic Execution for Memory Consumption Analysis
Duc-Hiep Chu, Joxan Jaffar

, and Rasool Maghareh
(National University of Singapore, Singapore)
With the advances in both hardware and software of embedded systems in the past few years, dynamic memory allocation can now be safely used in embedded software. As a result, the need to develop methods to avoid heap overflow errors in safety-critical embedded systems has increased. Resource analysis of imperative programs with non-regular loop patterns and signed integers, to support both memory allocation and deallocation, has long been an open problem. Existing methods can generate symbolic bounds that are parametric w.r.t. the program inputs; such bounds, however, are imprecise in the presence of non-regular loop patterns. In this paper, we present a worst-case memory consumption analysis, based upon the framework of symbolic execution. Our assumption is that loops (and recursions) of to-be-analyzed programs are indeed bounded. We then can exhaustively unroll loops and the memory consumption of each iteration can be precisely computed and summarized for aggregation. Because of path-sensitivity, our algorithm generates more precise bounds. Importantly, we demonstrate that by introducing a new concept of reuse, symbolic execution scales to a set of realistic benchmark programs.

TIC: A Scalable Model Checking Based Approach to WCET Estimation
Ravindra Metta, Martin Becker, Prasad Bokil, Samarjit Chakraborty, and R Venkatesh
(Tata Consultancy Services, India; TU Munich, Germany)
The application of Model Checking to compute WCET has not been explored as much as Integer Linear Programming (ILP), primarily because model checkers fail to scale for complex programs. These programs have loops with large or unknown bounds, leading to a state space explosion that model checkers cannot handle. To overcome this, we have developed a technique, TIC, that employs slicing, loop acceleration and over-approximation on time-annotated source code, enabling Model Checking to scale better for WCET computation. Further, our approach is parametric, so that the user can make a trade-off between the tightness of WCET estimate and the analysis time. We conducted experiments on the Mälardalen benchmarks to evaluate the effect of various abstractions on the WCET estimate and analysis time. Additionally, we compared our estimates to those made by an ILP-based analyzer and found that our estimates were tighter for more than 30% of the examples and were equal for the rest.

Compensate or Ignore? Meeting Control Robustness Requirements through Adaptive Soft-Error Handling
Kuan-Hsun Chen, Björn Bönninghoff, Jian-Jia Chen, and Peter Marwedel
(TU Dortmund, Germany)
To avoid catastrophic events like unrecoverable system failures on mobile and embedded systems caused by soft-errors, software-based error detection and compensation techniques have been proposed. Methods like error-correction codes or redundant execution can offer high flexibility and allow for application-specific fault-tolerance selection without the needs of special hardware supports. However, such software-based approaches may lead to system overload due to the execution time overhead. An adaptive deployment of such techniques to meet both application requirements and system constraints is desired. From our case study, we observe that a control task can tolerate limited errors with acceptable performance loss. Such tolerance can be modeled as a (m,k) constraint which requires at least m correct runs out of any k consecutive runs to be correct. In this paper, we discuss how a given (m,k) constraint can be satisfied by adopting patterns of task instances with individual error detection and compensation capabilities. We introduce static strategies and provide a formal feasibility analysis for validation. Furthermore, we develop an adaptive scheme that extends our initial approach with online awareness that increases efficiency while preserving analysis results. The effectiveness of our method is shown in a real-world case study as well as for synthesized task sets.

Computation Partitioning

Opportunity for Compute Partitioning in Pursuit of Energy-Efficient Systems
Prasenjit Chakraborty, Gautam Doshi, Shashank Shekhar, and Vikrant Kumar
(Intel, India)
Performance of computing systems, from handhelds to supercomputers, is increasingly constrained by the energy consumed. A significant and increasing fraction of the energy is consumed in the movement of data. In a compute node, caches have been very effective in reducing data movement by exploiting the available data locality in programs. Program regions with poor data locality, then effect most of the data movement, and consequently consume an ever larger fraction of energy. In this paper we explore the energy efficiency opportunity of minimizing the data movement in precisely such program regions, by first imagining the possibility of compute near memory, and then partitioning the program’s execution between a compute core and the compute near memory (CnM). Due to the emergence of 3D stacked memory, a CnM implementation appears more realistic. Our focus is on evaluating the partitioning opportunity in applications and to do a limit study of systems enabled with CnM capabilities to understand and guide their architectural embodiment. We describe an automated method of analyzing the data access pattern of optimized workload binaries, via a binary-instrumentation tool called SnapCnM, to identify the beneficial program regions (loops) for CnM execution.We also perform a limit study to evaluate the impact of such partitioning over a range of parameters affecting CnM design choices. Our results show that compute partitioning a small (<10%) fraction of a workload can improve its energy efficiency from 3% (for compute-bound applications) to 27% (for memory-bound applications). From the study in this work we discuss the important aspects that help to shape the future CnM design space.

Compiling a Gesture Recognition Application for a Low-Power Spatial Architecture
Phitchaya Mangpo Phothilimthana, Michael Schuldt, and Rastislav Bodik
(University of California at Berkeley, USA; University of Washington, USA)
Energy efficiency is one of the main performance goals when designing processors for embedded systems. Typically, the simpler the processor, the less energy it consumes. Thus, an ultra-low power multicore processor will, likely have very small distributed memory with a simple interconnect. To compile for such an architecture, a partitioning strategy that can tune between space and communication minimization is crucial to fit a program in its limited resources and achieve good performance. A careful program layout design is also critical. Aside fulfilling the space constraint, a compiler needs to be able to optimize for program latency to satisfy a certain timing requirement as well. To satisfy all aforementioned constraints, we present a flexible code partitioning strategy and light-weight mechanisms to express parallelism and program layout. First, we compare two strategies for partitioning program structures and introduce a language construct to let programmers choose which strategies to use and when. The compiler then partitions program structures with a mix of both strategies. Second, we add supports for programmer-specified parallelism and program layout through imposing additional spatial constraints to the compiler. We evaluate our compiler by implementing an accelerometer-based gesture recognition application on GA144, a recent low-power minimalistic multicore architecture. When compared to MSP430, GA144 is overall 19x more energy-efficient and 23x faster when running this application. Without these inventions, this application would not be able to fit on GA144.

A Machine Learning Approach to Mapping Streaming Workloads to Dynamic Multicore Processors
Paul-Jules Micolet, Aaron Smith, and Christophe Dubach
(University of Edinburgh, UK; Microsoft Research, USA)
Dataflow programming languages facilitate the design of data intensive programs such as streaming applications commonly found in embedded systems. They also expose parallelism that can be exploited using multicore processors which are now part of the mobile landscape. In recent years a shift has occurred towards heterogeneity ( ARM big.LITTLE) and reconfigurability. Dynamic Multicore Processors (DMPs) bridge the gap between fully reconfigurable processors and homogeneous multicore systems. They can re-allocate their resources at runtime to create larger more powerful logical processors fine-tuned to the workload. Unfortunately, there exists no accurate method to determine how to partition the cores in a DMP among application threads. Often programmers rely on analyzing the application manually and using a set of hand picked heuristics. This leads to sub-optimal performance, reducing the potential of DMPs. What is needed is a way to determine the optimal partitioning and grouping of resources to maximize performance. As a first step, this paper studies the effect of thread partitioning and hardware resource allocation on a set of StreamIt applications. We show that the resulting space is not trivial and exhibits a large performance variation depending on the combination of parameters. We introduce a machine-learning based methodology to tackle the space complexity. Our machine-learning model is able to directly predict the best combination of parameters using static code features. The predicted set of parameters leads to performance on-par with the best performance found in a space of more than 32,000 configurations per application.

LCTES 2016 – Proceedings

Frontmatter

Dynamic Translation and Iterative Compilation

Loop and Dataflow Analysis

Worst-Case Analysis and Error Handling

Computation Partitioning