LCTES 2017
18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2017)
Powered by
Conference Publishing Consulting

18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2017), June 21–22, 2017, Barcelona, Spain

LCTES 2017 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page


Message from the Chairs
Welcome to LCTES 2017, the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools and Theory for Embedded Systems. This year’s LCTES conference is being held in Barcelona, Spain on June 21-22, 2017, and is co-located with PLDI 2017.

LCTES 2017 Organization
Committee listings

Compiler Optimization for Embedded Systems

AOT vs. JIT: Impact of Profile Data on Code Quality
April W. Wade, Prasad A. Kulkarni, and Michael R. Jantz
(University of Kansas, USA; University of Tennessee, USA)
Just-in-time (JIT) compilation during program execution and ahead-of-time (AOT) compilation during software installation are alternate techniques used by managed language virtual machines (VM) to generate optimized native code while simultaneously achieving binary code portability and high execution performance. Profile data collected by JIT compilers at run-time can enable profile-guided optimizations (PGO) to customize the generated native code to different program inputs. AOT compilation removes the speed and energy overhead of online profile collection and dynamic compilation, but may not be able to achieve the quality and performance of customized native code. The goal of this work is to investigate and quantify the implications of the AOT compilation model on the quality of the generated native code for current VMs.
First, we quantify the quality of native code generated by the two compilation models for a state-of-the-art (HotSpot) Java VM. Second, we determine how the amount of profile data collected affects the quality of generated code. Third, we develop a mechanism to determine the accuracy or similarity for different profile data for a given program run, and investigate how the accuracy of profile data affects its ability to effectively guide PGOs. Finally, we categorize the profile data types in our VM and explore the contribution of each such category to performance.

Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems
Ben Taylor, Vicent Sanz Marco, and Zheng Wang
(Lancaster University, UK)
Heterogeneous multi-core architectures consisting of CPUs and GPUs are commonplace in today’s embedded systems. These architectures offer potential for energy efficient computing if the application task is mapped to the right core. Realizing such potential is challenging due to the complex and evolving nature of hardware and applications. This paper presents an automatic approach to map OpenCL kernels onto heterogeneous multi-cores for a given optimization criterion – whether it is faster runtime, lower energy consumption or a trade-off between them. This is achieved by developing a machine learning based approach to predict which processor to use to run the OpenCL kernel and the host program, and at what frequency the processor should operate. Instead of hand-tuning a model for each optimization metric, we use machine learning to develop a unified framework that first automatically learns the optimization heuristic for each metric off-line, then uses the learned knowledge to schedule OpenCL kernels at runtime based on code and runtime information of the program. We apply our approach to a set of representative OpenCL benchmarks and evaluate it on an ARM big.LITTLE mobile platform. Our approach achieves over 93% of the performance delivered by a perfect predictor.We obtain, on average, 1.2x, 1.6x, and 1.8x improvement respectively for runtime, energy consumption and the energy delay product when compared to a comparative heterogeneous-aware OpenCL task mapping scheme.

NOLINKDECO
Auto-vectorization for Image Processing DSLs
Oliver Reiche, Christof Kobylko, Frank Hannig, and Jürgen Teich
(University of Erlangen-Nuremberg, Germany)
The parallelization of programs and distributing their workloads to multiple threads can be a challenging task. In addition to multi-threading, harnessing vector units in CPUs proves highly desirable. However, employing vector units to speed up programs can be quite tedious. Either a program developer solely relies on the auto-vectorization capabilities of the compiler or he manually applies vector intrinsics, which is extremely error-prone, difficult to maintain, and not portable at all.
Based on whole-function vectorization, a method to replace control flow with data flow, we propose auto-vectorization techniques for image processing DSLs in the context of source-to-source compilation. The approach does not require the input to be available in SSA form. Moreover, we formulate constraints under which the vectorization analysis and code transformations may be greatly simplified in the context of image processing DSLs. As part of our methodology, we present control flow to data flow transformation as a source-to-source translation. Moreover, we propose a method to efficiently analyze algorithms with mixed bit-width data types to determine the optimal SIMD width, independently of the target instruction set. The techniques are integrated into an open source DSL framework. Subsequently, the vectorization capabilities are compared to a variety of existing state-of-the-art C/C++ compilers. A geometric mean speedup of up to 3.14 is observed for benchmarks taken from ISPC and image processing, compared to non-vectorized executions.

Dynamic Translation of Structured Loads/Stores and Register Mapping for Architectures with SIMD Extensions
Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, and Wei-Chung Hsu
(National Taiwan University, Taiwan; Academia Sinica, Taiwan)
More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, guest memory instructions with strides are emulated by a sequence of scalar instructions, leaving a significant room for performance improvement when the host machines have SIMD instructions available. Structured loads/stores, such as VLDn/VSTn in ARM NEON, are one type of strided SIMD data access instructions. They are widely used in signal processing, multimedia, mathematical and 2D matrix transposition applications. Efficient translation of such structured loads/stores is a critical issue when migrating ARM executables to other ISAs. However, it is quite challenging since not only the translation of structured loads/stores is not trivial, but also the difference between guest and host register configurations must be taken into consideration. In this work, we present the design and implementation of translating structured loads/stores in DBT, including target code generation as well as efficient SIMD register mapping. Our proposed register mapping mechanisms are not limited to handling structured loads/stores, they can be extended to deal with normal SIMD instructions. On a set of OpenCV benchmarks, our QEMU-based system has achieved a maximum speedup of 5.41x, with an average improvement of 2.93x. On a set of BLAS benchmarks, our system has also obtained a maximum speedup of 2.19x and an average improvement of 1.63x.

Abstraction, Modelling, and Scheduling for IoT and Embedded Systems

Optimal Functional Unit Assignment and Voltage Selection for Pipelined MPSoC with Guaranteed Probability on Time Performance
Weiwen Jiang, Edwin H.-M. Sha, Qingfeng Zhuge, Hailiang Dong, and Xianzhang Chen
(Chongqing University, China)
Pipelined heterogeneous multiprocessor system-on-chip (MPSoC) can provide high throughput for streaming applications. In the design of such systems, time performance and system cost are the most concerning issues. By analyzing runtime behaviors of benchmarks in real-world platforms, we find that execution times of tasks are not fixed but spread with probabilities. In terms of this feature, we model execution times of tasks as random variables. In this paper, we study how to design high-performance and low-cost MPSoC systems to execute a set of such tasks with data dependencies in a pipelined fashion. Our objective is to obtain the optimal functional unit assignment and voltage selection for the pipelined MPSoC systems, such that the system cost is minimized while timing constraints can be met with a given guaranteed probability. For each required probability, our proposed algorithm can efficiently obtain the optimal solution. Experiments show that other existing algorithms cannot find feasible solutions in most cases, but ours can. Even for those solutions that other algorithms can obtain, ours can reach 30% reductions in total cost compared with others.

Integrated IoT Programming with Selective Abstraction
Gyeongmin Lee, Seonyeong Heo, Bongjun Kim, Jong Kim, and Hanjun Kim
(POSTECH, South Korea)
The explosion of networked devices has driven a new computing environment called the Internet of Things (IoT), enabling various services such as home automation and health monitoring. Despite the promising applicability of the IoT, developing an IoT service is challenging for programmers, because the programmers should integrate multiple programmable devices and heterogeneous third-party devices. Recent works have proposed integrated programming platforms, but they either require device-specific implementation for third-party devices without any device abstraction, or abstract all the devices to the standard interfaces requiring unnecessary abstraction of programmable devices. To integrate IoT devices with selective abstraction, this work revisits the object oriented programming (OOP) model, and proposes a new language extension and its compiler-runtime framework, called Esperanto. With three annotations that map each object to its corresponding IoT device, the Esperanto language allows programmers to integrate multiple programmable devices into one OOP program and to abstract similar third-party devices into their common ancestor classes. Given the annotations, the Esperanto compiler automatically partitions the integrated program into multiple sub-programs for each programmable IoT device, and inserts communication and synchronization code. Moreover, for the ancestor classes, the Esperanto runtime dynamically identifies connected third-party devices, and links their corresponding descendent objects. Compared to an existing approach on the integrated IoT programming, Esperanto requires 33.3% fewer lines of code to implement 5 IoT services, and reduces their response time by 44.8% on average.

Video
Towards SMT-Based LTL Model Checking of Clock Constraint Specification Language for Real-Time and Embedded Systems
Min Zhang and Yunhui Ying
(East China Normal Univeristy, China)
The Clock Constraint Specification Language (CCSL) is a formal language companion to MARTE (shorthand for Modeling and Analysis of Real-Time and Embedded systems), a UML profile used to facilitate the design and analysis of real-time and embedded systems. CCSL is proposed to specify constraints on the occurrences of events in systems. However, the language lacks efficient verification support to formally analyze temporal properties, which are important properties to real-time and embedded systems. In this paper, we propose an SMT-based approach to model checking of the temporal properties specified in Linear Temporal Logic (LTL) for CCSL by transforming CCSL constraints and LTL formulas into SMT formulas. We implement a prototype tool for the proposed approach and use the state-of-the-art tool Z3 as its underlying SMT solver. We model two practical real-time and embedded systems, i.e., a traffic light controller and a power window system in CCSL , and model check LTL properties of them using the proposed approach. Experimental results demonstrate the effectiveness and efficiency of our approach.

Integrating Task Scheduling and Cache Locking for Multicore Real-Time Embedded Systems
Wenguang Zheng, Hui Wu, and Chuanyao Nie
(Tianjin University of Technology, China; UNSW, Australia)
Modern embedded processors provide hardware support for cache locking, a mechanism used to facilitate the WCET (Worst-Case Execution Time) calculation of a task. We investigate the problem of integrating task scheduling and cache locking for a set of preemptible tasks with individual release times and deadlines on a multi-core processor with two-level caches. We propose a novel integrated approach that schedules the task set and allocates the locked cache contents of each task to the local caches (L1 caches) and the level-two cache (L2 cache). Our approach consists of three major components, the task scheduler, the L1 cache allocator, and the L2 cache allocator. The task scheduler aims at minimizing the number of task preemptions. The L1 cache allocator converts the interference graph of all the tasks scheduled on each core into a DAG by considering the preemptions between tasks and allocates the L1 cache space to each task. The L2 cache allocator converts the interference graph of all the tasks into a DAG by using a k-longest-path-based graph orientation algorithm and allocates the L2 cache space to each task. Both cache allocators significantly improve the cache utilization for all the caches due to the efficient use of the interference graphs of tasks. We have implemented our approach and compared it with the extended version of the preemption tree-based approach and the static analysis approach without cache locking by using a set of benchmarks from the MRTC WCET benchmark suite and SNU real-time benchmarks. Compared to the extended version of the preemption tree-based approach, the maximum WCRT (Worst Case Response Time) improvement of our approach is 15%. Compared to the static analysis approach, the maximum WCRT improvement of our approach is 37%.

Non-volatile Memory/Processor and RTOS

Towards Memory-Efficient Processing-in-Memory Architecture for Convolutional Neural Networks
Yi Wang, Mingxu Zhang, and Jing Yang
(Shenzhen University, China; Institute of Computing Technology at Chinese Academy of Sciences, China; Harbin Institute of Technology, China)
Convolutional neural networks (CNNs) are widely adopted in artificial intelligent systems. In contrast to conventional computing centric applications, the computational and memory resources of CNN applications are mixed together in the network weights. This incurs a significant amount of data movement, especially for highdimensional convolutions. Although recent embedded 3D-stacked Processing-in-Memory (PIM) architecture alleviates this memory bottleneck to provide fast near-data processing, memory is still a limiting factor of the entire system. An unsolved key challenge is how to efficiently allocate convolutions to 3D-stacked PIM to combine the advantages of both neural and computational processing.
This paper presents Memolution, a compiler-based memory efficient data allocation strategy for convolutional neural networks on PIM architecture. Memolution offers thread-level parallelism that can fully exploit the computational power of PIM architecture. The objective is to capture the characteristics of neural network applications and present a hardware-independent design to transparently allocate CNN applications onto the underlining hardware resources provided by PIM. We demonstrate the viability of the proposed technique using a variety of realistic convolutional neural network applications. Our extensive evaluations show that, Memolution significantly improves performance and the cache utilization compared to the baseline scheme.

Unified nvTCAM and sTCAM Architecture for Improving Packet Matching Performance
Xianzhong Ding, Zhiyong Zhang, Zhiping Jia, Lei Ju, Mengying Zhao, and Huawei Huang
(Shandong University, China; University of Aizu, Japan)
Software-Defined Networking (SDN) allows controlling applications to install fine-grained forwarding policies in the underlying switches. Ternary Content Addressable Memory (TCAM) enables fast lookups in hardware switches with flexible wildcard rule patterns. However, the performance of packet processing is severely constrained by the capacity of TCAM, which aggravates the processing burden and latency issues. In this paper, we propose a hybrid TCAM architecture which consists of NVM-based TCAM (nvTCAM) and SRAM-based TCAM (sTCAM), utilizing nvTCAM to cache the most popular rules to improve cache-hit-ratio while relying on a very small-size sTCAM to handle cache-miss traffic to effectively decrease update latency. Considering the special rule dependency, we present an efficient Rule Migration Replacement (RMR) policy to make full utilization of both nvTCAM and sTCAM to obtain better performance. Experimental results show that the proposed architecture outperforms current TCAM architectures.

A Lightweight Progress Maximization Scheduler for Non-volatile Processor under Unstable Energy Harvesting
Chen Pan, Mimi Xie, Yongpan Liu, Yanzhi Wang, Chun Jason Xue, Yuangang Wang, Yiran Chen, and Jingtong Hu
(Oklahoma State University, USA; Tsinghua University, China; Syracuse University, USA; City University of Hong Kong, China; Huawei Technologies, China; Duke University, USA)
Energy harvesting techniques become increasingly popular as power supplies for embedded systems. However, the harvested energy is intrinsically unstable. Thus, the program execution may be interrupted frequently. Although the development of non-volatile processors (NVP) can save and restore execution states, both hardware and software challenges exist for energy harvesting powered embedded systems. On the hardware side, existing power detector only signals the ``poor'' quality of the harvested power based on a preset threshold voltage. The inappropriate setting of this threshold will make the NVP based embedded system suffer from either unnecessary checkpointing or checkpointing failures. On the software side, not all tasks can be checkpointed. Once the power is off, these tasks will have to restart from the beginning. In this paper, a task scheduler is proposed to maximize task progress by prioritizing tasks which cannot be checkpointed when power is weak so that they can finish before the power outage. To assist task scheduling, three additional modules including voltage monitor, checkpointing handler, and routine handler, are proposed. Experimental results show increased overall task progress and reduced energy consumption.

OSEK-V: Application-Specific RTOS Instantiation in Hardware
Christian Dietrich and Daniel Lohmann
(Leibniz Universität Hannover, Germany)
The employment of a real-time operating system (RTOS) in an embedded control systems is often an all-or-nothing decision: While the RTOS-abstractions provide for easier software composition and development, the price in terms of event latencies and memory costs are high. Especially in HW/SW codesign settings, system developers try to avoid the employment of a full-blown RTOS as far as possible. In OSEK-V, we mitigate this trade-off by a very aggressive tailoring of the concrete RTOS instance into the hardware. Instead of implementing generic OS components as custom hardware devices, we capture the actually possible application-kernel interactions as a finite-state machine and integrate the tailored RTOS semantics directly into the processor pipeline. In our experimental results with an OSEK-based implementation of a quadrotor flight controller into the Rocket/RISC-V softcore, we thereby can significantly reduce event latencies, interrupt lock times, and memory footprint at moderate costs in terms of FPGA resources.

Info

proc time: 0.9