Powered by
26th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2025),
June 16–17, 2025,
Seoul, Republic of Korea
26th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2025)
Frontmatter
25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2024)
Papers
rtesbench: A Multi-core Benchmark Framework for Real-Time Embedded Systems
Yixiao Xing,
Yixiao Li, and
Hiroaki Takada
(Nagoya University, Japan)
With the increasing demand for performance, multi-core hardware and RTOSs are being increasingly adopted in embedded systems. These modern embedded systems are exhibiting differentiated features in various aspects, such as hardware architecture and RTOS design. However, traditional benchmark tools struggle to evaluate the performance of multi-core systems effectively. To enable the evaluation of highly diversified multi-core embedded systems, we designed a multi-core specialized benchmark framework that assesses multi-core performance characteristics, including multi-core contention and RTOS design factors, across different target systems. Using this framework, we implemented multi-core benchmarks on three distinct yet representative embedded systems, each with unique characteristics. The evaluation not only allowed us to successfully capture their multi-core performance characteristics, but also helped us identify performance bottlenecks caused by specific design features.
Article Search
Info
R-Visor: An Extensible Dynamic Binary Instrumentation and Analysis Framework for Open Instruction Set Architectures
Edwin Kayang,
Mishel Jyothis Paul,
Eric Jahns,
Muslum Ozgur Ozmen,
Milan Stojkov,
Kevin Rudd, and
Michel A. Kinsy
(Arizona State University, USA; University of Novi Sad, Serbia)
Binary instrumentation tools are widely used to facilitate the development of hardware and software systems. Traditionally, these tools are designed around a fixed Instruction Set Architecture (ISA) specification. However there is a shift in the architectural community towards open ISAs, whose key feature is the ability to add custom ISA extensions. The lack of extensibility in traditional binary instrumentation tools limits their capacity to adapt to these evolving ISAs, thus hindering their ability to analyze and modify binaries built for open ISAs.
To address this challenge, we present R-Visor, a modular and extensible Dynamic Binary Instrumentation (DBI) framework designed for open ISAs by allowing seamless integration of new extensions for instrumentation. R-Visor uses a cache-based just-in-time execution model to run application binaries while supporting advanced instrumentation routines at multiple granularities. R-Visor leverages ArchVisor, a new Domain-Specific Language (DSL) that allows users to write specifications for ISAs and extensions, enabling seamless extensibility. Our implementation of R-Visor on the RISC-V architecture shows that on average R-Visor incurs 1.81× less overhead while utilizing 2.64× less memory than DynamoRIO, an industry standard DBI. Through ArchVisor, R-Visor requires 9.30× less code than DynamoRIO to support the F (floating point) and C (compressed) extensions.
Article Search
Artifacts Available
SPARQ: An Accelerator Architecture for Large Language Models with Joint Sparsity and Quantization Techniques
Seonggyu Choi and
Hyungmin Cho
(Sungkyunkwan University, Republic of Korea)
Large Language Models (LLMs) have demonstrated unprecedented capabilities in text generation, translation, and summarization tasks. However, their deployment on resource-constrained systems remains challenging due to their large parameter sizes and high computational demands. To address this, we propose SPARQ, a specialized accelerator architecture that leverages both sparsity and quantization to optimize LLM inference. By integrating multiply-accumulate units tailored for quantized operations and a systolic array architecture supporting N:M semi-structured sparsity, SPARQ significantly enhances area and energy efficiency with minimal impact on model quality, as demonstrated in prior work on GPTQ and SparseGPT.
Our evaluations demonstrate that SPARQ achieves up to 1.53 times greater area efficiency and 1.58 times better energy efficiency compared to the baseline, particularly for larger models.
Article Search
Artifacts Available
ASC-Hook: Efficient System Call Interception for ARM
Yang Shen,
Min Xie,
Tao Wu,
Wenzhe Zhang,
Ruibo Wang, and
Gen Zhang
(National University of Defense Technology, China; Changsha University of Science and Technology, China)
System call interception is essential for tools that modify or monitor application behavior. However, current system call interception solutions on ARM platforms still face challenges related to performance and completeness. This paper introduces ASC-Hook, an efficient and comprehensive binary rewriting framework specifically designed for intercepting system calls on ARM architectures. ASC-Hook tackles two critical challenges: the misalignment of the target address caused by directly replacing the SVC instruction with BR x8, and the return to the original control flow after system call interception. To achieve this, we propose a hybrid replacement strategy combined with a customized trampoline mechanism. Additionally, multiple completeness strategies tailored for system call interception are implemented to guarantee thorough coverage. Experimental evaluations demonstrate that ASC-Hook reduces overhead to as low as 1/29 of existing solutions, while incurring an average performance loss of only 3.8% in system call-intensive applications.
Article Search
Modeling and Verification of Sigma Delta Neural Networks using Satisfiability Modulo Theory
Sirshendu Das,
Ansuman Banerjee, and
Swarup Kumar Mohalik
(Indian Statistical Institute, Kolkata, India; Ericsson Research, India)
In the context of modern day embedded safety-critical systems and low-resource edge devices in particular, Sigma-Delta Neural Networks (SDNNs) offer a promising alternative to traditional Artificial Neural Networks (ANNs) by leveraging event-driven, sparse computations inspired by biological neural processing. This energy-efficient paradigm makes SDNNs well-suited for neuromorphic hardware and real-time applications, particularly in scenarios with temporal redundancy, such as video processing. However, as neural networks become integral to safety-critical systems, ensuring their robustness against adversarial perturbations is an absolute necessity. In this work, we propose an end-to-end framework for formal modeling and verification of SDNNs using Satisfiability Modulo Theory (SMT). Unlike empirical robustness evaluations, SMT-based verification provides formal guarantees by encoding SDNN behavior and adversarial robustness properties as mathematical constraints. We introduce an SMT-based formulation for encoding SDNNs with SMT constraints and define a robustness property motivated by video stream processing. Our approach systematically examines how well SDNNs can handle adversarial attacks, ensuring they work correctly in safety-critical applications. We validate our framework through experiments on temporal version of the MNIST dataset. To the best of our knowledge, this is the first formal verification framework for SDNNs, bridging the gap between neuromorphic computing and rigorous verification.
Article Search
Zoozve: A Strip-Mining-Free RISC-V Vector Extension with Arbitrary Register Grouping Compilation Support (WIP)
Siyi Xu,
Limin Jiang,
Yintao Liu,
Yihao Shen,
Yi Shi,
Shan Cao, and
Zhiyuan Jiang
(Shanghai University, China)
Vector processing is crucial for boosting processor performance and efficiency, particularly with data-parallel tasks. The RISC-V ”V” Vector Extension (RVV) enhances algorithm efficiency by supporting vector registers of dynamic sizes and their grouping. Nevertheless, for very long vectors, the static number of RVV vector registers and its power-of-two grouping can lead to performance restrictions. To counteract this limitation, this work introduces Zoozve, a RISC-V vector instruction extension that eliminates the need for strip-mining. Zoozve allows for flexible vector register length and count configurations to boost data computation parallelism. With a data-adaptive register allocation approach, Zoozve permits any register groupings and accurately aligns vector lengths, cutting down register overhead and alleviating performance declines from strip-mining. Additionally, the paper details Zoozve’s compiler and hardware implementations using LLVM and SystemVerilog. Initial results indicate Zoozve yields a minimum 10.10× reduction in dynamic instruction count for fast Fourier transform (FFT), with a mere 5.2% increase in overall silicon area.
Article Search
Artifacts Available
DSP-MLIR: A Domain-Specific Language and MLIR Dialect for Digital Signal Processing
Abhinav Kumar,
Atharva Khedkar,
Hwisoo So,
Megan Kuo,
Ameya Gurjar,
Partha Biswas, and
Aviral Shrivastava
(Arizona State University, USA; Yonsei University, Republic of Korea; MathWorks, USA)
High-quality compilation of Digital Signal Processing (DSP) algorithms is crucial for achieving real-time performance and optimizing resource utilization. Traditional compilers often struggle to effectively optimize DSP applications since their optimization passes mainly deal with low-level intermediate representations. This paper introduces DSP-MLIR – a comprehensive framework for DSP application development and optimization. DSP-MLIR comprises i) a Python-like domain-specific language (DSL) (named DSP-DSL) for intuitive and easier programming of DSP applications, ii) a dedicated MLIR dialect (named DSP-dialect) with 90+ operations and 16 optimizations at the level of DSP operations, and iii) lowerings to the Affine and standard MLIR dialects for high-quality compilation flow for DSP applications. The effectiveness of the proposed DSP-MLIR is evaluated by comparing the runtimes of the binaries generated by the various compilation flows, including GCC, Clang, Hexagon-Clang, and existing MLIR passes. Experiments on 20 DSP applications collected from various sources demonstrate an average performance improvement of 12% over state-of-the-art compilation flows with a 10% reduction in the generated binary size and no significant variation in compilation time. Further, expressing DSP applications in the proposed DSP-DSL reduces the code complexity and development time of DSP applications (as measured in lines of code (LOC)) by an average of 5x over their specification in the programming language, “C”.
The DSP-MLIR framework is open-source and available at: https://github.com/MPSLab-ASU/DSP_MLIR
Article Search
SSFFT: Energy-Efficient Selective Scaling for Fast Fourier Transform in Embedded GPUs
Dongwon Yang,
Jaebeom Jeon,
Minseong Gil,
Junsu Kim,
Seondeok Kim,
Gunjae Koo,
Myung Kuk Yoon, and
Yunho Oh
(Korea University, Republic of Korea; Ewha Womans University, Republic of Korea)
Fast Fourier Transform (FFT) is critical in applications such as signal processing, communications, and AI. Embedded GPUs are often used to accelerate FFT due to their computational efficiency, but energy efficiency remains a key challenge due to power constraints. Existing solutions, such as the cuFFT library provided by NVIDIA, employ static configurations for the number of thread blocks and threads per block. This static approach often results in ineffective threads that consume power without contributing to performance, particularly if the FFT length or batch size varies. Furthermore, for large FFT lengths, cuFFT internally splits the computation into multiple kernel invocations. This decomposition can lead to L2 cache thrashing, resulting in redundant global memory accesses and degraded efficiency. To address these challenges, this paper proposes SSFFT, a software technique for embedded GPUs. The key idea of SSFFT is to maximize the number of useful threads that contribute to performance while minimizing ineffective threads. SSFFT is implemented based on a novel theoretical model that determines how many thread blocks and threads per block are effective for a given FFT length, batch size, and hardware resource availability. SSFFT statically determines these configurations and adaptively launches either a GPU kernel for regular FFT operations or a newly implemented kernel that integrates multiple FFT steps. By tailoring thread allocation to workload characteristics and minimizing inter-kernel memory interference, SSFFT improves energy efficiency without compromising performance. In our evaluation, SSFFT achieves a 1.29× speedup and a 1.26× improvement in throughput per watt compared to cuFFT.
Article Search
Grouptuner: Efficient Group-Aware Compiler Auto-tuning
Bingyu Gao,
Mengyu Yao,
Ziming Wang,
Dong Liu,
Ding Li,
Xiangqun Chen, and
Yao Guo
(Peking University, China; ZTE Corporation, China)
Modern compilers typically provide hundreds of options to optimize program performance, but users often cannot fully leverage them due to the huge number of options. While standard optimization combinations (e.g., -O3) provide reasonable defaults, they often fail to deliver near-peak performance across diverse programs and architectures. To address this challenge, compiler auto-tuning techniques have emerged to automate the discovery of improved option combinations. Existing techniques typically focus on identifying critical options and prioritizing them during the search to improve efficiency. However, due to limited tuning iterations, the resulting data is often sparse and noisy, making it highly challenging to accurately identify critical options. As a result, these algorithms are prone to being trapped in local optima.
To address this limitation, we propose GroupTuner, a group-aware auto-tuning technique that directly applies localized mutation to coherent option groups based on historically best-performing combinations, thus avoiding explicitly identifying critical options. By forgoing the need to know precisely which options are most important, GroupTuner maximizes the use of existing performance data, ensuring more targeted exploration. Extensive experiments demonstrate that GroupTuner can efficiently discover competitive option combinations, achieving an average performance improvement of 12.39% over -O3 while requiring only 77.21% of the time compared to the random search algorithm, significantly outperforming state-of-the-art methods.
Article Search
Artifacts Available
SetMP: Set Associative Mapping Management for Multi-plane Optimization in SSDs
Aobo Yang,
Huanhuan Tian,
Yuyang He,
Jiaojiao Wu,
Jiaxu Wu,
Zhibing Sha,
Zhigang Cai, and
Jianwei Liao
(Southwest University, China)
Modern solid state drives (SSDs) are composed of a four-level parallel structure, including channels, chips, dies, and planes, to enhance SSD performance with maximum access parallelism. Because the planes within the same die share the same set of control units and peripheral circuits, it generally has to open multiple aligned blocks to enable access parallelism through multi-plane (MP) operations. Such a passive method, however, cannot effectively exploit plane level parallelism, since MP operations can only be triggered when the accessed data pages have the same offset address across the planes. In addition, it will worsen the block open time issue, as multiple aligned blocks are opened to enable MP operations for simultaneous data writing. This, in turn, increases the error rate when reading data from blocks with long open time. This paper introduces SetMP, a novel approach that proactively aggregates requests to exploit plane level parallelism through set associative management. By increasing the frequency of MP operations, SetMP enhances I/O responsiveness while reducing the open time of block associated with maintaining multiple open blocks for MP operations. Evaluation results demonstrate that SetMP achieves an average reduction in I/O latency of 16.9%, without significantly increasing the open time of block, outperforming existing optimization schemes.
Article Search
ADaPS: Adaptive Data Partitioning to Parallelize CNN Inference on Resource-Constrained Hardware
Jaume Mateu Cuadrat and
Bernhard Egger
(Seoul National University, Republic of Korea)
The growing adoption of AI applications has led to an increased demand for deploying neural networks on diverse device platforms. However, even modest networks now require specialized hardware for efficient execution due to their rising computational cost. To address this, distributed execution across connected, resource-constrained devices is gaining importance. While prior work relies on empirical models or supports limited partitioning, we present ADaPS, a novel framework for distributing Convolutional Neural Networks (CNNs) inference workloads across heterogeneous
embedded devices. Our analytical model partitions the height and width dimensions of 4D tensors and explores layer fusion opportunities, accounting for compute, memory, and communication constraints. ADaPS efficiently explores the vast partitioning space using a tree-based hybrid optimization algorithm combining Alpha-Beta pruning and dynamic programming. Evaluations on multiple CNNs and device configurations show that ADaPS is able to improve inference latency by up to 1.2x on average while significantly reducing data transfers compared to state-of-the-art methods.
Article Search
Graphitron: A Domain Specific Language for FPGA-Based Graph Processing Accelerator Generation
Xinmiao Zhang,
Zheng Feng,
Shengwen Liang,
Xinyu Chen,
Lei Zhang, and
Cheng Liu
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Hong Kong University of Science and Technology, Guangzhou, China)
Due to hardware customization capabilities, FPGA-based graph processing accelerators achieve significantly higher energy efficiency than many general-purpose computing engines. However, designing these accelerators remains a substantial challenge for high-level users. To overcome the programming barrier, FPGA-based accelerator design frameworks on top of generic graph processing programming models have been developed to automate accelerator generation through pre-built templates. However, they often tightly couple graph processing algorithms, programming models and processing paradigms, and accelerator architectures, which severely limits the expression scope of the algorithms and may also restrict the performance when the generated accelerators fail to suit dynamic processing patterns of the graph processing algorithms.
In this work, we propose Graphitron, a domain-specific language (DSL) that enables the automatic generation of FPGA-based graph processing accelerators without engaging with the complexities of low-level FPGA designs. Graphitron defines vertices and edges as primitive data types and enables users to implement graph processing algorithms by performing various functionalities on top of these primitive data, which greatly eases the algorithm descriptions for high-level users. During compilation, the graph processing functions are naturally classified into either a vertex-centric processing paradigm or an edge-centric processing paradigm according to the target data types, enabling the generation of accelerator kernels of different characteristics. In addition, because of the explicit binding between the graph processing functions and the data types, the Graphitron compiler can automatically infer the computing and memory access patterns of each processing function within graph processing algorithms and apply corresponding hardware optimizations such as pipelining, data shuffling, and caching. Basically, graph semantic information can be utilized to guide algorithm-specific customization of resulting accelerators for higher performance. Our experiments show that Graphitron can generate accelerators for a broader range of graph processing algorithms than prior template-based generation frameworks. Moreover, the accelerators produced by Graphitron achieve performance comparable to, and in some cases exceeding, that of existing frameworks when the combined programming paradigms are beneficial from an algorithmic perspective.
Article Search
vNV-Heap: An Ownership-Based Virtually Non-volatile Heap for Embedded Systems
Markus Elias Gerber,
Luis Gerhorst,
Ishwar Mudraje,
Kai Vogelgesang,
Thorsten Herfet, and
Peter Wägemann
(Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany; Saarland University, Germany)
The Internet of Batteryless Things might revolutionize our understanding of connected devices by harvesting required operational energy from the environment. These systems come with the system-software challenge that the intermittently powered IoT devices have to checkpoint their state in non-volatile memory to later resume with this state when sufficient energy is available. The scarce energy resources demand that only modified data is persisted before a power failure, which requires precise modification tracking.
We present vNV-Heap, the first ownership-based virtually Non-Volatile Heap for intermittently powered systems with guaranteed power-failure resilience. The heap exploits ownership systems, a zero-cost (i.e., compile-time) abstraction for example implemented by Rust, to track modifications and virtualize object persistence. To achieve power-failure resilience, our heap is designed and implemented to guarantee bounded operations by static program code analysis: For example, the heap allows for determining a worst-case energy consumption for the operation of persisting modified and currently volatile objects. The evaluation of our open-source implementation on an embedded hardware platform (i.e., ESP32-C3) shows that using our heap abstraction is more energy efficient than existing approaches while also providing runtime guarantees by static worst-case bounds.
Preprint
Info
Artifacts Available
Towards Macro-Aware C-to-Rust Transpilation (WIP)
Robbe De Greef,
Attilio Discepoli,
Esteban Aguililla Klein,
Théo Engels,
Ken Hasselmann, and
Antonio Paolillo
(Vrije Universiteit Brussel, Belgium; Université Libre de Bruxelles, Belgium; Royal Military Academy of Belgium, Belgium)
The automatic translation of legacy C code to Rust presents significant challenges, particularly in handling preprocessor macros.
C macros introduce metaprogramming constructs that operate at the text level, outside of C's syntax tree, making their direct translation to Rust non-trivial.
Existing transpilers --- source-to-source compilers --- expand macros before translation, sacrificing their abstraction and reducing code maintainability.
In this work, we introduce Oxidize, a macro-aware C-to-Rust transpilation framework that preserves macro semantics by translating C macros into Rust-compatible constructs while selectively expanding only those that interfere with Rust's stricter semantics.
We evaluate our techniques on a small-scale study of real-world macros and find that the majority can be safely and idiomatically transpiled without full expansion.
Article Search
LUCI: Lightweight UI Command Interface
Guna Lagudu,
Vinayak Sharma, and
Aviral Shrivastava
(Arizona State University, USA)
Modern embedded systems are powered by increasingly powerful hardware and are increasingly reliant on Artificial Intelligence (AI) technologies for advanced capabilities. Large Language Models (LLMs) are now being widely used to enable the next generation of human-computer interaction. While LLMs have shown impressive task orchestration capabilities, their computation complexity has limited them to run on the cloud – which introduces internet dependency and additional latency. While smaller LLMs (< 5𝐵 parameters) can run on modern embedded systems such as smartwatches and phones, their performance in UI-interaction and task orchestration remains poor. In this paper we introduce LUCI:Lightweight UI Command Interface. LUCI follows a separation of tasks structure by using a combination of LLM agents and algorithmic procedures to accomplish sub-tasks while using a high-level level LLM-Agent with rule-based checks to orchestrate the pipeline. LUCI addresses the limitations of previous In-Context learning approaches by incorporating a novel semantic information extraction mechanism that compresses the frontend code into a structured intermediate Information-Action-Field (IAF) representation. These IAF representations are then used by an Action Selection LLM. This compression allows LUCI to have a much larger effective context window along with better grounding due to the context information in IAF. Pairing our multi-agent pipeline with our IAF representations allows LUCI to achieve similar task success rates as GPT-4Von the Mind2Web benchmark, while using 2.7B parameter text-only PHI-2 model. When testing with GPT 3.5, LUCI shows a 20% improvement in task success rates over the state-of-the-art (SOTA) on the same benchmarks.
Article Search
Kubism: Disassembling and Reassembling K-Means Clustering for Mobile Heterogeneous Platforms
Seondeok Kim,
Sangun Choi,
Jaebeom Jeon,
Junsu Kim,
Minseong Gil,
Jaehyeok Ryu, and
Yunho Oh
(Korea University, Republic of Korea)
K-means clustering is widely used in applications such as classification, recommendation, and image processing for its simplicity and efficiency. While often deployed on servers, it is also used on mobile platforms for tasks like sensor data analysis. However, mobile devices face tight hardware and energy constraints, making efficient execution challenging. Prior parallel K-means approaches still suffer from GPU underutilization due to warp divergence and leave CPUs idle. This paper proposes Kubism, a novel software technique that disassembles and reassembles a K-means clustering algorithm to maximize CPU and GPU resource utilization on mobile platforms. Kubism incorporates several key strategies, including reordering operations to minimize unnecessary work, ensuring balanced workloads across processing units to avoid idle time, dynamically adjusting task execution based on real-time performance metrics, and distributing computation efficiently between the CPU and GPU. These methods synergistically improve performance by reducing idle periods and optimizing the use of hardware resources. In our evaluation on the NVIDIA Jetson Orin AGX platform, Kubism achieves up to a 2.65× speedup in individual clustering iterations and an average 1.23× improvement in overall end-to-end execution time compared to prior work.
Article Search
Multi-level Machine Learning-Guided Autotuning for Efficient Code Generation on a Deep Learning Accelerator
JooHyoung Cha,
Munyoung Lee,
Jinse Kwon,
Jemin Lee, and
Yongin Kwon
(UST, Republic of Korea; ETRI, Republic of Korea)
The growing complexity of deep learning models necessitates specialized hardware and software optimizations, particularly for deep learning accelerators.
While machine learning-based autotuning methods have emerged as a promising solution to reduce manual effort, both template-based and template-free approaches suffer from prolonged tuning times due to the profiling of invalid configurations, which may result in runtime errors.
To address this issue, we propose ML2Tuner, a multi-level machine learning-guided autotuning technique designed to improve efficiency and robustness.
ML2Tuner introduces two key ideas: (1) a validity prediction model to filter out invalid configurations prior to profiling, and (2) an advanced performance prediction model that leverages hidden features extracted during the compilation process.
Experimental results on an extended VTA accelerator demonstrate that ML2Tuner achieves equivalent performance improvements using only 12.3% of the samples required by a TVM-like approach and reduces invalid profiling attempts by an average of 60.8%, highlighting its potential to enhance autotuning performance by filtering out invalid configurations.
Article Search
26th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2025)
proc time: 0.49