Powered by
27th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2026), June 15–16, 2026,
Boulder, CO, USA
27th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2026)
Frontmatter
LoopHint: A Compiler-Assisted Loop Branch Predictor for Embedded DSPs
Yuanyang Xiang,
Chen Xu,
Ruozhou Xiao, and
Zhiwei Zhang
(Institute of Automation at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Loop-intensive computations dominate embedded DSP applications, making loop branch prediction critical for maintaining pipeline efficiency. However, a gap exists between resource-heavy dynamic predictors and inflexible Zero Overhead Looping (ZOL) hardware. Conventional predictors are often too costly for embedded silicon, while traditional ZOL is limited by strict constraints on loop size, nesting, and data-dependent exit conditions. To address these limitations, we introduce LoopHint, a hardware-software co-design approach that provides a flexible, learning-free loop prediction mechanism. LoopHint utilizes compiler-time static analysis to identify deterministic loop patterns and dependency chains, communicating this metadata to a simplified hardware loop table via specialized instructions. This co-design approach allows the processor to bypass the warm-up and aliasing issues of dynamic predictors while supporting a broader range of loop structures than ZOL. Experimental results on DSP kernels and the Embench-IoT benchmark suite show that LoopHint achieves an average 67% reduction in branch mispredictions. With minimal impact on the area and power, LoopHint provides a scalable solution for high-efficiency control flow in modern embedded DSPs.
Article Search
Article: pldiws26lctesmain-p1-p
Scheduled Partial-Credit RL for Reliable Code Generation with Small Language Models (WIP)
Suryansh Singh Sijwali and
Suman Saha
(Pennsylvania State University, USA)
Small language models (SLMs, ≤1.5B parameters) are attractive for embedded and resource-limited development workflows because they can run under single-GPU or CPU budgets and be adapted without distributed training. SLM-based code generation is brittle under strict sandboxed evaluation, and reinforcement learning (RL) with binary test rewards is often too sparse to train SLMs reliably. This WIP paper presents a reliability-first RL framework for SLM code generation built around a joint reward. The functional term assigns intermediate credit to near-miss outcomes for syntax validity, crash-free execution, and output production, while a static-analysis term discourages unsafe shortcuts during training. On DeepSeek-Coder-1.3B evaluated on 100 stdin-style APPS+ prompts, a binary-to-partial-credit curriculum improves syntax validity to 63% and produces solutions that pass at least one test in 9% of prompts in a single generated attempt. In contrast, binary-reward PPO regresses below a supervised fine-tuning baseline and partial-credit training from scratch reaches only 27% syntax validity.
Article Search
Article: pldiws26lctesmain-p11-p
SymFlow: Event-Chain-Aware Symbolic Execution for Serverless Sensitive Data Flow Detection
Yuanpeng Wang,
Zhineng Zhong,
Zhenkai Liang,
Ding Li,
Yao Guo, and
Xiangqun Chen
(Peking University, China; National University of Singapore, Singapore)
Serverless applications are widely adopted for their scalability, cost-efficiency, and elastic resource management. However, their event-driven nature introduces complex event chains whose trigger-handler relationships are often determined dynamically by conditional logic, asynchronous callbacks, and resource-state dependencies. Existing security analysis tools, such as CloudFlow, mainly rely on static analysis, making it difficult to capture these dynamic event-chain interactions and the semantics of coarse-grained cloud APIs. As a result, they often fail to bridge the gap between architectural reachability and semantic feasibility, leading to both false positives and false negatives.
To address this limitation, we propose SymFlow, an event-chain-aware symbolic execution framework for sensitive data flow detection in serverless applications. SymFlow combines static architectural analysis with symbolic reasoning to identify feasible event chains and validate their concrete code semantics across service boundaries. By constraining exploration with architectural event dependencies while semantically analyzing inter-function and inter-service behaviors along each event chain, SymFlow can more precisely recover real sensitive data flows and substantially reduce spurious results from purely static reasoning. Evaluated on CloudBench and 104 real-world AWSomePy applications, SymFlow reports 36.6% more sensitive data flows than CloudFlow, improves detection precision by 14.4% and increases event-chain coverage by 73.6%. It also discovered two previously unknown zero-day vulnerabilities in real-world applications.
Article Search
Article: pldiws26lctesmain-p21-p
DeduBB: Binary Code Size Reduction via Post-Link Basic Block Deduplication
Chaitanya Mamatha Ananda,
Mahbod Afarin,
Rajiv Gupta,
Sriraman Tallam,
Han Shen, and
Xinliang David Li
(University of California at Riverside, USA; Google, USA)
Binary sizes of upgraded versions of software applications tend to be larger, primarily due to feature bloat. This poses various challenges, particularly for mobile applications. It affects upgrade rates directly impacting revenues, increases maintenance costs of supporting multiple versions, and prevents some users from getting critical security fixes. Code bloat also poses a problem for large warehouse-scale applications. Such applications experience performance degradation when their code size exceeds what smaller and more efficient code models can handle.
In this paper, we introduce a post-link optimization technique called DeduBB, which deduplicates basic blocks of an application across procedure boundaries. As the prior techniques used function outlining to deduplicate identical code sequences, they missed out on many opportunities such as duplicate code patterns that manipulate the program stack. In addition, previous techniques were either limited to the scope of a module or lacked scalable implementations required to handle large warehouse-scale applications. Our technique, DeduBB, exploits inter-module opportunities and de-duplicates more code patterns than prior techniques as it uses a novel save-and-jump code sequence to execute deduplicated code blocks. In addition, DeduBB has been designed to work on scalable post-link optimizers and can even be applied to large warehouse-scale data center applications. Finally, DeduBB is profile-guided and can be applied selectively to infrequently executed cold basic blocks to not affect application performance. In fact, in several cases, the performance of the smaller application binary improves slightly due to reductions in its hot working set size. We have designed our technique for the state-of-the-art post-link optimizers, BOLT and Propeller. Experiments show that we can significantly reduce the code size of several benchmarks by 1.55% to 18.63%, on both Arm and x86 platforms, even on binaries that have already been heavily optimized for size using existing code size reduction features. For warehouse-scale binaries, DeduBB reduces code size by up to 25.8%. Finally, aided by profiles, our technique can retain over 82% of the maximal code size savings without affecting performance.
Article Search
Artifacts Available
Article: pldiws26lctesmain-p31-p
Towards Verifiable System Code using a DSL Compiled to Efficient and Readable C Code
Clément Chavanon,
Henrik Karlsson,
Frédéric Besson,
Sandrine Blazy, and
Roberto Guanciale
(Inria - Univ Rennes - CNRS - IRISA, France; KTH Royal Institute of Technology, Sweden; Univ Rennes - Inria - CNRS - IRISA, France)
Critical embedded systems deserve the highest level of assurance to guarantee that their implementation satisfies their specification. Verification techniques such as proof by deduction operate at source level but the verification effort often requires to design higher-level abstractions that facilitate the reasoning. However, this approach comes at the cost of assuming the correctness of the abstraction with respect to the source code.
To circumvent this problem, we have defined a verification-aware DSL designed to program and prove critical embedded software. Our DSL, named Barocq, is minimal and purely functional. It makes programs amenable to verification inside the Rocq proof-assistant. Barocq comes with a compiler to C that generates efficient and human-readable programs. Barocq bridges the gap between the program that is reasoned about and the program that is compiled. A key feature of Barocq is that its semantics and data-representation are zero-cost abstractions at runtime. This is achieved by enforcing, using static analysis, a strict alias-control programming discipline which enables to compile functional updates as imperative in-place mutations and minimise pointer dereferences using unboxed data-structures.
Finally, we demonstrate that Barocq has the adequate features to write an interesting class of critical embedded code, including most components of the S3K microkernel, which can be used as drop-in replacement of the original C code.
Article Search
Artifacts Available
Article: pldiws26lctesmain-p40-p
Hikami: A Lightweight Hypervisor for Emulating RISC-V Extension Semantics with Sail-Driven Auto-generation
Norimasa Takana and
Yoshihiro Oyama
(University of Tsukuba, Japan)
The rapid expansion of RISC-V extension specifications frequently outpaces hardware availability. This lag severely bottlenecks the development and validation of software stacks, which ultimately hinders the feedback loop necessary for actual hardware implementations. To address this challenge, we propose a lightweight Type-1 hypervisor that leverages the RISC-V Hypervisor extension to emulate the semantics of unimplemented RISC-V extensions. By trapping and emulating only the target instructions and control and status register (CSR) accesses, our system allows guest software to run natively for all supported instructions. Furthermore, to guarantee the correctness of the emulation toolchain and minimize manual effort, we introduce a novel auto-generation framework. We automatically derive the instruction decoders and hypervisor module templates directly from Sail, the formal semantics specification of the RISC-V ISA, thereby eliminating manual implementation errors. We evaluated our approach on a real RISC-V hardware platform (Milk-V Megrez). Experimental results demonstrate that our system outperforms a widely-used full-system emulator (QEMU) in realistic workloads, achieving faster execution times when emulated instructions account for 0.1% or less of the total execution. Additionally, our hypervisor maintains 99.8% of native performance for non-target workloads and restricts interrupt latency to under 1 microsecond. This formally-supported virtualization approach provides a practical, high-performance foundation for validating software against emerging RISC-V extensions prior to silicon availability.
Article Search
Artifacts Available
Article: pldiws26lctesmain-p45-p
RAPO: Retrieval-Augmented Phase Ordering
Jinwook Yang,
Junghyun Lee,
Yeonsun Hong, and
Hyojin Sung
(Seoul National University, Republic of Korea)
The phase-ordering problem—finding optimal pass sequence—remains NP-hard. While recent RL approaches have shown improved results, they impose heavy per-input search costs. We introduce RAPO, a retrieval-augmented phase-ordering framework that replaces online RL exploration with similarity-based retrieval. Offline, RAPO embeds LLVM IR with IR‑BERT, clusters programs via k-means, and stores each cluster’s representative RL‑discovered pass sequences in a sequence cache. During compilation, a new program is embedded, mapped into its nearest cluster, and optimized by retrieving the cached sequence, thereby transforming “per-program search“ into “similarity-based retrieval.“ RAPO is model-agnostic (compatible with PPO, DQN, and ε-greedy, etc.) and includes lightweight fallbacks for corner cases. RAPO achieves up to ~18.6% of IR instruction count reduction over -Oz, matching or outperforming RL baselines, while reducing phase-ordering search overhead up to ~177x. These results suggest that RAPO delivers near–per‑input quality with deployment‑grade efficiency by transforming online phase ordering into fast, similarity‑driven retrieval.
Article Search
Article: pldiws26lctesmain-p50-p
On the Origins of Indirect Jumps in Embedded Software
Ariane Nicolas,
Ronan Lashermes,
Isabelle Puaut, and
Erven Rohou
(Univ Rennes - Inria - CNRS - IRISA, France; Rambus, Netherlands)
Indirect control-flow transfers complicate control-flow graph (CFG) construction, thereby reducing the precision of static analyses and control-flow integrity mechanisms in embedded systems.
While previous work has primarily focused on resolving indirect jump targets, comparatively little attention has been devoted to understanding the reasons behind their generation.
This paper presents a systematic empirical study of the origins of indirect jumps in compiled binaries.
We introduce a taxonomy that characterizes the programming constructs and compiler transformations responsible for their generation.
Our analysis encompasses C, C++, Fortran, and Rust programs compiled with GCC and LLVM at multiple optimization levels, targeting the 32-bit RISC-V instruction set.
We then quantify the prevalence of each identified category over representative benchmarks and analyze differences across programming languages and compilation configurations.
By clarifying the origins of indirect control transfers, this work provides insight into their impact on CFG precision and the static analysis of embedded software.
Article Search
Artifacts Available
Article: pldiws26lctesmain-p55-p
MemSpec: Memory-Aware Runtime for Adaptive Draft Scheduling in Speculative Decoding on Edge Devices
Eunjeong Kim,
Yeong Jun Jeon, and
Myeonggyun Han
(Kyungpook National University, Republic of Korea)
Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to speculate multiple tokens, reducing expensive target model decoding steps. Its effectiveness depends heavily on draft selection, motivating adaptive methods that exploit variation across inputs and generation stages. On memory-constrained edge devices, however, these methods often fail to improve end-to-end throughput due to the overhead of switching between draft models. We identify a key limitation in this setting: the mismatch between draft selection and draft availability under tight memory budgets.
To address this challenge, we present MemSpec, a prediction-guided, memory-aware runtime for adaptive speculative decoding on edge devices. MemSpec decouples draft selection from execution through proactive resident working-set management. A lightweight predictor estimates draft effectiveness from prompt and generation context, while a memory-aware scheduler reduces reactive model loading overhead. Experiments on a Jetson Orin Nano show that MemSpec improves steady-state generation throughput by 40.7% on average over state-of-the-art bandit-based adaptive methods while closely approaching the oracle upper bound.
Article Search
Article: pldiws26lctesmain-p56-p
Sirop: A Small IR for HLS with Parallel Patterns
Louis Hildebrand and
Christophe Dubach
(McGill University, Canada)
Designers of custom streaming accelerators traditionally use HDLs (Hardware Description Languages), but this is time-consuming and requires advanced hardware expertise. C-based HLS (High-Level Synthesis) offers a higher level of abstraction and faster design time, but still requires some hardware expertise and performance is often left on the table. A promising direction is to use HLS with high-level functional parallel patterns such as map and reduce. Prior works have shown that high performance is achievable this way. However, designing such compiler systems is challenging because the optimizer must handle a large number of language primitives and interactions between them.
This paper introduces a minimal functional IR (Intermediate Representation) for hardware design, Sirop, which can express both pipelining and spatial parallelism with just five primitives. High-level operators from prior works are represented as syntax sugar and lowered to the core language. This simplifies hardware generation and optimization.
Sirop is compared with existing compilers on a set of image processing and linear algebra benchmarks. The Sirop designs use 61% fewer ALMs (Adaptive Logic Modules) than Aetherling, 68% fewer ALMs than Shir, and 76% fewer ALMs than the Intel HLS compiler, all for the same throughput.
Article Search
Artifacts Available
Article: pldiws26lctesmain-p61-p
Bridging the Memory Hotness Gap in Edge Systems with Hotness-Segregated Object Allocation
Ruizhe Huang,
Jiahua Wang,
Qihang Xu,
Peng Jiang,
Zhida An,
Ding Li,
Yao Guo,
Xiangqun Chen,
Yuxin Ren, and
Ning Jia
(Peking University, China; Southeast University, China; Huawei Technologies, China)
Kernel operations in resource-constrained edge systems, such as memory swapping and deduplication, use the access frequency (hotness) of memory pages to guide page placement and reclamation. However, these operations suffer from page-hotness skew: a page may contain a mix of highly accessed and infrequently accessed objects, which causes inaccurate page-level classification, wasted DRAM capacity, and expensive I/O. We attribute this skewness to a cross-layer mismatch: the kernel manages memory at page granularity, whereas user-level allocators place objects without considering access hotness.
To bridge this gap, we present HotMalloc, a memory allocator that reduces this skewness through object-granularity hotness-segregated allocation. HotMalloc uses profile-guided optimization to analyze object access patterns offline and synthesizes an application-specific allocator. At runtime, HotMalloc identifies object hotness from offset-encoded call-site contexts and co-locates objects with similar hotness on the same pages without adding per-access overhead. Additionally, HotMalloc exposes simple interfaces to inform the kernel of page hotness. Evaluation on memory swapping and deduplication shows that HotMalloc significantly reduces skewness and improves hotness-aware kernel operations by 4.6% to 42.1%.
Article Search
Article: pldiws26lctesmain-p64-p
FLUX: Frequency Scaling with Layer-wise Utilization for Energy-Efficient NPU Execution (WIP)
Inho Lee,
Ky Yeop Lim,
Hyejun Kim,
Beomseok Kim,
Dongsuk Jeon,
Hunjun Lee, and
Yongjun Park
(Hanyang University, Republic of Korea; Samsung Electronics, Republic of Korea; Yonsei University, Republic of Korea; Seoul National University, Republic of Korea)
With the widespread adoption of Deep Neural Networks (DNNs), Neural Processing Units (NPUs) are emerging as energy-efficient alternatives to GPUs through parallel processing and high data reuse.
However, since diverse deep learning kernels have different memory and computation resource requirements, a utilization imbalance between memory and computation resources often occurs.
To address this challenge, we propose FLUX, a frequency-scalable NPU system that applies Dynamic Frequency Scaling (DFS) to the core's frequency based on layer-wise arithmetic intensity.
FLUX splits the clocking into an adjustable core domain and a fixed system domain.
The core-domain frequency is first estimated via roofline cycle analysis and then refined at runtime by an Energy-Delay Product (EDP)-driven calibration.
Evaluation on a Gemmini NPU using a 28nm process technology shows that FLUX achieves 5.7%, 16.5%, and 27.9% EDP improvements on 8×8, 16×16, and 32×32 systolic arrays for ResNet50, respectively, reaching within 1.7% of the oracle optimal frequency assignment.
Article Search
Article: pldiws26lctesmain-p69-p
CVS: A Metric for Security-Aware Compilation against Side-Channel Attacks in Edge SoCs (WIP)
Yi Han,
Puhong Lei,
Yang Shi,
Zhe Li,
Xing Mou,
Jianjun Chen, and
Yaohua Wang
(National University of Defense Technology, Changsha, China; Key Laboratory of Advanced Microprocessor Chips and Systems, Changsha, China; Hunan Greatwall Galaxy Science and Technology, Changsha, China)
Deep learning compilers (DLCs) have become the standard approach for optimizing edge inference performance, employing techniques such as operator fusion, loop tiling, and scheduling to meet stringent resource constraints. Yet, the security implications of these optimizations remain largely unexplored. In this work, we investigate shared-memory side-channel attacks on edge SoCs and analyze how compiler optimizations reshape the leakage surface. Our study reveals that identical operators can exhibit distinct shared-resource access patterns under different compilation strategies, resulting in divergent attack outcomes. To address this, we introduce the Confusion Variance Score (CVS), a metric that quantifies compilation-induced security by measuring confusion in time-series resource traces (e.g., DRAM bandwidth). CVS integrates multidimensional dynamic time warping with statistical morphological features to ensure temporal robustness, and shows a strong negative correlation (Spearman r ≈ −0.9394) with practical attack error rates. Finally, we demonstrate the feasibility of CVS-guided compilation in TVM and TensorRT, achieving a 24 % increase in attack error rate compared to default strategies, while limiting inference latency overhead to under 5 %.
Article Search
Article: pldiws26lctesmain-p71-p
Can Fine-Grain Multi-threading Subsume VLIW?
Scott Pomerville,
Soner Önder,
Gang-Ryung Uh, and
David Whalley
(Northern Michigan University, USA; Michigan Technological University, USA; Florida State University, USA)
We explore the question: "Can a fine-grain multi-threaded architecture form the basis for an efficient, VLIW style, statically scheduled
architecture?" We illustrate that operations comprising a VLIW instruction can indeed be viewed as belonging to separate threads, such that
the number of such operations is equivalent to the number of threads representing the program's semantics. On the other hand, a more efficient
synchronization mechanism than data synchronization is needed to realize the lock-step execution model of VLIW processors. This
synchronization is accomplished through the instruction space, by using a small number of bits in each instruction under the compiler control.
We call the resulting architecture a "Synchronized Lane Architecture (SLA)".
The SLA approach makes embedding of no-operations in the code unnecessary in the majority of the cases, and the architecture can dynamically
adapt to changing levels of ILP, while permitting code compiled for a narrow-width processor to run unmodified on a larger width processor.
We present this novel architecture paradigm, as well as the mechanism of transforming traditional VLIW code so that it can be executed by
our SLA processor. We provide an evaluation of the new paradigm with respect to a conventional VLIW architecture, and demonstrate that the
SLA approach delivers similar levels of performance to that of VLIW processors while providing significant energy and code size savings.
Article Search
Article: pldiws26lctesmain-p73-p
Empirical Observations about Profile-Guided Optimizations for Mainstream C/C++ Compilers
Soma Pal and
Prasad Anil Kulkarni
(University of Kansas, USA)
The idea behind profile-guided optimizations (PGO) is to monitor different aspects of a program’s run-time behavior, and then employ this information to guide decisions during individual compiler optimizations to improve program performance. PGO is a mature technology that is available in most mainstream compilers and is widely regarded to benefit performance. In this work, we conduct the first comprehensive empirical study of the behavior and properties of PGOs in two state-of-the-art mainstream C/C++ compilers, GCC and Clang, evaluated using MiBench embedded system benchmarks on x86-64 platform. Our study reveals many interesting, some expected and some counter-intuitive observations about PGOs in mainstream C/C++ compilers. We believe our intellectually intriguing observations will help compiler designers and software developers further develop and usefully deploy this technology.
Article Search
Article: pldiws26lctesmain-p88-p
CausalTuner: Feature-Aware Causal Guidance for Compiler Auto-tuning
Jiaqing Zhong,
Juan Chen,
Yichang Zhou, and
Kuan Li
(National University of Defense Technology, China; Dongguan University of Technology, China)
Modern compilers like LLVM and GCC provide hundreds of optimization options (e.g., flags or passes), yet their fixed, predefined sequences (e.g., -O3) often fail to exploit the full performance potential of specific programs. Search-based auto-tuning has emerged to address this pass selection and ordering problem — known as the phase ordering problem. Existing approaches typically reduce search complexity by identifying critical flags or employing localized group-based mutations. However, these methods often remain feature-agnostic or rely on manually designed static mappings during the online search. Consequently, they fail to adaptively link program features to optimization logic. Furthermore, they are prone to being misled by spurious correlations and noise inherent in sparse performance data.
To address this limitation, we propose CausalTuner, a feature-aware framework that injects causal rules into a Two-Phase Rule-Injected Search Engine to guide the optimization search. By identifying performance-critical pass subsequences and mining the causal dependencies between program features and optimization effectiveness, CausalTuner effectively transforms the search process from heuristic-driven to causal-guided. We evaluate CausalTuner on a diverse range of benchmarks, including cBench, Polybench, SPEC CPU 2017, and llama.cpp. CausalTuner consistently outperforms existing autotuning methods, discovering superior optimization configurations more efficiently with fewer search iterations.
Article Search
Article: pldiws26lctesmain-p93-p
A Pointer-Ownership Model for C Inspired by Rust
David Svoboda,
William Klieber,
Lori Flynn,
Ruben Martins, and
Jeffrey Hoskinson
(SEI at Carnegie Mellon University, USA; Carnegie Mellon University, USA)
Memory-safety bugs are a major source of vulnerabilities in C code. Much work has focused on spatial memory safety (e.g., buffer overflows), while temporal memory safety (e.g., use-after-free) has received less attention. One solution for achieving temporal memory safety is to apply an ownership model to an existing program and enforce it. In this paper, we describe the design and implementation of a new temporal memory safety model for C source code. Our design improves on CERT's Pointer Ownership Model with enhancements including use of a SAT solver to enforce constraint satisfaction, LLMs to complete a per-program model, and an improved mechanism to prevent use-after-free errors inspired by Rust's borrow checker and object lifetimes. Our implementation performed well on a large test suite of memory-safe and memory-unsafe code examples. We tested all 4,604 C code examples for the 5 CWEs associated with temporal memory safety (CWEs 401, 415, 416, 590, 761) from the Juliet C/C++ test suite. In our tests, all of the memory-unsafe examples were correctly recognized as unsafe, and 81% of the 2,302 memory-safe examples were correctly recognized as memory-safe.
Article Search
Artifacts Available
Article: pldiws26lctesmain-p95-p
A Functional Approach to Synthesizing Routable Programmable Accelerators for Neural Networks
Tzung-Han Juang,
Paul Teng, and
Christophe Dubach
(McGill University, Canada; MILA, Canada)
Producing optimized accelerators is tedious, as even modern HDLs (Hardware Description Languages) such as Chisel, require reasoning about low-level concepts. Recent functional approaches, such as Aetherling and SHIR, treat hardware as composition of pure operators. This raises the abstraction
level, allowing for systematic optimizations through rewriterules for FPGAs (Field Programmable Gate Arrays).
These approaches have so far been limited to small, fixed-function accelerators. Recent work maps neural networks to FPGAs by sharing coarse-grained functions via the Let construct. However, as the number of call sites or parallelism increases, synthesis fails due to increased routing congestion.
These limitations are addressed with a new way to express sharing in a functional IR (Intermediate Representation). By combining the Reduce and SwitchApply primitives over an instruction stream, functions become programmable, with shared control logic and a datapath, reducing routing pressure. Upper-bounded streams further enable sharing across varying input sizes. Across networks from LeNet 5 to ResNet, the resulting FPGA designs remain routable, delivering high performance with speedups between 1.1×–3.4× compared to prior work.
Article Search
Artifacts Available
Article: pldiws26lctesmain-p99-p
A Programming Model for Efficient Inter-Kernel Control-Flow on Memory-Mapped Near-Data Processing Architecture (WIP)
Seungheon Lee,
Wonhyuk Yang,
Seonyeong Heo, and
Gwangsun Kim
(POSTECH, Republic of Korea; Kyung Hee University, Republic of Korea)
As the memory wall problem worsens, Near-Data Processing (NDP) has emerged to reduce data movement by computing close to data. Recently proposed memory-mapped NDP (M2NDP) enables general-purpose NDP with low hardware overhead by extending RISC-V ISA and maximizing data parallelism with lightweight µthreads. However, a high-level programming model for this architecture—particularly one that naturally expresses control flow across kernels—has not yet been established.
In this work, we propose a novel programming model for M2NDP, called Arachne, which explicitly expresses the mapping between µthreads and the memory regions that hold the data they process. To avoid the overhead of host-side control flow decisions among kernels executed on the device, Arachne supports flexible and efficient device-side control flow. Furthermore, it improves programmability for expressing device-side control flow compared to existing approaches such as CUDA graphs. Our preliminary results show that the proposed programming model reduces control-flow overhead and improves hardware resource utilization compared to conventional programming models.
Article Search
Article: pldiws26lctesmain-p112-p
proc time: 0.31