Powered by
Proceedings of the ACM on Programming Languages, Volume 10, Number OOPSLA1
Sponsors
Article: oopslaa26foreword-fm003-p
Learning Symmetric Invariants from Symmetric Samples
Zhijie Xu and
Fei He
(Tsinghua University, China; Key Laboratory for Information System Security, China)
Invariant synthesis is a fundamental problem in program verification, yet existing learning-based approaches rarely exploit the inherent symmetry present in many programs, particularly parameterized and concurrent systems.
Such symmetry induces a symmetric reachable state space, naturally yielding symmetric samples and admitting symmetric invariants, motivating the task of learning symmetric invariants from symmetric samples.
To this end, we introduce symmetric decision trees (SDTs), a novel hypothesis class that enforces symmetry structurally, guaranteeing symmetric invariants by construction.
Furthermore, we develop a learning algorithm to construct SDTs and integrate it as the learner within the Horn-ICE framework, yielding our approach, Horn-SDT.
Empirical evaluation on parameterized programs demonstrates that Horn-SDT achieves faster convergence and constructs more compact trees compared to non-symmetric baselines.
Article Search
Article: oopslaa26main-p9-p
Decompiling for Constant-Time Analysis
Santiago Arranz-Olmos,
Gilles Barthe,
Lionel Blatter,
Youcef Bouzid,
Sören van der Wall, and
Zhiyuan Zhang
(MPI-SP, Germany; IMDEA Software Institute, Spain; ENS Paris-Saclay, France; TU Braunschweig, Germany)
The constant-time programming discipline is commonly used to protect cryptographic libraries against side-channel attacks. However, it is hard to write constant-time code; moreover, compilers can introduce constant-time violations. Therefore, it is important to ensure that assembly code is constant-time. One approach is to show that source programs are constant-time, and that constant-timeness is preserved by compilation. In this paper, we explore the methodological soundness and scalability of the Decompile-then-Analyze approach, a less conventional alternative that has been suggested in the broader setting of static analysis. Informally, the Decompile-then-Analyze approach uses decompilers a front-end for static analysis tools. As a motivation for our study, we show that current decompilers eliminate CT vulnerabilities before CT analysis, leading to non-CT programs being accepted as constant-time. Independently, we provide constructed examples of non-CT, exploitable, programs that are accepted by two popular CT analysis tools; in both cases the culprit are program transformations that are used internally prior to CT analysis and eliminate CT violations. While our examples do not invalidate the general approach of these tools, they emphasize the need for studying the Decompile-then-Analyze approach.
On the methodological side, we define the notion of CT transparency. Informally, a program transformation is CT transparent if does not eliminate nor introduce CT violations. We also provide general methods for proving that a transformation is CT transparent, and show that several transformations of interest are transparent. We also sketch an extension of CT transparency to speculative constant-time, which is used by cryptographic software as a protection against Spectre attacks.
On the practical side, we build a CT-transparent version of the popular LLVM-based decompiler RetDec, and combine it with CT-LLVM, an existing CT verification tool for LLVM. We evaluate the resulting tool, called CT-RetDec on a benchmark set of real-world vulnerabilities in binaries, and show that the modifications had significant impact on how well CT-RetDec performs.
Article Search
Artifacts Available
Article: oopslaa26main-p13-p
Taming the Hydra: Targeted Control-Flow Transformations for Dynamic Symbolic Execution
Charitha Saumya,
Muhammad Hassan,
Rohan Gangaraju,
Milind Kulkarni, and
Kirshanthan Sundararajah
(Intel, USA; Virginia Tech, USA; University of Texas at Austin, USA; Purdue University, USA)
Dynamic Symbolic Execution (DSE) suffers from the path explosion problem when the target program has many conditional branches. The classical approach for managing the path explosion problem is dynamic state merging. Dynamic state merging combines similar symbolic program states to avoid the exponential growth in the number of states during DSE. However, state merging still requires solver invocations at each program branch, even when both paths of the branch are feasible. Moreover, the best path search strategy for DSE may not create the best state merging opportunities. Some drawbacks of state merging can be mitigated by compile-time state merging (i.e., branch elimination by converting control-flow into data flow). In this paper, we propose a non-semantics-preserving but failure-preserving compiler transformation for removing expensive symbolic branches in a program to improve the scalability of DSE. We have developed a framework for detecting spurious bugs that our transformation can insert. Finally, we show that our transformation can significantly improve the performance of DSE on various benchmark programs and help improve the performance of coverage and bug discovery of large real-world programs.
Article Search
Article: oopslaa26main-p20-p
Floating-Point Usage on GitHub: A Large-Scale Study of Statically Typed Languages
Andrea Gilot,
Tobias Wrigstad, and
Eva Darulova
(Uppsala University, Sweden)
Reasoning about floating-point arithmetic is notoriously hard. While static and dynamic analysis techniques or program repair have made significant progress, more work is still needed to make them relevant to real-world code. On the critical path to that goal is understanding what real-world floating-point code looks like.
To close that knowledge gap, this paper presents the first large-scale empirical study of floating-point arithmetic usage across public GitHub repositories. We focus on statically typed languages to allow our study to scale to millions of repositories. We follow state-of the art mining practices including random sampling and filtering based on only intrinsic properties to avoid bias, and identify floating-point usage by searching for keywords in the source code, and programming language constructs (e.g., loops) by parsing the code. Our evaluation supports the claim often made in papers that floating-point arithmetic is widely used. Comparing statistics such as size and usage of certain constructs and functions, we find that benchmarks used in literature to evaluate automated reasoning techniques for floating-point arithmetic are in certain aspects representative of 'real-world' code, but not in all.
We publish a dataset of 10 million real-world floating-point functions extracted from our study. We demonstrate in a case study how it may be used to identify new floating-point benchmarks and help future techniques for floating-point arithmetic to be designed and evaluated to match actual users' expectations.
Preprint
Artifacts Available
Article: oopslaa26main-p22-p
OBsmith: LLM-Powered JavaScript Obfuscator Testing
Shan Jiang,
Chenguang Zhu, and
Sarfraz Khurshid
(University of Texas at Austin, USA)
JavaScript obfuscators are widely deployed to protect intellectual property and resist reverse engineering, yet their correctness has been largely overlooked compared to performance and resilience. Existing evaluations typically measure resistance to deobfuscation, leaving the critical question of whether obfuscators preserve program semantics unanswered. Incorrect transformations can silently alter functionality, compromise reliability, and erode security—undermining the very purpose of obfuscation. To address this gap, we present OBsmith, a novel framework to systematically test JavaScript obfuscators using large language models (LLMs). OBsmith leverages LLMs to generate program sketches—abstract templates capturing diverse language constructs, idioms, and corner cases—which are instantiated into executable programs and subjected to obfuscation under different configurations. Besides LLM-powered sketching, OBsmith also employs a second source: automatic extraction of skeletons from real programs. This extraction path enables more focused testing of project-specific features and lets developers inject domain knowledge into the resulting test cases. OBsmith uses two techniques to derive test oracles: (i) reference-oriented equivalence testing, which takes the original
program as reference oracle (ground truth) and checks whether the obfuscated version preserves equivalent functionality, and (ii) metamorphic testing, which applies semantics-preserving transformations to the original program and checks if obfuscation violates expected behavior.
We evaluate OBsmith on two widely used obfuscators, Obfuscator.IO and JS-Confuser, generating 600
sketches using six popular LLMs. OBsmith fills these sketches and generates over 3,000 candidate programs and obfuscates them across seven obfuscation configurations. OBsmith uncovers 11 previously unknown correctness bugs. Under an equal program budget, five general purpose state-of-the-art JavaScript fuzzers (FuzzJIT, Jsfunfuzz, Superion, DIE, Fuzzilli) failed to detect these issues, highlighting OBsmith’s complementary focus on obfuscation-induced misbehavior. An ablation shows that all components except our generic MRs contribute to at least one bug class; the negative MR result suggests the need for obfuscator-specific metamorphic relations. Our results also seed a discussion on how to balance obfuscation presets and performance cost. We envision OBsmith as an important step towards automated testing and quality assurance of obfuscators and other semantic-preserving toolchains.
Article Search
Article: oopslaa26main-p25-p
Peeling Off the Cocoon: Unveiling Suppressed Golden Seeds for Mutational Greybox Fuzzing
Ruixiang Qian,
Chunrong Fang,
Zengxu Chen,
Youxin Fu, and
Zhenyu Chen
(Nanjing University, China)
Mutational greybox fuzzing (MGF) is a powerful software testing technique. Initial seeds are critical for MGF since they define the space of possible inputs and fundamentally shape the effectiveness of MGF. Nevertheless, having more initial seeds is not always better. A bloated initial seed set can inhibit throughput, thereby degrading the effectiveness of MGF. To avoid bloating, modern fuzzing practices recommend performing seed selection to maintain golden seeds (i.e., those identified as beneficial for MGF) while minimizing the size of the set. Typically, seed selection favors seeds that execute unique code regions and discards those that contribute stale coverage. This coverage-based strategy is straightforward and useful, and is widely adopted by the fuzzing community. However, coverage-based seed selection (CSS) is not flawless and has a notable blind spot: it fails to identify golden seeds suppressed by unpassed coverage guards, even if these seeds contain valuable payload that can benefit MGF. This blind spot prevents suppressed golden seeds from realizing their true values, which may ultimately degrade the effectiveness of downstream MGF.
In this paper, we propose a novel technique named PoCo to address the blind spot of traditional CSS. The basic idea behind PoCo is to manifest the true strengths of the suppressed golden seeds by gradually disabling obstacle conditional guards. To this end, we develop a lightweight program transformation to enable flexible disabling of guards and devise a novel guard hierarchy analysis to identify obstacle ones. An iterative seed selection algorithm is constructed to stepwise select suppressed golden seeds. We prototype PoCo on top of the AFL++ utilities (version 4.10c) and compare it with seven baselines, including two state-of-the-art tools afl-cmin and OptiMin. Compared with afl-cmin, PoCo selects 3–40 additional seeds within a practical time budget of two hours. To evaluate how effective the studied techniques are in seeding MGF, we further conduct extensive fuzzing (over 17,280 CPU hours) with eight different targets from a mature benchmark named Magma, adopting the most representative fuzzer AFL++ for MGF. The results show that the additional seeds selected by PoCo yield modest improvements in both code coverage and bug discovery. Although our evaluation reveals some limitations of PoCo, it also demonstrates the presence and value of suppressed golden seeds. Based on the evaluation results, we distill lessons and insights that may inspire the fuzzing community.
Article Search
Artifacts Available
Article: oopslaa26main-p37-p
Diatom: Polylithic Binary Lifting with Data-Flow Summaries and Type-Aware IR Linking
Anshunkang Zhou and
Charles Zhang
(Hong Kong University of Science and Technology, China)
Binary lifting, which translates binary code into LLVM intermediate representations (IRs) through iterative IR transformations for recovering high-level constructs from low-level machine features, is the cornerstone of many binary analysis systems. Therefore, the scalability and precision of the upper layer analysis could be greatly affected by the underlying binary lifting. However, all existing binary lifters still suffer from severe performance problems in that they require much time to handle extremely large binaries, which becomes a barrier to achieving the expected performance gains in various analyses and hinders them from meeting the requirement of quick response in modern continuous integration pipelines. We found that the root cause of the scalability issue is the inherent "monolithic" design that performs all lifting stages on a single LLVM module, which entails a global environment that enforces sequential dependences between any two transformations on IRs, thus limiting the parallelism.
This paper presents Diatom, a novel parallel binary lifter powered by a new "polylithic" design, which decomposes the monolithic LLVM module into partitions to perform fully parallelized binary lifting. In the meantime, it leverages light-weight data-flow summaries and type-aware IR linking to avoid soundness loss caused by separating dependent code fragments. Large-scale experiments on 16 real-world benchmarks whose sizes range from dozens of megabytes (MBs) to several gigabytes (GBs) show that Diatom achieves an average speedup of 7.45× and a maximum speedup of 16.8× over a traditional monolithic binary lifter, while still maintaining the lifting soundness. Diatom can complete the translation for the Linux Kernel binary within only 10 minutes, which significantly accelerates the overall binary code analysis process.
Preprint
Article: oopslaa26main-p41-p
Handling Exceptions and Effects with Automatic Resource Analysis
Ethan Chu,
Yiyang Guo, and
Jan Hoffmann
(Carnegie Mellon University, USA)
There exist many techniques for automatically deriving parametric resource (or cost) bounds by analyzing the source code of a program. These techniques work effectively for a large class of programs and language features. However, non-local transfer of control as needed for exception or effect handlers has remained a challenge.
This paper presents the first automatic resource bound analysis that supports non-local control transfer between exceptions or effects and their handlers. The analysis is an extension of type-based automatic amortized resource analysis (AARA), which automates the potential method of amortized analysis. It is presented for a simple functional language with lists and linear potential functions. However, the ideas are directly applicable to richer settings and implemented for Standard ML and polynomial potential functions.
Apart from the new type system for exceptions and effects, a main contribution is a novel syntactic type- soundness theorem that establishes the correctness of the derived bounds with respect to a stack-based abstract machine. An experimental evaluation shows that the new analysis is capable of analyzing programs that cannot be analyzed by existing methods and that the efficiency overhead of supporting exception and effect handlers is low.
Article Search
Artifacts Available
Article: oopslaa26main-p42-p
RAT-CAT-SAT: Model Checking Memory Consistency Models
Jan Grünke,
Thomas Haas, and
Roland Meyer
(TU Braunschweig, Germany)
We present a model checking approach for axiomatic memory consistency models written in the language CAT.
It can prove properties of single memory models like the monotonicity of barriers, and also compare models like TSO and ARM.
To achieve this expressiveness, our approach supports the full rational fragment of CAT.
We not only support the Kleene star operation, composition, and union to define relations in a memory model, but also, for the first time, intersection, converse, and the construction of relations from sets.
Our model checking approach for memory consistency models is logical in nature: we formulate the problem as satisfiability in a logical theory of relations.
Our technical contribution is then a new theory solver that is sound, complete, and optimal from a complexity point of view.
At the heart of our solver is a cyclic proof system that can detect invariants on-the-fly.
This allows us to terminate early not only in the case that a model has been found, but also in the case of unsatisfiability.
Our solver is easy to implement and easy to combine with heuristics.
We have implemented a combination with counterexample-guided abstraction refinement, and exercised the prototype on a number of benchmarks - with promising results.
Article Search
Artifacts Available
Article: oopslaa26main-p45-p
Specy: Learning Specifications for Distributed Systems from Event Traces
Mike He,
Ankush Desai,
Aishwarya Jagarapu,
Doug Terry,
Sharad Malik, and
Aarti Gupta
(Princeton University, USA; Snowflake, USA; Amazon Web Services, USA; LinkedIn, USA)
Reasoning about the correctness of distributed systems is a significant challenge, with precise correctness specifications serving as an essential prerequisite to verification. However, identifying and formulating specifications remains a major hurdle for developers in practice. Specy addresses this challenge by automatically learning specifications from observable event traces generated by message exchanges in distributed systems. The system employs a specialized grammar tailored for event-based specifications, incorporating support for quantifiers over events – a capability essential for capturing the complex behavioral patterns inherent in distributed protocols. Specy utilizes a novel learning procedure that combines grammar-based enumerative search with dynamic learning from event traces, providing effective control over the specification search. We evaluated Specy on established distributed protocols and industrial case studies, demonstrating its ability to successfully learn important protocol specifications. Specy can discover previously unidentified specifications overlooked by developers, automatically derive inductive invariants that were previously constructed manually for verification purposes, and, through run-time monitoring in production systems, reveal gaps in testing coverage – highlighting opportunities to leverage specifications in practice.
Article Search
Artifacts Available
Article: oopslaa26main-p48-p
Commuting Conversions and Join Points for Call-by-Push-Value
Jonathan Chan,
Madi Gudin,
Annabel Levy, and
Stephanie Weirich
(University of Pennsylvania, USA; Amherst College, USA; University of Maryland, Baltimore County, USA)
Levy’s call-by-push-value (CBPV) is a language that subsumes both call-by-name and call-by-value lambda calculi by syntactically distinguishing values from computations and explicitly specifying execution order. This low-level handling of computation suspension and resumption makes CBPV suitable as a compiler intermediate representation (IR), while its substitution evaluation semantics affords compositional reasoning about programs. In particular, βη-equivalences in CBPV have been used to justify compiler optimizations in low-level IRs. However, these equivalences do not validate commuting conversions, which are key transformations in compiler passes such as A-normalization. Such transformations syntactically rearrange computations without affecting evaluation order, and can reveal new opportunities for inlining.
In this work, we identify the commuting conversions of CBPV, define a commuting conversion normal form (CCNF) for CBPV, present a single-pass transformation into CCNF based on A-normalization, and prove that well-typed, translated programs evaluate to the same result. To avoid the usual code duplication issues that also arise with A-normal form, we adapt the explicit join point constructs by Maurer et al. Our results are all mechanized in Lean 4.
Article Search
Artifacts Available
Article: oopslaa26main-p54-p
Hermes: Making Path-Sensitive Pointer Analysis Scalable for Sparse Value-Flow Analysis
Yuxuan He,
Ruilin Jiang,
He Zhang,
Qingkai Shi,
Huaxun Huang, and
Rongxin Wu
(Xiamen University, China; Nanjing University, China)
Sparse Value-Flow Analysis (SVFA) is essential for detecting software bugs such as null pointer dereference and memory leak. However, SVFA heavily relies on path-sensitive pointer analysis, which faces significant scalability challenges when analyzing industrial-scale projects, notably the summary-explosion problem. To address this issue, we propose Hermes, which symbolizes memory side effects and constructs an incomplete Sparse Value-Flow Graph (SVFG) called Lazy Symbolic Expression Graph (LSEG). Leveraging this structure, Hermes builds inter-procedural value flows relevant to bug detection only when necessary, significantly reducing the overhead of pointer analysis and streamlining the bug-search paths. Evaluations on large-scale real-world projects demonstrate that, compared to the state-of-the-art, Hermes achieves average speedups of at least 9.84× and 4.79× for pointer analysis and bug search, respectively, without sacrificing the effectiveness of bug detection.
Article Search
Article: oopslaa26main-p61-p
MetaSpace: Metamorphic Testing for Spatial Cognition in Embodied Agents
Gengyang Xu,
Dongwei Xiao,
Yiteng Peng, and
Shuai Wang
(Hong Kong University of Science and Technology, China)
An embodied agent is an intelligent entity that interacts with its environment through a physical body. Currently, the evaluation of embodied agents primarily relies on two paradigms: (1) manually annotated Visual Question Answering (VQA) pairs and (2) high-level task completion metrics, such as success in navigation or manipulation. The former is labor-intensive and subject to variability in annotation quality. The latter may obscure critical vulnerabilities, allowing agents to complete tasks through suboptimal means or safety violations, thereby concealing safety risks and inefficiencies. Given that spatial cognition is the cornerstone for executing embodied tasks, there is a pressing need to assess whether embodied agents possess robust spatial cognition during task execution.
Inspired by metamorphic testing principles in software engineering, we propose MetaSpace, a novel framework designed to evaluate the spatial cognition of agents. By leveraging spatiotemporal multimodal states derived from real execution trajectories, MetaSpace automatically generates test cases based on predefined metamorphic relations (MRs) grounded in logical rules and physical laws. Crucially, we encode these MRs as executable rules in a logic programming language (Prolog). Violations of these relations indicate failures in spatial cognition. Our empirical evaluation across three embodied scenarios demonstrates that MetaSpace successfully detects 90,422 spatial cognition errors in state-of-the-art (SOTA) MLLM-driven agents. We introduce the Spatial Cognition (SC) score to quantify performance. Results indicate that all SOTA agents achieve average scores between 0.44 and 0.52, significantly lower than the human benchmark of 0.96. Additionally, these agents struggle with directional tasks, with SC scores consistently below 0.38. In contrast, their performance in magnitude-related tasks is relatively better, with most SC scores exceeding 0.5. To mitigate the identified spatial cognition errors, we explore potential improvement strategies. Preliminary results suggest that traditional prompting techniques (e.g., Chain of Thought) are limited, while spatially-aware prompting (e.g., cognitive maps) shows promise. Our findings underscore the importance of ongoing community efforts to enhance embodied agent performance by prioritizing the improvement of spatial cognition, a fundamental requirement for executing embodied tasks.
Article Search
Article: oopslaa26main-p63-p
Debugging Debugging Information using Dynamic Call Trees
J. Ryan Stinnett and
Stephen Kell
(King's College London, UK)
Debugging tools rely on compiler-generated metadata to present a source-language view, but current compilers often throw away or corrupt debugging information in optimised programs. Attempts to test debugging information are confounded by ad-hoc limitations of the debug info formats and a lack of clarity on whether the compiler or the format is to blame for any given loss. Adopting the "residual program" conceptual view of debug info, we conduct a study of the quality of debugging information in respect of the source-level dynamic call trees it can recover. We compare the trees recovered from optimised and unoptimised versions of the same program, producing a classification of the observed divergences. For each class, we analyse whether format or compiler is to blame and identify specific ways to address these defects. We also validate our classification across a larger collection of well-known codebases.
Article Search
Artifacts Available
Article: oopslaa26main-p64-p
Beer: Interactive Alarm Resolution in Bayesian Program Analysis via Exploration-Exploitation
Haoran Lin,
Zhenyu Yan, and
Xin Zhang
(Peking University, China)
Interactive Bayesian program analysis enhances static analysis by modeling derivations as probabilistic dependencies, enabling ranking alarms by calculated confidences, proposing highly likely alarms for user inspection, and updating confidences with inspection results. Existing interactive approaches adopt a purely greedy, exploitation-only selection strategy that always inspects the highest-confidence alarm. However, such strategies are prone to local optima, leading to redundant inspections and delayed identification of true alarms. We propose Beer (Bayesian Exploration-Exploitation Ranker), a framework that systematically integrates the Exploration-Exploitation trade-off into Bayesian program analysis. Beer leverages structural correlations between alarms—derived from shared root causes in the Bayesian model—to estimate expected information gain and guide exploration. When repeated false alarms indicate model stagnation, Beer selects alarms from minimally explored, highly correlated clusters to accelerate learning. Implemented atop the Bingo framework, Beer achieves up to 32% effectiveness in ranking efficiency over the greedy baseline on datarace, thread-escape, and taint analyses, demonstrating the efficacy of exploration-guided alarm resolution.
Article Search
Artifacts Available
Article: oopslaa26main-p91-p
noDice: Inference for Discrete Probabilistic Programs with Nondeterminism and Conditioning
Tobias Gürtler and
Benjamin Lucien Kaminski
(Saarland University, Germany; Saarland Informatics Campus, Germany; University College London, UK)
Probabilistic programming languages (PPLs) are an expressive and intuitive means of representing complex probability distributions. In that realm, languages like Dice target an important class of probabilistic programs: those whose probability distributions are discrete. Discrete distributions are common in many fields, including text analysis, network verification, artificial intelligence, and graph analysis. Another important feature in the world of probabilistic modeling are nondeterministic choices as found in Markov Decision Processes (MDPs) which play a major role in reinforcement learning. Modern PPLs usually lack support for nondeterminism. We address this gap with the introduction of noDice, which extends the discrete probabilistic inference engine Dice. noDice performs inference on loop-free programs by constructing an MDP so that the distributions modeled by the program correspond to schedulers in the MDP. Furthermore, decision diagrams are used as an intermediate step to exploit the program structure and drastically reduce the state space of the MDP.
Article Search
Artifacts Available
Article: oopslaa26main-p98-p
Mechanically Translating Iterative Dataflow Analysis to Algebraic Program Analysis
Chenyu Zhou,
Jingbo Wang, and
Chao Wang
(University of Southern California, USA; Purdue University, USA)
We propose a method for mechanically translating iterative dataflow analysis (IDA) algorithms to algebraic program analysis (APA) algorithms capable of computing exactly the same set of dataflow facts. The method is useful because while most of the dataflow analysis algorithms used in practice are expressed as iterative procedures, APA provides an alternative and inherently-compositional approach to solving dataflow problems, thus making it well suited for certain applications (e.g., incremental analysis of a program that goes through frequent code changes, or amortizing the cost of answering a large number of dataflow queries for the same program). However, manually crafting an APA algorithm not only is labor intensive and error prone but also can lead to suboptimal performance. Our method overcomes the limitation by providing a mechanical translation that guarantees to be correct by construction. Our method handles a broad class of dataflow analysis problems — the only requirements are that (1) the set of dataflow facts is finite and (2) the dataflow functions distribute over the confluence operation (e.g., set union). They include classical dataflow problems whose IDA algorithms can be expressed using Gen/Kill sets, such as reaching definitions, live variables, and available expressions. They also include non Gen/Kill problems such as copy constant propagation, truly-live variables, and possibly-initialized variables. Our experimental evaluation shows that the translated APA algorithms are not only simpler and easier to understand, but also significantly faster than manually-designed APA algorithms, especially for incremental program analysis.
Article Search
Article: oopslaa26main-p99-p
SymGPT: Auditing Smart Contracts via Combining Symbolic Execution with Large Language Models
Shihao Xia,
Mengting He,
Shuai Shao,
Tingting Yu,
Yiying Zhang,
Nobuko Yoshida, and
Linhai Song
(Pennsylvania State University, USA; University of Connecticut, USA; University of California at San Diego, USA; University of Oxford, UK; Institute of Computing Technology at Chinese Academy of Sciences, China)
To govern smart contracts running on Ethereum, multiple Ethereum Request for Comment (ERC) standards have been developed, each defining a set of rules governing contract behavior. Violating these rules can cause serious security issues and financial losses, signifying the importance of verifying ERC compliance. Today’s practices of such verification include manual audits, expert-developed program-analysis tools, and large language models (LLMs), all of which remain ineffective at detecting ERC rule violations.
This paper introduces SymGPT, a tool that combines LLMs with symbolic execution to automatically verify smart contracts’ compliance with ERC rules. We begin by empirically analyzing 132 ERC rules from three major ERC standards, examining their content, security implications, and natural language descriptions. Based on this study, SymGPTinstructs an LLM to translate ERC rules into a domain-specific language, synthesizes constraints from the translated rules to model potential rule violations, and performs symbolic execution for violation detection. Our evaluation shows that SymGPTidentifies 5,783 ERC rule violations in 4,000 real-world contracts, including 1,375 violations with clear attack paths for financial theft. Furthermore, SymGPToutperforms six automated techniques and a security-expert auditing service, underscoring its superiority over current smart contract analysis methods.
Article Search
Article: oopslaa26main-p107-p
CMakeSonar: A Static Approach to Detecting CMake Bugs with a Fine-Grained Type System
Haotian Han,
Zihang Zhong,
Qingan Li,
Jingling Xue, and
Mengting Yuan
(Wuhan University, China; UNSW Sydney, Australia)
As build systems and their scripts grow in size and complexity, detecting bugs in build configurations becomes increasingly challenging due to the rich functionality and weak typing of build scripting languages. This paper introduces CMakeSonar, the first static approach to precisely identifying semantic bugs in CMake scripts. CMakeSonar addresses this challenge by (1) designing a fine-grained type system that captures the runtime semantics of CMake values, and (2) performing a flow-sensitive analysis that detects inconsistent and ill-typed value usages by solving type constraints. Our approach identifies configuration and usage errors that can silently affect build correctness, portability, and deployment safety. In our evaluation, CMakeSonar identifies 155 bugs across 36 real-world CMake projects on GitHub, of which 23 have been accepted and fixed by developers. With a false positive rate of 4.32% and a recall of 97.48%, CMakeSonar demonstrates that precise static analysis can effectively uncover high-impact bugs in untyped build systems.
Article Search
Artifacts Available
Article: oopslaa26main-p114-p
When Specifications Meet Reality: Uncovering API Inconsistencies in Ethereum Infrastructure
Jie Ma,
Ningyu He,
Jinwen Xi,
Mingzhe Xing,
Liangxin Liu,
Jiushenzi Luo,
Xiaopeng Fu,
Chiachih Wu,
Haoyu Wang,
Ying Gao, and
Yinliang Yue
(Beihang University, China; Zhongguancun Laboratory, China; Hong Kong Polytechnic University, Hong Kong; Beijing Institute of Technology, China; Amber Group, Hong Kong; Huazhong University of Science and Technology, China)
The Ethereum ecosystem, which secures over $381 billion in assets, fundamentally relies on client APIs as the sole interface between users and the blockchain. However, these critical APIs suffer from widespread implementation inconsistencies, which can lead to financial discrepancies, degraded user experiences, and threats to network reliability. Despite this criticality, existing testing approaches remain manual and incomplete: they require extensive domain expertise, struggle to keep pace with Ethereum's rapid evolution, and fail to distinguish genuine bugs from acceptable implementation variations. We present APIDIFfer, the first specification-guided differential testing framework designed to automatically detect API inconsistencies across Ethereum's diverse client ecosystem. APIDIFfer transforms API specifications into comprehensive test suites through two key innovations: (1) specification-guided test input generation that creates both syntactically valid and invalid requests enriched with real-time blockchain data, and (2) specification-aware false positive filtering that leverages large language models to distinguish genuine bugs from acceptable variations. Our evaluation across all 11 major Ethereum clients reveals the pervasiveness of API bugs in production systems. APIDIFfer uncovered 72 bugs, with 90.28% already confirmed or fixed by developers, including one critical error in the official specifications themselves. Beyond these raw numbers, APIDIFfer achieves up to 89.67% higher code coverage than existing tools and reduces false positive rates by 37.38%. The Ethereum community's response validates our impact: developers have integrated our test cases, expressed interest in adopting our methodology, and escalated one bug to the official Ethereum Project Management meeting. By making APIDIFfer open-source, we enable continuous validation of Ethereum client API implementations, thereby strengthening the foundational integrity of the entire Ethereum ecosystem.
Article Search
Artifacts Available
Article: oopslaa26main-p122-p
Type Inference for Functional and Imperative Dynamic Languages
Mickaël Laurent and
Jan Vitek
(Charles University, Czech Republic; Czech Technical University, Czech Republic)
In this paper, we formalize a type system based on set-theoretic types for dynamic languages that support both functional and imperative programming paradigms. We adapt prior work in the typing of overloaded and generic functions to support an impure lambda-calculus, focusing on imperative features commonly found in dynamic languages such as JavaScript, Python, and Julia. We introduce a general notion of parametric opaque data types using set-theoretic types, enabling precise modeling of mutable data structures while promoting modularity, clarity, and readability. Finally, we compare our approach to existing work and evaluate our prototype implementation on a range of examples.
Article Search
Artifacts Available
Article: oopslaa26main-p124-p
Fully-Automatic Type Inference for Borrows with Lifetimes
William Brandon,
Benjamin Driscoll,
Frank Dai,
Jonathan Ragan-Kelley,
Mae Milano, and
Alex Aiken
(Massachusetts Institute of Technology, USA; Stanford University, USA; Unaffiliated, USA; Princeton University, USA)
We present a new pure functional language and type system with borrows with lifetimes, and a corresponding fully-automatic type inference procedure. Inference provides users the performance benefits of borrows with lifetimes without requiring user annotation. If the user's program cannot be typed, inference inserts a handful of reference count operations so that it can be typed. We provide a heap semantics for our borrowing language and prove a soundness theorem, guaranteeing that well-typed programs do not violate memory safety. We implement our memory management strategy as part of the Morphic language stack and compare it to Perceus, a state-of-the-art reference-counting technique based on linear type inference. We find that our system is able to eliminate almost all reference count operations across a range of programs, reducing reference count increments by 75-100% on all benchmarks with reference count increments under the baseline. As a result, we achieve a 1.48x geomean speedup overall on all benchmarks.
Article Search
Artifacts Available
Article: oopslaa26main-p128-p
(Dis)Proving Spectre Security with Speculation-Passing Style
Santiago Arranz-Olmos,
Gilles Barthe,
Lionel Blatter,
Xingyu Xie, and
Zhiyuan Zhang
(MPI-SP, Germany; IMDEA Software Institute, Spain)
Constant-time (CT) verification tools are commonly used for detecting potential side-channel vulnerabilities
in cryptographic libraries. Recently, a new class of tools, called speculative constant-time (SCT) tools, has
also been used for detecting potential Spectre vulnerabilities. In many cases, these SCT tools have emerged
as liftings of CT tools. However, these liftings are seldom defined precisely and are almost never analyzed
formally. The goal of this paper is to address this gap, by developing formal foundations for these liftings, and
to demonstrate that these foundations can yield practical benefits.
Concretely, we introduce a program transformation, coined Speculation-Passing Style (SPS), for reducing
SCT verification to CT verification. Essentially, the transformation instruments the program with a new input
that corresponds to attacker-controlled predictions and modifies the program to follow them. This approach
is sound and complete, in the sense that a program is SCT if and only if its SPS transform is CT. Thus, we can
leverage existing CT verification tools to prove SCT; we illustrate this by combining SPS with three standard
methodologies for CT verification, namely reducing it to noninterference, assertion safety, and dynamic taint
analysis. We realize these combinations with three existing tools, EasyCrypt, Binsec/Rel, and CTGrind, and
we evaluate them on Kocher’s benchmarks for Spectre-v1. Our results focus on Spectre-v1 in the standard CT
leakage model; however, we also discuss applications of our method to other variants of Spectre and other
leakage models.
Article Search
Artifacts Available
Article: oopslaa26main-p130-p
Static Factorisation of Probabilistic Programs with User-Labelled Sample Statements and While Loops
Markus Böck and
Jürgen Cito
(TU Wien, Austria)
It is commonly known that any Bayesian network can be implemented as a probabilistic program, but the reverse direction is not so clear.
In this work, we address the open question to what extent a probabilistic program with user-labelled sample statements and while loops - features found in languages like Gen, Turing, and Pyro - can be represented graphically.
To this end, we extend existing operational semantics to support these language features.
By translating a program to its control-flow graph, we define a sound static analysis that approximates the dependency structure of the random variables in the program.
As a result, we obtain a static factorisation of the implicitly defined program density, which is equivalent to the known Bayesian network factorisation for programs without loops and constant labels, but constitutes a novel graphical representation for programs that define an unbounded number of random variables via loops or dynamic labels.
We further develop a sound program slicing technique to leverage this structure to statically enable three well-known optimisations for the considered program class:
we reduce the variance of gradient estimates in variational inference and we speed up both single-site Metropolis Hastings and sequential Monte Carlo.
These optimisations are proven correct and empirically shown to match or outperform existing techniques.
Article Search
Artifacts Available
Article: oopslaa26main-p132-p
Mixtris: Mechanised Higher-Order Separation Logic for Mixed Choice Multiparty Message Passing
Jonas Kastberg Hinrichsen,
Iwan Quémerais, and
Lars Birkedal
(Aalborg University, Denmark; ENS-Lyon, France; Aarhus University, Denmark)
Mixed choice multiparty message passing is an expressive concurrency programming paradigm where components use non-determinism to choose between concurrent options for sending and receiving messages. This
flexibility makes it possible to program advanced algorithms, such as leader election protocols, succinctly. We
present Mixtris, a mechanised higher-order separation logic for reasoning about functional correctness of
higher-order imperative programs with mixed choice multiparty message passing in a shared memory setting.
Mixtris builds upon recent work on separation logic for (non-mixed choice) multiparty message-passing
programs, by drawing inspiration from session type systems for mixed choice multiparty message-passing
programs. Mixtris is the first program logic for mixed choice multiparty message passing. We prove soundness
of Mixtris using a novel model of our mixed choice multiparty protocols. We demonstrate how Mixtris can be
used to formally reason about challenging examples, including some leader election protocols such as Chang
and Roberts’s ring leader election protocol. All the results in the paper (both meta-theory and examples) have
been formalised in the Rocq proof assistant on top of the Iris program logic framework.
Article Search
Artifacts Available
Article: oopslaa26main-p134-p
Phaedrus: Predicting Dynamic Application Behavior with Lightweight Generative Models and LLMs
Bodhisatwa Chatterjee,
Neeraj Jadhav, and
Santosh Pande
(Georgia Institute of Technology, USA)
Application profiling is an indispensable technique for many software development tasks, such as code and memory layout optimizations, where optimization decisions are tailored to specific program profiles. Unfortunately, modern application codebases exhibit highly variant behavior across different inputs, creating challenges for conventional profiling approaches that rely on a single representative execution instance. In this paper, we propose Phaedrus, a new compiler-assisted deep learning framework designed to predict dynamic program behavior across varied execution instances, specifically focusing on dynamic function call prediction. These predicted call sequences are subsequently used to guide input-specific compiler optimizations, producing code specialized for each execution instance.
Traditional profile-guided optimization methods struggle with the input-dependent variability of modern applications, where profiling on different inputs yields divergent application behaviors. To address this, Phaedrus proposes two new approaches: Application Profile Synthesis (Dynamis), a profile-less approach where Large Language Models (LLMs) directly infer dynamic functions based on source code & static compiler analysis, bypassing the need for traditional profiling, and Application Profile Generalization (Morpheus), which uses generative models trained on compressed and augmented Whole Program Path (WPP) based function profiles to predict application behavior under unseen inputs. Our experiments show that Phaedrus accurately identifies the most frequently executed and runtime-dominated hotspot functions, accounting for up to 85–99% of total execution time. Leveraging these predictions, Phaedrus enables superior profile-guided optimizations, delivering an average speedup of 6% (up to 25%) and a binary size reduction of 5.19% (up to 19%), without any program execution. In addition, Phaedrus reduces WPP function profile sizes by up to 107×.
Preprint
Artifacts Available
Article: oopslaa26main-p136-p
Metamorphic Testing for Infrastructure-as-Code Engines
David Spielmann,
George Zakhour,
Dominik Arnold,
Matteo Biagiola,
Roland Meier, and
Guido Salvaneschi
(University of St. Gallen, Switzerland; University of Zurich, Switzerland; USI Lugano, Switzerland; armasuisse, Switzerland)
Infrastructure-as-Code (IaC) engines, such as Terraform, OpenTofu, and Pulumi, automate the provisioning and management of cloud resources. They parse IaC specifications and orchestrate the required actions, making them the backbone of modern clouds, and critical to the reliability of both the underlying infrastructure and the software that depends on it. Despite this importance, this class of systems has received little attention: prior work largely targets the correctness of IaC programs rather than the IaC engines themselves. Existing test suites rely on manually written oracles and struggle to expose faults that manifest across multiple executions, leaving a significant reliability gap.
We present EMIaC, a metamorphic testing framework for IaC engines. EMIaC defines metamorphic relations as graph-based transformations of IaC programs and checks invariants across executions of the original and transformed programs. A central novelty is our use of e-graphs in software testing, as both a test-input generator and an equivalence oracle. E-graphs compactly represent program equivalences, enabling the systematic generation of large spaces of equivalent IaC programs. To ground these relations, we analyze 43,593 real-world Terraform programs and show that IaC dependency graphs are typically small and sparse, making e-graphs a natural fit.
Evaluating EMIaC on Pulumi, Terraform, and OpenTofu, we show that it complements existing test suites by exercising engine-critical code paths and covering 98 previously untested statements in Terraform and 1,313 in Pulumi. EMIaC also uncovers previously unknown issues in all three test suites, improving their adequacy. Three test cases have been merged into Terraform's main branch, and Pulumi has merged a specification fix.
Article Search
Artifacts Available
Article: oopslaa26main-p139-p
Class-Dictionary Specialization with Rank-2 Polymorphic Functions
Yong Qi Foo and
Michael D. Adams
(National University of Singapore, Singapore)
Rank-2 polymorphism, when combined with type-class-constrained arguments, enables powerful abstractions and code reuse by allowing functions to accept arguments that are themselves ad-hoc polymorphic. Optimizing compilers like the Glasgow Haskell Compiler (GHC) use techniques like class-dictionary specialization and inlining for optimization. However, these techniques can falter when the rank-2 polymorphic functions are recursive. Specializing the polymorphic arguments of recursive rank-2 polymorphic function applications requires inlining the applied function to expose type and dictionary applications to the arguments, but the compiler’s heuristic-driven inliner is reluctant to inline recursive functions, risking non-termination during compilation. This stalemate causes recursive rank-2 polymorphic functions to remain un-optimized and be left with a runtime penalty. Within the Haskell ecosystem, we identify this stalemate within three widely used libraries: the Scrap Your Boilerplate (SYB) and SYB With Class (SYB3) generic-programming libraries, and a fragment of the lens optics library. In these libraries, the combination of rank-2 polymorphism, type-class constraints and recursion is central to their implementation.
In this paper, we present a new optimization technique that breaks this stalemate and enables class-dictionary specialization with recursive rank-2 polymorphic functions. We introduce a partial evaluator that strategically applies the standard transformations of inlining, β-reduction and memoization to applications of rank-2 polymorphic functions that are partially static, i.e., whose arguments have some information known statically. This process exposes applications of its polymorphic arguments onto concrete types and dictionaries, which can be specialized by the standard compiler optimization pipeline. Additionally, we introduce type-constant folding to evaluate run-time type-equality tests statically, further leveraging static type information gained from partial evaluation. We implement our technique as a GHC plugin and demonstrate its effectiveness by resolving the performance bottlenecks in the aforementioned Haskell libraries. On SYB and SYB3 traversals, our technique achieves speedups of 43× on average (up to 155×) and 6.1× on average (up to 9.5×), respectively, matching the performance of their hand-written counterparts. On the fragment of the lens library containing the identified slowdowns, our technique achieves speedups of 1.6× on average (up to 2.1×).
Article Search
Artifacts Available
Article: oopslaa26main-p141-p
Sound and Complete Invariant-Based Heap Encodings
Zafer Esen,
Philipp Rümmer, and
Tjark Weber
(Uppsala University, Sweden; University of Regensburg, Germany)
Verification of programs operating on heap-allocated data structures, for instance lists or trees, poses significant challenges due to the potentially unbounded size of such data structures.We present time-indexed heap invariants, a novel invariant-based heap encoding leveraging uninterpreted predicates and prophecy variables to reduce verification of heap-manipulating programs to verification of programs over integers only. Our encoding of heap is general and agnostic to specific data structures. To the best of our knowledge, our approach is the first heap invariant-based method that achieves both soundness and completeness. We provide formal proofs establishing the correctness of our encodings. Through an experimental evaluation, we demonstrate that time-indexed heap invariants significantly extend the
capability of existing verification tools, allowing automatic verification of programs with heap that were previously out of reach for state-of-the-art tools.
Article Search
Artifacts Available
Article: oopslaa26main-p144-p
Scylla: Translating an Applicative Subset of C to Safe Rust
Aymeric Fromherz and
Jonathan Protzenko
(Inria, France; Google, USA)
The popularity of the Rust language continues to explode; yet, many critical codebases remain authored in C. Automatically translating C to Rust is thus an appealing course of action. Several works have gone down this path, handling an ever-increasing subset of C through a variety of Rust features, such as unsafe. While the prospect of automation is appealing, producing code that relies on unsafe negates the memory safety guarantees offered by Rust, and therefore the main advantages of porting existing codebases to memory-safe languages. We instead advocate for a different approach, where the programmer iterates on the original C, gradually making the code more structured until it becomes eligible for compilation to safe Rust. This means that redesigns and rewrites can be evaluated incrementally for performance and correctness against existing test suites and production environments.
Compiling structured C to safe Rust relies on the following contributions: a type-directed translation from (a subset of) C to safe Rust; a novel static analysis based on “split trees” which allows expressing C’s pointer arithmetic using Rust’s slices and splitting operations; an analysis that infers which borrows need to be mutable; and a compilation strategy for C pointer types that is compatible with Rust’s distinction between non-owned and owned allocations. We evaluate our approach on real-world cryptographic libraries, binary parsers and serializers, and a file compression library. We show that these can be rewritten to Rust with small refactors of the original C code, and that the resulting Rust code exhibits similar performance characteristics as the original C code. As part of our translation process, we also identify and report undefined behaviors in the bzip2 compression library and in Microsoft’s implementation of the FrodoKEM cryptographic primitive.
Article Search
Artifacts Available
Article: oopslaa26main-p146-p
Hybrid Game Control Envelope Synthesis
Aditi Kabra,
Jonathan Laurent,
Stefan Mitsch, and
André Platzer
(Carnegie Mellon University, USA; KIT, Germany; DePaul University, USA)
Control problems for embedded systems like cars and trains can be modeled by two-player hybrid games. Control envelopes, which are families of safe control solutions, correspond to nondeterministic policies that ensure a player following them will not lose. Each deterministic, finite specialization of the nondeterministic policy is a control solution. This paper synthesizes control envelopes for hybrid games that are as permissive as possible. It introduces subvalue maps, a compositional representation of such policies that enables verification and synthesis along the structure of the game. An inductive logical characterization in differential game logic (dGL) checks whether a subvalue map induces a sound control envelope which ensures that the player never loses, no matter what actions the opponent plays. The maximal subvalue map, which allows the most action options while still winning, is shown to exist and satisfy a logical characterization. An inductive subvalue map synthesis framework is obtained from the soundness characterization. An evaluation of the framework uses the significant expressivity of dGL to model and solve a broad range of control challenges.
Article Search
Article: oopslaa26main-p158-p
Hunting CUDA Bugs at Scale with cuFuzz
Mohamed Tarek Ibn Ziad and
Christos Kozyrakis
(NVIDIA, USA; Stanford University, USA)
GPUs play an increasingly important role in modern software. However, the heterogeneous host-device execution model and expanding software stacks make GPU programs prone to memory-safety and concurrency bugs that evade static analysis. While fuzz-testing, combined with dynamic error checking tools, offers a plausible solution, it remains underutilized for GPUs. In this work, we identify three main obstacles limiting prior GPU fuzzing efforts: (1) kernel-level fuzzing leading to false positives, (2) lack of device-side coverage-guided feedback, and (3) incompatibility between coverage and sanitization tools. We present cuFuzz, the first CUDA-oriented fuzzer that makes GPU fuzzing practical by addressing these obstacles.
cuFuzz uses whole program fuzzing to avoid false positives from independently fuzzing device-side kernels. It leverages NVBit to instrument device-side instructions and merges the resultant coverage with compiler-based host coverage. Finally, cuFuzz decouples sanitization from coverage collection by executing host- and device-side sanitizers in separate processes. cuFuzz uncovers 43 previously unknown bugs (19 in commercial libraries) across 14 CUDA programs, including illegal memory accesses, uninitialized reads, and data races. cuFuzz achieves significantly more discovered edges and unique inputs compared to baseline approaches, especially on closed-source targets. Moreover, we quantify the execution time overheads of the different cuFuzz components and add persistent-mode support to improve the overall fuzzing throughput. Our results demonstrate that cuFuzz is an effective and deployable addition to the GPU testing toolbox. cuFuzz is publicly available at https://github.com/NVlabs/cuFuzz/.
Article Search
Artifacts Available
Article: oopslaa26main-p167-p
Beacon: Detecting Broken Access Control Vulnerabilities in DBMSs via System Catalog Consistency Validation
Zongrui Peng,
Jingzhou Fu,
Zhiyong Wu,
Jie Liang,
Xiangdong Huang,
Dalong Shi, and
Yu Jiang
(Tsinghua University, China; Beihang University, China; Aviation Industry Corporation of China, China)
Access control in DBMSs is critical for ensuring data security and integrity. However, the increasing complexity of its implementation often introduces broken access control (BAC) vulnerabilities. These vulnerabilities can lead to severe consequences, including privilege escalation, unauthorized data access, or even full compromise of the DBMS. Existing manual testing for BAC vulnerabilities is time-consuming and incomplete. Automated methods like static analysis also struggle in DBMSs, as static rules are difficult to apply to multi-level and dynamically changing privileges.
In this paper, we propose Beacon, which detects BAC vulnerabilities by validating the consistency between SQL operations and system catalogs. Our key insight is that the visibility of objects in the system catalogs is consistent with the user's access control: if an object is invisible to a user in the system catalogs, the user should not have any access privileges on that. Any inconsistency suggests that a user is exceeding their privileges, indicating a potential BAC vulnerability. We used Beacon to test eight popular DBMSs (e.g., MySQL and MariaDB), uncovering 39 previously unknown BAC vulnerabilities. Among them, 19 result in privilege escalation, and 20 lead to unauthorized information disclosure.Moreover, 7 of them have existed in DBMSs for more than 6 years, with the longest-persisting one lasting 13 years. DBMS vendors took these issues seriously and have already confirmed all of these vulnerabilities. Many vendors provided positive feedback, recognizing the importance of addressing these vulnerabilities. For instance, OceanBase awarded bounties for reported vulnerabilities, underscoring Beacon's role in improving DBMS access control.
Article Search
Artifacts Available
Article: oopslaa26main-p172-p
Efficient Directed Hybrid Fuzzing via Target-Centric Seed Selection and Generation
Zhen Li,
Shenghan Liu,
Qiuping Yi,
Pengbo Du, and
Hongliang Liang
(Beijing University of Posts and Telecommunications, China)
Software vulnerabilities pose severe security threats, highlighting the need for effective automated detection. Directed hybrid fuzzing, which combines the rapid exploration of fuzz testing with the precise constraint solving of symbolic execution, has made notable advancements in vulnerability discovery. However, existing directed hybrid fuzzing approaches still face two key challenges: (1) inefficient seed selection, leading to inadequate prioritization of optimal inputs for symbolic execution, and (2) inefficient seed generation, resulting in suboptimal seed generation. To address these issues, we propose TACO-Fuzz, TArget-Centric cOncolic Fuzzing, which introduces a two-phase target-centric seed selection strategy to prioritize under-explored paths and a target-centric seed generation approach based on constructing extended path conditions, thereby improving seed quality. Our evaluation on a selected set of public benchmarks shows that TACO-Fuzz can outperform several representative state-of-the-art directed fuzzing tools, achieving up to an average speedup of nearly 10x in reaching target locations, along with comparable improvements in reproducing real-world vulnerabilities. Moreover, TACO-Fuzz contributed to the discovery of 17 previously unknown vulnerabilities, each assigned a CVE, and demonstrated faster vulnerability discovery and reproduction in most cases.
Article Search
Artifacts Available
Article: oopslaa26main-p176-p
Efficient Incremental GR(1) Synthesis via Monotonic Fixed-Point Reuse
Sirui Liu,
Wei Dong,
Yijie Zheng, and
Haonan Guo
(National University of Defense Technology, China)
Although reactive synthesis guarantees correct-by-construction implementations, its practical adoption is limited by a performance bottleneck in the iterative design-and-refinement cycle of formal specifications. GR(1) synthesizers perform redundant, from-scratch computations for each specification modification, severely slowing the development workflow. To address this inefficiency, we propose an incremental synthesis method that exploits the monotonicity of the underlying fixed-point computations. By reusing the system winning region from the preceding check, our method significantly accelerates realizability check.
Our work applies an incremental method to iterative GR(1) specification development covering the full GR(1) scope, including system guarantees and environment assumptions, to accelerate realizability checking. Furthermore, we introduce two heuristics for early fixed-point detection during incremental realizability checking. Finally, we analyze why prior work failed to integrate an incremental technique limited to system guarantees into the unrealizable core minimization algorithm DDMin, further establishing iterative development as a suitable application setting for incremental methods.
Evaluated on a large-scale benchmark of 8,282 specifications, our method exhibits a strong positive correlation between performance gain and specification complexity. For the most challenging specifications, which constitute the primary bottlenecks in development, our approach achieves speedups of several orders of magnitude, reducing computation times in some cases from nearly an hour to just seconds.
Article Search
Article: oopslaa26main-p187-p
VeriEQ: Finding Verilog Simulators and Synthesizers Bugs with Equivalence Circuit Transformation
Zhen Yan,
Yuanliang Chen,
Fuchen Ma,
Zehong Yu,
Dalong Shi, and
Yu Jiang
(Tsinghua University, China; Aviation Industry Corporation of China, China)
Verilog simulators and synthesizers play a critical role in chip design and verification. However, due to the complexity of simulation and synthesis processes, they are prone to introducing various types of bugs. Among these, Behavioral Deviation Bugs (BDBs) are particularly severe, as they can cause incorrect results by introducing subtle semantic deviations. Such deviations make the chip behave differently from its intended design and may even enable hardware backdoors.
In this work, we propose VeriEQ, an automated framework based on the concept of metamorphic testing, to detect BDBs by generating semantically equivalent Verilog programs. First, to increase the likelihood of triggering BDBs, we analyze the structural patterns of historical BDB cases and design a specialized Verilog code template. Second, we generate semantically equivalent variants by applying equivalence-preserving circuit transformation rules. These rules incorporate constraints on bit-width and signedness to ensure logical consistency before and after transformation. Finally, we design an inlined deviation-checking mechanism that embeds multiple equivalent modules within a single testbench, thereby improving testing efficiency. We implement and evaluate VeriEQ on four mainstream Verilog simulators and synthesizers. Experimental results demonstrate that VeriEQ achieves a 138.1% to 4161.9% speedup compared to state-of-the-art tools. In total, VeriEQ successfully detects 33 previously unknown bugs, including 29 BDBs and 4 hang bugs as additional findings. All discovered bugs have been confirmed, with 27 already fixed. In contrast, existing tools are able to detect only 1 to 7 bugs.
Article Search
Artifacts Available
Article: oopslaa26main-p188-p
Experimental Evaluation Methodology for the Era of No Steady Performance
Jaromír Antoch,
Walter Binder,
Lubomír Bulej,
François Farquet,
Vojtěch Horký,
Aleksandar Prokopec,
Andrea Rosà, and
Petr Tůma
(Charles University, Czech Republic; USI Lugano, Switzerland; Oracle Labs, Switzerland)
Recent studies of virtual machine warm up have pointed out that even small deterministic microbenchmarks executed in tightly controlled circumstances often do not reach a steady state of peak performance. This impacts performance evaluation methodologies that focus on performance after warm up, because the lack of a steady state may violate common assumptions made when computing metrics such as the average performance or the confidence interval for that average.
Our work examines the reported lack of steady state in the context of comparatively larger virtual machine workloads. We document and analyze similar lack of steady state and argue that it should be considered an inherent property of these workloads rather than a fault. We introduce an updated performance evaluation methodology for workloads whose execution exhibits segments of steady state performance separated by sudden performance changes. Using the Renaissance benchmark suite for the Java Virtual Machine, we show that the methodology can produce confidence intervals that miss the true performance over 20 % less often than the existing methodologies.
Article Search
Article: oopslaa26main-p193-p
SART: Sign-Absolute Reformulation Theory for Binary Variable Reduction in Neural Network Verification
Jin Xu,
Miaomiao Zhang, and
Bowen Du
(Tongji University, China)
Complete formal verification of neural networks is crucial for their deployment in safety-critical domains. A key bottleneck stems from encoding complexity: traditional methods assign one binary variable per unstable ReLU neuron. We propose the Sign-Absolute Reformulation Theory (SART), which fundamentally breaks the conventional one-to-one mapping between unstable neurons and binary variables by establishing formal reducibility criteria. This allows for finer-grained modeling, where each unstable neuron corresponds on average to fewer than one binary variable, thereby reducing verification complexity at its source. Based on SART, we derive a theoretical lower bound on the number of binary variables required for complete verification and, under the assumption that P ≠ NP, prove that variables in the final layer can be compressed by 50%, while the number of variables in intermediate layers cannot be further reduced. To overcome the apparent “last-layer-only” limitation, we recast verification as a sequential process and, crucially, show that the gain lifts to the entire network: LayerABS, a SART-based progressive tightening verifier, iteratively treats intermediate layers as temporary final layers and propagates tight bounds that shrink the global search space and binary-variable counts. Furthermore, we reveal a structural law influencing verification complexity: when the signs of weights of unstable neurons satisfy numerical symmetry, with positive and negative weights equal or differing by at most one, the worst-case verification complexity achieves the theoretical optimum, offering theoretical guidance for the design of verification-friendly architectures. As a general-purpose underlying encoding, the value of SART is independent of specific algorithms. To comprehensively evaluate its effectiveness, we first evaluate the abstraction-free SART encoding, and then integrate it with abstraction techniques to construct the complete verifier LayerABS and its incomplete variant Incomplete-LayerABS. Across benchmarks, our methods surpass state-of-the-art baselines, validating SART’s practical impact.
Article Search
Article: oopslaa26main-p200-p
InspectCoder: Dynamic Analysis-Driven Self Repair through Interactive LLM-Debugger Collaboration
Yunkun Wang,
Yue Zhang,
Guochang Li,
Chen Zhi,
Binhua Li,
Fei Huang,
Yongbin Li, and
Shuiguang Deng
(Zhejiang University, China; Alibaba, China)
Complex logic errors in LLM-generated code are challenging to diagnose and repair. While existing LLM-based self-repair approaches conduct intensive static semantic analysis or rely on superficial execution logs, they miss the in-depth runtime behaviors that often expose bug root causes—lacking the interactive dynamic analysis capabilities that make human debugging effective. We present InspectCoder, the first agentic program repair system that empowers LLMs to actively conduct dynamic analysis via interactive debugger control. Our dual-agent framework enables strategic breakpoint placement, targeted state inspection, and incremental runtime experimentation within stateful debugger sessions. Unlike existing methods that follow fixed log collection procedures, InspectCoder adaptively inspects and perturbs relevant intermediate states at runtime, and leverages immediate process rewards from debugger feedback to guide multi-step reasoning, transforming LLM debugging paradigm from blind trial-and-error into systematic root cause diagnosis. We conduct comprehensive experiments on two challenging self-repair benchmarks: BigCodeBench-R and LiveCodeBench-R. InspectCoder achieves 5.10%–60.37% relative improvements in repair accuracy over the strongest baseline, while delivering 1.67x-2.24x superior bug-fix efficiency respectively. We also contribute InspectWare, an open-source middleware that abstracts debugger complexities and maintains stateful debugging sessions across mainstream Python testing frameworks. Our work provides actionable insight into the interactive LLM-debugger systems, demonstrating the significant potential of LLM-driven dynamic analysis for automated software engineering.
Article Search
Article: oopslaa26main-p202-p
Geo: A Query Rewrite Framework for Graph Pattern Mining
Nazanin Yousefian,
Kasra Jamshidi,
Keval Vora, and
Anders Miltner
(Simon Fraser University, Canada)
Graph pattern mining is important for analyzing graph data. Graph mining systems typically require answering pattern matching queries, which involve solving the NP-complete subgraph isomorphism problem. To address this, domain experts often develop custom pattern matching query optimization strategies based on exploiting substructural similarities across different patterns. While these optimizers can be effective, their development is challenging due to the complex structural properties of the patterns (e.g., subsymmetries), which are difficult to address. This complexity limits the exploration of interactions between different optimization strategies and restricts experts from continuously improving the optimizers—such as by incorporating additional custom or general pattern-based equivalences over time.
In this paper, we present a programmable pattern matching query optimizer called Geo, which automatically manages the interactions between various equivalences, ensures the optimizations maintain correctness of results, and simplifies the management of substructure equivalences. Geo exposes a simple but flexible language for expressing pattern equivalences as rewrite rules. By maintaining canonical representations of generated patterns during equality saturation, Geo avoids issues arising from syntactic differences in isomorphic patterns. Additionally, we develop embedded reconstructablility (EmRec) that tracks provenance across equivalences to ensure various reconstructability needs of desired outputs. Our evaluation demonstrates that Geo can discover novel query equivalences through complex composition of various rewrite rules, enabling our optimized queries to achieve a cost reduction of up to 99% compared to the queries in prior work. We further test Geo’s effectiveness at speeding up practical graph mining problems by using it in two representative case studies – approximate pattern matching and quasi-clique mining, and find it is highly effective at optimizing these tasks, enabling cost reductions of up to 71%.
Article Search
Artifacts Available
Article: oopslaa26main-p206-p
Lawyer: Modular Obligations-Based Liveness Reasoning in Higher-Order Impredicative Concurrent Separation Logic
Egor Namakonov,
Justus Fasse,
Bart Jacobs,
Lars Birkedal, and
Amin Timany
(Aarhus University, Denmark; KU Leuven, Belgium)
Higher-order concurrent separation logics, such as Iris, have been
tremendously successful in verifying safety properties of concurrent programs. However,
state-of-the-art attempts to verify liveness properties in such logics have so far either
lacked modularity (the ability to compose specifications of
independent modules), or they have been far too complex to mechanize in a proof assistant.
In this work, we introduce Lawyer --- a mechanized program logic for modular verification of (fair)
termination of concurrent programs.
Lawyer draws inspiration from state-of-the-art approaches that use obligations for specifying and proving termination.
However, unlike these approaches, which incorporate obligations by instrumenting the source code with erasable auxiliary code and state, Lawyer avoids such instrumentations.
Instead, Lawyer incorporates obligations into the logic by embedding them into a purely logical labeled transition system that the program is shown to refine --- this makes Lawyer far more amenable to mechanization.
We demonstrate the expressivity of Lawyer by verifying termination of a range of examples,
including modular verification of a client program whose termination relies on
correctness of a fair lock library, and (separately) proving that a ticket lock
implementation implements that library's interface. To the best of our knowledge, Lawyer is the
first mechanized program logic that supports modular higher-order impredicative
liveness specifications of program modules. All the results that appear in the
paper have been mechanized in the Rocq proof assistant on top of the Iris
separation logic framework.
Article Search
Artifacts Available
Article: oopslaa26main-p211-p
LARTS: Language Abstractions for Real-Time and Secure Systems
Yanqi Li,
Hongliang Liang,
Rui Yao,
Yang Zhang,
Dong Liu,
Lei Wang, and
Qiuping Yi
(Beijing University of Posts and Telecommunications, China; Beijing Institute of Computer Technology and Application, China)
Real-time systems must simultaneously deliver predictable timing, fault isolation, and memory safety, yet current operating systems expose only low-level primitives that force developers to manually balance concurrency, isolation, and performance. This paper presents LARTS, a language-aided runtime system that elevates these requirements into language abstractions with enforceable semantics. LARTS introduces execution domain, a unified process–thread abstraction that combines thread-level responsiveness with process-level isolation. Memory is managed through deterministic memory contracts, which bind allocation at load time to eliminate runtime failures and unpredictable latencies. Domain interactions are expressed via deterministic communication channels that integrate efficient transfer, type safety, and priority inheritance, ensuring analyzable end-to-end bounds. Moreover, LARTS enforces secure-by-construction semantics, making classes of bugs such as double fetch and use-after-free unrepresentable in the programming model. We formalize the core semantics of LARTS and show how they guarantee determinism and safety by design. A prototype built on RTEMS demonstrates that LARTS preserves competitive real-time performance while substantially reducing programming complexity and eliminating vulnerabilities in realistic case studies. Our results suggest that high-assurance real-time programming can be treated not as an ad-hoc engineering problem, but as a first-class abstraction with verifiable semantics.
Article Search
Artifacts Available
Article: oopslaa26main-p216-p
Grammar Repair with Examples and Tree Automata
Yunjeong Lee,
Gokul Rajiv, and
Ilya Sergey
(National University of Singapore, Singapore)
Context-free grammars (CFGs) are the de-facto formalism for declaratively describing concrete syntax for programming languages and generating parsers. One of the major challenges in defining a desired syntax is ruling out all possible ambiguities in the CFG productions that determine scoping rules as well as operator precedence and associativity. Practical tools for parser generation typically apply ad-hoc approaches for resolving such ambiguities, which might result in a parser's behaviour that contradicts the intents of the language designer. In this work, we present a user-friendly approach to soundly repair grammars with ambiguities, which is inspired by the programming by example line of research in automated program synthesis. At the heart of our approach is the interpretation of both the initial CFG and additional examples that define the desired restrictions in precedence and associativity, as tree automata (TAs). The technical novelties of our approach are (1) a new TA learning algorithm that constructs an automaton based on the original grammar and examples that encode the user's preferred ways of resolving ambiguities all in a single TA, and (2) an efficient algorithm for TA intersection that utilises reachability analysis and optimizations that significantly reduce the size of the resulting automaton, which results in idiomatic CFGs amenable to parser generators. We have proven the soundness of the algorithms, and implemented our approach in a tool called Greta, demonstrating its effectiveness on a series of case studies.
Preprint
Artifacts Available
Article: oopslaa26main-p219-p
Online Input Grammar Synthesis Aided Symbolic Execution
Ke Ma,
Yunlai Luo,
Zhenbang Chen,
Weijiang Hong,
Yufeng Zhang, and
Ji Wang
(National University of Defense Technology, Changsha, China; Hunan University, Changsha, China)
Symbolic execution faces the challenge of generating valid inputs when analyzing the program with complex input formats. Token-based symbolic execution can partially tackle this challenge but is still doomed by the difficulty of passing input checking and failing to analyze the code after input checking. We propose Lase, an online input grammar synthesis aided symbolic execution method, to generate valid inputs for improving the effectiveness of symbolic execution. Inside Lase, we propose an input grammar-oriented search strategy and a token-level grammar synthesis method. The search strategy selects the paths to cover more syntax rules in priority. The token-level grammar synthesis improves the synthesized grammar's precision and completeness while ensuring efficiency. The experimental results on real-world parsing programs with complex input grammars demonstrate that Lase can improve the coverage of parsing code and generate more valid inputs to improve the coverage of functionality code significantly. Furthermore, compared with the state-of-the-art grammar synthesis methods, the grammars learned by Lase have better precision and recall on most benchmark programs.
Article Search
Article: oopslaa26main-p232-p
Designing GPU Data Structures for Efficient Memory Oversubscription
Vipin Patel,
Srinjoy Sarkar,
Swarnendu Biswas, and
Mainak Chaudhuri
(IIT Kanpur, India)
Efficient concurrent data structures are important building blocks for accelerating applications on GPUs. With the ever-increasing memory footprint of GPU workloads, data structures used by kernels can exceed global memory capacity. Using the unified virtual memory (UVM) model is a popular approach for kernels to oversubscribe GPU memory without the need for explicit memory management by a programmer. However, we show that data structures executing with UVM can suffer from performance degradation due to the high overheads associated with data migration and thrashing for irregular access patterns.
In this paper, we propose two-level hierarchical designs for hash table and skip list data structures that aim to maximize access locality and handle use cases where the data structure oversubscribes GPU memory. The outer-level container enables efficient jumps to desired regions of the data structure, while the inner container allows operating on the data. The inner container is sized to facilitate efficient data transfers between the CPU and the GPU. Experimental results on a diverse set of input operation sequences show that our data structure designs substantially improve performance over optimized UVM baselines while supporting high degrees of GPU memory oversubscription. Importantly, our proposed design, when used to implement key-value stores in metagenomics classification and k-mer counting applications, achieves a geomean speedup of 2.06× for hash table and 2.37× for skip list over baseline UVM implementations.
Article Search
Article: oopslaa26main-p244-p
Understanding and Finding JIT Compiler Performance Bugs
Zijian Yi,
Cheng Ding,
August Shi, and
Milos Gligoric
(University of Texas at Austin, USA)
Just-in-time (JIT) compilers are key components for many popular
programming languages with managed runtimes (e.g., Java and
JavaScript). JIT compilers perform optimizations and generate native
code at runtime based on dynamic profiling data, to improve the
execution performance of the running application. Like other software
systems, JIT compilers might have software bugs, and prior work has
developed a number of automated techniques for detecting functional
bugs (i.e., generated native code does not semantically match that of
the original code). However, no prior work has targeted JIT compiler
performance bugs, which can cause significant performance degradation
while an application is running. These performance bugs are
challenging to detect due to the complexity and dynamic nature of JIT
compilers. In this paper, we present the first work on demystifying
JIT performance bugs. First, we perform an empirical study across four
popular JIT compilers for Java and JavaScript. Our manual analysis of
191 bug reports uncovers common triggers of performance bugs, patterns
in which these bugs manifest, and their root causes. Second, informed
by these insights, we propose layered differential performance
testing, a lightweight technique to automatically detect JIT compiler
performance bugs, and implement it in a tool called Jittery. We
incorporate practical optimizations into Jittery such as test
prioritization, which reduces testing time by 92.40% without
compromising bug-detection capability, and automatic filtering of
false-positives and duplicates, which substantially reduces manual
inspection effort. Using Jittery, we discovered 12 previously unknown
performance bugs in the Oracle HotSpot and Graal JIT compilers, with
11 confirmed and 6 fixed by developers.
Article Search
Article: oopslaa26main-p246-p
Deegen: A JIT-Capable VM Generator for Dynamic Languages
Haoran Xu and
Fredrik Kjolstad
(Stanford University, USA)
Building a high-performance JIT-capable VM for a dynamic language has traditionally required a tremendous amount of time, money, and expertise. We present Deegen, a meta-compiler that allows users to generate a high-performance JIT-capable VM for their own language at an engineering cost similar to writing a simple interpreter. Deegen takes in the execution semantics of the bytecodes implemented as C++ functions, and automatically generates a two-tier VM execution engine with a state-of-the-art interpreter, a state-of-the-art baseline JIT, and the tier-switching logic that connects them into a self-adaptive system.
We are the first to show how to automatically generate a baseline JIT compiler that rivals the current state of the art, and an interpreter that outperforms the current state of the art. Our performance comes from Deegen's ability to automatically apply many state-of-the-art optimizations that previously had to be hand-implemented. These optimizations include bytecode specialization and quickening, register pinning, tag register optimization, call inline caching, generic inline caching, JIT polymorphic IC, JIT IC inline slab, type-check removal and strength reduction, type-based slow-path extraction and outlining, JIT hot-cold code splitting, and JIT OSR-entry. As a result, the performance of the Deegen-generated interpreter and baseline JITs matches or surpasses state-of-the-art interpreters and baseline JITs.
To evaluate Deegen, we use it to implement two languages: a Lua 5.1 VM called LuaJIT Remake (LJR) and a SOM VM called DSOM. Across 44 benchmarks, LJR's interpreter is on average 2.79x faster than the official PUC Lua interpreter, and 1.31x faster than LuaJIT's interpreter. LJR's baseline JIT has negligible compilation cost, and its execution performance is on average 4.60x faster than PUC Lua and only 33% slower (but faster on 13/44 benchmarks) than LuaJIT's optimizing JIT. Across 13 benchmarks, DSOM's interpreter is 4.28x--5.82x faster than the five existing SOM interpreters, and DSOM's baseline JIT compiles 25.84x faster than 2SOM's baseline JIT, while also generating code that runs 15.46x faster.
Article Search
Article: oopslaa26main-p258-p
Automatic Propagation of Profile Information through the Optimization Pipeline
Elisa Frohlich,
Angelica Moreira, and
Fernando Magno Quintão Pereira
(Federal University of Minas Gerais, Brazil; Microsoft Research, Brazil)
Profile-guided optimization (PGO) is a well-established technique for improving program performance, being integrated into major compilers such as GCC, LLVM/Clang, and Microsoft Visual C++. PGO collects information about a program's execution and uses it to guide optimizations such as inlining, and code layout. However, these very transformations alter the program's control flow, rendering the collected profiles stale or inaccurate. To deal with this problem, this paper investigates how to reuse profile data after optimization without re-executing the program. We study two complementary strategies: prediction, which estimates likely hot code paths in the optimized program, and projection, which transfers profile information from the original control-flow graph to its transformed version. We evaluate several techniques for reconstructing profile data, including a large language model (LLM)–based approach using GPT-4o, and a lightweight method that compares opcode histograms of code regions recursively to identify structural similarities. Our results show that the histogram-based method is not only simpler but also consistently more accurate than both the LLM-based approach and prior prediction and projection techniques, including those implemented in LLVM and the BOLT binary optimizer.
Article Search
Article: oopslaa26main-p259-p
PLEX: Normalization for Refinement Types
Alessio Ferrarini,
Niki Vazou, and
Wouter Swierstra
(IMDEA Software Institute, Spain; Universidad Politécnica de Madrid, Spain; Utrecht University, Netherlands)
Refinement types often use SMT solvers to automate program verification. However, since SMT solvers are first-order, verification of properties that requires higher-order reasoning is not possible. Proof by Logical Evaluation (PLE) is an algorithm that provides a layer between refinement types and SMT solvers that permits symbolic evaluation of functions, but it lacks support for higher-order reasoning. We introduce PLEX, an extension to PLE, that supports η-expansions, β-reductions, and dependent pattern matching. We prove that PLEX is sound and terminating, describe its implementation in Liquid Haskell, and evaluate it on examples that make essential use of higher-order data, and as such they cannot be handled by PLE. The new PLEX algorithm bridges the gap between higher-order languages and first-order SMT solvers via refinement types.
Article Search
Article: oopslaa26main-p262-p
EditFlow: Benchmarking and Optimizing Code Edit Recommendation Systems via Reconstruction of Developer Flows
Chenyan Liu,
Yun Lin,
Jiaxin Chang,
Jiawei Liu,
Binhang Qi,
Bo Jiang,
Zhiyong Huang, and
Jin Song Dong
(Shanghai Jiao Tong University, China; National University of Singapore, Singapore; Bytedance Network Technology, China)
Large language models (LLMs) for code editing have achieved remarkable progress, yet recent empirical studies reveal a fundamental disconnect between technical accuracy and developer productivity. Despite their strong benchmark performance, developers complete tasks 19% slower when using AI assistance, with over 68.81% of recommendations disrupting their mental flow. This misalignment stems from the use of static commit snapshots that lack temporal information, causing models to optimize for end results rather than the incremental, context-sensitive steps that align with developers’ natural reasoning process.
To bridge this gap, we present EditFlow, which benchmarks and optimizes subsequent code edit recommendation systems through the reconstruction of developer editing flows. EditFlow addresses three key challenges. First, collecting edit-order data that reflects developers’ flow is inherently difficult: manual annotation introduces prohibitive overhead, while development logs capture only single trajectories instead of all plausible editing flows. Second, benchmarking recommendation performance against developers’ ongoing editing flow requires a digital-twin-like simulation that can faithfully simulate the editing process. Third, existing heterogeneous systems vary drastically in scale and architecture, posing challenges for developing a unified optimization strategy that endows all models with mental-flow awareness regardless of design or capability.
To overcome these challenges, we propose three tightly coupled components: (1) a prompt auto-tuning mechanism that learns an optimized prompt for inferring the relative order between two edits, (2) a digital twin that replays reconstructed edit sequences to simulate developers’ editing process, and (3) EditFlow, a unified optimization strategy that optimizes the flow continuity of subsequent edit suggestions based on developers’ ongoing flow. Evaluations across diverse benchmarks, including manually annotated commits, real-world industrial code, and open-source repositories, show that EditFlow improves order reconstruction accuracy by 63.81%, reduces flow violations by over 75%, and boosts recommendation precision by 66.99%. A user study with 32 developers further demonstrates 25.11% faster task completion and significantly higher perceived recommendation quality. To the best of our knowledge, EditFlow is the first to evaluate and optimize code edit recommendation systems from the perspective of developers’ mental flow, establishing flow-awareness as a new dimension for advancing human-AI code collaboration.
Article Search
Info
Artifacts Available
Article: oopslaa26main-p268-p
CLower: Detecting Compiler Pessimization Bugs through Redundant Memory Accesses
Jianhao Xu,
Kunbo Zhang,
Mathias Payer,
Kangjie Lu, and
Bing Mao
(Southeast University, China; Nanjing University, China; EPFL, Switzerland; University of Minnesota, USA)
Compilers are expected to generate optimized code, but they sometimes introduce
pessimizations, quality-degrading redundant instructions. These bugs not only incur
performance overhead but also, critically, expand the attack surface by introducing
unexpected side effects (e.g., redundant memory accesses) without breaking compilation
correctness. Existing bug-finding methods are neither designed for nor effective
at identifying such security-sensitive pessimizations.
This paper presents CLower, a novel, black-box approach for automatically detecting
compiler pessimizations via redundant memory accesses.
CLower's core insight is that any extra global memory accesses in a fully optimized
binary, compared to the source, indicate a pessimization. To reliably distinguish
compiler-introduced redundancy from source-level redundancy, we generate random
C programs in which each global variable has a predetermined, controlled number of memory accesses.
CLower then executes the instrumented binary and verifies whether superfluous accesses
have been introduced during compilation.
We applied CLower to GCC and LLVM, reporting 23 unique bugs (21 in GCC, 2 in Clang),
with 16 confirmed as new pessimization bugs.
Our evaluation shows that CLower accurately detects diverse, impactful pessimization
bugs, the majority of which (75%) also manifest for heap-allocated objects,
demonstrating that the underlying compiler flaws are general and not limited to global memory.
Furthermore, we identify a systematic conflict between
compiler optimizations and pessimization bugs, which causes many such bugs to remain hidden
in compiler versions. This study sheds light on the under-explored area of compiler
pessimization and provides a practical tool for improving compiler quality.
Preprint
Artifacts Available
Article: oopslaa26main-p272-p
Beyond Coverage: Automatic Test Suite Augmentation for Enhanced Effectiveness using Large Language Models
Zeyu Lu,
Peng Zhang,
Yuge Nie,
Yibiao Yang,
Yutian Tang,
Chun Yong Chong, and
Yuming Zhou
(Nanjing University, China; University of Glasgow, UK; Monash University Malaysia, Subang Jaya, Malaysia)
Large Language Models (LLMs) have gained significant traction in software engineering for automating tasks such as unit test generation. Most existing studies prioritize code coverage as the primary metric for enhancing test suite effectiveness. However, prior research has shown that although code coverage can reach approximately 80%, the mutation score, which generally exhibits a stronger correlation with defect detection effectiveness, attains only about 35%. This gap highlights the need to enhance test suite effectiveness guided by mutation score rather than code coverage. Recent studies, including MuTAP and MutGen, explored the use of survived mutants to enhance test suite effectiveness. However, their evaluations were limited to simple standalone methods that rely on built-in functions and standard libraries. Non-standalone methods, which depend on other classes and involve complex user-defined types, are more intricate and commonly found in real-world projects. The limited contextual information and basic repair mechanisms in their prompt designs make it unclear whether their performance can generalize to non-standalone methods. Moreover, the two studies rely on existing language-specific, rule-based mutation techniques, which require specific configurations and incur additional costs when adapting to other programming languages.
To bridge this gap, we propose a novel, fully automatic LLM-based approach to enhance test suite effectiveness, guided by survived mutants. The approach augments initial test suites by integrating mutation testing with test case generation. It takes focal method information as input and generates test cases targeting survived mutants identified from applying the initial test suites. Our approach incorporates multiple prompt techniques, rich contextual information, and an advanced repair mechanism to effectively generate test cases for non-standalone methods. The evaluation covers 1,035 focal methods, categorized as standalone or non-standalone. On average, the mutation score increases by 16.04% for standalone methods and 8.11% for non-standalone methods. We validate the practical impact of augmented test suites in LLM-based code generation. After test suite augmentation, pass@1 decreased by 0.3152 and 0.1772 on average for standalone and non-standalone methods, respectively, indicating the effectiveness of our approach in reducing false positives caused by insufficient test cases in code generation evaluation.
Article Search
Article: oopslaa26main-p275-p
Type-Safe Monotonic Object Evolution
Alexandra Mirrlees-Black,
Haoyu Wu,
Gregor Richards, and
Fabian Muehlboeck
(Australian National University, Australia; University of Waterloo, Canada)
Object evolution is a monotonic approach to typestate and object reclassification, enforcing that objects may gain, but not lose properties, to permit aliasing. We present a formalization and prototype implementation of our new language May, featuring inheritance-based evolution that changes the run-time class of an object to a subclass. To statically guarantee evolution succeeds, we introduce a simple affine permission system for ensuring evolvable references match the run-time type of an object. Furthermore, we demonstrate that our system provides an effective and type-safe way of expressing staged operations and complex initialization procedures.
Article Search
Artifacts Available
Article: oopslaa26main-p276-p
Localizing Type Errors for Syntactic Sugar by Lifting
Zhichao Guan,
Tailai Yu,
Di Wang, and
Zhenjiang Hu
(Peking University, China)
Syntactic sugar enhances the usability of a core language by providing intuitive syntax in a surface language; however, its interaction with the core-language type checker often results in error messages that are unclear to surface programmers. Existing techniques, such as type lifting, can automatically infer typing rules for syntactic sugar, but they do not consider localizing type errors directly in the surface syntax. This paper studies the problem of localizing and reporting type errors for syntactic sugar, addressing two key challenges: precisely localizing errors and ensuring that they are fixable. Inspired by the recently proposed marked lambda calculus (MLC), we develop ℓMLC as our core language which tracks error provenance and locations via type annotations. Building on this, we propose the Stellar framework, which automatically lifts the core language's typing rules to the surface language while enabling error localization in the surface syntax. Stellar also ensures that the reported errors are fixable by incorporating extra premises into the lifted typing rules. We implement provenance and evaluate it across various surface languages with different type structures, demonstrating that our approach precisely localizes errors and avoids unhelpful references to core-language constructs. Our evaluation suggests that provenance can help surface programmers address type errors more effectively, enhancing the practicality of syntactic sugar in language engineering.
Article Search
Article: oopslaa26main-p284-p
When Lifetimes Liberate: A Type System for Arenas with Higher-Order Reachability Tracking
Siyuan He,
Songlin Jia,
Yuyan Bao, and
Tiark Rompf
(Purdue University, USA; Augusta University, USA)
Statically enforcing safe resource management is challenging due to tensions between flexible lifetime disciplines and expressive sharing patterns. Region-based systems offer lexically scoped regions under a stack discipline, wherein resources are managed in bulk. In many such systems, however, resources are second-class and can neither escape their scope nor be freely returned from functions. Ownership and linear type systems, such as Rust, offer first-class, non-lexical lifetimes with robust static guarantees, but rely on invariants that limit higher-order patterns and expressive sharing.
In this work, we propose a type system that uniformly treats all heap-allocated resources under diverse lifetime, granularity, and sharing settings. Our system provides programmers with three allocation modes: (1) fresh allocation for first-class, non-lexical resources; (2) fresh allocation for second-class resources with lexically bounded lifetimes; and (3) coallocation that groups resources by shadow arenas for bulk tracking and deallocation. Regardless of mode, resources are represented uniformly at the type level, supporting generic abstraction and preserving the higher-order parametric nature of the language.
Obtaining static safety in higher-order languages with flexible sharing is nontrivial. To address this, our solution builds on reachability types, and our extension adds the capability to track both individual and grouped resources, enables the expression of cyclic store structures, and allows the selective enforcing of stack lifetime discipline. These mechanisms are formalized in the A<: and {A}<: type systems, which are proven type safe and memory safe in Rocq.
Article Search
Artifacts Available
Article: oopslaa26main-p305-p
Determining the Unreachable: Constraint-Guided Reachability Analysis for Dependency Vulnerabilities
Wenbu Feng,
Xiaohong Li,
Ruitao Feng,
Yao Zhang,
Yuekang Li,
Zhiping Zhou,
Yunqian Wang, and
Yuqing Li
(Tianjin University, China; Southern Cross University, Australia; UNSW Sydney, Australia)
In software development, investigating the accessibility of dependency vulnerabilities is of great importance, as third-party libraries often contain known vulnerabilities that could be exploited in the application's business logic. The existing accessibility analysis methods encounter challenges such as undecidability, abstraction loss, and path explosion in large-scale programs, resulting in an inaccurate distinction between accessibility vulnerabilities and non-accessibility vulnerabilities. This paper introduces an approach called ConVReach for analyzing the reachability of vulnerabilities in dependencies in C/C++ programs. ConVReach overcomes the problems of high abstraction loss and potential path explosion in the current methods by combining static and dynamic approaches, particularly a constraint-guided analysis method. This approach extracts and decomposes the path constraints that trigger vulnerabilities, independently verifies the satisfiability of each constraint, and then aggregates the feasible paths. This effectively reduces unnecessary path exploration and avoids the common path explosion issues in traditional methods. Experimental results show that ConVReach outperforms existing tools in both accuracy and efficiency, effectively distinguishing between reachable and unreachable vulnerabilities, and significantly reducing false positives and false negatives.
We constructed a benchmark dataset to evaluate ConVReach, which includes 53 CVEs and 347 flags artificially inserted into various open-source projects. This dataset was designed to simulate both real-world vulnerabilities and complex scenarios. Through testing on this dataset, ConVReach demonstrated exceptional performance. It successfully identified 59 out of 61 reachable vulnerabilities and all 23 unreachable ones in the CVE dataset. Within a 24-hour time budget, ConVReach detected above 50% more reachable vulnerabilities than the baseline tools in the first 6 hours and nearly completed the detection of reachable vulnerabilities by the 12-hour mark. These results highlight ConVReach's superior ability to handle both real-world vulnerabilities and challenging cases.
Article Search
Article: oopslaa26main-p311-p
Mixed Choice in Asynchronous Multiparty Session Types
Laura Bocchi,
Raymond Hu,
Adriana Laura Voinea, and
Simon Thompson
(University of Kent, UK; Queen Mary University of London, UK; University of Glasgow, UK)
We present a multiparty session type (MST) framework with asynchronous mixed choice (MC). We propose a core construct for MC that allows transient inconsistencies in protocol state between distributed participants, but ensures all participants can always eventually reach a mutually consistent state. We prove the correctness of our system by establishing a progress property and an operational correspondence between global types and distributed local type projections. Based on our theory, we implement a practical toolchain for specifying and validating asynchronous MST protocols featuring MC, and programming compliant gen_statem processes in Erlang/OTP. We test our framework by using our toolchain to specify and reimplement part of the amqp_client of the RabbitMQ broker for Erlang.
Article Search
Artifacts Available
Article: oopslaa26main-p321-p
RandSet: Randomized Corpus Reduction for Fuzzing Seed Scheduling
Yuchong Xie,
Kaikai Zhang,
Yu Liu,
Rundong Yang,
Ping Chen,
Shuai Wang, and
Dongdong She
(Fudan University, China; Hong Kong University of Science and Technology, China)
Seed explosion is a fundamental problem in fuzzing seed scheduling. It occurs when a fuzzer maintains a corpus with a huge number of seeds and fails to choose a promising one. Existing seed scheduling works focus on seed prioritization but suffer from the seed explosion since the seed corpus size is still huge. We tackle seed explosion from a new perspective, corpus reduction, i.e., compute a seed corpus subset. Corpus reduction can eliminate redundant seeds in the corpus and significantly reduce corpus size. However, this could lead to poor diversity in seed selection and severely impact the fuzzing performance. Meanwhile, effective corpus reduction incurs large runtime overhead. In practice, it’s challenging to adopt corpus reduction in fuzzing seed scheduling. Prior techniques like cull_queue, AFLCmin and MinSet all suffer from poor seed diversity. AFL-Cmin and MinSet incur prohibitive runtime overhead and are hence only applicable to one-time task initial seed selection rather than high-frequency seed scheduling.
We propose a novel randomized corpus reduction technique, RandSet, that can reduce the corpus size and yield diverse seed selection simultaneously. Meanwhile, the runtime overhead of RandSet is minimal, suiting a high-frequency seed scheduling task. Our key insight is to introduce randomness into corpus reduction so as to enjoy the two benefits of a randomized algorithm: randomized output (i.e., diverse seed selection) and low runtime overhead. Specifically, we formulate the corpus reduction in seed scheduling as a classic set cover problem and compute a randomized subset of seed corpus as a set cover to cover all features of the entire corpus. We then develop a novel seed scheduling approach using the randomized corpus subset. Our technique can effectively mitigate seed explosion by scheduling a small and randomized subset of the corpus rather than the entire corpus.
We implement RandSet on three popular fuzzers: AFL++, LibAFL and Centipede to showcase its general algorithmic design. We perform a comprehensive evaluation of RandSet on three benchmarks: standalone programs, FuzzBench and Magma. Our evaluation results show that RandSet can achieve significantly more diverse seed selection compared with other corpus reduction techniques. RandSet also yields high reduction ratio, achieving an average subset ratio of 4.03% and 5.99% after corpus reduction in terms of standalone programs and FuzzBench programs. In terms of fuzzing performance gain from our randomized corpus reduction, RandSet achieves a 16.58% gain on standalone programs and up to 3.57% gain on FuzzBench programs in AFL++. RandSet triggers up to 7 more ground-truth bugs than the state-of-the-art fuzzer on Magma, while introducing only 3.93% overhead on standalone programs and as low as 1.17% overhead on FuzzBench.
Preprint
Article: oopslaa26main-p340-p
LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer
Kunpeng Zhang,
Dongwei Xiao,
Daoyuan Wu,
Shuai Wang,
Jiali Zhao,
Yuanyi Lin,
Tongtong Xu, and
Shaohua Wang
(Hong Kong University of Science and Technology, Hong Kong; Lingnan University, Hong Kong; Huawei, China; Central University of Finance and Economics, China)
Deep learning (DL) libraries are widely used in critical applications, where even subtle silent bugs can lead to serious consequences. While existing DL fuzzing techniques have made progress in detecting crashes, they inherently struggle to detect silent bugs due to the lack of effective test programs and corresponding oracles.
Building on the observation that historical bug reports contain rich, underutilized information about silent bugs, we leverage large language models (LLMs) to perform versatile yet controlled bug transfer for silent bug fuzzing. Specifically, our approach uses LLMs to extract context-aware bug patterns from historical issues, match semantically related Application Programming Interfaces (APIs) using functionality-based embeddings, and synthesize test cases with customized oracles. This enables proactive detection of silent bugs by transferring high-risk contexts and oracle designs from known buggy APIs to functionally similar target APIs. To ensure the reliability of our context-aware bug transfer, we introduce an LLM-powered self-validation module that systematically evaluates the validity of each transferred bug instance. We implement this methodology in a tool named TransFuzz and evaluate it on three mainstream DL libraries: PyTorch, TensorFlow, and MindSpore. TransFuzz successfully discovers 79 previously unknown bugs (12 confirmed as Common Vulnerabilities and Exposures (CVEs)) in 10 bug types, demonstrating its effectiveness and generalizability in migrating DL library bug discovery capabilities.
Article Search
Article: oopslaa26main-p341-p
Effectively Propositional Higher-Order Functional Programming
Nicholas V. Lewchenko,
Kunha Kim,
Bor-Yuh Evan Chang, and
Gowtham Kaki
(University of Colorado Boulder, USA; Amazon, USA)
Decidable automation is a key feature of program verification tools, which makes them easier for non-expert developers to use and understand. Unfortunately, decidable fragments of logic are very restrictive, and not ideal for the expression of idiomatic programs. The decidable Extended EPR fragment, combined with an encoding technique called relational abstraction, has seen wide use for its ability to handle both quantifiers and uninterpreted functions. However, the complexity of the relational abstraction encoding, combined with its inherent incompleteness, still poses a significant obstacle to non-experts.
In this paper, we show that Extended EPR and relational abstraction can be deployed to achieve decidable, quantified verification within a familiar, principled domain: a higher-order, purely-functional programming language. We observe that, by defining the semantics of our language in terms of partial functions, we obtain a well-behaved, three-valued logic that matches the behavior of the relational abstraction encoding while hiding its complexity. We demonstrate that our prototype implementation can ergonomically replicate EPR-based program verification benchmarks and verify recursive programs on inductive datatypes.
Article Search
Artifacts Available
Article: oopslaa26main-p348-p
Mechanised Semantics of Multi-stage Programming
Ka Wing Li,
Maite Kramarz,
Ningning Xie, and
Jeremy Yallop
(University of Cambridge, UK; University of Toronto, Canada)
Multi-stage programming (MSP) languages such as MetaML have subtle semantics, in which familiar properties often fail to hold and hazardous interactions with other language features such as state or polymorphism abound. The ongoing incorporation of MSP features into general purpose languages makes the need to establish confidence in their design increasingly pressing.
Taking inspiration from existing MSP systems, we present a Rocq mechanisation of a core calculus for compile-time and run-time MSP with effects, λrun$, formally establishing key properties such as type and elaboration soundness and phase distinction. We hope that our mechanised semantics will be a useful basis for formal study of other designs, easing the extension of existing languages with support for MSP.
Article Search
Article: oopslaa26main-p354-p
Differential Execution with Lexical Tracing
Sebastian Erdweg,
Runqing Xu, and
Mo Bitar
(KIT, Germany)
Incremental computing promises large speed-ups after small input edits. Yet, most incrementality approaches merely skip unchanged work and recompute the remaining sub-computations, even when the inputs change only slightly. Differential execution avoids this by propagating data changes (i.e., deltas), and prior work has shown how to develop a provably correct differential big-step semantics. Unfortunately, that semantics must still replay the original computation at every step, squandering much of the potential gain of incrementalization. While the semantics clearly needs caching to avoid recomputations, a sound and efficient caching discipline is challenging. First, each execution step must be uniquely identified; second, the identifier must remain stable even when the preceding control flow changes. To this end, we develop lexical tracing, which identifies execution steps through their path in the derivation tree of the big-step semantics. We then extend differential execution with lexical tracing and caching to deliver, for the first time, a formally verified, asymptotically efficient account of differential execution for imperative languages. In particular, we developed a novel mechanized theory of cache stability for lexical traces and their semantic rules, which was essential in proving the differential caching semantics correct and complete in Rocq.
Article Search
Article: oopslaa26main-p381-p
Prunario: Testing Autonomous Driving Systems by Pruning Likely Redundant Scenarios
Minsu Kim,
Sunbeom So, and
Hakjoo Oh
(Korea University, Republic of Korea)
We present Prunario, a novel technique for effectively testing autonomous driving systems (ADS). Ensuring the safety of ADS is critical, as their failures can lead to severe casualties. While ADS testing methods have advanced in recent years, they remain unsatisfactory in generating diverse test scenarios that induce distinct driving behaviors-a key requirement for thoroughly evaluating ADS across different situations. To address this, Prunario employs a novel simulation prediction technique to estimate ADS runtime behavior and prune redundant test scenarios that yield similar driving records. Experimental results demonstrate Prunario's effectiveness: it uncovered 23 previously undetected bugs in an industrial-strength ADS and outperformed three state-of-the-art testing techniques.
Article Search
Artifacts Available
Article: oopslaa26main-p383-p
Fail Faster: Staging and Fast Randomness for High-Performance PBT
Cynthia Richey,
Joseph W. Cutler,
Harrison Goldstein, and
Benjamin C. Pierce
(University of Pennsylvania, USA; University at Buffalo, USA)
Property-based testing (PBT) relies on generators for random test cases, often constructed using embedded domain specific languages, which provide expressive combinators for building and composing generators. The effectiveness of PBT depends critically on the speed of these generators. However, careful measurements show that the generator performance of widely used PBT libraries falls well short of what is possible, due principally to (1) the abstraction overhead of their combinator-heavy style and (2) suboptimal sources of randomness. We characterize, quantify, and address these bottlenecks.
To eliminate abstraction overheads, we propose a technique based on multi-stage programming, dubbed Allegro. We apply this technique to leading generator libraries in OCaml and Scala 3, significantly improving performance. To quantify the performance impact of the randomness source, we carry out a controlled experiment, replacing the randomness in the OCaml PBT library with an optimized version. Both interventions exactly preserve the semantics of generators, enabling precise, pointwise comparisons. Together, these improvements find bugs up to 13× faster.
Article Search
Artifacts Available
Article: oopslaa26main-p387-p
DeCo: A Core Calculus for Incremental Functional Programming with Generic Data Types
Timon Böhler,
Tobias Reinhard,
David Richter, and
Mira Mezini
(TU Darmstadt, Germany; hessian.AI, Germany; National Research Center for Applied Cybersecurity ATHENE, Germany)
Incrementalization speeds up computations by avoiding unnecessary recomputations and by efficiently reusing previous results.
While domain-specific techniques achieve impressive speedups, e.g., in the context of database queries, they are difficult to generalize.
Meanwhile, general approaches offer little support for incrementalizing domain-specific operations.
In this work, we present DeCo, a novel core calculus for incremental functional programming with support for a wide range of user-defined data types.
Despite its generic nature, our approach statically incrementalizes domain-specific operations on user-defined data types.
It is, hence, more fine-grained than other generic techniques which resort to treating domain-specific operations as black boxes.
We mechanized our work in Lean and proved it sound, meaning incrementalized execution computes the same result as full reevaluation.
We also provide an executable implementation with case studies featuring examples from linear algebra, relational algebra, dictionaries, trees, and conflict-free replicated data types,
plus a brief performance evaluation on linear and relational algebra and on trees.
Article Search
Artifacts Available
Article: oopslaa26main-p391-p
Detecting Flaky Tests by Controlling Nondeterministic API Behavior
Hengchen Yuan,
Jiefang Lin, and
August Shi
(University of Texas at Austin, USA)
Regression testing is an essential part of software development to ensure high-quality software; but the presence of flaky tests makes the testing outcomes unreliable. It is essential to proactively detect flaky tests, so developers are aware of them early on and can react appropriately in case they fail. While there has been prior work in detecting flaky tests, they either are developed to focus on specific flaky test types or can be imprecise in their analyses, such as by relying on AI models for prediction.
We present ChaosAPI, a framework to support detecting a variety of different types of flaky tests. Our insight is that the most effective approach to detecting flaky tests is to target the nondeterministic components that lead tests to have the flaky behavior in the first place. In particular, we target specific APIs within the Java Standard Library that all Java code relies on and that are known to exhibit nondeterministic behavior, such as those related to system time, concurrency, and environmental factors. During test execution, ChaosAPI modifies the behavior of these API calls, perturbing inputs and return values of these APIs in a systematic manner while still remaining compliant with the API specification. We can detect flaky tests by observing whether tests that previously passed would now fail when run through ChaosAPI.
Our evaluation on a prior dataset of known flaky tests, as well as running on test suites of other popular open-source projects, demonstrates that ChaosAPI not only detects more flaky tests than simple rerunning across a wide range of projects but also detects them more efficiently, making ChaosAPI a practical addition to the toolbox of flaky test detection techniques.
Article Search
Article: oopslaa26main-p396-p
From Raw Pointers to Memory Safety: A Modular Demand-Driven Typestate Analysis for Rust
Wei Li,
Wenyao Chen, and
Jingling Xue
(UNSW Sydney, Australia)
Rust combines high performance with strong memory safety through strict ownership and borrowing rules. However, its unsafe mode reintroduces vulnerabilities by allowing raw-pointer manipulation, a major source of memory-safety bugs.
Existing whole-program analyses for Rust often suffer from low recall and high false positives. Since unsafe code is typically small and isolated, we propose a demand-driven alternative. We present Pincer, a flow-, field-, and context-sensitive dataflow analysis framework built on IFDS. Pincer performs mutually bidirectional analysis—backward to trace raw-pointer origins and forward to explore aliases—adapting this strategy to Rust’s ownership model and low-level semantics.
On this foundation, Pincer performs a modular, bottom-up vulnerability-oriented typestate analysis to detect use-after-free and double-free bugs. It tracks raw-pointer aliasing and nullness, exploits strong updates at container-manipulating returns, and leverages Rust’s safety invariants to prune provably safe regions via AXM checking. The modular design enables controlled exploration, optionally under a budget, improving scalability. Controlled unsoundness further boosts efficiency while maintaining high recall and precision.
We evaluate Pincer on vulnerable programs and large Rust projects. The results show that Pincer detects memory-safety errors more accurately than state-of-the-art analyses while maintaining practical efficiency.
Article Search
Article: oopslaa26main-p409-p
Speak Now: Safe Actor Programming with Multiparty Session Types
Simon Fowler and
Raymond Hu
(University of Glasgow, UK; Queen Mary University of London, UK)
Actor languages such as Erlang and Elixir are widely used for implementing scalable and reliable distributed applications, but the informally-specified nature of actor communication patterns leaves systems vulnerable to costly errors such as communication mismatches and deadlocks. Multiparty session types (MPSTs) rule out communication errors early in the development process, but until now, the many-sender, single-receiver nature of actor communication has made it difficult for actor languages to benefit from session types.
This paper introduces Maty, the first actor language design supporting both static multiparty session typing and the full power of actors taking part in multiple sessions. Maty therefore combines the error prevention mechanism of session types with the scalability and fault tolerance of actor languages. Our main insight is to enforce session typing through a flow-sensitive effect system, combined with an event-driven programming style and first-class message handlers. Using MPSTs allows us to guarantee communication safety: a process will never send or receive an unexpected message, nor will a session get stuck because an actor is waiting for a message that will never be sent. We extend Maty to support Erlang-style supervision and cascading failure, and show that this preserves Maty’s strong metatheory. We implement Maty in Scala using an API generation approach, and demonstrate the expressiveness of our model by implementing a representative sample of the widely-used Savina actor benchmark suite; an industry-supplied factory scenario; and a chat server.
Preprint
Artifacts Available
Article: oopslaa26main-p508-p
A Tale of 1001 LoC: Potential Runtime Error-Guided Specification Synthesis for Verifying Large-Scale Programs
Zhongyi Wang,
Tengjie Lin,
Mingshuai Chen,
Haokun Li,
Mingqi Yang,
Xiao Yi,
Shengchao Qin,
Yixing Luo,
Xiaofeng Li,
Bin Gu,
Liqiang Lu, and
Jianwei Yin
(Zhejiang University, China; Peking University, China; Chinese University of Hong Kong, China; Xidian University, China; Beijing Institute of Control Engineering, China)
Fully automated verification of large-scale software and hardware systems is arguably the holy grail of formal methods. Large language models (LLMs) have recently demonstrated their potential for enhancing the degree of automation in formal verification by, e.g., generating formal specifications as essential to deductive verification, yet exhibit poor scalability due to long-context reasoning limitations and, more importantly, the difficulty of inferring complex, interprocedural specifications. This paper presents Preguss - a modular, fine-grained framework for automating the generation and refinement of formal specifications. Preguss synergizes between static analysis and deductive verification by steering two components in a divide-and-conquer fashion: (i) potential runtime error-guided construction and prioritization of verification units, and (ii) LLM-aided synthesis of interprocedural specifications at the unit level. We show that Preguss substantially outperforms state-of-the-art LLM-based approaches and, in particular, it enables highly automated RTE-freeness verification for real-world programs with over a thousand LoC, with a reduction of 80.6%~88.9% human verification effort.
Preprint
Artifacts Available
Article: oopslaa26main-p510-p
Frashokereti: Non-aborting Optimistically Replicated Objects
Eric Man Chan,
Javad Saberlatibari, and
Mohsen Lesani
(University of California at Riverside, USA; University of California at Santa Cruz, USA)
Optimistic replication of objects avoids coordination and brings higher responsiveness and availability. However, when clients issue concurrent operations, conflicts naturally arise which can lead the replicated states to diverge or lose integrity. When conflicts occur, existing approaches resort to pessimism or abortion. This paper characterizes ORDTs (Optimistically Replicated Data Types), objects that can be optimistically replicated with convergence and integrity, and without aborting calls. It shows that ORDTs subsume CRDTs and transformed relational schema, and presents techniques to convert objects to ORDTs. It further proves that optimistic replication for objects that fall out of ORDTs is aborting and NP-Complete. Further, it presents an optimistic replication protocol for ORDTs called Frashokereti.It uses a statically decided order to efficiently order calls. The paper proves that Frashokereti is sound for every ORDT, i.e., Frashokereti is optimistic and non-aborting, and preserves convergence, integrity, and liveness properties. Experimental results show that Frashokereti significantly outperforms previous optimistic protocols.
Article Search
Article: oopslaa26main-p517-p
Context-Free Language Reachability via Efficient Relation Chaining
Chenghang Shi,
Haofeng Li,
Jie Lu, and
Lian Li
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China)
Context-free language (CFL) reachability is a fundamental framework widely used to model a variety of program analysis tasks, though it often suffers from inherent inefficiency due to its (sub)cubic time complexity.
In this paper, we propose a novel perspective, relation chaining, which interprets CFL-reachability solving as the process of chaining labeled edges representing binary relations.
This formulation exposes substantial derivation redundancy (in terms of frequent and repetitive chaining operations) arising from inefficient chaining strategies employed by existing approaches.
To address this, we introduce SQUID, a new algorithm that incorporates two simple yet effective chaining techniques—adaptive chaining and differential chaining—built upon an enhanced graph representation.
We have implemented SQUID as a standalone tool and evaluated it against two state-of-the-art CFL-reachability solvers and a leading Datalog solver across three key program analyses: field-sensitive alias analysis and context-sensitive value-flow analysis for C/C++, and field-sensitive points-to analysis for Java.
Experimental results show that SQUID substantially improves the scalability of CFL-reachability solving by effectively reducing a large portion of redundant chaining operations.
Article Search
Artifacts Available
Article: oopslaa26main-p533-p
Process-Centric Analysis of Agentic Software Systems
Shuyang Liu,
Yang Chen,
Rahul Krishna,
Saurabh Sinha,
Jatin Ganhotra, and
Reyhaneh Jabbarvand
(University of Illinois at Urbana-Champaign, USA; IBM Research, USA)
Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines. Unlike conventional programs, their execution, i.e., trajectories, is inherently stochastic and adaptive to the problems they are solving. Evaluation of such systems is often outcome-centric, i.e., judging their performance based on success or failure at the final step. This narrow focus overlooks detailed insights about such systems, failing to explain how agents reason, plan, act, or change their strategies. Inspired by the structured representation of conventional software systems as graphs, we introduce Graphectory to systematically encode the temporal and semantic relations in such software systems. Graphectory facilitates the design of process-centric metrics and analyses to assess the quality of agentic workflows.
Using Graphectory, we automatically analyze 4000 trajectories of two dominant agentic programming workflows, namely SWE-agent and OpenHands, with a combination of four backbone Large Language Models (LLMs), attempting to resolve SWE-bench Verified issues. Our fully automated analyses (completed within four minutes) reveal that: (1) agents using richer prompts or stronger LLMs exhibit more complex Graphectory, reflecting deeper exploration, broader context gathering, and more thorough validation before patch submission; (2) agents’ problem-solving strategies vary with both problem difficulty and the underlying LLM—for resolved issues, the strategies often follow coherent localization–patching–validation steps, while unresolved ones exhibit chaotic, repetitive, or backtracking behaviors; and (3) even when successful, agentic programming systems often display inefficient processes, leading to unnecessarily prolonged trajectories.
We also implement a novel technique for real-time construction and analysis of Graphectory and Langutory during the agent’s execution to flag trajectory issues. Upon detecting such issues in the trajectory, the proposed technique notifies the agent with a diagnostic message and, when applicable, rolls back the trajectory. The experimental results show that online monitoring and process-centric analysis, when accompanied by appropriate interventions, can improve resolution rates by 6.9%-23.5% across models for problematic instances, while significantly shortening trajectories with near-zero overhead.
Article Search
Article: oopslaa26main-p534-p
Reframing Paths as Logic: Semantic Segmentation for Vulnerability Detection
Zong Cao,
Yuqiang Sun,
Zhengzi Xu,
Kaixuan Li,
Yeqi Fu,
Yiran Zhang,
Ziqiao Kong, and
Yang Liu
(Imperial Global Singapore, Singapore; Nanyang Technological University, Singapore; National University of Singapore, Singapore)
Path-sensitive vulnerabilities, such as use-after-free, integer overflows, and command injection, pose significant challenges for traditional static analysis tools, which often face trade-offs between precision, scalability, and interpretability. To address these challenges, we present SEVDF (Semantic-Enhanced Vulnerability Detection Framework), a novel methodology that integrates may-analysis taint propagation with large language models (LLMs) to detect path-related vulnerabilities in large C/C++ codebases. SEVDF begins by constructing a program dependency graph and performing a sound but incomplete taint analysis to extract all potential vulnerable paths. After segmentation, deduplication, feasibility check, and semantic summarization by LLMs, the vulnerable paths are reformed and confirmed with LLMs for their inter-procedural feasibility and semantic consistency.
We evaluate SEVDF on the Juliet Test Suite (thirteen representative CWE categories) and a curated real-world dataset of 71 vulnerabilities across 9 projects. SEVDF consistently outperforms the default CodeQL rules, CodeQL rules with all unnecessary constraints removed, and three open-source detectors, which are Infer, Cppcheck and CodeChecker. SEVDF is able to achieve 100% precision on several CWEs while maintaining or improving recall on Juliet benchmark. Moreover, our segment-based design reduces the analysis workload for LLMs by 90.6% compared to direct-path prompting through Logic Unit deduplication, making SEVDF cost-effective for large-scale deployment. Finally, SEVDF uncovered and reported 29 0-day vulnerabilities (12 confirmed to date), including 3 CVEs in VirtualBox, demonstrating practical value.
Article Search
Article: oopslaa26main-p549-p
IRIDIUM: A Framework for Statically Optimizing JavaScript Programs
Meetesh Kalpesh Mehta,
Anirudh Garg,
Aneeket Yadav, and
Manas Thakur
(IIT Bombay, India; IIT Delhi, India)
Static analysis of JavaScript remains notoriously difficult due to the language’s dynamically typed nature, unconventional scoping rules, and pervasive side effects. Unlike mature infrastructures such as LLVM for C/C++ or Soot for Java, comparable frameworks for JavaScript are fragmented and limited in scope. In this paper, we introduce IRIDIUM, a first-of-its-kind framework to statically optimize JavaScript programs. IRIDIUM systematically lowers JavaScript into a structured intermediate representation (called IRI) that models bindings, environments, and control flow explicitly. The resultant expressiveness enables more predictable analyses and transformations, ranging from dataflow tracking to optimization passes to executable code generation for existing runtimes, that are otherwise hindered by the language’s complexity. By bridging the gap between JavaScript’s surface syntax and the requirements of static analysis, IRIDIUM, thus, lays the foundation for a new generation of tools that can reason effectively about modern JavaScript applications.
Preprint
Artifacts Available
Article: oopslaa26main-p554-p
Block Tests
Kevin Guan,
Pengyue Jiang,
Milos Gligoric, and
Owolabi Legunsen
(Cornell University, USA; University of Texas at Austin, USA)
Inline tests validate single program statements and were shown to find single-statement bugs or kill mutants that unit tests miss. Inline tests complement unit tests by enabling testing at a finer program granularity level than methods. So, inline tests can more easily find faults in target statements that unit tests do not reach, or where errors do not propagate to unit tests’ oracles. But, the limitation to single statements means inline tests cannot validate data or control flow across code fragments—sequences of multiple statements in a method.
We motivate the need for testing arbitrary fragments and propose block tests, which generalize inline tests and validate code fragments. To motivate, we discuss six software testing needs (e.g., due to increasing usage of lambdas in imperative code) for which unit tests are too coarse grained and inline tests are too fine grained. To bridge this gap, we propose syntax and semantics for specifying inputs, expected outputs, and scope of block tests. We also implement a block-test development kit (BDK) for writing and running block tests in Java.
We evaluate block tests and BDK in two ways. First, we write 1,012 block tests for 346 fragments in 146 open-source projects. Developer written unit tests do not cover 58.7% of these fragments, and automated unit-test generation does not reach 46% of them even after 30.8 CPU days. But, each block test takes us 2.2 minutes to write and 0.9 seconds to run on average. Second, we use mutation testing to evaluate the fault-finding effectiveness of block tests in fragments that unit tests cover. Block tests kill 4,418 of 9,554 mutants that survived unit tests. These results provide initial but strong evidence on block tests’ feasibility and utility. We outline an agenda for future research on block testing.
Article Search
Article: oopslaa26main-p561-p
A Minimalist Proof Language for Neural Theorem Proving over Isabelle/HOL
Qiyuan Xu,
Renxi Wang,
Peixin Wang,
Haonan Li, and
Conrad Watt
(Nanyang Technological University, Singapore; MBZUAI, United Arab Emirates; East China Normal University, China)
Neural Theorem Proving (NTP) employs Large Language Models (LLMs) to automate formal proofs in proof assistants.
While LLMs have achieved relatively remarkable success in informal reasoning tasks using natural languages, the transition to mechanized formal theorem proving presents persistent challenges.
Mechanized proof languages often contain many syntactic constructs and diverse, specialized proof tactics, which facilitate expert use but have no direct counterpart in informal mathematical proofs.
These prover-specific idioms represent an additional burden for LLM-based NTPs that might be otherwise successful in generating informal proofs.
Seeking to bridge this gap between formal proof construction and informal reasoning, in order to better facilitate NTP, this work approaches these challenges from a language design perspective.
We look at common reasoning patterns in informal proofs and in existing mechanized proofs, and design Minilang (formally named Isabelle/Minilang), a minimalist proof language that captures these reasoning patterns.
In contrast to proof languages (informal and formal) that often feature a large collection of operations with unclear semantic boundaries, Minilang is deliberately kept minimalist --- its core design comprises only 10 proof operations, each with clear semantic distinctions.
We further develop a rule-based translator from Isabelle's proof language (Isar) to Minilang, translating ~340,000 existing Isabelle proofs with an ~85% success rate.
Using this translated corpus, we finetune two LLMs to compare machine learning performance on Minilang versus the original Isar language.
Experiments show Minilang benefits the two LLMs by improving the pass@1 success rate on the PISA benchmark by up to 20/29 percentage points in comparison to the Isar-based LLMs w/wo Sledgehammer.
The pass@1 rate reaches 69.1%, exceeding the prior work Baldur's pass@64 (65.7%); the pass@8 rate reaches 79.2%, exceeding the state-of-the-art on PISA (71.0%) achieved by Magnushammer.
Preprint
Info
Artifacts Available
Article: oopslaa26main-p581-p
proc time: 18.8