OOPSLA1 2026
Proceedings of the ACM on Programming Languages, Volume 10, Number OOPSLA1

Powered by

Proceedings of the ACM on Programming Languages, Volume 10, Number OOPSLA1

OOPSLAA – Journal Issue

Contents - Abstracts - Authors

Title Page
Article: oopslaa26foreword-fm000-p doi:

Editorial Message
Article: oopslaa26foreword-fm001-p doi:

Learning Symmetric Invariants from Symmetric Samples
Zhijie Xu and Fei He
(Tsinghua University, China; Key Laboratory for Information System Security, China)

Publisher's Version

Published Artifact

Artifacts Available Article: oopslaa26main-p9-p doi:10.1145/3798200

Decompiling for Constant-Time Analysis
Santiago Arranz-Olmos, Gilles Barthe, Lionel Blatter, Youcef Bouzid, Sören van der Wall, and Zhiyuan Zhang
(MPI-SP, Germany; IMDEA Software Institute, Spain; ENS Paris-Saclay, France; TU Braunschweig, Germany)
The constant-time programming discipline is commonly used to protect cryptographic libraries against side-channel attacks. However, it is hard to write constant-time code; moreover, compilers can introduce constant-time violations. Therefore, it is important to ensure that assembly code is constant-time. One approach is to show that source programs are constant-time, and that constant-timeness is preserved by compilation. In this paper, we explore the methodological soundness and scalability of the Decompile-then-Analyze approach, a less conventional alternative that has been suggested in the broader setting of static analysis. Informally, the Decompile-then-Analyze approach uses decompilers a front-end for static analysis tools. As a motivation for our study, we show that current decompilers eliminate CT vulnerabilities before CT analysis, leading to non-CT programs being accepted as constant-time. Independently, we provide constructed examples of non-CT, exploitable, programs that are accepted by two popular CT analysis tools; in both cases the culprit are program transformations that are used internally prior to CT analysis and eliminate CT violations. While our examples do not invalidate the general approach of these tools, they emphasize the need for studying the Decompile-then-Analyze approach. On the methodological side, we define the notion of CT transparency. Informally, a program transformation is CT transparent if does not eliminate nor introduce CT violations. We also provide general methods for proving that a transformation is CT transparent, and show that several transformations of interest are transparent. We also sketch an extension of CT transparency to speculative constant-time, which is used by cryptographic software as a protection against Spectre attacks. On the practical side, we build a CT-transparent version of the popular LLVM-based decompiler RetDec, and combine it with CT-LLVM, an existing CT verification tool for LLVM. We evaluate the resulting tool, called CT-RetDec on a benchmark set of real-world vulnerabilities in binaries, and show that the modifications had significant impact on how well CT-RetDec performs.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p13-p doi:10.1145/3798201

Taming the Hydra: Targeted Control-Flow Transformations for Dynamic Symbolic Execution
Charitha Saumya, Muhammad Hassan, Rohan Gangaraju, Milind Kulkarni, and Kirshanthan Sundararajah
(Intel, USA; Virginia Tech, USA; University of Texas at Austin, USA; Purdue University, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p20-p doi:10.1145/3798202

Floating-Point Usage on GitHub: A Large-Scale Study of Statically Typed Languages
Andrea Gilot, Tobias Wrigstad, and Eva Darulova
(Uppsala University, Sweden)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p22-p doi:10.1145/3798203

OBsmith: LLM-Powered JavaScript Obfuscator Testing
Shan Jiang, Chenguang Zhu, and Sarfraz Khurshid
(University of Texas at Austin, USA)
JavaScript obfuscators are widely deployed to protect intellectual property and resist reverse engineering, yet their correctness has been largely overlooked compared to performance and resilience. Existing evaluations typically measure resistance to deobfuscation, leaving the critical question of whether obfuscators preserve program semantics unanswered. Incorrect transformations can silently alter functionality, compromise reliability, and erode security—undermining the very purpose of obfuscation. To address this gap, we present OBsmith, a novel framework to systematically test JavaScript obfuscators using large language models (LLMs). OBsmith leverages LLMs to generate program sketches—abstract templates capturing diverse language constructs, idioms, and corner cases—which are instantiated into executable programs and subjected to obfuscation under different configurations. Besides LLM-powered sketching, OBsmith also employs a second source: automatic extraction of skeletons from real programs. This extraction path enables more focused testing of project-specific features and lets developers inject domain knowledge into the resulting test cases. OBsmith uses two techniques to derive test oracles: (i) reference-oriented equivalence testing, which takes the original program as reference oracle (ground truth) and checks whether the obfuscated version preserves equivalent functionality, and (ii) metamorphic testing, which applies semantics-preserving transformations to the original program and checks if obfuscation violates expected behavior.
We evaluate OBsmith on two widely used obfuscators, Obfuscator.IO and JS-Confuser, generating 600 sketches using six popular LLMs. OBsmith fills these sketches and generates over 3,000 candidate programs and obfuscates them across seven obfuscation configurations. OBsmith uncovers 11 previously unknown correctness bugs. Under an equal program budget, five general purpose state-of-the-art JavaScript fuzzers (FuzzJIT, Jsfunfuzz, Superion, DIE, Fuzzilli) failed to detect these issues, highlighting OBsmith’s complementary focus on obfuscation-induced misbehavior. An ablation shows that all components except our generic MRs contribute to at least one bug class; the negative MR result suggests the need for obfuscator-specific metamorphic relations. Our results also seed a discussion on how to balance obfuscation presets and performance cost. We envision OBsmith as an important step towards automated testing and quality assurance of obfuscators and other semantic-preserving toolchains.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: oopslaa26main-p25-p doi:10.1145/3798204

Peeling Off the Cocoon: Unveiling Suppressed Golden Seeds for Mutational Greybox Fuzzing
Ruixiang Qian, Chunrong Fang, Zengxu Chen, Youxin Fu, and Zhenyu Chen
(Nanjing University, China)
Mutational greybox fuzzing (MGF) is a powerful software testing technique. Initial seeds are critical for MGF since they define the space of possible inputs and fundamentally shape the effectiveness of MGF. Nevertheless, having more initial seeds is not always better. A bloated initial seed set can inhibit throughput, thereby degrading the effectiveness of MGF. To avoid bloating, modern fuzzing practices recommend performing seed selection to maintain golden seeds (i.e., those identified as beneficial for MGF) while minimizing the size of the set. Typically, seed selection favors seeds that execute unique code regions and discards those that contribute stale coverage. This coverage-based strategy is straightforward and useful, and is widely adopted by the fuzzing community. However, coverage-based seed selection (CSS) is not flawless and has a notable blind spot: it fails to identify golden seeds suppressed by unpassed coverage guards, even if these seeds contain valuable payload that can benefit MGF. This blind spot prevents suppressed golden seeds from realizing their true values, which may ultimately degrade the effectiveness of downstream MGF.
In this paper, we propose a novel technique named PoCo to address the blind spot of traditional CSS. The basic idea behind PoCo is to manifest the true strengths of the suppressed golden seeds by gradually disabling obstacle conditional guards. To this end, we develop a lightweight program transformation to enable flexible disabling of guards and devise a novel guard hierarchy analysis to identify obstacle ones. An iterative seed selection algorithm is constructed to stepwise select suppressed golden seeds. We prototype PoCo on top of the AFL++ utilities (version 4.10c) and compare it with seven baselines, including two state-of-the-art tools afl-cmin and OptiMin. Compared with afl-cmin, PoCo selects 3–40 additional seeds within a practical time budget of two hours. To evaluate how effective the studied techniques are in seeding MGF, we further conduct extensive fuzzing (over 17,280 CPU hours) with eight different targets from a mature benchmark named Magma, adopting the most representative fuzzer AFL++ for MGF. The results show that the additional seeds selected by PoCo yield modest improvements in both code coverage and bug discovery. Although our evaluation reveals some limitations of PoCo, it also demonstrates the presence and value of suppressed golden seeds. Based on the evaluation results, we distill lessons and insights that may inspire the fuzzing community.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p37-p doi:10.1145/3798205

Diatom: Polylithic Binary Lifting with Data-Flow Summaries and Type-Aware IR Linking
Anshunkang Zhou and Charles Zhang
(Hong Kong University of Science and Technology, China)

Publisher's Version Article: oopslaa26main-p41-p doi:10.1145/3798206

Handling Exceptions and Effects with Automatic Resource Analysis
Ethan Chu, Yiyang Guo, and Jan Hoffmann
(Carnegie Mellon University, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p42-p doi:10.1145/3798207

RAT-CAT-SAT: Model Checking Memory Consistency Models
Jan Grünke, Thomas Haas, and Roland Meyer
(TU Braunschweig, Germany)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p45-p doi:10.1145/3798208

Specy: Learning Specifications for Distributed Systems from Event Traces
Mike He, Ankush Desai, Aishwarya Jagarapu, Doug Terry, Sharad Malik, and Aarti Gupta
(Princeton University, USA; Snowflake, USA; Amazon Web Services, USA; LinkedIn, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p48-p doi:10.1145/3798209

Commuting Conversions and Join Points for Call-by-Push-Value
Jonathan Chan, Madi Gudin, Annabel Levy, and Stephanie Weirich
(University of Pennsylvania, USA; Amherst College, USA; University of Maryland, Baltimore County, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p54-p doi:10.1145/3798210

Hermes: Making Path-Sensitive Pointer Analysis Scalable for Sparse Value-Flow Analysis
Yuxuan He, Ruilin Jiang, He Zhang, Qingkai Shi, Huaxun Huang, and Rongxin Wu
(Xiamen University, China; Nanjing University, China)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p61-p doi:10.1145/3798211

MetaSpace: Metamorphic Testing for Spatial Cognition in Embodied Agents
Gengyang Xu, Dongwei Xiao, Yiteng Peng, and Shuai Wang
(Hong Kong University of Science and Technology, China)
An embodied agent is an intelligent entity that interacts with its environment through a physical body. Currently, the evaluation of embodied agents primarily relies on two paradigms: (1) manually annotated Visual Question Answering (VQA) pairs and (2) high-level task completion metrics, such as success in navigation or manipulation. The former is labor-intensive and subject to variability in annotation quality. The latter may obscure critical vulnerabilities, allowing agents to complete tasks through suboptimal means or safety violations, thereby concealing safety risks and inefficiencies. Given that spatial cognition is the cornerstone for executing embodied tasks, there is a pressing need to assess whether embodied agents possess robust spatial cognition during task execution.
Inspired by metamorphic testing principles in software engineering, we propose MetaSpace, a novel framework designed to evaluate the spatial cognition of agents. By leveraging spatiotemporal multimodal states derived from real execution trajectories, MetaSpace automatically generates test cases based on predefined metamorphic relations (MRs) grounded in logical rules and physical laws. Crucially, we encode these MRs as executable rules in a logic programming language (Prolog). Violations of these relations indicate failures in spatial cognition. Our empirical evaluation across three embodied scenarios demonstrates that MetaSpace successfully detects 90,422 spatial cognition errors in state-of-the-art (SOTA) MLLM-driven agents. We introduce the Spatial Cognition (SC) score to quantify performance. Results indicate that all SOTA agents achieve average scores between 0.44 and 0.52, significantly lower than the human benchmark of 0.96. Additionally, these agents struggle with directional tasks, with SC scores consistently below 0.38. In contrast, their performance in magnitude-related tasks is relatively better, with most SC scores exceeding 0.5. To mitigate the identified spatial cognition errors, we explore potential improvement strategies. Preliminary results suggest that traditional prompting techniques (e.g., Chain of Thought) are limited, while spatially-aware prompting (e.g., cognitive maps) shows promise. Our findings underscore the importance of ongoing community efforts to enhance embodied agent performance by prioritizing the improvement of spatial cognition, a fundamental requirement for executing embodied tasks.

Publisher's Version

Published Artifact

Artifacts Available Article: oopslaa26main-p63-p doi:10.1145/3798212

Debugging Debugging Information using Dynamic Call Trees
J. Ryan Stinnett and Stephen Kell
(King's College London, UK)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: oopslaa26main-p64-p doi:10.1145/3798213

Beer: Interactive Alarm Resolution in Bayesian Program Analysis via Exploration-Exploitation
Haoran Lin, Zhenyu Yan, and Xin Zhang
(Peking University, China)

Publisher's Version

Published Artifact

Artifacts Available Article: oopslaa26main-p91-p doi:10.1145/3798214

noDice: Inference for Discrete Probabilistic Programs with Nondeterminism and Conditioning
Tobias Gürtler and Benjamin Lucien Kaminski
(Saarland University, Germany; Saarland Informatics Campus, Germany; University College London, UK)

Publisher's Version

Published Artifact

Artifacts Available Article: oopslaa26main-p98-p doi:10.1145/3798215

Mechanically Translating Iterative Dataflow Analysis to Algebraic Program Analysis
Chenyu Zhou, Jingbo Wang, and Chao Wang
(University of Southern California, USA; Purdue University, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p99-p doi:10.1145/3798216

SymGPT: Auditing Smart Contracts via Combining Symbolic Execution with Large Language Models
Shihao Xia, Mengting He, Shuai Shao, Tingting Yu, Yiying Zhang, Nobuko Yoshida, and Linhai Song
(Pennsylvania State University, USA; University of Connecticut, USA; University of California at San Diego, USA; University of Oxford, UK; Institute of Computing Technology at Chinese Academy of Sciences, China)

Publisher's Version Article: oopslaa26main-p107-p doi:10.1145/3798217

CMakeSonar: A Static Approach to Detecting CMake Bugs with a Fine-Grained Type System
Haotian Han, Zihang Zhong, Qingan Li, Jingling Xue, and Mengting Yuan
(Wuhan University, China; UNSW Sydney, Australia)

Publisher's Version

Published Artifact

Artifacts Available Article: oopslaa26main-p114-p doi:10.1145/3798218

When Specifications Meet Reality: Uncovering API Inconsistencies in Ethereum Infrastructure
Jie Ma, Ningyu He, Jinwen Xi, Mingzhe Xing, Liangxin Liu, Jiushenzi Luo, Xiaopeng Fu, Chiachih Wu, Haoyu Wang, Ying Gao, and Yinliang Yue
(Beihang University, China; Zhongguancun Laboratory, China; Hong Kong Polytechnic University, Hong Kong; Beijing Institute of Technology, China; Amber Group, Hong Kong; Huazhong University of Science and Technology, China)
The Ethereum ecosystem, which secures over $381 billion in assets, fundamentally relies on client APIs as the sole interface between users and the blockchain. However, these critical APIs suffer from widespread implementation inconsistencies, which can lead to financial discrepancies, degraded user experiences, and threats to network reliability. Despite this criticality, existing testing approaches remain manual and incomplete: they require extensive domain expertise, struggle to keep pace with Ethereum's rapid evolution, and fail to distinguish genuine bugs from acceptable implementation variations. We present APIDiffer, the first specification-guided differential testing framework designed to automatically detect API inconsistencies across Ethereum's diverse client ecosystem. APIDiffer transforms API specifications into comprehensive test suites through two key innovations: (1) specification-guided test input generation that creates both syntactically valid and invalid requests enriched with real-time blockchain data, and (2) specification-aware false positive filtering that leverages large language models to distinguish genuine bugs from acceptable variations. Our evaluation across all 11 major Ethereum clients reveals the pervasiveness of API bugs in production systems. APIDiffer uncovered 72 bugs, with 90.28% already confirmed or fixed by developers, including one critical error in the official specifications themselves. Beyond these raw numbers, APIDiffer achieves up to 89.67% higher code coverage than existing tools and reduces false positive rates by 37.38%. The Ethereum community's response validates our impact: developers have integrated our test cases, expressed interest in adopting our methodology, and escalated one bug to the official Ethereum Project Management meeting. By making APIDiffer open-source, we enable continuous validation of Ethereum client API implementations, thereby strengthening the foundational integrity of the entire Ethereum ecosystem.

Publisher's Version

Published Artifact

Artifacts Available Article: oopslaa26main-p122-p doi:10.1145/3798219

Type Inference for Functional and Imperative Dynamic Languages
Mickaël Laurent and Jan Vitek
(Charles University, Czech Republic; Czech Technical University, Czech Republic)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p124-p doi:10.1145/3798220

Fully-Automatic Type Inference for Borrows with Lifetimes
William Brandon, Benjamin Driscoll, Frank Dai, Jonathan Ragan-Kelley, Mae Milano, and Alex Aiken
(Massachusetts Institute of Technology, USA; Stanford University, USA; Unaffiliated, USA; Princeton University, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p128-p doi:10.1145/3798221

(Dis)Proving Spectre Security with Speculation-Passing Style
Santiago Arranz-Olmos, Gilles Barthe, Lionel Blatter, Xingyu Xie, and Zhiyuan Zhang
(MPI-SP, Germany; IMDEA Software Institute, Spain)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p130-p doi:10.1145/3798222

Static Factorisation of Probabilistic Programs with User-Labelled Sample Statements and While Loops
Markus Böck and Jürgen Cito
(TU Wien, Austria)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p132-p doi:10.1145/3798223

Mixtris: Mechanised Higher-Order Separation Logic for Mixed Choice Multiparty Message Passing
Jonas Kastberg Hinrichsen, Iwan Quémerais, and Lars Birkedal
(Aalborg University, Denmark; ENS-Lyon, France; Aarhus University, Denmark)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p134-p doi:10.1145/3798224

Phaedrus: Predicting Dynamic Application Behavior with Lightweight Generative Models and LLMs
Bodhisatwa Chatterjee, Neeraj Jadhav, and Santosh Pande
(Georgia Institute of Technology, USA)

Publisher's Version

Published Artifact

Artifacts Available Article: oopslaa26main-p136-p doi:10.1145/3798225

Metamorphic Testing for Infrastructure-as-Code Engines
David Spielmann, George Zakhour, Dominik Arnold, Matteo Biagiola, Roland Meier, and Guido Salvaneschi
(University of St. Gallen, Switzerland; University of Zurich, Switzerland; USI Lugano, Switzerland; armasuisse, Switzerland)

Publisher's Version

Published Artifact

Artifacts Available Article: oopslaa26main-p139-p doi:10.1145/3798226

Class-Dictionary Specialization with Rank-2 Polymorphic Functions
Yong Qi Foo and Michael D. Adams
(National University of Singapore, Singapore)
Rank-2 polymorphism, when combined with type-class-constrained arguments, enables powerful abstractions and code reuse by allowing functions to accept arguments that are themselves ad-hoc polymorphic. Optimizing compilers like the Glasgow Haskell Compiler (GHC) use techniques like class-dictionary specialization and inlining for optimization. However, these techniques can falter when the rank-2 polymorphic functions are recursive. Specializing the polymorphic arguments of recursive rank-2 polymorphic function applications requires inlining the applied function to expose type and dictionary applications to the arguments, but the compiler’s heuristic-driven inliner is reluctant to inline recursive functions, risking non-termination during compilation. This stalemate causes recursive rank-2 polymorphic functions to remain un-optimized and be left with a runtime penalty. Within the Haskell ecosystem, we identify this stalemate within three widely used libraries: the Scrap Your Boilerplate (SYB) and SYB With Class (SYB3) generic-programming libraries, and a fragment of the lens optics library. In these libraries, the combination of rank-2 polymorphism, type-class constraints and recursion is central to their implementation.
In this paper, we present a new optimization technique that breaks this stalemate and enables class-dictionary specialization with recursive rank-2 polymorphic functions. We introduce a partial evaluator that strategically applies the standard transformations of inlining, β-reduction and memoization to applications of rank-2 polymorphic functions that are partially static, i.e., whose arguments have some information known statically. This process exposes applications of its polymorphic arguments onto concrete types and dictionaries, which can be specialized by the standard compiler optimization pipeline. Additionally, we introduce type-constant folding to evaluate run-time type-equality tests statically, further leveraging static type information gained from partial evaluation. We implement our technique as a GHC plugin and demonstrate its effectiveness by resolving the performance bottlenecks in the aforementioned Haskell libraries. On SYB and SYB3 traversals, our technique achieves speedups of 43× on average (up to 155×) and 6.1× on average (up to 9.5×), respectively, matching the performance of their hand-written counterparts. On the fragment of the lens library containing the identified slowdowns, our technique achieves speedups of 1.6× on average (up to 2.1×).

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p141-p doi:10.1145/3798227

Sound and Complete Invariant-Based Heap Encodings
Zafer Esen, Philipp Rümmer, and Tjark Weber
(Uppsala University, Sweden; University of Regensburg, Germany)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p144-p doi:10.1145/3798228

Scylla: Translating an Applicative Subset of C to Safe Rust
Aymeric Fromherz and Jonathan Protzenko
(Inria, France; Google, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p146-p doi:10.1145/3798229

Hybrid Game Control Envelope Synthesis
Aditi Kabra, Jonathan Laurent, Stefan Mitsch, and André Platzer
(Carnegie Mellon University, USA; KIT, Germany; DePaul University, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable Article: oopslaa26main-p158-p doi:10.1145/3798230

Hunting CUDA Bugs at Scale with cuFuzz
Mohamed Tarek Ibn Ziad and Christos Kozyrakis
(NVIDIA, USA; Stanford University, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p167-p doi:10.1145/3798231

Beacon: Detecting Broken Access Control Vulnerabilities in DBMSs via System Catalog Consistency Validation
Zongrui Peng, Jingzhou Fu, Zhiyong Wu, Jie Liang, Xiangdong Huang, Dalong Shi, and Yu Jiang
(Tsinghua University, China; Beihang University, China; Aviation Industry Corporation of China, China)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p172-p doi:10.1145/3798232

Efficient Directed Hybrid Fuzzing via Target-Centric Seed Selection and Generation
Zhen Li, Shenghan Liu, Qiuping Yi, Pengbo Du, and Hongliang Liang
(Beijing University of Posts and Telecommunications, China)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable Article: oopslaa26main-p176-p doi:10.1145/3798233

Efficient Incremental GR(1) Synthesis via Monotonic Fixed-Point Reuse
Sirui Liu, Wei Dong, Yijie Zheng, and Haonan Guo
(National University of Defense Technology, China)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p187-p doi:10.1145/3798234

VeriEQ: Finding Verilog Simulators and Synthesizers Bugs with Equivalence Circuit Transformation
Zhen Yan, Yuanliang Chen, Fuchen Ma, Zehong Yu, Dalong Shi, and Yu Jiang
(Tsinghua University, China; Aviation Industry Corporation of China, China)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p188-p doi:10.1145/3798235

SART: Sign-Absolute Reformulation Theory for Binary Variable Reduction in Neural Network Verification
Jin Xu, Miaomiao Zhang, and Bowen Du
(Tongji University, China)
Complete formal verification of neural networks is crucial for their deployment in safety-critical domains. A key bottleneck stems from encoding complexity: traditional methods assign one binary variable per unstable ReLU neuron. We propose the Sign-Absolute Reformulation Theory (SART), which fundamentally breaks the conventional one-to-one mapping between unstable neurons and binary variables by establishing formal reducibility criteria. This allows for finer-grained modeling, where each unstable neuron corresponds on average to fewer than one binary variable, thereby reducing verification complexity at its source. Based on SART, we derive a theoretical lower bound on the number of binary variables required for complete verification and, under the assumption that P ≠ NP, prove that variables in the final layer can be compressed by 50%, while the number of variables in intermediate layers cannot be further reduced. To overcome the apparent “last-layer-only” limitation, we recast verification as a sequential process and, crucially, show that the gain lifts to the entire network: LayerABS, a SART-based progressive tightening verifier, iteratively treats intermediate layers as temporary final layers and propagates tight bounds that shrink the global search space and binary-variable counts. Furthermore, we reveal a structural law influencing verification complexity: when the signs of weights of unstable neurons satisfy numerical symmetry, with positive and negative weights equal or differing by at most one, the worst-case verification complexity achieves the theoretical optimum, offering theoretical guidance for the design of verification-friendly architectures. As a general-purpose underlying encoding, the value of SART is independent of specific algorithms. To comprehensively evaluate its effectiveness, we first evaluate the abstraction-free SART encoding, and then integrate it with abstraction techniques to construct the complete verifier LayerABS and its incomplete variant Incomplete-LayerABS. Across benchmarks, our methods surpass state-of-the-art baselines, validating SART’s practical impact.

Publisher's Version Article: oopslaa26main-p200-p doi:10.1145/3798237

InspectCoder: Dynamic Analysis-Driven Self Repair through Interactive LLM-Debugger Collaboration
Yunkun Wang, Yue Zhang, Guochang Li, Chen Zhi, Binhua Li, Fei Huang, Yongbin Li, and Shuiguang Deng
(Zhejiang University, China; Alibaba, China)

Publisher's Version Article: oopslaa26main-p202-p doi:10.1145/3798238

Geo: A Query Rewrite Framework for Graph Pattern Mining
Nazanin Yousefian, Kasra Jamshidi, Keval Vora, and Anders Miltner
(Simon Fraser University, Canada)
Graph pattern mining is important for analyzing graph data. Graph mining systems typically require answering pattern matching queries, which involve solving the NP-complete subgraph isomorphism problem. To address this, domain experts often develop custom pattern matching query optimization strategies based on exploiting substructural similarities across different patterns. While these optimizers can be effective, their development is challenging due to the complex structural properties of the patterns (e.g., subsymmetries), which are difficult to address. This complexity limits the exploration of interactions between different optimization strategies and restricts experts from continuously improving the optimizers—such as by incorporating additional custom or general pattern-based equivalences over time.
In this paper, we present a programmable pattern matching query optimizer called Geo, which automatically manages the interactions between various equivalences, ensures the optimizations maintain correctness of results, and simplifies the management of substructure equivalences. Geo exposes a simple but flexible language for expressing pattern equivalences as rewrite rules. By maintaining canonical representations of generated patterns during equality saturation, Geo avoids issues arising from syntactic differences in isomorphic patterns. Additionally, we develop embedded reconstructablility (EmRec) that tracks provenance across equivalences to ensure various reconstructability needs of desired outputs. Our evaluation demonstrates that Geo can discover novel query equivalences through complex composition of various rewrite rules, enabling our optimized queries to achieve a cost reduction of up to 99% compared to the queries in prior work. We further test Geo’s effectiveness at speeding up practical graph mining problems by using it in two representative case studies – approximate pattern matching and quasi-clique mining, and find it is highly effective at optimizing these tasks, enabling cost reductions of up to 71%.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p206-p doi:10.1145/3798239

Lawyer: Modular Obligations-Based Liveness Reasoning in Higher-Order Impredicative Concurrent Separation Logic
Egor Namakonov, Justus Fasse, Bart Jacobs, Lars Birkedal, and Amin Timany
(Aarhus University, Denmark; KU Leuven, Belgium)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p211-p doi:10.1145/3798240

LARTS: Language Abstractions for Real-Time and Secure Systems
Yanqi Li, Hongliang Liang, Rui Yao, Yang Zhang, Dong Liu, Lei Wang, and Qiuping Yi
(Beijing University of Posts and Telecommunications, China; Beijing Institute of Computer Technology and Application, China)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: oopslaa26main-p216-p doi:10.1145/3798241

Grammar Repair with Examples and Tree Automata
Yunjeong Lee, Gokul Rajiv, and Ilya Sergey
(National University of Singapore, Singapore)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p219-p doi:10.1145/3798242

Online Input Grammar Synthesis Aided Symbolic Execution
Ke Ma, Yunlai Luo, Zhenbang Chen, Weijiang Hong, Yufeng Zhang, and Ji Wang
(National University of Defense Technology, Changsha, China; Hunan University, Changsha, China)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p232-p doi:10.1145/3798243

Designing GPU Data Structures for Efficient Memory Oversubscription
Vipin Patel, Srinjoy Sarkar, Swarnendu Biswas, and Mainak Chaudhuri
(IIT Kanpur, India)

Publisher's Version Article: oopslaa26main-p244-p doi:10.1145/3798244

Understanding and Finding JIT Compiler Performance Bugs
Zijian Yi, Cheng Ding, August Shi, and Milos Gligoric
(University of Texas at Austin, USA)

Publisher's Version Article: oopslaa26main-p246-p doi:10.1145/3798245

Deegen: A JIT-Capable VM Generator for Dynamic Languages
Haoran Xu and Fredrik Kjolstad
(Stanford University, USA)
Building a high-performance JIT-capable VM for a dynamic language has traditionally required a tremendous amount of time, money, and expertise. We present Deegen, a meta-compiler that allows users to generate a high-performance JIT-capable VM for their own language at an engineering cost similar to writing a simple interpreter. Deegen takes in the execution semantics of the bytecodes implemented as C++ functions, and automatically generates a two-tier VM execution engine with a state-of-the-art interpreter, a state-of-the-art baseline JIT, and the tier-switching logic that connects them into a self-adaptive system.
We are the first to show how to automatically generate a baseline JIT compiler that rivals the current state of the art, and an interpreter that outperforms the current state of the art. Our performance comes from Deegen's ability to automatically apply many state-of-the-art optimizations that previously had to be hand-implemented. These optimizations include bytecode specialization and quickening, register pinning, tag register optimization, call inline caching, generic inline caching, JIT polymorphic IC, JIT IC inline slab, type-check removal and strength reduction, type-based slow-path extraction and outlining, JIT hot-cold code splitting, and JIT OSR-entry. As a result, the performance of the Deegen-generated interpreter and baseline JITs matches or surpasses state-of-the-art interpreters and baseline JITs.
To evaluate Deegen, we use it to implement two languages: a Lua 5.1 VM called LuaJIT Remake (LJR) and a SOM VM called DSOM. Across 44 benchmarks, LJR's interpreter is on average 2.79x faster than the official PUC Lua interpreter, and 1.31x faster than LuaJIT's interpreter. LJR's baseline JIT has negligible compilation cost, and its execution performance is on average 4.60x faster than PUC Lua and only 33% slower (but faster on 13/44 benchmarks) than LuaJIT's optimizing JIT. Across 13 benchmarks, DSOM's interpreter is 4.28x--5.82x faster than the five existing SOM interpreters, and DSOM's baseline JIT compiles 25.84x faster than 2SOM's baseline JIT, while also generating code that runs 15.46x faster.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: oopslaa26main-p258-p doi:10.1145/3798246

Automatic Propagation of Profile Information through the Optimization Pipeline
Elisa Fröhlich, Angélica Moreira, and Fernando Magno Quintão Pereira
(Federal University of Minas Gerais, Brazil; Microsoft Research, Brazil)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p259-p doi:10.1145/3798247

PLEX: Normalization for Refinement Types
Alessio Ferrarini, Niki Vazou, and Wouter Swierstra
(IMDEA Software Institute, Spain; Universidad Politécnica de Madrid, Spain; Utrecht University, Netherlands)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p262-p doi:10.1145/3798248

EditFlow: Benchmarking and Optimizing Code Edit Recommendation Systems via Reconstruction of Developer Flows
Chenyan Liu, Yun Lin, Jiaxin Chang, Jiawei Liu, Binhang Qi, Bo Jiang, Zhiyong Huang, and Jin Song Dong
(Shanghai Jiao Tong University, China; National University of Singapore, Singapore; Bytedance Network Technology, China)
Large language models (LLMs) for code editing have achieved remarkable progress, yet recent empirical studies reveal a fundamental disconnect between technical accuracy and developer productivity. Despite their strong benchmark performance, developers complete tasks 19% slower when using AI assistance, with over 68.81% of recommendations disrupting their mental flow. This misalignment stems from the use of static commit snapshots that lack temporal information, causing models to optimize for end results rather than the incremental, context-sensitive steps that align with developers’ natural reasoning process. To bridge this gap, we present EditFlow, which benchmarks and optimizes subsequent code edit recommendation systems through the reconstruction of developer editing flows. EditFlow addresses three key challenges. First, collecting edit-order data that reflects developers’ flow is inherently difficult: manual annotation introduces prohibitive overhead, while development logs capture only single trajectories instead of all plausible editing flows. Second, benchmarking recommendation performance against developers’ ongoing editing flow requires a digital-twin-like simulation that can faithfully simulate the editing process. Third, existing heterogeneous systems vary drastically in scale and architecture, posing challenges for developing a unified optimization strategy that endows all models with mental-flow awareness regardless of design or capability. To overcome these challenges, we propose three tightly coupled components: (1) a prompt auto-tuning mechanism that learns an optimized prompt for inferring the relative order between two edits, (2) a digital twin that replays reconstructed edit sequences to simulate developers’ editing process, and (3) EditFlow, a unified optimization strategy that optimizes the flow continuity of subsequent edit suggestions based on developers’ ongoing flow. Evaluations across diverse benchmarks, including manually annotated commits, real-world industrial code, and open-source repositories, show that EditFlow improves order reconstruction accuracy by 63.81%, reduces flow violations by over 75%, and boosts recommendation precision by 66.99%. A user study with 32 developers further demonstrates 25.11% faster task completion and significantly higher perceived recommendation quality. To the best of our knowledge, EditFlow is the first to evaluate and optimize code edit recommendation systems from the perspective of developers’ mental flow, establishing flow-awareness as a new dimension for advancing human-AI code collaboration.

Publisher's Version

Published Artifact

Info

Artifacts Available Article: oopslaa26main-p268-p doi:10.1145/3798249

CLower: Detecting Compiler Pessimization Bugs through Redundant Memory Accesses
Jianhao Xu, Kunbo Zhang, Mathias Payer, Kangjie Lu, and Bing Mao
(Southeast University, China; State Key Laboratory for Novel Software Technology at Nanjing University, China; EPFL, Switzerland; University of Minnesota, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p272-p doi:10.1145/3798250

Beyond Coverage: Automatic Test Suite Augmentation for Enhanced Effectiveness using Large Language Models
Zeyu Lu, Peng Zhang, Yuge Nie, Yibiao Yang, Yutian Tang, Chun Yong Chong, and Yuming Zhou
(Nanjing University, China; University of Glasgow, UK; Monash University Malaysia, Subang Jaya, Malaysia)
Large Language Models (LLMs) have gained significant traction in software engineering for automating tasks such as unit test generation. Most existing studies prioritize code coverage as the primary metric for enhancing test suite effectiveness. However, prior research has shown that although code coverage can reach approximately 80%, the mutation score, which generally exhibits a stronger correlation with defect detection effectiveness, attains only about 35%. This gap highlights the need to enhance test suite effectiveness guided by mutation score rather than code coverage. Recent studies, including MuTAP and MutGen, explored the use of survived mutants to enhance test suite effectiveness. However, their evaluations were limited to simple standalone methods that rely on built-in functions and standard libraries. Non-standalone methods, which depend on other classes and involve complex user-defined types, are more intricate and commonly found in real-world projects. The limited contextual information and basic repair mechanisms in their prompt designs make it unclear whether their performance can generalize to non-standalone methods. Moreover, the two studies rely on existing language-specific, rule-based mutation techniques, which require specific configurations and incur additional costs when adapting to other programming languages.
To bridge this gap, we propose a novel, fully automatic LLM-based approach to enhance test suite effectiveness, guided by survived mutants. The approach augments initial test suites by integrating mutation testing with test case generation. It takes focal method information as input and generates test cases targeting survived mutants identified from applying the initial test suites. Our approach incorporates multiple prompt techniques, rich contextual information, and an advanced repair mechanism to effectively generate test cases for non-standalone methods. The evaluation covers 1,035 focal methods, categorized as standalone or non-standalone. On average, the mutation score increases by 16.04% for standalone methods and 8.11% for non-standalone methods. We validate the practical impact of augmented test suites in LLM-based code generation. After test suite augmentation, pass@1 decreased by 0.3152 and 0.1772 on average for standalone and non-standalone methods, respectively, indicating the effectiveness of our approach in reducing false positives caused by insufficient test cases in code generation evaluation.

Publisher's Version Article: oopslaa26main-p275-p doi:10.1145/3798251

Type-Safe Monotonic Object Evolution
Alexandra Mirrlees-Black, Haoyu Wu, Gregor Richards, and Fabian Muehlboeck
(Australian National University, Australia; University of Waterloo, Canada)

Publisher's Version

Published Artifact

Artifacts Available Article: oopslaa26main-p276-p doi:10.1145/3798252

Localizing Type Errors for Syntactic Sugar by Lifting
Zhichao Guan, Tailai Yu, Di Wang, and Zhenjiang Hu
(Peking University, China)

Publisher's Version Article: oopslaa26main-p284-p doi:10.1145/3798253

When Lifetimes Liberate: A Type System for Arenas with Higher-Order Reachability Tracking
Siyuan He, Songlin Jia, Yuyan Bao, and Tiark Rompf
(Purdue University, USA; Augusta University, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p305-p doi:10.1145/3798254

Determining the Unreachable: Constraint-Guided Reachability Analysis for Dependency Vulnerabilities
Wenbu Feng, Xiaohong Li, Ruitao Feng, Yao Zhang, Yuekang Li, Zhiping Zhou, Yunqian Wang, and Yuqing Li
(Tianjin University, China; Southern Cross University, Australia; UNSW Sydney, Australia)
In software development, investigating the accessibility of dependency vulnerabilities is of great importance, as third-party libraries often contain known vulnerabilities that could be exploited in the application's business logic. The existing accessibility analysis methods encounter challenges such as undecidability, abstraction loss, and path explosion in large-scale programs, resulting in an inaccurate distinction between accessibility vulnerabilities and non-accessibility vulnerabilities. This paper introduces an approach called ConVReach for analyzing the reachability of vulnerabilities in dependencies in C/C++ programs. ConVReach overcomes the problems of high abstraction loss and potential path explosion in the current methods by combining static and dynamic approaches, particularly a constraint-guided analysis method. This approach extracts and decomposes the path constraints that trigger vulnerabilities, independently verifies the satisfiability of each constraint, and then aggregates the feasible paths. This effectively reduces unnecessary path exploration and avoids the common path explosion issues in traditional methods. Experimental results show that ConVReach outperforms existing tools in both accuracy and efficiency, effectively distinguishing between reachable and unreachable vulnerabilities, and significantly reducing false positives and false negatives.
We constructed a benchmark dataset to evaluate ConVReach, which includes 53 CVEs and 347 flags artificially inserted into various open-source projects. This dataset was designed to simulate both real-world vulnerabilities and complex scenarios. Through testing on this dataset, ConVReach demonstrated exceptional performance. It successfully identified 59 out of 61 reachable vulnerabilities and all 23 unreachable ones in the CVE dataset. Within a 24-hour time budget, ConVReach detected above 50% more reachable vulnerabilities than the baseline tools in the first 6 hours and nearly completed the detection of reachable vulnerabilities by the 12-hour mark. These results highlight ConVReach's superior ability to handle both real-world vulnerabilities and challenging cases.

Publisher's Version Article: oopslaa26main-p311-p doi:10.1145/3798255

Mixed Choice in Asynchronous Multiparty Session Types
Laura Bocchi, Raymond Hu, Adriana Laura Voinea, and Simon Thompson
(University of Kent, UK; Queen Mary University of London, UK; University of Glasgow, UK)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p321-p doi:10.1145/3798256

RandSet: Randomized Corpus Reduction for Fuzzing Seed Scheduling
Yuchong Xie, Kaikai Zhang, Yu Liu, Rundong Yang, Ping Chen, Shuai Wang, and Dongdong She
(Fudan University, China; Hong Kong University of Science and Technology, China)
Seed explosion is a fundamental problem in fuzzing seed scheduling. It occurs when a fuzzer maintains a corpus with a huge number of seeds and fails to choose a promising one. Existing seed scheduling works focus on seed prioritization but suffer from the seed explosion since the seed corpus size is still huge. We tackle seed explosion from a new perspective, corpus reduction, i.e., compute a seed corpus subset. Corpus reduction can eliminate redundant seeds in the corpus and significantly reduce corpus size. However, this could lead to poor diversity in seed selection and severely impact the fuzzing performance. Meanwhile, effective corpus reduction incurs large runtime overhead. In practice, it’s challenging to adopt corpus reduction in fuzzing seed scheduling. Prior techniques like cull_queue, AFLCmin and MinSet all suffer from poor seed diversity. AFL-Cmin and MinSet incur prohibitive runtime overhead and are hence only applicable to one-time task initial seed selection rather than high-frequency seed scheduling.
We propose a novel randomized corpus reduction technique, RandSet, that can reduce the corpus size and yield diverse seed selection simultaneously. Meanwhile, the runtime overhead of RandSet is minimal, suiting a high-frequency seed scheduling task. Our key insight is to introduce randomness into corpus reduction so as to enjoy the two benefits of a randomized algorithm: randomized output (i.e., diverse seed selection) and low runtime overhead. Specifically, we formulate the corpus reduction in seed scheduling as a classic set cover problem and compute a randomized subset of seed corpus as a set cover to cover all features of the entire corpus. We then develop a novel seed scheduling approach using the randomized corpus subset. Our technique can effectively mitigate seed explosion by scheduling a small and randomized subset of the corpus rather than the entire corpus.
We implement RandSet on three popular fuzzers: AFL++, LibAFL and Centipede to showcase its general algorithmic design. We perform a comprehensive evaluation of RandSet on three benchmarks: standalone programs, FuzzBench and Magma. Our evaluation results show that RandSet can achieve significantly more diverse seed selection compared with other corpus reduction techniques. RandSet also yields high reduction ratio, achieving an average subset ratio of 4.03% and 5.99% after corpus reduction in terms of standalone programs and FuzzBench programs. In terms of fuzzing performance gain from our randomized corpus reduction, RandSet achieves a 16.58% gain on standalone programs and up to 3.57% gain on FuzzBench programs in AFL++. RandSet triggers up to 7 more ground-truth bugs than the state-of-the-art fuzzer on Magma, while introducing only 3.93% overhead on standalone programs and as low as 1.17% overhead on FuzzBench.

Publisher's Version Article: oopslaa26main-p340-p doi:10.1145/3798257

LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer
Kunpeng Zhang, Dongwei Xiao, Daoyuan Wu, Shuai Wang, Jiali Zhao, Yuanyi Lin, Tongtong Xu, and Shaohua Wang
(Hong Kong University of Science and Technology, Hong Kong; Lingnan University, Hong Kong; Huawei, China; Central University of Finance and Economics, China)

Publisher's Version Article: oopslaa26main-p341-p doi:10.1145/3798258

Effectively Propositional Higher-Order Functional Programming
Nicholas V. Lewchenko, Kunha Kim, Bor-Yuh Evan Chang, and Gowtham Kaki
(University of Colorado Boulder, USA; Amazon, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p348-p doi:10.1145/3798259

Mechanised Semantics of Multi-stage Programming
Ka Wing Li, Maite Kramarz, Ningning Xie, and Jeremy Yallop
(University of Cambridge, UK; University of Toronto, Canada)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p354-p doi:10.1145/3798260

Differential Execution with Lexical Tracing
Sebastian Erdweg, Runqing Xu, and Mo Bitar
(KIT, Germany)

Publisher's Version Article: oopslaa26main-p381-p doi:10.1145/3798261

Prunario: Testing Autonomous Driving Systems by Pruning Likely Redundant Scenarios
Minsu Kim, Sunbeom So, and Hakjoo Oh
(Korea University, Republic of Korea)

Publisher's Version

Published Artifact

Artifacts Available Article: oopslaa26main-p383-p doi:10.1145/3798262

Fail Faster: Staging and Fast Randomness for High-Performance PBT
Cynthia Richey, Joseph W. Cutler, Harrison Goldstein, and Benjamin C. Pierce
(University of Pennsylvania, USA; University at Buffalo, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p387-p doi:10.1145/3798263

DeCo: A Core Calculus for Incremental Functional Programming with Generic Data Types
Timon Böhler, Tobias Reinhard, David Richter, and Mira Mezini
(TU Darmstadt, Germany; hessian.AI, Germany; National Research Center for Applied Cybersecurity ATHENE, Germany)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p391-p doi:10.1145/3798264

Detecting Flaky Tests by Controlling Nondeterministic API Behavior
Hengchen Yuan, Jiefang Lin, and August Shi
(University of Texas at Austin, USA)

Publisher's Version Article: oopslaa26main-p396-p doi:10.1145/3798265

From Raw Pointers to Memory Safety: A Modular Demand-Driven Typestate Analysis for Rust
Wei Li, Wenyao Chen, and Jingling Xue
(UNSW Sydney, Australia)

Publisher's Version Article: oopslaa26main-p409-p doi:10.1145/3798266

Speak Now: Safe Actor Programming with Multiparty Session Types
Simon Fowler and Raymond Hu
(University of Glasgow, UK; Queen Mary University of London, UK)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p508-p doi:10.1145/3798267

A Tale of 1001 LoC: Potential Runtime Error-Guided Specification Synthesis for Verifying Large-Scale Programs
Zhongyi Wang, Tengjie Lin, Mingshuai Chen, Haokun Li, Mingqi Yang, Xiao Yi, Shengchao Qin, Yixing Luo, Xiaofeng Li, Bin Gu, Liqiang Lu, and Jianwei Yin
(Zhejiang University, China; Peking University, China; Chinese University of Hong Kong, China; Xidian University, China; Beijing Institute of Control Engineering, China)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p510-p doi:10.1145/3798268

Frashokereti: Non-aborting Optimistically Replicated Objects
Eric Man Chan, Javad Saberlatibari, and Mohsen Lesani
(University of California at Riverside, USA; University of California at Santa Cruz, USA)

Publisher's Version Article: oopslaa26main-p517-p doi:10.1145/3798269

Context-Free Language Reachability via Efficient Relation Chaining
Chenghang Shi, Haofeng Li, Jie Lu, and Lian Li
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional

Results Reproduced Article: oopslaa26main-p533-p doi:10.1145/3798270

Process-Centric Analysis of Agentic Software Systems
Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhaneh Jabbarvand
(University of Illinois at Urbana-Champaign, USA; IBM Research, USA)
Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines. Unlike conventional programs, their execution, i.e., trajectories, is inherently stochastic and adaptive to the problems they are solving. Evaluation of such systems is often outcome-centric, i.e., judging their performance based on success or failure at the final step. This narrow focus overlooks detailed insights about such systems, failing to explain how agents reason, plan, act, or change their strategies. Inspired by the structured representation of conventional software systems as graphs, we introduce Graphectory to systematically encode the temporal and semantic relations in such software systems. Graphectory facilitates the design of process-centric metrics and analyses to assess the quality of agentic workflows.
Using Graphectory, we automatically analyze 4000 trajectories of two dominant agentic programming workflows, namely SWE-agent and OpenHands, with a combination of four backbone Large Language Models (LLMs), attempting to resolve SWE-bench Verified issues. Our fully automated analyses (completed within four minutes) reveal that: (1) agents using richer prompts or stronger LLMs exhibit more complex Graphectory, reflecting deeper exploration, broader context gathering, and more thorough validation before patch submission; (2) agents’ problem-solving strategies vary with both problem difficulty and the underlying LLM—for resolved issues, the strategies often follow coherent localization–patching–validation steps, while unresolved ones exhibit chaotic, repetitive, or backtracking behaviors; and (3) even when successful, agentic programming systems often display inefficient processes, leading to unnecessarily prolonged trajectories.
We also implement a novel technique for real-time construction and analysis of Graphectory and Langutory during the agent’s execution to flag trajectory issues. Upon detecting such issues in the trajectory, the proposed technique notifies the agent with a diagnostic message and, when applicable, rolls back the trajectory. The experimental results show that online monitoring and process-centric analysis, when accompanied by appropriate interventions, can improve resolution rates by 6.9%-23.5% across models for problematic instances, while significantly shortening trajectories with near-zero overhead.

Publisher's Version Article: oopslaa26main-p534-p doi:10.1145/3798271

Reframing Paths as Logic: Semantic Segmentation for Vulnerability Detection
Zong Cao, Yuqiang Sun, Zhengzi Xu, Kaixuan Li, Yeqi Fu, Yiran Zhang, Ziqiao Kong, and Yang Liu
(Imperial Global Singapore, Singapore; Nanyang Technological University, Singapore; National University of Singapore, Singapore)

Publisher's Version Article: oopslaa26main-p549-p doi:10.1145/3798272

IRIDIUM: A Framework for Statically Optimizing JavaScript Programs
Meetesh Kalpesh Mehta, Anirudh Garg, Aneeket Yadav, and Manas Thakur
(IIT Bombay, India; IIT Delhi, India)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

Results Reproduced Article: oopslaa26main-p554-p doi:10.1145/3798273

Block Tests
Kevin Guan, Pengyue Jiang, Milos Gligoric, and Owolabi Legunsen
(Cornell University, USA; University of Texas at Austin, USA)

Publisher's Version Article: oopslaa26main-p561-p doi:10.1145/3798274

A Minimalist Proof Language for Neural Theorem Proving over Isabelle/HOL
Qiyuan Xu, Renxi Wang, Peixin Wang, Haonan Li, and Conrad Watt
(Nanyang Technological University, Singapore; MBZUAI, United Arab Emirates; East China Normal University, China)
Neural Theorem Proving (NTP) employs Large Language Models (LLMs) to automate formal proofs in proof assistants. While LLMs have achieved relatively remarkable success in informal reasoning tasks using natural languages, the transition to mechanized formal theorem proving presents persistent challenges. Mechanized proof languages often contain many syntactic constructs and diverse, specialized proof tactics, which facilitate expert use but have no direct counterpart in informal mathematical proofs. These prover-specific idioms represent an additional burden for LLM-based NTPs that might be otherwise successful in generating informal proofs. Seeking to bridge this gap between formal proof construction and informal reasoning, in order to better facilitate NTP, this work approaches these challenges from a language design perspective. We look at common reasoning patterns in informal proofs and in existing mechanized proofs, and design Minilang (formally named Isabelle/Minilang), a minimalist proof language that captures these reasoning patterns. In contrast to proof languages (informal and formal) that often feature a large collection of operations with unclear semantic boundaries, Minilang is deliberately kept minimalist --- its core design comprises only 10 proof operations, each with clear semantic distinctions. We further develop a rule-based translator from Isabelle's proof language (Isar) to Minilang, translating ~340,000 existing Isabelle proofs with an ~85% success rate. Using this translated corpus, we finetune two LLMs to compare machine learning performance on Minilang versus the original Isar language. Experiments show Minilang benefits the two LLMs by improving the pass@1 success rate on the PISA benchmark by up to 20/29 percentage points in comparison to the Isar-based LLMs w/wo Sledgehammer. The pass@1 rate reaches 69.1%, exceeding the prior work Baldur's pass@64 (65.7%); the pass@8 rate reaches 79.2%, exceeding the state-of-the-art on PISA (71.0%) achieved by Magnushammer.

Publisher's Version

Published Artifact

Info

Artifacts Available

Artifacts Reusable Article: oopslaa26main-p581-p doi:10.1145/3798275

proc time: 1.54