OOPSLA2 2025
Proceedings of the ACM on Programming Languages, Volume 9, Number OOPSLA2
Powered by
Conference Publishing Consulting

Proceedings of the ACM on Programming Languages, Volume 9, Number OOPSLA2

OOPSLAB – Journal Issue

Contents - Abstracts - Authors

Frontmatter

Title Page


Editorial Message
The Proceedings of the ACM series presents the highest-quality research conducted in diverse areas of computer science, as represented by the ACM Special Interest Groups (SIGs). The ACM Proceedings of the ACM on Programming Languages (PACMPL) focuses on research on all aspects of programming languages, from design to implementation and from mathematical formalisms to empirical studies. The journal operates in close collaboration with the Special Interest Group on Programming Languages (SIGPLAN) and is committed to making high-quality peer-reviewed scientific research in programming languages free of restrictions on both access and use.

Sponsors



Regular Papers

Exploring the Theory and Practice of Concurrency in the Entity-Component-System Pattern
Patrick Redmond, Jonathan Castello, José Manuel Calderón Trilla, and Lindsey Kuper
(University of California at Santa Cruz, USA; Haskell Foundation, Canada)
The Entity-Component-System (ECS) software design pattern, long used in game development, encourages a clean separation of identity (entities), data properties (components), and computational behaviors (systems). Programs written using the ECS pattern are naturally concurrent, and the pattern offers modularity, flexibility, and performance benefits that have led to a proliferation of ECS frameworks. Nevertheless, the ECS pattern is little-known and not well understood outside of a few domains. Existing explanations of the ECS pattern tend to be mired in the concrete details of particular ECS frameworks, or they explain the pattern in terms of imperfect metaphors or in terms of what it is not. We seek a rigorous understanding of the ECS pattern via the design of a formal model, Core ECS, that abstracts away the details of specific implementations to reveal the essence of software using the ECS pattern. We identify a class of Core ECS programs that behave deterministically regardless of scheduling, enabling use of the ECS pattern as a deterministic-by-construction concurrent programming model. With Core ECS as a point of comparison, we then survey several real-world ECS frameworks and find that they all leave opportunities for deterministic concurrency unexploited. Our findings point out a space for new ECS implementation techniques that better leverage such opportunities.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Quantified Underapproximation via Labeled Bunches
Lang Liu, Farzaneh Derakhshan, Limin Jia, Gabriel A. Moreno, and Mark Klein
(Illinois Institute of Technology, USA; Carnegie Mellon University, USA)
Given the high cost of formal verification, a large system may include differently analyzed components: a few are fully verified, and the rest are tested. Currently, there is no reasoning system that can soundly compose these heterogeneous analyses and derive the overall formal guarantees of the entire system. The traditional compositional reasoning technique—rely-guarantee reasoning—is effective for verified components, which undergo over-approximated reasoning, but not for those components that undergo under-approximated reasoning, e.g., using testing or other program analysis techniques.
The goal of this paper is to develop a formal, logical foundation for composing heterogeneous analysis, deploying both over-approximated (verification) and under-approximated (testing) reasoning. We focus on systems that can be modeled as a collection of communicating processes. Each process owns its internal resources and a set of channels through which it communicates with other processes. The key idea is to quantify the guarantees obtained about the behavior of a process as a test level, which captures the constraints under which this guarantee is analyzed to be true. We design a novel proof system LabelBI based on the logic of bunched implications that enables rely-guarantee reasoning principles for a system of differently analyzed components. We develop trace semantics for this logic, against which we prove our logic is sound. We also prove cut elimination of our sequent calculus. We demonstrate the expressiveness of our logic via a case study.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Pyrosome: Verified Compilation for Modular Metatheory
Dustin Jamner, Gabriel Kammer, Ritam Nag, and Adam Chlipala
(Massachusetts Institute of Technology, USA; Intel, USA)
We present Pyrosome, a generic framework for modular language metatheory that embodies a novel approach to extensible semantics and compilation, implemented in Coq. Common techniques for semantic reasoning are often tied to the specific structures of the languages and compilers that they support. Contextual equivalence is difficult to work with directly, and both logical relations and transition system-based approaches typically fix a specific notion of effect globally. While modular transition systems have been effective in imperative settings, they are suboptimal for functional code. These limitations restrict the extension and composition of semantics in these systems. In Pyrosome, verified compilers are fully extensible, meaning that to extend a language simply requires defining and verifying the compilation of the new feature, reusing the old correctness theorem for all other cases. The novel enabling idea is an inductive formulation of equivalence preservation that supports the addition of new rules to the source language, target language, and compiler.
Pyrosome defines a formal, deeply embedded notion of programming languages with semantics given by dependently sorted equational theories, so all compiler-correctness proofs boil down to type-checking and equational reasoning. We support vertical composition of any compilers expressed in our framework in addition to feature extension. Since our design requires compilers to support open programs, our correctness guarantees support linking with any target code of the appropriate type. As a case study, we present a multipass compiler from System F with simple references, through CPS translation and closure conversion. Specifically, we demonstrate how we can build such a compiler incrementally by starting with a compiler for simply typed lambda-calculus and adding natural numbers, the unit type, recursive functions, and a global heap, then extending judgments with a type environment and adding type abstraction, all while reusing the original theorems. We also present a linear version of the simply typed CPS pass and compile a small imperative language to the simply typed target to show how Pyrosome handles substructural typing and imperative features.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Boosting Program Reduction with the Missing Piece of Syntax-Guided Transformations
Zhenyang Xu, Yongqiang Tian, Mengxiao Zhang, and Chengnian Sun
(University of Waterloo, Canada; Monash University, Australia)
Program reduction is a widely used technique in testing and debugging language processors. Given a program that triggers a bug in a language processor, program reduction searches for a canonicalized and minimized program that triggers the same bug, thereby facilitating bug deduplication and simplifying the debugging process. To improve reduction performance without sacrificing generality, prior research has leveraged the formal syntax of the programming language as guidance. Two key syntax-guided transformations—Compatible Substructure Hoisting and Quantified Node Reduction—were introduced to enhance this process. While these transformations have proven effective to some extent, their application excessively prunes the search space, preventing the discovery of many smaller results. Consequently, there remains significant potential for further improvement in overall reduction performance.
To this end, we propose a novel syntax-guided transformation named Structure Form Conversion (SFC) to complement the aforementioned two transformations. Building on SFC, we introduce three reduction methods: Smaller Structure Replacement, Identifier Elimination, and Structure Canonicalization, designed to effectively and efficiently leverage SFC for program reduction. By integrating these reduction methods to previous language-agnostic program reducers, Perses and Vulcan, we implement two prototypes named SFCPerses and SFCVulcan. Extensive evaluations show that SFCPerses and SFCVulcan significantly outperforms Perses and Vulcan in both minimization and canonicalization. Specifically, compared to Perses, SFCPerses produces programs that are 36.82%, 18.71%, and 41.05% smaller on average on the C, Rust, and SMT-LIBv2 benchmarks at the cost of 3.65×, 16.99×, and 1.42× the time of Perses, respectively. Similarly, SFCVulcan generates programs that are 14.51%, 7.65%, and 7.66% smaller than those produced by Vulcan at the cost of 1.56×, 2.35×, and 1.42× the execution time of Vulcan. Furthermore, in an experiment with a benchmark suite containing 3,796 C programs that trigger 46 unique bugs, SFCPerses and SFCVulcan reduce 442 and 435 more duplicates (programs that trigger the same bug) to identical programs than Perses and Vulcan, respectively.

Publisher's Version
Static Inference of Regular Grammars for Ad Hoc Parsers
Michael Schröder and Jürgen Cito
(TU Wien, Austria)
Parsing—the process of structuring a linear representation according to a given grammar—is a fundamental activity in software engineering. While formal language theory has provided theoretical foundations for parsing, the most common kind of parsers used in practice are written ad hoc. They use common string operations without explicitly defining an input grammar. These ad hoc parsers are often intertwined with application logic and can result in subtle semantic bugs. Grammars, which are complete formal descriptions of input languages, can enhance program comprehension, facilitate testing and debugging, and provide formal guarantees for parsing code. But writing grammars—e.g., in the form of regular expressions—can be tedious and error-prone. Inspired by the success of type inference in programming languages, we propose a general approach for static inference of regular input string grammars from unannotated ad hoc parser source code. We use refinement type inference to synthesize logical and string constraints that represent regular parsing operations, which we then interpret with an abstract semantics into regular expressions. Our contributions include a core calculus λΣ for representing ad hoc parsers, a formulation of (regular) grammar inference as refinement inference, an abstract interpretation framework for solving string refinement variables, and a set of abstract domains for efficiently representing the constraints encountered during regular ad hoc parsing. We implement our approach in the PANINI system and evaluate its efficacy on a benchmark of 204 Python ad hoc parsers. Compared with state-of-the-art approaches, PANINI produces better grammars (100% precision, 93% average recall) in less time (0.82 ± 2.85 s) without prior knowledge of the input space.

Publisher's Version Published Artifact Artifacts Available
RestPi: Path-Sensitive Type Inference for REST APIs
Mark W. Aldrich, Kyla H. Levin, Michael Coblenz, and Jeffrey S. Foster
(Tufts University, USA; University of Massachusetts at Amherst, USA; University of California at San Diego, USA)
REST APIs form the backbone of modern interconnected systems by providing a language-agnostic communication interface. REST API specifications should clearly describe all response types, but automatically generating specifications is difficult with existing tools.
We present REST𝜋, a type inference engine capable of automatically generating REST API specifications. The novel contribution of REST𝜋 is our use of path-sensitive type inference, which encodes symbolic path-constraints directly into a type system. This allows REST𝜋 to enumerate all response types by considering each distinct execution path through an endpoint implementation. We implement path-sensitive type inference for Ruby, a popular language used for REST API servers. We evaluate REST𝜋 by using it to infer types for 132 endpoints across 5 open-source REST API implementations without utilizing existing specifications or test suites. We find REST𝜋 performs type inference efficiently and produces types that are more precise and complete than those obtained via an HTTP proxy. Our results suggest that path-sensitivity is a key technique to enumerate distinct response types for REST endpoints.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Compositional Quantum Control Flow with Efficient Compilation in Qunity
Mikhail Mints, Finn Voichick, Leonidas Lampropoulos, and Robert Rand
(California Institute of Technology, Pasadena, USA; University of Maryland, College Park, USA; University of Chicago, Chicago, USA)
Most existing quantum programming languages are based on the quantum circuit model of computation, as higher-level abstractions are particularly challenging to implement—especially ones relating to quantum control flow. The Qunity language, proposed by Voichick et al., offered such an abstraction in the form of a quantum control construct, with great care taken to ensure that the resulting language is still realizable. However, Qunity lacked a working implementation, and the originally proposed compilation procedure was very inefficient, with even simple quantum algorithms compiling to unreasonably large circuits.
In this work, we focus on the efficient compilation of high-level quantum control flow constructs, using Qunity as our starting point. We introduce a wider range of abstractions on top of Qunity's core language that offer compelling trade-offs compared to its existing control construct. We create a complete implementation of a Qunity compiler, which converts high-level Qunity code into the quantum assembly language OpenQASM 3. We develop optimization techniques for multiple stages of the Qunity compilation procedure, including both low-level circuit optimizations as well as methods that consider the high-level structure of a Qunity program, greatly reducing the number of qubits and gates used by the compiler.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Liberating Merges via Apartness and Guarded Subtyping
Han Xu, Xuejing Huang, and Bruno C. d. S. Oliveira
(Princeton University, USA; University of Hong Kong, China; IRIF - Université Paris Cité, France)
The merge operator is a powerful construct in programming languages, enabling flexible composition of various components such as functions, records, or classes. Unfortunately, its application often leads to ambiguity and non-determinism, especially when dealing with overlapping types. To prevent ambiguity, approaches such as disjoint intersection types have been proposed. However, disjointness imposes strict constraints to ensure determinism, at the cost of limiting expressiveness, particularly for function overloading. This paper introduces a novel concept called type apartness, which relaxes the strict disjointness constraints, while maintaining type safety and determinism. Type apartness allows some overlap for overloaded functions as long as the calling contexts of those functions can be used to disambiguate upcasts in function calls. By incorporating the notion of guarded subtyping to prevent ambiguity when upcasting, our approach is the first to support function overloading, return type overloading, extensible records, and nested composition in a single calculus while preserving determinism. We formalize our calculi and proofs using Coq and prove their type soundness and determinism. Additionally, we demonstrate how type normalization and type difference provide more convenience and help resolve conflicts, enhancing the flexibility and expressiveness of the merge operator.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Probabilistic Inference for Datalog with Correlated Inputs
Jingbo Wang, Shashin Halalingaiah, Weiyi Chen, Chao Wang, and Işıl Dillig
(Purdue University, USA; University of Texas at Austin, USA; University of Southern California, USA)
Probabilistic extensions of logic programming languages, such as ProbLog, integrate logical reasoning with probabilistic inference to evaluate probabilities of output relations; however, prior work does not account for potential statistical correlations among input facts. This paper introduces Praline, a new extension to Datalog designed for precise probabilistic inference in the presence of (partially known) input correlations. We formulate the inference task as a constrained optimization problem, where the solution yields sound and precise probability bounds for output facts. However, due to the complexity of the resulting optimization problem, this approach alone often does not scale to large programs. To address scalability, we propose a more efficient δ-exact inference algorithm that leverages constraint solving, static analysis, and iterative refinement. Our empirical evaluation on challenging real-world benchmarks, including side-channel analysis, demonstrates that our method not only scales effectively but also delivers tight probability bounds.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
im2im: Automatically Converting In-Memory Image Representations using a Knowledge Graph Approach
Fei Chen, Sunita Saha, Manuela Schuler, Philipp Slusallek, and Tim Dahmen
(German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany; Saarland University, Saarbrücken, Germany; Aalen University, Aalen, Germany)
Image processing workflows typically consist of a series of different functions, each working with “image” inputs and outputs in an abstract sense. However, the specific in-memory representation of images differs between and sometimes within libraries. Conversion is therefore necessary when integrating functions from several sources into a single program. The conversion process forces users to consider low-level implementation details, including data types, color channels, channel order, minibatch layout, memory locations, and pixel intensity ranges. Specifically in the case of visual programming languages (VPLs), this distracts from high-level operations. We introduce im2im, a Python library that automates the conversion of in-memory image representations. The central concept of this library is a knowledge graph that describes image representations and how to convert between them. The system queries this knowledge graph to generate the conversion code and execute it, converting an image to the desired representation. The effectiveness of the approach is evaluated through two case studies in VPLs. In each case, we compared a workflow created in a basic block-based VPL with the same workflow enhanced using im2im. These evaluations show that im2im automates type conversions and eliminates the need for manual intervention. Additionally, we compared the overhead of using explicit intermediate representations versus im2im, both of which avoid manual type conversions in VPLs. The results indicate that im2im generates only the necessary conversions, avoiding the runtime overhead associated with converting to and from intermediate formats. A performance comparison between the step-by-step approach used by im2im and a single-function approach demonstrates that the overhead introduced by im2im does not impact practical usability. While focused on block-based VPLs, im2im can be generalized to other VPLs and textual programming environments. Its principles are also applicable to domains other than images. The source code and analyses are available via GitHub.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Model-Guided Fuzzing of Distributed Systems
Ege Berkay Gulcan, Burcu Kulahcioglu Ozkan, Rupak Majumdar, and Srinidhi Nagendra
(Delft University of Technology, Netherlands; MPI-SWS, Germany; IRIF - CNRS - Université Paris Cité, France; Chennai Mathematical Institute, India)
We present a coverage-guided testing algorithm for distributed systems implementations. Our main innovation is the use of an abstract formal model of the system that is used to define coverage. Such abstract models are frequently developed in the early phases of protocol design and verification but are infrequently used at testing time. We show that guiding random test generation using model coverage can be effective in covering interesting points in the implementation state space. We have implemented a fuzzer for distributed system implementations and abstract models written in TLA+. Our algorithm achieves better coverage over purely random exploration as well as random exploration guided by different notions of scheduler coverage and mutation. In particular, we show consistently higher coverage on implementations of distributed consensus protocols such as Two-Phase Commit and the Raft implementations in Etcd-raft and RedisRaft and detect bugs faster. Moreover, we discovered 12 previously unknown bugs in their implementations, four of which could only be detected by model-guided fuzzing.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Flexible and Expressive Typed Path Patterns for GQL
Wenjia Ye, Matías Toro, Tomás Díaz, Bruno C. d. S. Oliveira, Manuel Rigger, Claudio Gutierrez, and Domagoj Vrgoč
(National University of Singapore, Singapore; University of Chile, Chile; IMFD, Chile; University of Hong Kong, China; Pontificia Universidad Católica de Chile, Chile)
Graph databases have become an important data management technology across various domains, including biology, sociology, industry (e.g. fraud detection, supply chain management, financial services), and investigative journalism, due to their ability to efficiently store and query large-scale knowledge graphs and networks. Recently, the Graph Query Language (GQL) was introduced as a new ISO standard providing a unified framework for querying graphs. However, this initial specification lacks a formal type system for query validation. As a result, queries can fail at runtime due to type inconsistencies or produce empty results without prior warning. Solving this issue could have great benefits for users in writing correct queries, especially when handling large datasets. To address this gap, we introduce a formal type model for a core fragment of GQL extended with property-based filtering and imprecise types both in the schema and the queries. This model, named FPPC, enables static detection of semantically incorrect and stuck queries, improving user feedback. We establish key theoretical properties, including emptiness (detecting empty queries due to type mismatches) and type safety (guaranteeing that well-typed queries do not fail at runtime). Additionally, we prove a gradual guarantee, ensuring that removing type annotations either does not introduce static type errors or only increases the result set. By integrating imprecision into GQL, FPPC offers a flexible solution for handling schema evolution and incomplete type information. This work contributes to making GQL more robust, improving both its usability and its formal foundation.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Mind the Abstraction Gap: Bringing Equality Saturation to Real-World ML Compilers
Arya Vohra, Leo Seojun Lee, Jakub Bachurski, Oleksandr Zinenko, Phitchaya Mangpo Phothilimthana, Albert Cohen, and William S. Moses
(University of Chicago, USA; University of Oxford, UK; University of Cambridge, UK; Brium, France; OpenAI, USA; Google, France; University of Illinois at Urbana-Champaign, USA)
Machine learning (ML) compilers rely on graph-level transformations to enhance the runtime performance of ML models. However, performing local transformations on individual operations can create effects far beyond the location of the rewrite. In particular, a local rewrite can change the profitability or legality of hard-to-predict downstream transformations, particularly regarding data layout, parallelization, fine-grained scheduling, and memory management. As a result, program transformations are often driven by manually-tuned compiler heuristics, which are quickly rendered obsolete by new hardware and model architectures.
Instead of hand-written local heuristics, we propose the use of equality saturation. We replace such heuristics with a more robust global performance model, which accounts for downstream transformations. Equality saturation addresses the challenge of local optimizations inadvertently constraining or negating the benefits of subsequent transformations, thereby providing a solution that is inherently adaptable to newer workloads. While this approach still requires a global performance model to evaluate the profitability of transformations, it holds significant promise for increased automation and adaptability.
This paper addresses challenges in applying equality saturation on real-world ML compute graphs and state-of-the-art hardware. By doing so, we present an improved method for discovering effective compositions of graph optimizations. We study different cost modeling approaches to deal with fusion and layout optimization, and tackle scalability issues that arise from considering a very wide range of algebraic optimizations. We design an equality saturation pass for the XLA compiler, with an implementation in C++ and Rust. We demonstrate an average speedup of 3.45% over XLA’s optimization flow across our benchmark suite on various CPU and GPU platforms, with a maximum speedup of 56.26% for NasRNN on CPU.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Coinductive Proofs of Regular Expression Equivalence in Zero Knowledge
John C. Kolesar, Shan Ali, Timos Antonopoulos, and Ruzica Piskac
(Yale University, USA)
Zero-knowledge (ZK) protocols enable software developers to provide proofs of their programs' correctness to other parties without revealing the programs themselves. Regular expressions are pervasive in real-world software, and zero-knowledge protocols have been developed in the past for the problem of checking whether an individual string appears in the language of a regular expression, but no existing protocol addresses the more complex PSPACE-complete problem of proving that two regular expressions are equivalent.
We introduce Crepe, the first ZK protocol for encoding regular expression equivalence proofs and also the first ZK protocol to target a PSPACE-complete problem. Crepe uses a custom calculus of proof rules based on regular expression derivatives and coinduction, and we introduce a sound and complete algorithm for generating proofs in our format. We test Crepe on a suite of hundreds of regular expression equivalence proofs. Crepe can validate large proofs in only a few seconds each.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Reasoning about External Calls
Sophia Drossopoulou, Julian Mackay, Susan Eisenbach, and James Noble
(Imperial College London, UK; Kry10, New Zealand; Victoria University of Wellington, New Zealand; Creative Research & Programming, New Zealand)
In today’s complex software, internal trusted code is tightly intertwined with external untrusted code. To reason about internal code, programmers must reason about the potential effects of calls to external code, even though that code is not trusted and may not even be available.
The effects of external calls can be limited if internal code is programmed defensively, limiting potential effects by limiting access to the capabilities necessary to cause those effects.
This paper addresses the specification and verification of internal code that relies on encapsulation and object capabilities to limit the effects of external calls. We propose new assertions for access to capabilities, new specifications for limiting effects, and a Hoare logic to verify that a module satisfies its specification, even while making external calls. We illustrate the approach though a running example with mechanised proofs, and prove soundness of the Hoare logic.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Fast Client-Driven CFL-Reachability via Regularization-Based Graph Simplification
Chenghang Shi, Dongjie He, Haofeng Li, Jie Lu, Lian Li, and Jingling Xue
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Chongqing University, China; Zhongguancun Laboratory, China; UNSW, Australia)
Context-free language (CFL) reachability is a critical framework for various program analyses, widely adopted despite its computational challenges due to cubic or near-cubic time complexity. This often leads to significant performance degradation in client applications. Notably, in real-world scenarios, clients typically require reachability information only for specific source-to-sink pairs, offering opportunities for targeted optimization.
We introduce MoYe, an effective regularization-based graph simplification technique designed to enhance the performance of client-driven CFL-reachability analyses by pruning non-contributing edges—those that do not participate in any specified CFL-reachable paths. MoYe employs a regular approximation to ensure exact reachability results for all designated node pairs and operates linearly with respect to the number of edges in the graph. This lightweight efficiency makes MoYe a valuable pre-processing step that substantially reduces both computational time and memory requirements for CFL-reachability analysis, outperforming a recent leading graph simplification approach. Our evaluations with two prominent CFL-reachability client applications demonstrate that MoYe can substantially improve performance and reduce resource consumption.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
qblaze: An Efficient and Scalable Sparse Quantum Simulator
Hristo Venev, Thien Udomsrirungruang, Dimitar Dimitrov, Timon Gehr, and Martin Vechev
(INSAIT at Sofia University St. Kliment Ohridski, Bulgaria; University of Oxford, UK; ETH Zurich, Switzerland)
Classical simulation of quantum circuits is critical for the development of implementations of quantum algorithms: it does not require access to specialized hardware, facilitates debugging by allowing direct access to the quantum state, and is the only way to test on inputs that are too big for current NISQ computers.
Many quantum algorithms rely on invariants that result in sparsity in the state vector. A sparse state vector simulator only computes with non-zero amplitudes. For important classes of algorithms, this results in an asymptotic improvement in simulation time. While promising prior work has investigated ways to exploit sparsity, it is still unclear what is the best way to scale sparse simulation to modern multi-core architectures.
In this work, we address this challenge and present qblaze, a highly optimized sparse state vector simulator based on (i) a compact sorted array representation, and (ii) new, easily parallelizable and highly-scalable algorithms for all quantum operations. Our extensive experimental evaluation shows that qblaze is often orders-of-magnitude more efficient than prior sparse state vector simulators even on a single thread, and also that qblaze scales well to a large number of CPU cores.
Overall, our work enables testing quantum algorithms on input sizes that were previously out of reach.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
React-tRace: A Semantics for Understanding React Hooks: An Operational Semantics and a Visualizer for Clarifying React Hooks
Jay Lee, Joongwon Ahn, and Kwangkeun Yi
(Seoul National University, Republic of Korea)
React has become the most widely used web front-end framework, enabling the creation of user interfaces in a declarative and compositional manner. Hooks are a set of APIs that manage side effects in function components in React. However, their semantics are often seen as opaque to developers, leading to UI bugs. We introduce React-tRace, a formalization of the semantics of the essence of React Hooks, providing a semantics that clarifies their behavior. We demonstrate that our model captures the behavior of React, by theoretically showing that it embodies essential properties of Hooks and empirically comparing our React-tRace-definitional interpreter against a test suite. Furthermore, we showcase a practical visualization tool based on the formalization to demonstrate how developers can better understand the semantics of Hooks.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced
Incremental Certified Programming
Tomás Díaz, Kenji Maillard, Nicolas Tabareau, and Éric Tanter
(University of Chile, Chile; Inria, France)
Certified programming, as carried out in proof assistants and dependently-typed programming languages, ensures that a software meets its requirements by supporting the definition of both specifications and proofs. However, proofs easily break with partial definitions and incremental changes because specifications are not designed to account for the intermediate incomplete states of programs. We advocate for proper support for incremental certified programming by analyzing its objectives and inherent challenges, and propose a formal framework for achieving incremental certified programming in a principled manner. The key idea is to define appropriate notions of completion refinement and completeness to capture incrementality, and to systematically produce specifications that are valid at every stage of development while preserving the intent of the original statements. We provide a prototype implementation in the Rocq Prover, called IncRease, which exploits typeclasses for automation and extensibility, and is independent of any specific mechanism used to handle incompleteness. We illustrate its use with both an incremental textbook formalization of the simply-typed λ-calculus, and a more complex case study of incremental certified programming for an existing dead-code elimination optimization pass of the CompCert project. We show that the approach is compatible with randomized property-based testing as provided by QuickChick. Finally we study how to combine incremental certified programming with deductive synthesis, using a novel incrementality-friendly adaptation of the Fiat library. This work provides theoretical and practical foundations towards systematic support for incremental certified programming, highlighting challenges and perspectives for future developments.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Correct Black-Box Monitors for Distributed Deadlock Detection: Formalisation and Implementation
Radosław Jan Rowicki, Adrian Francalanza, and Alceste Scalas
(DTU, Denmark; University of Malta, Malta)
Many software applications rely on concurrent and distributed (micro)services that interact via message-passing and various forms of remote procedure calls (RPC). As these systems organically evolve and grow in scale and complexity, the risk of introducing deadlocks increases and their impact may worsen: even if only a few services deadlock, many other services may block while awaiting responses from the deadlocked ones. As a result, the "core" of the deadlock can be obfuscated by its consequences on the rest of the system, and diagnosing and fixing the problem can be challenging.
In this work we tackle the challenge by proposing distributed black-box monitors that are deployed alongside each service and detect deadlocks by only observing the incoming and outgoing messages, and exchanging probes with other monitors. We present a formal model that captures popular RPC-based application styles (e.g., gen_servers in Erlang/OTP), and a distributed black-box monitoring algorithm that we prove sound and complete (i.e., identifies deadlocked services with neither false positives nor false negatives). We implement our results in a tool called DDMon for the monitoring of Erlang/OTP applications, and we evaluate its performance.
This is the first work that formalises, proves the correctness, and implements distributed black-box monitors for deadlock detection. Our results are mechanised in Coq. DDMon is the companion artifact of this paper.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Abstraction Refinement-Guided Program Synthesis for Robot Learning from Demonstrations
Guofeng Cui, Yuning Wang, Wensen Mao, Yuanlin Duan, and He Zhu
(Rutgers University, USA)
Over the past decade, deep reinforcement learning (RL) techniques have significantly advanced robotic systems. However, due to the complex architectures of neural network models, ensuring their trustworthiness is a considerable challenge. Programmatic reinforcement learning has surfaced as a promising approach. Nonetheless, synthesizing robot-control programs remains challenging. Existing methods rely on domain-specific languages (DSLs) populated with user-defined state abstraction predicates and a library of low-level controllers as abstract actions to boot synthesis, which is impractical in unknown environments that lack such predefined components. To address this limitation, we introduce RoboScribe, a novel abstraction refinement-guided program synthesis framework that automatically derives robot state and action abstractions from raw, unsegmented task demonstrations in high-dimensional, continuous spaces. It iteratively enriches and refines an initially coarse abstraction until it generates a task-solving program over the abstracted robot environment. RoboScribe is effective in synthesizing iterative programs by inferring recurring subroutines directly from the robot’s raw, continuous state and action spaces, without needing predefined abstractions. Experimental results show that RoboScribe programs inductively generalize to long-horizon robot tasks involving arbitrary numbers of objects, outperforming baseline methods in terms of both interpretability and efficiency.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
A Sound Static Analysis Approach to I/O API Migration
Shangyu Li, Zhaoyang Zhang, Sizhe Zhong, Diyu Zhou, and Jiasi Shen
(Hong Kong University of Science and Technology, China; Peking University, China)
The advances in modern storage technologies necessitate the development of new input/output (I/O) APIs to maximize their performance benefits. However, migrating existing software to use different APIs poses significant challenges due to mismatches in computational models and complex code structures surrounding stateful, non-contiguous multi-API call sites. We present Sprout, a new system for automatically migrating programs across I/O APIs that guarantees behavioral equivalence. Sprout uses flow-sensitive pointer analysis to identify semantic variables, which enables the typestate analysis for matching API semantics and the synthesis of migrated programs. Experimental results with real-world C programs highlight the efficiency and effectiveness of our approach. We also show that Sprout can be adapted to other domains, such as databases.

Publisher's Version
Homomorphism Calculus for User-Defined Aggregations
Ziteng Wang, Ruijie Fang, Linus Zheng, Dixin Tang, and Işıl Dillig
(University of Texas at Austin, USA)
Data processing frameworks like Apache Spark and Flink provide built-in support for user-defined aggregation functions (UDAFs), enabling the integration of domain-specific logic. However, for these frameworks to support efficient UDAF execution, the function needs to satisfy a homomorphism property, which ensures that partial results from independent computations can be merged correctly. Motivated by this problem, this paper introduces a novel homomorphism calculus that can both verify and refute whether a UDAF is a dataframe homomorphism. If so, our calculus also enables the construction of a corresponding merge operator which can be used for incremental computation and parallel execution. We have implemented an algorithm based on our proposed calculus and evaluate it on real-world UDAFs, demonstrating that our approach significantly outperforms two leading synthesizers.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Synthesizing DSLs for Few-Shot Learning
Paul Krogmeier and P. Madhusudan
(University of Illinois at Urbana-Champaign, USA)
We study the problem of synthesizing domain-specific languages (DSLs) for few-shot learning in symbolic domains. Given a base language and instances of few-shot learning problems, where each instance is split into training and testing samples, the DSL synthesis problem asks for a grammar over the base language that guarantees that small expressions solving training samples also solve corresponding testing samples. We prove that the problem is decidable for a class of languages whose semantics over fixed structures can be evaluated by tree automata and when expression size corresponds to parse tree depth in the grammar, and, furthermore, the grammars solving the problem correspond to a regular set of trees. We also prove decidability results for variants of the problem where DSLs are only required to express solutions for input learning problems and where DSLs are defined using macro grammars.

Publisher's Version
REPTILE: Performant Tiling of Recurrences
Muhammad Usman Tariq, Shiv Sundram, and Fredrik Kjolstad
(Stanford University, USA)
We introduce REPTILE, a compiler that performs tiling optimizations for programs expressed as mathematical recurrence equations. REPTILE recursively decomposes a recurrence program into a set of unique tiles and then simplifies each into a different set of recurrences. Given declarative user specifications of recurrence equations, optimizations, and optional mappings of recurrence subexpressions to external libraries calls, REPTILE generates C code that composes compiler-generated loops with calls to external hand-optimized libraries. We show that for direct linear solvers expressible as recurrence equations, the generated C code matches and often exceed the performance of standard hand-optimized libraries. We evaluate REPTILE's generated C code against hand-optimized implementations of linear solvers in Intel MKL, as well as two non-solver recurrences from bioinformatics: Needleman-Wunsch and Smith-Waterman. When the user provides good tiling specifications, REPTILE achieves parity with MKL, achieving between 0.79--1.27x speedup for the LU decomposition, 0.97--1.21x speedup for the Cholesky decomposition, 1.61x--2.72x for lower triangular matrix inversion, 1.01--1.14x speedup for triangular solve with multiple right-hand sides, and 1.14--1.73x speedup over handwritten implementations of the bioinformatics recurrences.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
SafeRace: Assessing and Addressing WebGPU Memory Safety in the Presence of Data Races
Reese Levine, Ashley Lee, Neha Abbas, Kyle Little, and Tyler Sorensen
(University of California at Santa Cruz, USA; Microsoft, USA)
In untrusted execution environments such as web browsers, code from remote sources is regularly executed. To harden these environments against attacks, constituent programming languages and their implementations must uphold certain safety properties, such as memory safety. These properties must be maintained across the entire compilation stack, which may include intermediate languages that do not provide the same safety guarantees. Any case where properties are not preserved could lead to a serious security vulnerability.
In this work, we identify a specification vulnerability in the WebGPU Shading Language (WGSL) where code with data races can be compiled to intermediate representations in which an optimizing compiler could legitimately remove memory safety guardrails. To address this, we present SafeRace, a collection of threat assessments and specification proposals across the WGSL execution stack. While our threat assessment showed that this vulnerability does not appear to be exploitable on current systems, it creates a ”ticking time bomb”, especially as compilers in this area are rapidly evolving. Given this, we introduce the SafeRace Memory Safety Guarantee (SMSG), two components that preserve memory safety in the WGSL execution stack even in the presence of data races. The first component specifies that program slices contributing to memory indexing must be race free and is implemented via a compiler pass for WGSL programs. The second component is a requirement on intermediate representations that limits the effects of data races so that they cannot impact race-free program slices. While the first component is not guaranteed to apply to all possible WGSL programs due to limitations on how some data types can be accessed, we show that existing language constructs are sufficient to implement this component with minimal performance overhead on many existing important WebGPU applications. We test the second component by performing a fuzzing campaign of 81 hours across 21 compilation stacks; our results show violations on only one (likely buggy) machine, thus providing evidence that lower-level GPU frameworks could relatively straightforwardly support this constraint. Finally, our assessments discovered GPU memory isolation vulnerabilities in Apple and AMD GPUs, as well as a security-critical miscompilation of WGSL in a pre-release version of Firefox.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
HEMVM: A Heterogeneous Blockchain Framework for Interoperable Virtual Machines
Vladyslav Nekriach, Sidi Mohamed Beillahi, Chenxing Li, Peilun Li, Ming Wu, Andreas Veneris, and Fan Long
(University of Toronto, Canada; Shanghai Tree-Graph Blockchain Research Institute, China)
This paper introduces HEMVM, an innovative heterogeneous blockchain framework that seamlessly integrates diverse virtual machines (VMs), including the Ethereum Virtual Machine (EVM) and the Move Virtual Machine (MoveVM), into a unified system. This integration facilitates interoperability while retaining compatibility with existing Ethereum and Move toolchains by preserving high-level language constructs. HEMVM’s unique cross-VM operations allow users to interact with contracts across various VMs using any wallet software, effectively resolving the fragmentation in user experience caused by differing VM designs. Our experimental results demonstrate that HEMVM is both fast and efficient, incurring minimal overhead (less than 4.4%) for intra-VM transactions and achieving up to 9,300 TPS for cross-VM transactions. Our results also show that the cross-VM operations in HEMVM are sufficiently expressive to support complex decentralized finance interactions across multiple VMs. Finally, the parallelized prototype of HEMVM shows performance improvements up to 44.8% compared to the sequential version of HEMVM under workloads with mixed transaction types.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Towards Verifying Crash Consistency
Keonho Lee, Conan Truong, and Brian Demsky
(University of California at Irvine, USA)
Compute Express Link (CXL) memory sharing, persistent memory, and other related technologies allow data to survive crash events. A key challenge is ensuring that data is consistent after crashes such that it can be safely accessed. While there has been much work on bug-finding tools for persistent memory programs, these tools cannot guarantee that a program is crash-consistent. In this paper, we present a language, CrashLang, and its type system, that together guarantee that well-typed data structure implementations written in CrashLang are crash-consistent. CrashLang leverages the well-known commit-store pattern in which a single store logically commits an entire data structure operation. In this paper, we prove that well-typed CrashLang programs are crash-consistent, and provide a prototype implementation of the CrashLang compiler. We have evaluated CrashLang on five benchmarks: the Harris linked list, the Treiber stack, the Michael–Scott queue, a Read-Copy-Update binary search tree, and a Cache-Line Hash Table. We experimentally verified that each implementation correctly survives crashes.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
A Domain-Specific Probabilistic Programming Language for Reasoning about Reasoning (Or: A Memo on memo)
Kartik Chandra, Tony Chen, Joshua B. Tenenbaum, and Jonathan Ragan-Kelley
(Massachusetts Institute of Technology, USA)
The human ability to think about thinking ("theory of mind") is a fundamental object of study in many disciplines. In recent decades, researchers across these disciplines have converged on a rich computational paradigm for modeling theory of mind, grounded in recursive probabilistic reasoning. However, practitioners often find programming in this paradigm challenging: first, because thinking-about-thinking is confusing for programmers, and second, because models are slow to run. This paper presents memo, a new domain-specific probabilistic programming language that overcomes these challenges: first, by providing specialized syntax and semantics for theory of mind, and second, by taking a unique approach to inference that scales well on modern hardware via array programming. memo enables practitioners to write dramatically faster models with much less code, and has already been adopted by several research groups.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced ACM SIGPLAN Distinguished Paper Award
Interleaving Large Language Models for Compiler Testing
Yunbo Ni and Shaohua Li
(Chinese University of Hong Kong, Hong Kong)
Testing compilers with AI models, especially large language models (LLMs), has shown great promise. However, current approaches struggle with two key problems: The generated programs for testing compilers are often too simple, and extensive testing with the LLMs is computationally expensive. In this paper, we propose a novel compiler testing framework that decouples the testing process into two distinct phases: an offline phase and an online phase. In the offline phase, we use LLMs to generate a collection of small but feature-rich code pieces. In the online phase, we reuse these code pieces by strategically combining them to build high-quality and valid test programs, which are then used to test compilers.
We implement this idea in a tool, LegoFuzz, for testing C compilers. The results are striking: we found 66 bugs in GCC and LLVM, the most widely used C compilers. Almost half of the bugs are miscompilation bugs, which are serious and hard-to-find bugs that none of the existing LLM-based tools could find. We believe this efficient design opens up new possibilities for using AI models in software testing beyond just C compilers.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
A Hoare Logic for Symmetry Properties
Vaibhav Mehta and Justin Hsu
(Cornell University, USA)
Many natural program correctness properties can be stated in terms of symmetries, but existing formal methods have little support for reasoning about such properties. We consider how to formally verify a broad class of symmetry properties expressed in terms of group actions. To specify these properties, we design a syntax for group actions, supporting standard constructions and a natural notion of entailment. Then, we develop a Hoare-style logic for verifying symmetry properties of imperative programs, where group actions take the place of the typical pre- and post-condition assertions. Finally, we develop a prototype tool SymVerif, and use it to verify symmetry properties on a series of handcrafted benchmarks. Our tool uncovered an error in a model of a dynamical system described by McLachlan and Quispel [Acta Numerica 2002].

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional Results Reproduced
Two Approaches to Fast Bytecode Frontend for Static Analysis
Chenxi Li, Haoran Lin, Tian Tan, and Yue Li
(Nanjing University, China)
In static analysis frameworks for Java, the bytecode frontend serves as a critical component, transforming complex, stack-based Java bytecode into a more analyzable register-based, typed 3-address code representation. This transformation often significantly influences the overall performance of analysis frameworks, particularly when processing large-scale Java applications, rendering the efficiency of the bytecode frontend paramount for static analysis. However, the bytecode frontends of currently dominant Java static analysis frameworks, Soot and WALA, despite being time-tested and widely adopted, exhibit limitations in efficiency, hindering their ability to offer a better user experience.
To tackle efficiency issues, we introduce a new bytecode frontend. Typically, bytecode frontends consist of two key stages: (1) translating Java bytecode to untyped 3-address code, and (2) performing type inference on this code. For 3-address code translation, we identified common patterns in bytecode that enable more efficient processing than traditional methods. For type inference, we found that traditional algorithms often include redundant computations that hinder performance. Leveraging these insights, we propose two novel approaches: pattern-aware 3-address code translation and pruning-based type inference, which together form our new frontend and lead to significant efficiency improvements. Besides, our approach can also generate SSA IR, enhancing its usability for various static analysis techniques.
We implemented our new bytecode frontend in Tai-e, a recent state-of-the-art static analysis framework for Java, and evaluated its performance across a diverse set of Java applications. Experimental results demonstrate that our frontend significantly outperforms Soot, WALA, and SootUp (an overhaul of Soot)—in terms of efficiency, being on average 14.2×, 14.5×, and 75.2× faster than Soot, WALA, and SootUp, respectively. Moreover, additional experiments reveal that our frontend exhibits superior reliability in processing Java bytecode compared to these tools, thus providing a more robust foundation for Java static analysis.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Tuning Random Generators: Property-Based Testing as Probabilistic Programming
Ryan Tjoa, Poorva Garg, Harrison Goldstein, Todd Millstein, Benjamin C. Pierce, and Guy Van den Broeck
(University of Washington, USA; University of California at Los Angeles, USA; University of Maryland, USA; University of Pennsylvania, USA)
Property-based testing validates software against an executable specification by evaluating it on randomly generated inputs. The standard way that PBT users generate test inputs is via generators that describe how to sample test inputs through random choices. To achieve a good distribution over test inputs, users must tune their generators, i.e., decide on the weights of these individual random choices. Unfortunately, it is very difficult to understand how to choose individual generator weights in order to achieve a desired distribution, so today this process is tedious and limits the distributions that can be practically achieved.
In this paper, we develop techniques for the automatic and offline tuning of generators. Given a generator with undetermined symbolic weights and an objective function, our approach automatically learns values for these weights that optimize for the objective. We describe useful objective functions that allow users to (1) target desired distributions and (2) improve the diversity and validity of their test cases. We have implemented our approach in a novel discrete probabilistic programming system, Loaded Dice, that supports differentiation and parameter learning, and use it as a language for generators. We empirically demonstrate that our approach is effective at optimizing generator distributions according to the specified objective functions. We also perform a thorough evaluation on PBT benchmarks, demonstrating that, when automatically tuned for diversity and validity, the generators exhibit a 3.1-7.4x speedup in bug finding.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Efficient Abstract Interpretation via Selective Widening
Jiawei Wang, Xiao Cheng, and Yulei Sui
(UNSW, Australia; Macquarie University, Australia)
Abstract interpretation provides a systematic framework for static analysis, where widening operators are crucial for ensuring termination when analyzing programs over infinite-height lattices. Current abstract interpreters apply widening operations and fixpoint detection uniformly across all variables at the identified widening point (e.g., control-flow loop headers), leading to costly computations. Through our empirical study, we observe that infinite ascending chains typically originate from only a subset of variables involved in value-flow cycles, providing opportunities for selective widening and targeted fixpoint detection. This paper introduces an efficient approach to optimize abstract interpretation over non-relational domains through selective widening guided by value-flow analysis. We develop a modular and condensed value-flow graph (MVFG) that enables precise identification of variables requiring widening by detecting value-flow cycles across procedure boundaries. Our MVFG design incorporates efficient shortcut edges that summarize interprocedural value flows, achieving the precision of context-sensitive analysis but with linear complexity. By aligning value-flow cycles with the weak topological ordering (WTO) of the control-flow graph, we identify the minimal set of variables requiring widening operations, applying widening exclusively to variables that participate in value-flow back edges. Our evaluation on large-scale open-source projects shows that our selective widening approach reduces analysis time by up to 41.2% while maintaining identical precision. The method significantly reduces the number of widened variables by up to 99.5%, with greater benefits observed in larger codebases.

Publisher's Version
Revamping Verilog Semantics for Foundational Verification
Joonwon Choi, Jaewoo Kim, and Jeehoon Kang
(Amazon Web Services, USA; KAIST, Republic of Korea; FuriosaAI, Republic of Korea)
In formal hardware verification, particularly for Register-Transfer Level (RTL) designs in Verilog, model checking has been the predominant technique. However, it suffers from state explosion, limited expressive power, and a large trusted computing base (TCB). Deductive verification offers greater expressive power and enables foundational verification with a minimal TCB. Nevertheless, Verilog's standard semantics, characterized by its nondeterministic and global scheduling, pose significant challenges to its application.
To address these challenges, we propose a new Verilog semantics designed to facilitate deductive verification. Our semantics is based on least fixpoints to enable cycle-level functional evaluation and modular reasoning. For foundational verification, we prove our semantics equivalent to the standard scheduling semantics for synthesizable designs. We demonstrate the benefits of our semantics with a modular verification of a pipelined RISC-V processor's functional correctness and progress guarantees. All our results are mechanized in Rocq.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced ACM SIGPLAN Distinguished Paper Award
Tracing Just-in-Time Compilation for Effects and Handlers
Marcial Gaißert, CF Bolz-Tereick, and Jonathan Immanuel Brachthäuser
(University of Tübingen, Germany; Heinrich-Heine-Universität Düsseldorf, Germany)
Effect handlers are a programming language feature that has recently gained popularity. They allow for non-local yet structured control flow and subsume features like generators, exceptions, asynchronicity, etc. However, implementations of effect handlers currently often sacrifice features to enable efficient implementations. Meta-tracing just-in-time (JIT) compilers promise to yield the performance of a compiler by implementing an interpreter. They record execution in a trace, dynamically detect hot loops, and aggressively optimize those using information available at runtime. They excel at optimizing dynamic control flow, which is exactly what effect handlers introduce. We present the first evaluation of tracing JIT compilation specifically for effect handlers. To this end, we developed RPython-based tracing JIT implementations for Eff, Effekt, and Koka by compiling them to a common bytecode format. We evaluate the performance, discuss which classes of effectful programs are optimized well and how our additional optimizations influence performance. We also benchmark against a baseline of state-of-the-art mainstream language implementations.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced
Convex Hull Approximation for Activation Functions
Zhongkui Ma, Zihan Wang, and Guangdong Bai
(University of Queensland, Australia; CSIRO’s Data61, Australia)
The wide adoption of deep learning in safety-critical domains has driven the need for formally verifying the robustness of neural networks. A critical challenge in this endeavor lies in addressing the inherent non-linearity of activation functions. The convex hull of the activation function has emerged as a promising solution, as it effectively tightens variable ranges and provides multi-neuron constraints, which together enhance verification precision. Given that constructing exact convex hulls is computationally expensive and even infeasible in most cases, existing research has focused on over-approximating them. Several ad-hoc methods have been devised for specific functions such as ReLU and Sigmoid. Nonetheless, there remains a substantial gap in developing broadly applicable approaches for general activation functions.
In this work, we propose WraAct, an approach to efficiently constructing tight over-approximations for activation function hulls. Its core idea is to introduce linear constraints to smooth out the fluctuations in the target function, by leveraging double-linear-piece (DLP) functions to simplify the local geometry. In this way, the problem is reduced to over-approximating DLP functions, which can be efficiently handled. We evaluate WraAct against SBLM+PDDM, the state-of-the-art (SOTA) multi-neuron over-approximation method based on decompositing functions into segments. WraAct outperforms it on commonly-used functions like Sigmoid, Tanh, and MaxPool, offering superior efficiency (average 400X faster) and precision (average 150X) while constructing fewer constraints (average 50% reduction). It can complete the computation of up to 8 input dimensions in 10 seconds. We also integrate WraAct into a neural network verifier to evaluate its capability in verification tasks. On 100 benchmark samples, it significantly enhances the single-neuron verification from under 10 to over 40, and outperforms the multi-neuron verifier PRIMA with up to additional 20 verified samples. On large networks like ResNets with 22k neurons, it can complete the verification of one sample within one minute.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Statically Analyzing the Dataflow of R Programs
Florian Sihler and Matthias Tichy
(Ulm University, Germany)
The R programming language is primarily designed for statistical computing and mostly used by researchers without a background in computer science. R provides a wide range of dynamic features and peculiarities that are difficult to analyze statically like dynamic scoping and lazy evaluation with dynamic side effects. At the same time, the R ecosystem lacks sophisticated analysis tools that support researchers in understanding and improving their code. In this paper, we present a novel static dataflow analysis framework for the R programming language that is capable of handling the dynamic nature of R programs and produces the dataflow graph of given R programs. This graph can be essential in a range of analyses, including program slicing, which we implement as a proof of concept. The core analysis works as a stateful fold over a normalized version of the abstract syntax tree of the R program, which tracks (re-)definitions, values, function calls, side effects, external files, and a dynamic control flow to produce one dataflow graph per program. We evaluate the correctness of our analysis using output equivalence testing on a manually curated dataset of 779 sensible slicing points from executable real-world R scripts. Additionally, we use a set of systematic test cases based on the capabilities of the R language and the implementation of the R interpreter and measure the runtimes well as the memory consumption on a set of 4,230 real-world R scripts and 20,815 packages available on R’s package manager CRAN. Furthermore, we evaluate the recall of our program slicer, its accuracy using shrinking, and its improvement over the state of the art. We correctly analyze almost all programs in our equivalence test suite, preserving the identical output for 99.7% of the manually curated slicing points. On average, we require 576ms to analyze the dataflow and around 213kB to store the graph of a research script. This shows that our analysis is capable of analyzing real-world sources quickly and correctly. Our slicer achieves an average reduction of 84.8% of tokens indicating its potential to improve program comprehension.

Publisher's Version
Synthesizing Sound and Precise Abstract Transformers for Nonlinear Hyperbolic PDE Solvers
Jacob Laurel, Ignacio Laguna, and Jan Hückelheim
(Georgia Institute of Technology, USA; Lawrence Livermore National Laboratory, USA; Argonne National Laboratory, USA)
Partial Differential Equations (PDEs) play a ubiquitous role in scientific computing and engineering. While numerical methods make solving PDEs tractable, these numerical solvers encounter several issues, particularly for hyperbolic PDEs. These issues arise from multiple sources including the PDE’s physical model, which can lead to effects like shock wave formation, and the PDE solver’s inherent approximations, which can introduce spurious numerical artifacts. These issues can cause the solver’s program execution to crash (due to overflow) or return results with unacceptable levels of inaccuracy (due to spurious oscillations or dissipation). Moreover, these challenges are compounded by the nonlinear nature of many of these PDEs. In addition, PDE solvers must obey numerical invariants like the CFL condition. Hence there exists a critical need to apply program analysis to PDE solvers to certify such problems do not arise and that invariants are always satisfied.
As a solution, we develop Phocus, which is the first abstract interpretation of hyperbolic PDE solvers. Phocus can certify precise bounds on nonlinear PDE solutions and certify key invariants such as the CFL condition and a solution’s total variation bound. Hence Phocus can verify the absence of shock formation, the stability of the solver, and bounds on the amount of spurious numerical effects. To enable effective abstract interpretation of hyperbolic PDE solvers, Phocus uses a novel optimization-based procedure to synthesize precise abstract transformers for multiple finite difference schemes. To evaluate Phocus, we develop a new set of PDE benchmark programs and use them to perform an extensive experimental evaluation which demonstrates Phocus’s significant precision benefits and scalability to several thousand mesh points.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
HybridPersist: A Compiler Support for User-Friendly and Efficient PM Programming
Yiyu Zhang, Yongzhi Wang, Yanfeng Gao, Xuandong Li, and Zhiqiang Zuo
(Nanjing University, China)
Persistent memory (PM), with its data persistence, has found widespread applications. However, programmers have to manually annotate PM operations in programming to achieve crash consistency, which is labor-intensive and error-prone. In this paper, to alleviate the burden of programming PM applications, we develop HybridPersist, a compiler support for user-friendly and efficient PM programming. On the one hand, HybridPersist automatically achieves crash consistency, minimally intruding on programmers with negligible annotations. On the other hand, it enhances both performance and correctness of PM programs through a series of dedicated analysis passes. The evaluations on well-known benchmarks validate that HybridPersist offers superior programming productivity and runtime performance compared to the state-of-the-art.

Publisher's Version Published Artifact Artifacts Available
Memory-Safety Verification of Open Programs with Angelic Assumptions
Gourav Takhar, Baldip Bijlani, Prantik Chatterjee, Akash Lal, and Subhajit Roy
(IIT Kanpur, India; Microsoft Research, India)
An open program is one for which the complete source code is not available, which is a reality for real-world program verification. Software verification tools tend to assume the worst about any unconstrained behavior and this can yield an enormous number of spurious warnings for open programs. For any serious verification effort, the engineer must invest time up-front in building a suitable model (or mock) of any missing code, which is time-consuming and error-prone. Inaccuracies in the mocks can lead to incorrect verification results.
In this paper, we demonstrate a technique that is capable of distinguishing between false positives and actual bugs from potential memory-safety violations in an open program with high accuracy. Central to the technique is the ability of making angelic assumptions about missing code. To accomplish this, we first mine a set of idiomatic patterns in buffer-manipulating programs using a large language model (LLM). This is complemented by a formal synthesis strategy that performs property-directed reasoning to select, adapt and instantiate these idiomatic patterns into angelic assumptions on the target program. Overall, our system, Seeker, guarantees that a program is deemed correct only if it can be verified under a well-defined set of "trusted" idiomatic patterns. In our experiments over a set of benchmarks curated from popular open-source software, our tool Seeker is able to identify 79% of the false positives with zero false negatives.

Publisher's Version Published Artifact Artifacts Available
Structural Temporal Logic for Mechanized Program Verification
Eleftherios Ioannidis, Yannick Zakowski, Steve Zdancewic, and Sebastian Angel
(University of Pennsylvania, USA; Inria, France)
Mechanized verification of liveness properties for infinite programs with effects and nondeterminism is challenging. Existing temporal reasoning frameworks operate at the level of models such as traces and automata. Reasoning happens at a very low-level, requiring complex nested (co-)inductive proof techniques and familiarity with proof assistant mechanics (e.g., the guardedness checker). Further, reasoning at the level of models instead of program constructs creates a verification gap that loses the benefits of modularity and composition enjoyed by structural program logics such as Hoare Logic. To address this verification gap, and the lack of compositional proof techniques for temporal specifications, we propose Ticl, a new structural temporal logic. Using Ticl, we encode complex (co-)inductive proof techniques as structural lemmas and focus our reasoning on variants and invariants. We show that it is possible to perform compositional proofs of general temporal properties in a proof assistant, while working at a high level of abstraction. We demonstrate the benefits of Ticl by giving mechanized proofs of safety and liveness properties for programs with scheduling, concurrent shared memory, and distributed consensus, demonstrating a low proof-to-code ratio.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
MTP: A Meaning-Typed Language Abstraction for AI-Integrated Programming
Jayanaka L. Dantanarayana, Yiping Kang, Kugesan Sivasothynathan, Christopher Clarke, Baichuan Li, Savini Kashmira, Krisztian Flautner, Lingjia Tang, and Jason Mars
(University of Michigan, USA; Jaseci Labs, USA)
Software development is shifting from traditional programming to AI-integrated applications that leverage generative AI and large language models (LLMs) during runtime. However, integrating LLMs remains complex, requiring developers to manually craft prompts and process outputs. Existing tools attempt to assist with prompt engineering, but often introduce additional complexity.
This paper presents Meaning-Typed Programming (MTP), a novel paradigm that abstracts LLM integration through intuitive language-level constructs. By leveraging the inherent semantic richness of code, MTP automates prompt generation and response handling without additional developer effort. We introduce the (1) by operator for seamless LLM invocation, (2) MT-IR, a meaning-based intermediate representation for semantic extraction, and (3) MT-Runtime, an automated system for managing LLM interactions. We implement MTP in Jac, a programming language that supersets Python, and find that MTP significantly reduces coding complexity while maintaining accuracy and efficiency. MTP significantly reduces development complexity, lines of code modifications needed, and costs while improving run-time performance and maintaining or exceeding the accuracy of existing approaches. Our user study shows that developers using MTP completed tasks 3.2× faster with 45% fewer lines of code compared to existing frameworks. Moreover, demonstrates resilience even when up to 50% of naming conventions are degraded, demonstrating robustness to suboptimal code. is developed as part of the Jaseci open-source project, and is available under the module byLLM.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Validating SMT Rewriters via Rewrite Space Exploration Supported by Generative Equality Saturation
Maolin Sun, Yibiao Yang, Jiangchang Wu, and Yuming Zhou
(Nanjing University, China)
Satisfiability Modulo Theories (SMT) solvers are widely used for program analysis and other applications that require automated reasoning. Rewrite systems, as crucial integral components of SMT solvers, are responsible for simplifying and transforming formulas to optimize the solving process. The effectiveness of an SMT solver heavily depends on the robustness of its rewrite system, making its validation crucial. Despite ongoing advancements in SMT solver testing, rewrite system validation remains largely unexplored. Our empirical analysis reveals that developers invest significant effort in ensuring the correctness and reliability of rewrite systems. However, existing testing techniques do not adequately address this aspect. In this paper, we introduce Aries, a novel technique designed to validate SMT solver rewrite systems. First, Aries employs mimetic mutation, a targeted strategy that actively reshapes input formulas to provoke and diversify rewrite opportunities. By aligning mutated terms with known rewrite patterns, Aries can conduct a thorough exploration of the rewrite space in the following phase. Second, Aries utilizes deductive rewriting, leveraging generative equality saturation to effectively explore rewrite space and produce semantically equivalent mutants for the purpose of validation. We implemented Aries as a practical validation tool and evaluated it on leading SMT solvers, including Z3 and cvc5. Our experiments demonstrate that Aries effectively identifies bugs, with 27 new issues detected, of which 22 have been confirmed or fixed by developers. Most of these issues involve the rewrite systems, highlighting Aries's strength in exploring the rewrite space.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Fuzzing C++ Compilers via Type-Driven Mutation
Bo Wang, Chong Chen, Ming Deng, Junjie Chen, Xing Zhang, Youfang Lin, Dan Hao, and Jun Sun
(Beijing Jiaotong University, China; Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, China; Tianjin University, China; Peking University, China; Singapore Management University, Singapore)
C++ is a system-level programming language for modern software development, which supports multiple programming paradigms, including object-oriented, generic, and functional programming. The intrinsic complexity of these paradigms and their interactions grants C++ powerful expressiveness while posing significant challenges for compilers in correctly implementing its type system. A type system encompasses various aspects such as type inference, type checking, subtyping, type conversions, generics, scoping, and binding. However, systematic testing of the type systems of C++ compilers remains largely underexplored in existing studies.
In this work, we present TyMut, the first approach specifically designed to test the C++ type system. TyMut is a mutation-based compiler fuzzer equipped with advanced type-driven mutation operators, carefully crafted to target intricate type-related features such as template generics, type conversions, and inheritance. Beyond differential testing, TyMut introduces enhanced test oracles through a must analysis that partially confirms the validity of generated programs. Specifically, mutation operators are classified into well-formed and not-well-formed: Programs generated by well-formed mutation operators are valid and must be accepted by compilers. Programs generated by not-well-formed operators are validated against a set of well-formedness rules. Any violation indicates the program is invalid and must be rejected. For programs that pass the rules but lack a definitive oracle, TyMut applies differential testing to identify behavioral inconsistencies across compilers.
The testing campaign took about 32 hours to generate and test 250584 programs. The must analysis provides definite test oracles for nearly 80% of all generated programs. TyMut uncovered 102 bugs in the recent versions of GCC and Clang, with 56 confirmed as new bugs by compiler developers. Among the confirmed bugs, 26 of them cause compiler crashes, and more than 50% cause miscompilation. Additionally, 7 of them had remained hidden for over 20 years, 22 for over 10 years, and 39 for over 5 years. One long-standing bug discovered by TyMut was later confirmed as the root cause of a real-world issue in TensorFlow. Before submitting this paper, 13 bugs were fixed, most of which were fixed within 60 days. Notably, some unconfirmed bugs have led to in-depth discussions among developers. For instance, one bug led a compiler developer to submit a new issue to the C++ language standard, showing that we uncovered ambiguities in the language specification.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
MetaKernel: Enabling Efficient Encrypted Neural Network Inference through Unified MVM and Convolution
Peng Yuan, Yan Liu, JianXin Lai, Long Li, Tianxiang Sui, Linjie Xiao, Xiaojing Zhang, Qing Zhu, and Jingling Xue
(Ant Group, China; UNSW, Australia)
Practical encrypted neural network inference under the CKKS fully homomorphic encryption (FHE) scheme relies heavily on accelerating two key kernel operations: Matrix-Vector Multiplication (MVM) and Convolution (Conv). However, existing solutions—such as expert-tuned libraries and domain-specific languages—are designed in an ad hoc manner, leading to significant inefficiencies caused by excessive rotations.
We introduce MKR, a novel composition-based compiler approach that optimizes MVM and Conv kernel operations for DNN models under CKKS within a unified framework. MKR decomposes each kernel into composable units, called MetaKernels, to enhance SIMD parallelism within ciphertexts (via horizontal batching) and computational parallelism across them (via vertical batching). Our approach tackles previously unaddressed challenges, including reducing rotation overhead through a rotation-aware cost model for data packing, while also ensuring high slot utilization, uniform handling of inputs with arbitrary sizes, and compatibility with the output tensor layout. Implemented in a production-quality FHE compiler, MKR achieves inference time speedups of 10.08×–185.60× for individual MVM and Conv kernels and 1.75×–11.84× for end-to-end inference compared to a state-of-the-art FHE compiler. Moreover, MKR enables homomorphic execution of large DNN models, where prior methods fail, significantly advancing the practicality of FHE compilers.

Publisher's Version Published Artifact Artifacts Available
Qualified Types with Boolean Algebras
Edward Lee, Jonathan Lindegaard Starup, Ondřej Lhoták, and Magnus Madsen
(University of Waterloo, Canada; Aarhus University, Denmark)
We propose type qualifiers based on Boolean algebras. Traditional type systems with type qualifiers have been based on lattices, but lattices lack the ability to express exclusion. We argue that Boolean algebras, which permit exclusion, are a practical and useful choice of domain for qualifiers.
In this paper, we present a calculus System F<:B that extends System F<: with type qualifiers over Boolean algebras and has support for negation, qualifier polymorphism, and subqualification. We illustrate how System F<:B can be used as a design recipe for a type and effect system, System F<:BE, with effect polymorphism, subeffecting, and polymorphic effect exclusion. We use System F<:BE to establish formal foundations of the type and effect system of the Flix programming language. We also pinpoint and implement a practical form of subeffecting: abstraction-site subeffecting. Experimental results show that abstraction-site subeffecting allows us to eliminate all effect upcasts present in the current Flix Standard Library.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
PReMM: LLM-Based Program Repair for Multi-method Bugs via Divide and Conquer
Linna Xie, Zhong Li, Yu Pei, Zhongzhen Wen, Kui Liu, Tian Zhang, and Xuandong Li
(Nanjing University, China; Hong Kong Polytechnic University, China; Huawei, China)
Large-language models (LLMs) have been leveraged to enhance the capability of automated program repair techniques in recent research. While existing LLM-based program repair techniques compared favorably to other techniques based on heuristics, constraint-solving, and learning in producing high-quality patches, they mainly target bugs that can be corrected by changing a single faulty method, which greatly limits the effectiveness of such techniques in repairing bugs that demand patches spanning across multiple methods. In this work, we propose the PReMM technique to effectively propose patches changing multiple methods. PReMM builds on three core component techniques: the faulty method clustering technique to partition the faulty methods into clusters based on the dependence relationship among them, enabling a divide-and-conquer strategy for the repairing task; the fault context extraction technique to gather extra information about the fault context which can be utilized to better guide the diagnosis of the fault and the generation of correct patches; the dual-agent-based patch generation technique that employs two LLM-based agents with different roles to analyze the fault more precisely and generate patches of higher-quality. We have implemented the PReMM technique into a tool with the same name and applied the tool to repair real-world bugs from datasets Defects4J V1.2 and V2.0. PReMM produced correct patches for 307 bugs in total. Compared with ThinkRepair, the state-of-the-art LLM-based program repair technique, PReMM correctly repaired 102 more bugs, achieving an improvement of 49.8%.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Understanding and Improving Flaky Test Classification
Shanto Rahman, Saikat Dutta, and August Shi
(University of Texas at Austin, USA; Cornell University, USA)
Regression testing is an essential part of software development, but it suffers from the presence of flaky tests – tests that pass and fail non-deterministically when run on the same code. These unpredictable failures waste developers’ time and often hide real bugs. Prior work showed that fine-tuned large language models (LLMs) can classify flaky tests into different categories with very high accuracy. However, we find that prior approaches over-estimated the accuracy of the models due to incorrect experimental design and unrealistic datasets – making the flaky test classification problem seem simpler than it is.
In this paper, we first show how prior flaky test classifiers over-estimate the prediction accuracy due to 1) flawed experiment design and 2) mis-representation of the real distribution of flaky (and non-flaky) tests in their datasets. After we fix the experimental design and construct a more realistic dataset (which we name FlakeBench), the prior state-of-the-art model shows a steep drop in F1-score, from 81.82% down to 56.62%. Motivated by these observations, we develop a new training strategy to fine-tune a flaky test classifier, FlakyLens, that improves the classification F1-score to 65.79% (9.17pp higher than the state-of-the-art). We also compare FlakyLens against recent pre-trained LLMs, such as CodeLlama and DeepSeekCoder, on the same classification task. Our results show that FlakyLens consistently outperforms these models, highlighting that general-purpose LLMs still fall short on this specialized task.
Using our improved flaky test classifier, we identify the important tokens in the test code that influence the models in making correct or incorrect predictions. By leveraging attribution scores computed per code token in each test, we investigate the tokens that have higher impact on the flaky test classifier’s decision-making per flaky test category. To assess the influence of these important tokens, we introduce adversarial perturbation using these important tokens into the tests and observe whether the model’s predictions change. Our findings show that, when introducing perturbations using the most important tokens, the classification accuracy can change by as much as -18.37pp. These results highlight that these models still struggle to generalize beyond their training data and rely on identifying category-specific tokens (instead of understanding their semantic context), calling for further research into more robust training methodologies.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Agora: Trust Less and Open More in Verification for Confidential Computing
Hongbo Chen, Quan Zhou, Sen Yang, Sixuan Dang, Xing Han, Danfeng Zhang, Fan Zhang, and XiaoFeng Wang
(Indiana University at Bloomington, USA; Pennsylvania State University, USA; Yale University, USA; Duke University, USA; Hong Kong University of Science and Technology, China; Nanyang Technological University, Singapore)
Confidential computing (CC), designed for security-critical scenarios, uses remote attestation to guarantee code integrity on cloud servers. However, CC alone cannot provide assurance of high-level security properties (e.g., no data leak) on the code. In this paper, we introduce a novel framework, Agora, scrupulously designed to provide a trustworthy and open verification platform for CC. To prompt trustworthiness, we observe that certain verification tasks can be delegated to untrusted entities, while the corresponding (smaller) validators are securely housed within the trusted computing base (TCB). Moreover, through a novel blockchain-based bounty task manager, it also utilizes crowdsourcing to remove trust in complex theorem provers. These synergistic techniques successfully ameliorate the TCB size burden associated with two procedures: binary analysis and theorem proving. To prompt openness, Agora supports a versatile assertion language that allows verification of various security policies. Moreover, the design of Agora enables untrusted parties to participate in any complex processes out of Agora’s TCB. By implementing verification workflows for software-based fault isolation, information flow control, and side-channel mitigation policies, our evaluation demonstrates the efficacy of Agora.

Publisher's Version Published Artifact Artifacts Available
Shaking Up Quantum Simulators with Fuzzing and Rigour
Vasileios Klimis, Avner Bensoussan, Elena Chachkarova, Karine Even-Mendoza, Sophie Fortz, and Connor Lenihan
(Queen Mary University of London, UK; King’s College London, UK)
Quantum computing platforms rely on simulators for modelling circuit behaviour prior to hardware execution, where inconsistencies can lead to costly errors. While existing formal validation methods typically target specific compiler components to manage state explosion, they often miss critical bugs. Meanwhile, conventional testing lacks systematic exploration of corner cases and realistic execution scenarios, resulting in both false positives and negatives.
We present FuzzQ, a novel framework that bridges this gap by combining formal methods with structured test generation and fuzzing for quantum simulators. Our approach employs differential benchmarking complemented by mutation testing and invariant checking. At its core, FuzzQ utilises our Alloy-based formal model of QASM 3.0, which encodes the semantics of quantum circuits to enable automated analysis and to generate structurally diverse, constraint-guided quantum circuits with guaranteed properties. We introduce several test oracles to assess both Alloy’s modelling of QASM 3.0 and simulator correctness, including invariant-based checks, statistical distribution tests, and a novel cross-simulator unitary consistency check that verifies functional equivalence modulo global phase, revealing discrepancies that standard statevector comparisons fail to detect in cross-platform differential testing.
We evaluate FuzzQ on both Qiskit and Cirq, demonstrating its platform-agnostic effectiveness. By executing over 800,000 quantum circuits to completion, we assess throughput, code and circuit coverage, and simulator performance metrics, including sensitivity, correctness, and memory overhead. Our analysis revealed eight simulator bugs, six previously undocumented. We also outline a path for extending the framework to support mixed-state simulations under realistic noise models.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Non-interference Preserving Optimising Compilation
Julian Rosemann, Sebastian Hack, and Deepak Garg
(Saarland University, Germany; MPI-SWS, Germany)
To protect security-critical applications, secure compilers have to preserve security policies, such as non-interference, during compilation. The preservation of security policies goes beyond the classical notion of compiler correctness which only enforces the preservation of the semantics of the source program. Therefore, several standard compiler optimisations are prone to break standard security policies like non-interference. Existing approaches to secure compilation are very restrictive with respect to the compiler optimisations that they permit or to the security policies they support because of conceptual limitations in their formal setup.
In this paper, we present hyperproperty simulations, a novel framework to secure compilation that models the preservation of arbitrary k-hyperproperties during compilation and overcomes several limitations of existing approaches, in particular it is more expressive and more flexible. We demonstrate this by designing and proving a generic non-interference preserving code transformation that can be applied on different optimisations and leakage models. This approach reduces the proof burden per optimisation to a minimum. We instantiate this code transformation on different leakage models with various standard compiler optimisations that could be handled in a very limited and less modular way (if at all) by existing approaches. Our results are formally verified in the Rocq theorem prover.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Active Learning for Neurosymbolic Program Synthesis
Celeste Barnaby, Qiaochu Chen, Ramya Ramalingam, Osbert Bastani, and Işıl Dillig
(University of Texas at Austin, USA; New York University, USA; University of Pennsylvania, USA)
The goal of active learning for program synthesis is to synthesize the desired program by asking targeted questions that minimize user interaction. While prior work has explored active learning in the purely symbolic setting, such techniques are inadequate for the increasingly popular paradigm of neurosymbolic program synthesis, where the synthesized program incorporates neural components. When applied to the neurosymbolic setting, such techniques can -- and, in practice, do -- return an unintended program due to mispredictions of neural components. This paper proposes a new active learning technique that can handle the unique challenges posed by neural network mispredictions. Our approach is based upon a new evaluation strategy called constrained conformal evaluation (CCE), which accounts for neural mispredictions while taking into account user-provided feedback. Our proposed method iteratively makes CCE more precise until all remaining programs are guaranteed to be observationally equivalent. We have implemented this method in a tool called SmartLabel and experimentally evaluated it on three neurosymbolic domains. Our results demonstrate that SmartLabel identifies the ground truth program for 98% of the benchmarks, requiring under 5 rounds of user interaction on average. In contrast, prior techniques for active learning are only able to converge to the ground truth program for at most 65% of the benchmarks.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Heap-Snapshot Matching and Ordering using CAHPs: A Context-Augmented Heap-Path Representation for Exact and Partial Path Matching using Prefix Trees
Matteo Basso, Aleksandar Prokopec, Andrea Rosà, and Walter Binder
(USI Lugano, Switzerland; Oracle Labs, Switzerland)
GraalVM Native Image is increasingly used to optimize the startup performance of applications that run on the Java Virtual Machine (JVM), and particularly of Function-as-a-Service and Serverless workloads. Native Image resorts to Ahead-of-Time (AOT) compilation to produce a binary from a JVM application that contains a snapshot of the pre-initialized heap memory, reducing the initialization time and hence improving startup performance. However, this performance improvement is hindered by page faults that occur when accessing objects in the heap snapshot. Related work has proposed profile-guided approaches to reduce page faults by reordering objects in the heap snapshot of an optimized binary based on the order in which objects are first accessed, obtaining this information by profiling an instrumented binary of the same application. This reordering is effective only if objects in the instrumented binary can be matched to the semantically equivalent ones in the optimized binary. Unfortunately, this is very challenging because objects do not have unique identities and the heap-snapshot contents are not persistent across Native-Image builds of the same program. This work tackles the problem of matching heap snapshots, and proposes a novel approach to improve the mapping between semantically equivalent objects in different binaries of a Native-Image application. We introduce the concept of context-augmented heap path (CAHP)—a list of elements that describes a path to an object stored in the heap snapshot. Our approach associates a CAHP to each object in a way that is as unique as possible. Objects with the same CAHP across different binaries are considered semantically equivalent. Moreover, since some semantically equivalent objects may have different CAHPs in the instrumented and optimized binaries (due to nondeterminism in the image-build process and other factors), we present an approach that finds, for each unmatched CAHP in the optimized binary, the most similar CAHP in the instrumented binary, associating the two objects. We integrate our approach into Native Image, reordering the objects stored in the heap snapshot more efficiently using the improved mapping. Our experiments show that our approach leads to much less page faults (2.98× on average) and considerably improves startup time (1.98× on average) w.r.t. the original Native-Image implementation.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
ABC: Towards a Universal Code Styler through Model Merging
Yitong Chen, Zhiqiang Gao, Chuanqi Shi, Baixuan Li, and Miao Gao
(Southeast University, China)
Code style transformation models built on code Language Models (code LMs) have achieved remarkable success. However, they typically focus on basis style transformations, where the target style follows a single criterion, and often struggle with combination styles, where the target style involves multiple criteria. In practice, style guides encompass multiple criteria, making the lack of effective combination style transformation a major limitation to their real-world applicability.
In this paper, we propose Absent-Basis-Combination (abbreviated as ABC), a novel framework for code style transformation that significantly improves combination style transformation and overcomes the limitations of existing approaches. We implement four variants of ABC with parameter sizes of 0.5B, 1.3B, 1.5B, and 3B, demonstrating consistent superiority over existing approaches across all model sizes in both basis and combination style transformations. Specifically, ABC achieves performance gains of up to 86.70%, and remains superior even when baseline approaches use three times the parameters. Furthermore, to address the lack of high-quality datasets and evaluation metrics, we construct and release a new style transformation dataset, Basis & Combination Code Style (abbreviated as BCCStyle), and introduce Code Sequence, Syntactic, Semantic and Stylistic BLEU (abbreviated as CS4BLEU), a novel code similarity metric that surpasses existing metrics in accuracy and consistency.

Publisher's Version Published Artifact Artifacts Available
Synchronized Behavior Checking: A Method for Finding Missed Compiler Optimizations
Yi Zhang, Yu Wang, Linzhang Wang, and Ke Wang
(Nanjing University, China)
Compilers are among the most foundational software ever developed. A critical component of a compiler is its optimization phase, which enhances the efficiency of the generated code. Given the sheer size and complexity of modern compilers, automated techniques for improving their optimization component have been a major area of research. This paper focuses on a specific category of issues, namely missed optimizations, where compilers fail to apply an optimization that could have made the generated code more efficient.
To detect missed optimizations, we propose Synchronized Behavior Checking (SBC), a novel approach that cross-validates multiple optimizations by leveraging their coordinated behaviors. The key insight behind SBC is that the outcome of one optimization can validate whether the conditions required for another optimization were met.
For a practical implementation of SBC, we cross-validate two optimizations at once based on two kinds of relationships — co-occurring and complementary. In the co-occurring relationship, if an optimization is applied based on a specific semantic constraint (, optimization condition) from an input program, another optimization, which depends on the same semantic constraint, should be applied as well. Second, when two optimizations are enabled by complementary semantic constraints, exactly one of the two optimizations should be applied. When an optimization should have been applied (according to either relationship) but was not applied, we regard it as a missed optimization.
We conduct an extensive evaluation of SBC on two state-of-the-art industry compilers LLVM and GCC. SBC successfully detects a large number of missed optimizations in both compilers, in particular, they are caused by a wide range of compiler analyses. Based on our evaluation results, we reported 101 issues to LLVM and GCC, out of which 84 have been confirmed, and 39 have been fixed or assigned (for planned fixes).
SBC opens up a new, exciting direction for finding missed compiler optimizations.

Publisher's Version
Formalizing Linear Motion G-Code for Invariant Checking and Differential Testing of Fabrication Tools
Yumeng He, Chandrakana Nandi, and Sreepathi Pai
(University of Utah, USA; Certora, USA; University of Washington at Seattle, USA; University of Rochester, USA)
The computational fabrication pipeline for 3D printing is much like a compiler --- users design models in Computer Aided Design (CAD) tools that are lowered to polygon meshes to be ultimately compiled to machine code by 3D slicers. For traditional compilers and programming languages, techniques for checking program invariants are well-established. Similarly, methods like differential testing are frequently used to uncover bugs in compilers themselves, which makes them more reliable.
The fabrication pipeline would benefit from similar techniques but traditional approaches do not directly apply to the representations used in this domain. Unlike traditional programs, 3D models exist both as geometric objects (a CAD model or a polygon mesh) as well as machine code that ultimately runs on the hardware. The machine code, like in traditional compiling, is affected by many factors like the model, the slicer being used, and numerous user-configurable parameters that control the slicing process.
In this work, we propose a new algorithm for lifting G-code (a common language used in many fabrication pipelines) by denoting a G-code program to a set of cuboids, and then defining an approximate point cloud representation for efficiently operating on these cuboids. Our algorithm opens up new opportunities: we show three use cases that demonstrate how it enables (1)~error localization in CAD models through invariant checking, (2)~quantitative comparisons between slicers, and (3)~evaluating the efficacy of mesh repair tools. We present a prototype implementation of our algorithm in a tool, GlitchFinder, and evaluate it on 58 real-world CAD models. Our results show that GlitchFinder is particularly effective in identifying slicing issues due to small features, can highlight differences in how popular slicers (Cura and PrusaSlicer) slice the same model, and can identify cases where mesh repair tools (MeshLab and Meshmixer) introduce new errors during repair.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Correct-by-Construction: Certified Individual Fairness through Neural Network Training
Ruihan Zhang and Jun Sun
(Singapore Management University, Singapore)
Fairness in machine learning is more important than ever as ethical concerns continue to grow. Individual fairness demands that individuals differing only in sensitive attributes receive the same outcomes. However, commonly used machine learning algorithms often fail to achieve such fairness. To improve individual fairness, various training methods have been developed, such as incorporating fairness constraints as optimisation objectives. While these methods have demonstrated empirical effectiveness, they lack formal guarantees of fairness. Existing approaches that aim to provide fairness guarantees primarily rely on verification techniques, which can sometimes fail to produce definitive results. Moreover, verification alone does not actively enhance individual fairness during training. To address this limitation, we propose a novel framework that formally guarantees individual fairness throughout training. Our approach consists of two parts, i.e., (1) provably fair initialisation that ensures the model starts in a fair state, and (2) a fairness-preserving training algorithm that maintains fairness as the model learns. A key element of our method is the use of randomised response mechanisms, which protect sensitive attributes while maintaining fairness guarantees. We formally prove that this mechanism sustains individual fairness throughout the training process. Experimental evaluations confirm that our approach is effective, i.e., producing models that are empirically fair and accurate. Furthermore, our approach is much more efficient than the alternative approach based on certified training (which requires neural network verification during training).

Publisher's Version
Float Self-Tagging
Olivier Melançon, Manuel Serrano, and Marc Feeley
(Université de Montréal, Canada; Inria, France)
Dynamic and polymorphic languages attach information, such as types, to run time objects, and therefore adapt the memory layout of values to include space for this information. This makes it difficult to efficiently implement IEEE754 floating-point numbers as this format does not leave an easily accessible space to store type information. The three main floating-point number encodings in use today, tagged pointers, NaN-boxing, and NuN-boxing, have drawbacks. Tagged pointers entail a heap allocation of all float objects, and NaN/NuN-boxing puts additional run time costs on type checks and the handling of other objects.
This paper introduces self-tagging, a new approach to object tagging that uses an invertible bitwise transformation to map floating-point numbers to tagged values that contain the correct type information at the correct position in their bit pattern, superimposing both their value and type information in a single machine word. Such a transformation can only map a subset of all floats to correctly typed tagged values, hence self-tagging takes advantage of the non-uniform distribution of floating point numbers used in practice to avoid heap allocation of the most frequently encountered floats.
Variants of self-tagging were implemented in two distinct Scheme compilers and evaluated on four microarchitectures to assess their performance and compare them to tagged pointers, NaN-boxing, and NuN-boxing. Experiments demonstrate that, in practice, the approach eliminates heap allocation of nearly all floating-point numbers and provides good execution speed of float-intensive benchmarks in Scheme with a negligible performance impact on other benchmarks, making it an attractive alternative to tagged pointers, alongside NaN-boxing and NuN-boxing.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
TailTracer: Continuous Tail Tracing for Production Use
Tianyi Liu, Yi Li, Yiyu Zhang, Zhuangda Wang, Rongxin Wu, Xuandong Li, and Zhiqiang Zuo
(Nanjing University, China; Xiamen University, China)
Despite extensive in-house testing, bugs often escape to deployed software. Whenever a failure occurs in production software, it is desirable to collect as much execution information as possible so as to help developers reproduce, diagnose and fix the bug. To reconcile the tension between trace capability, runtime overhead, and trace scale, we propose continuous tail tracing for production use. Instead of capturing only crash stacks, we produce the complete sequence of function calls and returns. Importantly, to avoid the overwhelming stress to I/O, storage, and network transfer caused by the tremendous amount of trace data, we only retain the final segment of trace. To accomplish it, we design a novel trace decoder to support precise tail trace decoding, and an effective path-based instrumentation-site selection algorithm to reduce overhead. We implemented our approach as a tool called TailTracer on top of LLVM, and conducted the evaluations over the SPEC CPU 2017 benchmark suite, the open-source database system, and real-world bugs. The experimental results validate that TailTracer achieves low-overhead tail tracing, while providing more informative trace data than the baseline.

Publisher's Version Published Artifact Artifacts Available
Tabby: A Synthesis-Aided Compiler for High-Performance Zero-Knowledge Proof Circuits
Junrui Liu, Jiaxin Song, Yanning Chen, Hanzhi Liu, Hongbo Wen, Luke Pearson, Yanju Chen, and Yu Feng
(University of California at Santa Barbara, USA; University of Illinois at Urbana-Champaign, USA; University of Toronto, Canada; Polychain Capital, USA)
Zero-knowledge proof (ZKP) applications require translating high-level programs into arithmetic circuits—a process that demands both correctness and efficiency. While recent DSLs improve usability, they often yield suboptimal circuits, and hand-optimized implementations remain difficult to construct and verify. We present Tabby, a synthesis-aided compiler that automates the generation of high-performance ZK circuits from high-level code. Tabby introduces a domain-specific intermediate representation designed for symbolic reasoning and applies sketch-based program synthesis to derive optimized low-level implementations. By decomposing programs into reusable components and verifying semantic equivalence via SMT-based reasoning, Tabby ensures correctness while achieving substantial performance improvements. We evaluate Tabby on a suite of real-world ZKP applications and demonstrate significant reductions in proof generation time and circuit size against mainstream ZK compilers.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Towards a Theoretically-Backed and Practical Framework for Selective Object-Sensitive Pointer Analysis
Chaoyue Zhang, Longlong Lu, Yifei Lu, Minxue Pan, and Xuandong Li
(Nanjing University, China)
Context sensitivity is a foundational technique in pointer analysis, critical and essential for improving precision but often incurring significant efficiency costs. Recent advances focus on selective context-sensitive analysis, where only a subset of program elements, such as methods or heap objects, are analyzed under context sensitivity while the rest are analyzed under context insensitivity, aiming to balance precision with efficiency. However, despite the proliferation of such approaches, existing methods are typically driven by specific code patterns, therefore lacking a comprehensive theoretical foundation for systematically identifying code scenarios that benefit from context sensitivity.
This paper presents a novel and foundational theory that establishes a sound over-approximation of the ground truth, i.e., objects that really improve precision under context sensitivity. The proposed theory reformulates the identification of this upper bound into graph reachability problems over a typical Pointer Flow Graph (PFG), each of which can be efficiently solved under context insensitivity, respectively. Building on this theoretical foundation, we introduce our selective context-sensitive analysis approach, Moon. Moon performs both backward and forward traversal on a Variable Flow Graph (VFG), an optimized variant of PFG designed to facilitate efficient traversal. This traversal systematically identifies all objects that improve precision under context sensitivity. Our theoretical foundation, along with carefully designed trade-offs within our approach, allows Moon to limit the scope of objects to be selected, leading to an effective balance between its analysis precision and efficiency. Extensive experiments with Moon across 30 Java programs demonstrate that Moon achieves 37.2X and 382.0X speedups for 2-object-sensitive and 3-object-sensitive analyses, respectively with negligible precision losses of only 0.1% and 0.2%. These results highlight that the balance between efficiency and precision achieved by Moon significantly outperforms all previous approaches.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
What’s in the Box: Ergonomic and Expressive Capture Tracking over Generic Data Structures
Yichen Xu, Oliver Bračevac, Cao Nguyen Pham, and Martin Odersky
(EPFL, Switzerland)
Capturing types in Scala unify static effect and resource tracking with object capabilities, enabling lightweight effect polymorphism with minimal notational overhead. However, their expressiveness has been insufficient for tracking capabilities embedded in generic data structures, preventing them from scaling to the standard collections library – an essential prerequisite for broader adoption. This limitation stems from the inability to name capabilities within the system’s notion of box types.
This paper develops System Capless, a new foundation for capturing types that provides the theoretical basis for reach capabilities (rcaps), a novel mechanism for naming “what’s in the box”. The calculus refines the universal capability notion into a new scheme with existential and universal capture set quantification. Intuitively, rcaps witness existentially quantified capture sets inside the boxes of generic types in a way that does not require exposing existential capture types in the surface language. We have fully mechanized the formal metatheory of System Capless in Lean, including proofs of type soundness and scope safety. System Capless supports the same lightweight notation of capturing types plus rcaps, as certified by a type-preserving translation, and also enables fully optional explicit capture-set quantification to increase expressiveness.
Finally, we present a full reimplementation of capture checking in Scala 3 based on System Capless and migrate the entire Scala collections library and an asynchronous programming library to evaluate its practicality and ergonomics. Our results demonstrate that reach capabilities enable the adoption of capture checking in production code with minimal changes and minimal-to-zero notational overhead in a vast majority of cases.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
GALA: A High Performance Graph Neural Network Acceleration LAnguage and Compiler
Damitha Lenadora, Nikhil Jayakumar, Chamika Sudusinghe, and Charith Mendis
(University of Illinois at Urbana-Champaign, USA; University of Texas at Austin, USA)
Multiple frameworks and optimizations have been proposed for accelerating Graph Neural Network (GNN) workloads over the years, achieving sizable runtime performance improvements. However, we notice that existing systems usually explore optimizing either at the intra-operator level or at the inter-operator level, missing synergies that exist due to their compositions. Further, most existing works focus primarily on optimizing the forward computation of GNNs, often overlooking opportunities for training-specific optimizations. To exploit these missed optimization opportunities, we introduce GALA, a domain-specific language (DSL) and a compiler that allows composing optimizations at different levels. The GALA DSL exposes intra-operator transformations as scheduling commands, while we introduce novel inter-operator transformations as part of the compiler. The composition of these transformations is made possible through the introduction of two novel intermediate representations (IR) in the GALA compiler that tracks and composes transformations at both the intra- and inter-operator levels. Further, the IRs maintain a global view of the GNN program, including its training process. This allows us to introduce training-specific transformations to aggressively optimize GNN training. Our evaluations show that GALA achieves a geo-mean speedup of 2.55× for inference and 2.52× for training across multiple systems, graphs, and GNN models. We also show that GALA performs well across different graph sizes and GNN model configurations, as well as allows users to explore different methods of performing similar optimizations leading to different tradeoff spaces.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Choreographic Quick Changes: First-Class Location (Set) Polymorphism
Ashley Samuelson, Andrew K. Hirsch, and Ethan Cecchetti
(University of Wisconsin-Madison, USA; SUNY Buffalo, USA)
Choreographic programming is a promising new paradigm for programming concurrent systems where a developer writes a single centralized program that compiles to individual programs for each node. Existing choreographic languages, however, lack critical features integral to modern systems, like the ability of one node to dynamically compute who should perform a computation and send that decision to others. This work addresses this gap with λQC, the first typed choreographic language with first class process names and polymorphism over both types and (sets of) locations. λQC also improves expressive power over previous work by supporting algebraic and recursive data types as well as multiply-located values. We formalize and mechanically verify our results in Rocq, including the standard choreographic guarantee of deadlock freedom.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Structural Abstraction and Refinement for Probabilistic Programs
Guanyan Li, Juanen Li, Zhilei Han, Peixin Wang, Hongfei Fu, and Fei He
(Tsinghua University, China; University of Oxford, UK; Beijing Normal University, China; East China Normal University, China; Shanghai Jiao Tong University, China)
In this paper, we present structural abstraction refinement, a novel framework for verifying the threshold problem of probabilistic programs. Our approach represents the structure of a Probabilistic Control-Flow Automaton (PCFA) as a Markov Decision Process (MDP) by abstracting away statement semantics. The maximum reachability of the MDP naturally provides a proper upper bound of the violation probability, termed the structural upper bound. This introduces a fresh “structural” characterization of the relationship between PCFA and MDP, contrasting with the traditional “semantical” view, where the MDP reflects semantics. The method uniquely features a clean separation of concerns between probability and computational semantics that the abstraction focuses solely on probabilistic computation and the refinement handles only the semantics aspect, where the latter allows non-random program verification techniques to be employed without modification.
Building upon this feature, we propose a general counterexample-guided abstraction refinement (CEGAR) framework, capable of leveraging established non-probabilistic techniques for probabilistic verification. We explore its instantiations using trace abstraction. Our method was evaluated on a diverse set of examples against state-of-the-art tools, and the experimental results highlight its versatility and ability to handle more flexible structures swiftly.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Modeling Reachability Types with Logical Relations: Semantic Type Soundness, Termination, Effect Safety, and Equational Theory
Yuyan Bao, Songlin Jia, Guannan Wei, Oliver Bračevac, and Tiark Rompf
(Augusta University, USA; Purdue University, USA; Tufts University, USA; EPFL, Switzerland)
Reachability types are a recent proposal to bring Rust-style reasoning about memory properties to higher-level languages, with a focus on higher-order functions, parametric types, and shared mutable state -- features that are only partially supported by current techniques as employed in Rust. While prior work has established key type soundness results for reachability types using the usual syntactic techniques of progress and preservation, stronger metatheoretic properties have so far been unexplored. This paper presents an alternative semantic model of reachability types using logical relations, providing a framework in which we study key properties of interest: (1) semantic type soundness, including of not syntactically well-typed code fragments, (2) termination, especially in the presence of higher-order mutable references, (3) effect safety, especially the absence of observable mutation, and, finally, (4) program equivalence, especially reordering of non-interfering expressions for parallelization or compiler optimization.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Incremental Bidirectional Typing via Order Maintenance
Thomas J. Porter, Marisa Kirisame, Ivan Wei, Pavel Panchekha, and Cyrus Omar
(University of Michigan, USA; University of Utah, USA)
Live programming environments provide various semantic services, including type checking and evaluation, continuously as the user is editing the program. The live paradigm promises to improve the developer experience, but liveness is an implementation challenge, particularly when working with large programs. This paper specifies and efficiently implements a system that is able to incrementally update type information for a live program in response to fine-grained program edits. This information includes type error marks and information about the expected and actual type of every expression. The system is specified type-theoretically as a small-step dynamics that propagates updates through the marked and annotated program. Most updates flow according to a base bidirectional type system. Additional pointers are maintained to connect bound variables to their binding locations, with type updates traversing these pointers directly. Order maintenance data structures are employed to efficiently maintain these pointers and to prioritize the order of update propagation. We prove this system is equivalent to naive reanalysis in the Agda theorem prover, along with other important metatheoretic properties. We then provide an efficient OCaml implementation, detailing a number of impactful optimizations. We evaluate this implementation's performance with a large stress-test and find that it is able to achieve multiple orders of magnitude speed-up compared to from-scratch reanalysis.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced ACM SIGPLAN Distinguished Paper Award
Quantization with Guaranteed Floating-Point Neural Network Classifications
Anan Kabaha and Dana Drachsler Cohen
(Technion, Israel)
Despite the wide success of neural networks, their computational cost is very high. Quantization techniques reduce this cost, but it can result in changing the classifications of the original floating-point network, even if the training is quantization-aware. In this work, we rely on verification to design correction systems that detect classification inconsistencies at inference time and eliminate them. The key idea is to overapproximate the space of inconsistent inputs with their maximal classification confidence. The main challenge in computing this confidence is that it involves analyzing a quantized network, which introduces a very high degree of nonlinearity, over all possible inputs. We propose CoMPAQt, an algorithm for computing this confidence. CoMPAQt relies on a novel encoding of quantization in mixed-integer linear programming (MILP), along with customized linear relaxations to reduce the high complexity. To prune the search space, it ties the computations of the quantized network and its floating-point counterpart. Given this confidence, we propose two correction mechanisms. The first mechanism guarantees to return the classification of the floating-point network and relies on networks with increasing bit precisions. The second mechanism mitigates classification inconsistencies by an ensemble of quantized networks. We evaluate our approach on MNIST, ACAS-Xu, and tabular datasets over fully connected and convolutional networks. Results show that our first correction mechanism guarantees 100% consistency with the floating-point network’s classifications while reducing its computational cost by 3.8x, on average. Our second mechanism reaches an almost perfect consistency guarantee in our experiments while reducing the computational cost by 4.1x. Our work is the first to provide a formal guarantee on the classification consistency of a quantized network.

Publisher's Version
A Refinement Methodology for Distributed Programs in Rust
Aurel Bílý, João Pereira, and Peter Müller
(ETH Zurich, Switzerland)
Refinement relates an abstract system model to a concrete, executable program, such that properties established for the abstract model carry over to the concrete implementation. Refinement has been used successfully in the development of substantial verified systems. Nevertheless, existing refinement techniques have limitations that impede their practical usefulness. Top-down refinement techniques that automatically generate executable code generally produce implementations with sub-optimal performance. Bottom-up refinement allows one to reason about existing, efficient implementations, but imposes strict requirements on the structure of the code, the structure of the refinement proofs, as well as the employed verification logic and tools.
In this paper, we present a novel bottom-up refinement methodology that removes these limitations. Our methodology uses the familiar notion of guarded transition systems to express abstract models, but combines guards with a novel notion of locally inductive invariants to relate the abstract model locally to concrete state. This approach is much more flexible than standard coupling invariants; in particular, it supports a wide range of program structures, data representations, and proof structures. We integrate our methodology as a library into Rust, leveraging the Rust type system to reason about ownership of guards. This integration allows one to use our methodology with an off-the-shelf Rust verification tool. It also facilitates practical applications, as we demonstrate on a number of substantial case studies including a concurrent implementation of Memcached.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Divide and Conquer: A Compositional Approach to Game-Theoretic Security
Ivana Bocevska, Anja Petković Komel, Laura Kovács, Sophie Rain, and Michael Rawson
(TU Wien, Austria; Argot Collective, Switzerland; University of Southampton, UK)
We propose a compositional approach to combine and scale automated reasoning in the static analysis of decentralized system security, such as blockchains. Our focus lies in the game-theoretic security analysis of such systems, allowing us to examine economic incentives behind user actions. In this context, it is particularly important to certify that deviating from the intended, honest behavior of the decentralized protocol is not beneficial: as long as users follow the protocol, they cannot be financially harmed, regardless of how others behave. Such an economic analysis of blockchain protocols can be encoded as an automated reasoning problem in the first-order theory of real arithmetic, reducing game-theoretic reasoning to satisfiability modulo theories (SMT). However, analyzing an entire game-theoretic model (called a game) as a single SMT instance does not scale to protocols with millions of interactions. We address this challenge and propose a divide-and-conquer security analysis based on compositional reasoning over games. Our compositional analysis is incremental: we divide games into subgames such that changes to one subgame do not necessitate re-analyzing the entire game, but only the ancestor nodes. Our approach is sound, complete, and effective: combining the security properties of subgames yields security of the entire game. Experimental results show that compositional reasoning discovers intra-game properties and errors while scaling to games with millions of nodes, enabling security analysis of large protocols.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Automated Discovery of Tactic Libraries for Interactive Theorem Proving
Yutong Xin, Jimmy Xin, Gabriel Poesia, Noah D. Goodman, Qiaochu Chen, and Işıl Dillig
(University of Texas at Austin, USA; Stanford University, USA; New York University, USA)
Enabling more concise and modular proofs is essential for advancing formal reasoning using interactive theorem provers (ITPs). Since many ITPs, such as Rocq and Lean, use tactic-style proofs, learning higher-level custom tactics is crucial for proof modularity and automation. This paper presents a novel approach to tactic discovery, which leverages Tactic Dependence Graphs (TDGs) to identify reusable proof strategies across multiple proofs. TDGs capture logical dependencies between tactic applications while abstracting away irrelevant syntactic details, allowing for both the discovery of new tactics and the refactoring of existing proofs into more modular forms. We have implemented this technique in a tool called TacMineR and compare it against an anti-unification-based approach (Peano) to tactic discovery. Our evaluation demonstrates that TacMineR can learn 3× as many tactics as Peano and reduces the size of proofs by 26% across all benchmarks. Furthermore, our evaluation demonstrates the benefits of learning custom tactics for proof automation, allowing a state-of-the-art proof automation tool to achieve a relative increase of 172% in terms of success rate.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Place Capability Graphs: A General-Purpose Model of Rust’s Ownership and Borrowing Guarantees
Zachary Grannan, Aurel Bílý, Jonáš Fiala, Jasper Geer, Markus de Medeiros, Peter Müller, and Alexander J. Summers
(University of British Columbia, Canada; ETH Zurich, Switzerland; New York University, USA)
Rust’s novel type system has proved an attractive target for verification and program analysis tools, due to the rich guarantees it provides for controlling aliasing and mutability. However, fully understanding, extracting and exploiting these guarantees is subtle and challenging: existing models for Rust’s type checking either support a smaller idealised language disconnected from real-world Rust code, or come with severe limitations in terms of precise modelling of Rust borrows, composite types storing them, function signatures and loops.
In this paper, we present Place Capability Graphs: a novel model of Rust’s type-checking results, which lifts these limitations, and which can be directly calculated from the Rust compiler’s own programmatic representations and analyses. We demonstrate that our model supports over 97% of Rust functions in the most popular public crates, and show its suitability as a general-purpose basis for verification and program analysis tools by developing promising new prototype versions of the existing Flowistry and Prusti tools.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Efficient Decrease-and-Conquer Linearizability Monitoring
Zheng Han Lee and Umang Mathur
(National University of Singapore, Singapore)
Linearizability has become the de facto standard for specifying correctness of implementations of concurrent data structures. While formally verifying such implementations remains challenging, linearizability monitoring has emerged as a promising first step to rule out early problems in the development of custom implementations, and serves as a key component in approaches that stress test such implementations. In this work, we undertake an algorithmic investigation of the linearizability monitoring problem, which asks to check if an execution history obtained from a concurrent data structure implementation is linearizable.
While this problem is largely understood to be intractable in general, a systematic understanding of when it becomes tractable has remained elusive. We revisit this problem and first present a unified ‘decrease-and-conquer’ algorithmic framework for designing linearizability monitoring. At its heart, this framework asks to identify special linearizability-preserving values in a given history — values whose presence yields an equi-linearizable sub-history (obtained by removing operations of such values), and whose absence indicates non-linearizability. More importantly, we prove that a polynomial time algorithm for the problem of identifying linearizability-preserving values, immediately yields a polynomial time algorithm for the linearizability monitoring problem, while conversely, intractability of this problem implies intractability of monitoring.
We demonstrate the effectiveness of our decrease-and-conquer framework by instantiating it for several popular concurrent data types — registers, sets, stacks, queues and priority queues — deriving polynomial time algorithms for them, under the (unambiguity) restriction that each insertion to the underlying data structure adds a distinct value. We further optimize these algorithms to achieve log-linear running time through the use of efficient data structures for amortizing the cost of solving induced sub-problems. Our implementation and evaluation on publicly available implementations of concurrent data structures show that our approach scales to very large histories and significantly outperforms existing state-of-the-art tools.

Publisher's Version Published Artifact Artifacts Available
Debugging WebAssembly? Put Some Whamm on It!
Elizabeth Gilbert, Matthew Schneider, Zixi An, Suhas Thalanki, Wavid Bowman, Alexander Y. Bai, Ben L. Titzer, and Heather Miller
(Carnegie Mellon University, USA; University of Florida, USA; New York University, USA)
Debugging and monitoring programs are integral to engineering and deploying software. Dynamic analyses monitor applications through source code or IR injection, machine code or bytecode rewriting, virtual machine APIs, or direct hardware support. While these techniques are viable within their respective domains, common tooling across techniques is rare, leading to fragmentation of skills, duplicated efforts, and inconsistent feature support. We address this problem in the WebAssembly ecosystem with Whamm, an instrumentation framework for Wasm that uses engine-level probing and has a bytecode rewriting fallback to promote portability. Whamm solves three problems: 1) tooling fragmentation, 2) prohibitive instrumentation overhead of general-purpose frameworks, and 3) tedium of tailoring low-level high-performance mechanisms. Whamm provides fully-programmable instrumentation with declarative match rules, static and dynamic predication, automatic state reporting, and user library support, achieving high performance through compiler and engine optimizations. The Whamm engine API allows instrumentation to be provided to a Wasm engine as Wasm code, reusing existing engine optimizations and unlocking new ones, most notably intrinsification, to minimize overhead. A key insight of our work is that explicitly requesting program state in match rules, rather than reflection, enables the engine to efficiently bundle arguments and even inline compiled probe logic. Whamm streamlines the tooling effort, as its bytecode-rewriting target can run instrumented programs everywhere, lowering fragmentation and advancing the state of the art for engine support. We evaluate Whamm with case studies of non-trivial monitors and show it is expressive, powerful, and efficient.

Publisher's Version Published Artifact Artifacts Available
Sound and Modular Activity Analysis for Automatic Differentiation in MLIR
Mai Jacob Peng, William S. Moses, Oleksandr Zinenko, and Christophe Dubach
(McGill University, Canada; University of Illinois at Urbana-Champaign, USA; Brium, France; Mila, Canada)
Computing derivatives is paramount for multiple domains ranging from training neural networks to precise climate simulations. While derivatives can be generated by Automatic Differentiation (AD) tools, they often require aggressive optimization to avoid compromising program performance. One of the central optimizations consists of identifying inactive operations that do not contribute to the partial derivatives of interest.
Multiple tools provide activity analyses for a variety of input languages, though often with only informal correctness guarantees. This paper formally defines activity analysis for AD as an abstract interpretation, proves its soundness, and implements it within the MLIR compiler infrastructure. To account for MLIR’s genericity, a subset of MLIR’s internal representation amenable to AD is formalized for the first time. Furthermore, the paper proposes a sound intraprocedural approximation of the whole-program activity analysis via function summaries along with a mechanism to automatically derive these summaries from function definitions.
The implementation is evaluated on a differentiation-specific benchmark suite. It achieves a 1.24 geometric mean speedup on CPU and a 1.7 geometric mean speedup on GPU in the runtime of generated programs, when compared to a baseline that does not use activity analysis. The evaluation also demonstrates that the intraprocedural analysis with function summaries proves inactive 100% of instructions proven inactive by the whole-program analysis.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Flix: A Design for Language-Integrated Datalog
Magnus Madsen and Ondřej Lhoták
(Aarhus University, Denmark; University of Waterloo, Canada)
We present a comprehensive overview of the Datalog facilities in the Flix programming language. We show how programmers can write functions implemented as Datalog programs and we demonstrate how to build modular and reusable families of Datalog programs using first-class Datalog program values, rho abstraction, parametric polymorphism, and type classes. We describe several features that improve the ergonomics, flexibility, and expressive power of Datalog programming in Flix, including the inject and query program constructs, head and guard expressions, functional predicates, lattice semantics, and more.
We illustrate Datalog programming in Flix with several applications, including implementations of Ullman's algorithm to stratify Datalog programs, the Ford-Fulkerson algorithm for maximum flow, and the IFDS and IDE algorithms for context-sensitive program analysis. The implementations of IFDS and IDE fulfill a long-term goal: to have fully modular, polymorphic, typed, and declarative formulations of these algorithms that can be instantiated with any abstract domain.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
SafeTree: Expressive Tree Policies for Microservices
Karuna Grewal, Brighten Godfrey, and Justin Hsu
(Cornell University, USA; University of Illinois at Urbana-Champaign, USA)
A microservice-based application is composed of multiple self-contained components called microservices, and controlling inter-service communication is important for enforcing safety properties. Presently, inter-service communication is configured using microservice deployment tools. However, such tools only support a limited class of single-hop policies, which can be overly permissive because they ignore the rich service tree structure of microservice calls. Policies that can express the service tree structure can offer development and security teams more fine-grained control over communication patterns.
To this end, we design an expressive policy language to specify service tree structures, and we develop a visibly pushdown automata-based dynamic enforcement mechanism to enforce service tree policies. Our technique is non-invasive: it does not require any changes to service implementations, and does not require access to microservice code. To realize our method, we build a runtime monitor on top of a service mesh, an emerging network infrastructure layer that can control inter-service communication during deployment. In particular, we employ the programmable network traffic filtering capabilities of Istio, a popular service mesh implementation, to implement an online and distributed monitor. Our experiments show that our monitor can enforce rich safety properties while adding minimal latency overhead on the order of milliseconds.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
TraceLinking Implementations with Their Verified Designs
Finn Hackett and Ivan Beschastnikh
(University of British Columbia, Canada)
An important correctness gap exists between formally verifiable distributed system designs and their implementations. Recently proposed work bridges this gap by automatically extracting, or compiling, an implementation from the formally-verified design. The runtime behavior of this compiled implementation, however, may deviate from its design. For example, the compiler may contain bugs, the design may make incorrect assumptions about the deployment environment, or the implementation might be misconfigured.
In this paper we develop TraceLink, a methodology to detect such deviations through trace validation. TraceLink maps traces, that capture an execution's behavior, to the corresponding formal design. Unlike previous work on trace validation, our approach is completely automated.
We implement TraceLink for PGo, a compiler from Modular PlusCal to both TLA+ and Go. We present a formal semantics for interpreting execution traces as TLA+, along with a templatization strategy to minimize the size of the TLA+ tracing specification. We also present a novel trace path validation strategy, called sidestep, which detects bugs faster and with little additional overhead.
We evaluated TraceLink on several distributed systems, including an MPCal implementation of a Raft key-value store. Our evaluation demonstrates that TraceLink is able to find 9 previously undetected and diverse bugs in PGo's TCB, including a bug in the PGo compiler itself. We also show the effectiveness of the templatization approach and the sidestep path validation strategy.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Universal Scalability in Declarative Program Analysis (with Choice-Based Combination Pruning)
Anastasios Antoniadis, Ilias Tsatiris, Neville Grech, and Yannis Smaragdakis
(University of Athens, Greece; Dedaub, Greece; University of Malta, Malta; Dedaub, Malta)
Datalog engines for fixpoint evaluation have brought great benefits to static program analysis over the past decades. A Datalog specification of an analysis allows a declarative, easy-to-maintain specification, without sacrificing performance, and indeed often achieving significant speedups compared to hand-coded algorithms.
However, these benefits come with a certain loss of control. Datalog evaluation is bottom-up, meaning that all inferences (from a set of initial facts) are performed and all their conclusions are outputs of the computation. In practice, virtually every program analysis expressed in Datalog becomes unscalable for some inputs, due to the worst-case blowup of computing all results, even when a partial answer would have been perfectly satisfactory.
In this work, we present a simple, uniform, and elegant solution to the problem, with great practical effectiveness and application to virtually any Datalog-based analysis. The approach consists of leveraging the choice construct, supported natively in modern Datalog engines like Souffle. The choice construct allows the definition of functional dependencies in a relation and has been used in the past for expressing worklist algorithms. We show a near-universal construction that allows the choice construct to flexibly limit evaluation of predicates. The technique is applicable to practically any analysis architecture imaginable, since it adaptively prunes evaluation results when a (programmer-controlled) projection of a relation exceeds a desired cardinality.
We apply the technique to probably the largest, pre-existing Datalog analysis frameworks in existence: Doop (for Java bytecode) and the main client analyses from the Gigahorse framework (for Ethereum smart contracts). Without needing to understand the existing analysis logic and with minimal, local-only changes, the performance of each framework increases dramatically, by over 20x for the hardest inputs, with near-negligible sacrifice in completeness.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Verifying Asynchronous Hyperproperties in Reactive Systems
Raven Beutner and Bernd Finkbeiner
(CISPA Helmholtz Center for Information Security, Germany)
Hyperproperties are system properties that relate multiple execution traces and commonly occur when specifying information-flow and security policies. Logics like HyperLTL utilize explicit quantification over execution traces to express temporal hyperproperties in reactive systems, i.e., hyperproperties that reason about the temporal behavior along infinite executions. An often unwanted side-effect of such logics is that they compare the quantified traces synchronously. This prohibits the logics from expressing properties that compare multiple traces asynchronously, such as Zdancewic and Myers’s observational determinism, McLean’s non-inference, or stuttering refinement. We study the model-checking problem for a variant of asynchronous HyperLTL (A-HLTL), a temporal logic that can express hyperproperties where multiple traces are compared across timesteps. In addition to quantifying over system traces, A-HLTL features secondary quantification over stutterings of these traces. Consequently, A-HLTL allows for a succinct specification of many widely used asynchronous hyperproperties. Model-checking A-HLTL requires finding suitable stutterings, which, thus far, has been only possible for very restricted fragments or terminating systems. In this paper, we propose a novel game-based approach for the verification of arbitrary ∀** A-HLTL formulas in reactive systems. In our method, we consider the verification as a game played between a verifier and a refuter, who challenge each other by controlling parts of the underlying traces and stutterings. A winning strategy for the verifier then corresponds to concrete witnesses for existentially quantified traces and asynchronous alignments for existentially quantified stutterings. We identify fragments for which our game-based interpretation is complete and thus constitutes a finite-state decision procedure. We contribute a prototype implementation for finite-state systems and report on encouraging experimental results.

Publisher's Version
Synthesizing Implication Lemmas for Interactive Theorem Proving
Ana Brendel, Aishwarya Sivaraman, and Todd Millstein
(University of California at Los Angeles, USA)
Interactive theorem provers (ITP) enable programmers to formally verify properties of their software systems. One burden for users of ITPs is identifying the necessary helper lemmas to complete a proof, for example those that define key inductive invariants. Existing approaches to lemma synthesis for ITPs have limited, if any, support for synthesizing implications: lemmas of the form P1 ∧ ⋯ ∧ PnQ. In this paper, we propose a technique and associated tool for synthesizing useful implication lemmas. Our approach employs a form of data-driven invariant inference to explore strengthenings of the current proof state, based on sample valuations of the current goal and assumptions. We have implemented our approach in a Rocq tactic called dilemma. We demonstrate its effectiveness in synthesizing necessary helper lemmas for proofs from the Verified Functional Algorithms textbook as well as from prior benchmark suites for lemma synthesis.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
AccelerQ: Accelerating Quantum Eigensolvers with Machine Learning on Quantum Simulators
Avner Bensoussan, Elena Chachkarova, Karine Even-Mendoza, Sophie Fortz, and Connor Lenihan
(King’s College London, UK)
We present AccelerQ, a framework for automatically tuning quantum eigensolver (QE) implementations–these are quantum programs implementing a specific QE algorithm–using machine learning and search-based optimisation. Rather than redesigning quantum algorithms or manually tweaking the code of an already existing implementation, AccelerQ treats QE implementations as black-box programs and learns to optimise their hyperparameters to improve accuracy and efficiency by incorporating search-based techniques and genetic algorithms (GA) alongside ML models to efficiently explore the hyperparameter space of QE implementations and avoid local minima.
Our approach leverages two ideas: 1) train on data from smaller, classically simulable systems, and 2) use program-specific ML models, exploiting the fact that local physical interactions in molecular systems persist across scales, supporting generalisation to larger systems. We present an empirical evaluation of AccelerQ on two fundamentally different QE implementations: ADAPT-QSCI and QCELS. For each, we trained a QE predictor model, a lightweight XGBoost Python regressor, using data extracted classically from systems of up to 16 qubits. We deployed the model to optimise hyperparameters for executions on larger systems of 20-, 24-, and 28-qubit Hamiltonians, where direct classical simulation becomes impractical. We observed a reduction in error from 5.48% to 5.3% with only the ML model and further to 5.05% with GA for ADAPT-QSCI, and from 7.5% to 6.5%, with no additional gain with GA for QCELS. Given inconclusive results for some 20- and 24-qubit systems, we recommend further analysis of training data concerning Hamiltonian characteristics. Nonetheless, our results highlight the potential of ML and optimisation techniques for quantum programs and suggest promising directions for integrating software engineering methods into quantum software stacks.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Modular Reasoning about Global Variables and Their Initialization
João Pereira, Isaac van Bakel, Patricia Firlejczyk, Marco Eilers, and Peter Müller
(ETH Zurich, Switzerland)
Many imperative programming languages offer global variables to implement common functionality such as global caches and counters. Global variables are typically initialized by module initializers (e.g., static initializers in Java), code blocks that are executed automatically by the runtime system. When or in what order these initializers run is typically not known statically and modularly. For instance in Java, initialization is triggered dynamically upon the first use of a class, while in Go, the order depends on all packages of a program. As a result, reasoning modularly about global variables and their initialization is difficult, especially because module initializers may perform arbitrary side effects and may have cyclic dependencies. Consequently, existing modular verification techniques either do not support global state or impose drastic restrictions that are not satisfied by mainstream languages and programs.
In this paper, we present the first practical verification technique to reason formally and modularly about global state and its initialization. Our technique is based on separation logic and uses module invariants to specify ownership and values of global variables. A partial order on modules and methods allows us to reason modularly about when a module invariant may be soundly assumed to hold, irrespective of when exactly the module initializer establishing it runs. Our technique supports both thread-local and shared global state. We formalize it as a program logic in Iris and prove its soundness in Rocq. We make only minimal assumptions about the initialization semantics, making our technique applicable to a wide range of programming languages. We implemented our technique in existing verifiers for Java and Go and demonstrate its effectiveness on typical uses cases of global state as well as a substantial codebase implementing an Internet router.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Modal Abstractions for Virtualizing Memory Addresses
Ismail Kuru and Colin S. Gordon
(Drexel University, USA)
Virtual memory management (VMM) code is a critical piece of general-purpose OS kernels, but verification of this functionality is challenging due to the complexity of the hardware interface (the page tables are updated via writes to those memory locations, using addresses which are themselves virtualized). Prior work on verification of VMM code has either only handled a single address space, or trusted significant pieces of assembly code.
In this paper, we introduce a modal abstraction to describe the truth of assertions relative to a specific virtual address space: [r]P indicating that P holds in the virtual address space rooted at r. Such modal assertions allow different address spaces to refer to each other, enabling complete verification of instruction sequences manipulating multiple address spaces. Using them effectively requires working with other assertions, such as points-to assertions about memory contents — which implicitly depend on the address space they are used in. We therefore define virtual points-to assertions to definitionally mimic hardware address translation, relative to a page table root. We demonstrate our approach with challenging fragments of VMM code showing that our approach handles examples beyond what prior work can address, including reasoning about a sequence of instructions as it changes address spaces. Our results are formalized for a RISC-like fragment of x86-64 assembly in Rocq.

Publisher's Version
A Language for Quantifying Quantum Network Behavior
Anita Buckley, Pavel Chuprikov, Rodrigo Otoni, Robert Soulé, Robert Rand, and Patrick Eugster
(USI Lugano, Switzerland; Télécom Paris, France; Institut Polytechnique de Paris, France; Yale University, USA; University of Chicago, USA)
Quantum networks have capabilities that are impossible to achieve using only classical information. They connect quantum capable nodes, with their fundamental unit of communication being the Bell pair, a pair of entangled quantum bits. Due to the nature of quantum phenomena, Bell pairs are fragile and difficult to transmit over long distances, thus requiring a network of repeaters along with dedicated hardware and software to ensure the desired results. The intrinsic challenges associated with quantum networks, such as competition over shared resources and high probabilities of failure, require quantitative reasoning about quantum network protocols. This paper develops PBKAT, an expressive language for specification, verification and optimization of quantum network protocols for Bell pair distribution. Our language is equipped with primitives for expressing probabilistic and possibilistic behaviors, and with semantics modeling protocol executions. We establish the properties of PBKAT’s semantics, which we use for quantitative analysis of protocol behavior. We further implement a tool to automate PBKAT’s usage, which we evaluated on real-world protocols drawn from the literature. Our results indicate that PBKAT is well suited for both expressing real-world quantum network protocols and reasoning about their quantitative properties.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
MIO: Multiverse Debugging in the Face of Input/Output
Tom Lauwaerts, Maarten Steevens, and Christophe Scholliers
(Universiteit Gent, Belgium)
Debugging non-deterministic programs on microcontrollers is notoriously challenging, especially when bugs manifest in unpredictable, input-dependent execution paths. A recent approach, called multiverse debugging, makes it easier to debug non-deterministic programs by allowing programmers to explore all potential execution paths. Current multiverse debuggers enable both forward and backward traversal of program paths, and some facilitate jumping to any previously visited states, potentially branching into alternative execution paths within the state space.
Unfortunately, debugging programs that involve input/output operations using existing multiverse debuggers can reveal inaccessible program states, i.e. states which are not encountered during regular execution. This can significantly hinder the debugging process, as the programmer may spend substantial time exploring and examining inaccessible program states, or worse, may mistakenly assume a bug is present in the code, when in fact, the issue is caused by the debugger.
This paper presents a novel approach to multiverse debugging, which can accommodate a broad spectrum of input/output operations. We provide the semantics of our approach and prove the correctness of our debugger, ensuring that despite having support for a wide range of input/output operations the debugger will only explore those program states which can be reached during regular execution.
We have developed a prototype, called MIO, leveraging the WARDuino WebAssembly virtual machine to demonstrate the feasibility and efficiency of our techniques. As a demonstration of the approach we highlight a color dial built with a Lego Mindstorms motor, and color sensor, providing a tangible example of how our approach enables multiverse debugging for programs running on an STM32 microcontroller.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Cost of Soundness in Mixed-Precision Tuning
Anastasia Isychev and Debasmita Lohar
(TU Wien, Austria; KIT, Germany)
Numerical code is often executed repetitively and on hardware with limited resources, which makes it a perfect target for optimizations. One of the most effective ways to boost performance—especially in terms of runtime—is by reducing the precision of computations. However, low precision can introduce significant rounding errors, potentially compromising the correctness of results. Mixed-precision tuning addresses this trade-off by assigning the lowest possible precision to a subset of variables and arithmetic operations in the program while ensuring that the overall error remains within acceptable bounds. State-of-the-art tools validate the accuracy of optimized programs using either sound static analysis or dynamic sampling. While sound methods are often considered safer but overly conservative, and dynamic methods are more aggressive and potentially more effective, the question remains: how do these approaches compare in practice? In this paper, we present the first comprehensive evaluation of existing mixed-precision tuning tools for floating-point programs, offering a quantitative comparison between sound static and (unsound) dynamic approaches. We measure the trade-offs between performance gains, utilizing optimization potential, and the soundness guarantees on the accuracy---what we refer to as the cost of soundness. Our experiments on the standard FPBench benchmark suite challenge the common belief that dynamic optimizers consistently generate faster programs. In fact, for small straight-line numerical programs, we find that sound tools enhanced with regime inference match or outperform dynamic ones, while providing formal correctness guarantees, albeit at the cost of increased optimization time. Standalone sound tools, however, are overly conservative, especially when accuracy constraints are tight; whereas dynamic tools are consistently effective for different targets, but exceed the maximum allowed error by up to 9 orders of magnitude.

Publisher's Version Published Artifact Artifacts Available
Encode the ∀∃ Relational Hoare Logic into Standard Hoare Logic
Shushu Wu, Xiwei Wu, and Qinxiang Cao
(Shanghai Jiao Tong University, China)
Verifying a real-world program’s functional correctness can be decomposed into (1) a refinement proof showing that the program implements a more abstract high-level program and (2) an algorithm correctness proof at the high level. Relational Hoare logic serves as a powerful tool to establish refinement but often necessitates formalization beyond standard Hoare logic. Particularly in the nondeterministic setting, the ∀∃ relational Hoare logic is required. Existing approaches encode this logic into a Hoare logic with ghost states and invariants, yet these extensions significantly increase formalization complexity and soundness proof overhead. This paper proposes a generic encoding theory that reduces the ∀∃ relational Hoare logic to standard (unary) Hoare logic. Precisely, we propose to redefine the validity of relational Hoare triples while preserving the original proof rules and then encapsulate the ∀∃ pattern within assertions. We have proved that the validity of encoded standard Hoare triples is equivalent to the validity of the desired relational Hoare triples. Moreover, the encoding theory demonstrates how common relational Hoare logic proof rules are indeed special cases of standard Hoare logic proof rules, and relational proof steps correspond to standard proof steps. Our theory enables standard Hoare logic to prove ∀∃ relational properties by defining a predicate Exec, without requiring modifications to the logic framework or re-verification of soundness.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Work Packets: A New Abstraction for GC Software Engineering, Optimization, and Innovation
Wenyu Zhao, Stephen M. Blackburn, and Kathryn S. McKinley
(Australian National University, Australia; Google, Australia; Google, USA)
Garbage collection (GC) implementations must meet efficiency and maintainability requirements, which are often perceived to be at odds. Moreover, the desire for efficiency typically sacrifices agility, undermining rapid development and innovation, with unintended consequences on longer-term performance aspirations. Prior GC implementations struggle to: i) maximize efficiency, parallelism, and hardware utilization, while ii) correctly and elegantly implementing optimizations and scheduling constraints. This struggle is reflected in today’s implementations, which tend to be monolithic and depend on coarse phase-based synchronization.
This paper presents a new design for GC implementations that emphasizes both agility and efficiency. The design simplifies and unifies all GC tasks into work packets which define: i) work items, ii) kernels that process them, and iii) scheduling constraints. Our simple insights are that execution is dominated by a few very small, heavily executed kernels, and that GC implementations are high-level algorithms that orchestrate vast numbers of performance-critical work items. Work packets comprise groups of like work items, such as the scanning of a thread’s stack or the tracing of a single object in a multi-million object heap. The kernel attached to a packet specifies how to process items within the packet, such as how to scan a stack, or how to trace an object. The scheduling constraints express dependencies, e.g. all mutators must stop before copying any objects. Fully parallel activities, such as scanning roots and performing a transitive closure, proceed with little synchronization. The implementation of a GC algorithm reduces to declaring required work packets, their kernels, and dependencies. The execution model operates transparently of GC algorithms and work packet type. We broaden the scope of work-stealing, applying it to any type of GC work and introduce a novel two-tier work-stealing algorithm to further optimize parallelism at fine granularity.
We show the software engineering benefits of this design via eight collectors that use 23 common work packet types in the MMTk GC framework. We use the LXR collector to show that the work packet abstraction supports innovation and high performance: i) comparing versions of LXR, work packets deliver performance benefits over a phase-based approach, and ii) LXR with work packets outperforms the highly-tuned latest (OpenJDK 24), state-of-the-art G1 garbage collector. We thus demonstrate that work packets achieve high performance, while simplifying GC implementation, making them inherently easier to optimize and verify.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Abstract Interpretation of Temporal Safety Effects of Higher Order Programs
Mihai Nicola, Chaitanya Agarwal, Eric Koskinen, and Thomas Wies
(Stevens Institute of Technology, USA; New York University, USA)
This paper describes a new abstract interpretation-based approach to verify temporal safety properties of recursive, higher-order programs. While prior works have provided theoretical impact and some automation, they have had limited scalability. We begin with a new automata-based "abstract effect domain" for summarizing context-sensitive dependent effects, capable of abstracting relations between the program environment and the automaton control state. Our analysis includes a new transformer for abstracting event prefixes to automatically computed context-sensitive effect summaries, and is instantiated in a type-and-effect system grounded in abstract interpretation. Since the analysis is parametric on the automaton, we next instantiate it to a broader class of history/register (or "accumulator") automata, beyond finite state automata to express some context-free properties, input-dependency, event summation, resource usage, cost, equal event magnitude, etc. We implemented a prototype evDrift that computes dependent effect summaries (and validates assertions) for OCaml-like recursive higher-order programs. As a basis of comparison, we describe reductions to assertion checking for higher-order but effect-free programs, and demonstrate that our approach outperforms prior tools Drift, RCaml/Spacer, MoCHi, and ReTHFL. Overall, across a set of 23 benchmarks, Drift verified 12 benchmarks, RCaml/Spacer verified 6, MoCHi verified 11, ReTHFL verified 18, and evDrift verified 21; evDrift also achieved a 6.3x, 5.3x, 16.8x, and 6.4x speedup over Drift, RCaml/Spacer, MoCHi, and ReTHFL, respectively, on those benchmarks that both tools could solve.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
ApkDiffer: Accurate and Scalable Cross-Version Diffing Analysis for Android Applications
Jiarun Dai, Mingyuan Luo, Yuan Zhang, Min Yang, and Minghui Yang
(Fudan University, China; OPPO, China)
Software diffing (a.k.a., code alignment) is a fundamental technique to differentiate similar and dissimilar code pieces between two given software products. It can enable various kinds of critical security analysis, e.g., n-day bug localization, software plagiarism detection, etc. To date, many diffing tools have been proposed dedicated to aligning binaries. However, few research efforts have elaborated on cross-version Android app diffing, largely hindering the security assessment of wild apps. To sum up, existing diffing works usually establish scalability-oriented alignment algorithms, and suffer from significant alignment errors when handling the large codebases of modern apps.
To fill this gap, we propose ApkDiffer, a method-level (i.e., function-level) diffing tool dedicated to aligning versions of the same closed-source Android app. ApkDiffer achieves a good balance between scalability and effectiveness, by featuring a two-stage decomposition-based alignment solution. It first decomposes the codebase of each app version, respectively, into multiple functionality units; then tries to precisely align methods that serve equivalent app functionalities across versions. In evaluation, the results show that ApkDiffer noticeably outperforms existing alignment algorithms in precision and recall, while still having a satisfactory time cost. In addition, we used ApkDiffer to track the one-year evolution of 100 popular Google Play apps. By pinpointing the detailed code locations where app versions deviate in privacy collection, we convincingly revealed that app updates may pose ever-evolving privacy threats to end-users.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Software Model Checking via Summary-Guided Search
Ruijie Fang, Zachary Kincaid, and Thomas Reps
(University of Texas at Austin, USA; Princeton University, USA; University of Wisconsin, USA)
In this work, we describe a new software model-checking algorithm called GPS. GPS treats the task of model checking a program as a directed search of the program states, guided by a compositional, summary-based static analysis. The summaries produced by static analysis are used both to prune away infeasible paths and to drive test generation to reach new, unexplored program states. GPS can find both proofs of safety and counter-examples to safety (i.e., inputs that trigger bugs), and features a novel two-layered search strategy that renders it particularly efficient at finding bugs in programs featuring long, input-dependent error paths. To make GPS refutationally complete (in the sense that it will find an error if one exists, if it is allotted enough time), we introduce an instrumentation technique and show that it helps GPS achieve refutation-completeness without sacrificing overall performance. We benchmarked GPS on a diverse suite of benchmarks including programs from the Software Verification Competition (SV-COMP), from prior literature, as well as synthetic programs based on examples in this paper. We found that our implementation of GPS outperforms state-of-the-art software model checkers (including the top performers in SV-COMP ReachSafety-Loops category), both in terms of the number of benchmarks solved and in terms of running time.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Opportunistically Parallel Lambda Calculus
Stephen Mell, Konstantinos Kallas, Steve Zdancewic, and Osbert Bastani
(University of Pennsylvania, USA; University of California at Los Angeles, USA)
Scripting languages are widely used to compose external calls such as native libraries and network services. In such scripts, execution time is often dominated by waiting for these external calls, rendering traditional single-language optimizations ineffective. To address this, we propose a novel opportunistic evaluation strategy for scripting languages based on a core lambda calculus that automatically dispatches independent external calls in parallel and streams their results. We prove that our approach is confluent, ensuring that it preserves the programmer’s original intent, and that it eventually executes every external call. We implement this approach in a scripting language called Opal. We demonstrate the versatility and performance of Opal, focusing on programs that invoke heavy external computation through the use of large language models (LLMs) and other APIs. Across five scripts, we compare to several state-of-the-art baselines and show that opportunistic evaluation improves total running time (up to 6.2×) and latency (up to 12.7×) compared to standard sequential Python, while performing very close (between 1.3% and 18.5% running time overhead) to hand-tuned manually optimized asynchronous Rust. For Tree-of-Thoughts, a prominent LLM reasoning approach, we achieve a 6.2× performance improvement over the authors’ own implementation.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
A Lightweight Type-and-Effect System for Invalidation Safety: Tracking Permanent and Temporary Invalidation with Constraint-Based Subtype Inference
Cunyuan Gao and Lionel Parreaux
(Hong Kong University of Science and Technology, China)
In many programming paradigms, some program entities are only valid within delimited regions of the program, such as resources that might be automatically deallocated at the end of specific scopes. Outside their live scopes, the corresponding entities are no longer valid – they are permanently invalidated. Sometimes, even within the live scope of a resource, the use of that resource must become temporarily invalid, such as when iterating over a mutable collection, as mutating the collection during iteration might lead to undefined behavior. However, high-level general-purpose programming languages rarely allow this information to be reflected on the type level. Most previously proposed solutions to this problem have relied on restricting either the aliasing or the capture of variables, which can reduce the expressiveness of the language. In this paper, we propose a higher-rank polymorphic type-and-effect system to statically track the permanent and temporary invalidation of sensitive values and resources, without any aliasing or capture restrictions. We use Boolean-algebraic types (unions, intersections, and negations) to precisely model the side effects of program terms and guarantee they are invalidation-safe. Moreover, we present a complete and practical type inference algorithm, whereby programmers only need to annotate the types of higher-rank and polymorphically-recursive functions. Our system, nicknamed InvalML, has a wide range of applications where tracking invalidation is needed, including stack-based and region-based memory management, iterator invalidation, data-race-free concurrency, mutable state encapsulation, type-safe exception and effect handlers, and even scope-safe metaprogramming.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
P³: Reasoning about Patches via Product Programs
Arindam Sharma, Daniel Schemmel, and Cristian Cadar
(Imperial College London, UK)
Software systems change on a continuous basis, with each patch prone to introducing new errors and security vulnerabilities. While providing a full functional specification for the program is a notoriously difficult task, writing a patch specification that describes the behaviour of the patched version in terms of the unpatched one (e.g., “the post-patch version is a refactoring of the pre-patch one”) is often easy. To reason about such specifications, program analysers have to concomitantly analyse the pre- and post-patch software versions.
In this paper, we propose P3, a framework for automated reasoning about patches via product programs. While product programs have been used before, particularly in a security context, P3 is the first framework that automatically constructs product programs for a real-world language (namely C), supports diverse and complex patches found in real software, and provides runtime support enabling techniques as varied as greybox fuzzing and symbolic execution to run unmodified. Our experimental evaluation on a set of complex software patches from the challenging CoREBench suite shows that P3 can successfully handle intricate code, inter-operate with the widely-used analysers AFL++ and KLEE, and enable reasoning over patch specifications.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
The Continuous Tensor Abstraction: Where Indices Are Real
Jaeyeon Won, Willow Ahrens, Teodoro Fields Collin, Joel S. Emer, and Saman Amarasinghe
(Massachusetts Institute of Technology, USA; Georgia Institute of Technology, USA; NVIDIA, USA)
This paper introduces the continuous tensor abstraction, allowing indices to take real-number values (e.g., A[3.14]). It also presents continuous tensor algebra expressions, such as Cx,y = Ax,yBx,y, where indices are defined over a continuous domain. This work expands the traditional tensor model to include continuous tensors. Our implementation supports piecewise-constant tensors, on which infinite domains can be processed in finite time. We also introduce a new tensor format for efficient storage and a code generation technique for automatic kernel generation. For the first time, our abstraction expresses domains like computational geometry and computer graphics in the language of tensor programming. Our approach demonstrates competitive or better performance to hand-optimized kernels in leading libraries across diverse applications. Compared to hand-implemented libraries on a CPU, our compiler-based implementation achieves an average speedup of 9.20× on 2D radius search with ∼60× fewer lines of code (LoC), 1.22× on genomic interval overlapping queries (with ∼18× LoC saving), and 1.69× on trilinear interpolation in Neural Radiance Field (with ∼6× LoC saving).

Publisher's Version
Translation Validation for LLVM’s AArch64 Backend
Ryan Berger, Mitch Briles, Nader Boushehrinejad Moradi, Nicholas Coughlin, Kait Lam, Nuno P. Lopes, Stefan Mada, Tanmay Tirpankar, and John Regehr
(NVIDIA, USA; University of Utah, USA; Defence Science and Technology Group, Australia; University of Queensland, Australia; INESC-ID, Portugal; Instituto Superior Técnico - University of Lisbon, Portugal)
LLVM's backends translate its intermediate representation (IR) to assembly or object code. Alongside register allocation and instruction selection, these backends contain many analogues of components traditionally associated with compiler middle ends: dataflow analyses, common subexpression elimination, loop invariant code motion, and a first-class IR -- MIR, the "machine IR." In effect, this kind of compiler backend is a highly optimizing compiler in its own right, with all of the correctness hazards entailed by a million lines of intricate C++. As a step towards gaining confidence in the correctness of work done by LLVM backends, we have created arm-tv, which formally verifies translations between LLVM IR and AArch64 (64-bit ARM) code. Ours is not the first translation validation work for LLVM, but we have advanced the state of the art along multiple fronts: arm-tv is a checking validator that enforces numerous ABI rules; we have extended Alive2 (which we reuse as a verification backend) to deal with unstructured mixes of pointers and integers that are typical of assembly code; we investigate the tradeoffs between hand-written AArch64 semantics and those derived mechanically from ARM's published formal semantics; and, we have used arm-tv to discover 45 previously unknown miscompilation bugs in this LLVM backend, most of which are now fixed in upstream LLVM.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Certified Decision Procedures for Width-Independent Bitvector Predicates
Siddharth Bhat, Léo Stefanesco, Chris Hughes, and Tobias Grosser
(University of Cambridge, UK; Independent Researcher, UK)
Bitvectors are foundational for automated reasoning. A few interactive theorem provers (ITP), such as Lean, have strong support for deciding fixed-width bitvector predicates by means of bitblasting. However, even these ITPs provide little automation for width-independent bitvector predicates. To fill this gap, we contribute novel, mechanized decision procedures for width-independent bitvector predicates in Lean. Classical algorithms to decide fragments of width-independent bitvector theory can be viewed from the lens of model checking, where the formula corresponds to an automaton and the correctness of the formula is a safety property. However, we cannot currently use this lens in mechanized proofs, as there are no executable, fast, and formally verified model checking algorithms that can be used interactively from within ITPs. To fill this gap, we mechanize key algorithms in the model checking literature: k-induction, automata reachability, automata emptiness checking, and automata minimization. Using these mechanized algorithms, we contribute scalable, mechanized, decision procedures for width-independent bitvector predicates. Furthermore, for controlled fragments of mixtures of arithmetic and bitwise operations which occur in the deobfuscation literature, we mechanize a recent fast algorithm (MBA-Blast), which outperforms the more general procedures on this fragment. Finally, we evaluate our decision procedures on benchmarks from classical compiler problems such as Hacker’s Delight and the LLVM peephole optimizer, as well as on equivalence checking problems for program obfuscation. Our tools solve 100% of Hacker’s Delight, two of our tools solve 100% of the deobfuscation dataset, and up to 27% of peephole rewrites extracted from LLVM’s peephole rewriting test suite. Our new decision procedures provide a push-button experience for width-independent bitvector reasoning in interactive theorem provers, and, more broadly, pave the way for foundational algorithms for fast, formally verified model checking.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
CoSSJIT: Combining Static Analysis and Speculation in JIT Compilers
Aditya Anand, Vijay Sundaresan, Daryl Maier, and Manas Thakur
(IIT Bombay, India; IBM, Canada)
Just-in-time (JIT) compilers typically sacrifice the precision of program analysis for efficiency, but are capable of performing sophisticated speculative optimizations based on run-time profiles to generate code that is specialized to a given execution. On the contrary, ahead-of-time static compilers can often afford precise flow-sensitive interprocedural analysis, but produce conservative results in scenarios where higher precision could be derived from run-time specialization. In this paper, we propose the first-of-its-kind approach to enrich static analysis with the possibility of speculative optimization during JIT compilation, as well as its usage to perform aggressive stack allocation on a production Java Virtual Machine.
Our approach of combining static analysis with JIT speculation -- named CoSSJIT -- involves three key contributions. First, we identify the scenarios where a static analysis would make conservative assumptions but a JIT could deliver precision based on run-time speculation. Second, we present the notion of "speculative conditions" and plug them into a static interprocedural dataflow analyzer (whose aim is to identify heap objects that can be allocated on stack), to generate partial results that can be specialized at run-time. Finally, we extend a production JIT compiler to read and enrich static-analysis results with the resolved values of speculative conditions, leading to a practical approach that efficiently combines the best of both worlds. Cherries on the cake: Using CoSSJIT, we obtain 5.7x improvement in stack allocation (translating to performance), while building on a system that ensures functional correctness during JIT compilation.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Mini-Batch Robustness Verification of Deep Neural Networks
Saar Tzour-Shaday and Dana Drachsler-Cohen
(Technion, Israel)
Neural network image classifiers are ubiquitous in many safety-critical applications. However, they are susceptible to adversarial attacks. To understand their robustness to attacks, many local robustness verifiers have been proposed to analyze є-balls of inputs. Yet, existing verifiers introduce a long analysis time or lose too much precision, making them less effective for a large set of inputs. In this work, we propose a new approach to local robustness: group local robustness verification. The key idea is to leverage the similarity of the network computations of certain є-balls to reduce the overall analysis time. We propose BaVerLy, a sound and complete verifier that boosts the local robustness verification of a set of є-balls by dynamically constructing and verifying mini-batches. BaVerLy adaptively identifies successful mini-batch sizes, accordingly constructs mini-batches of є-balls that have similar network computations, and verifies them jointly. If a mini-batch is verified, all its є-balls are proven robust. Otherwise, one є-ball is suspected as not being robust, guiding the refinement. BaVerLy leverages the analysis results to expedite the analysis of that є-ball as well as the analysis of the mini-batch with the other є-balls. We evaluate BaVerLy on fully connected and convolutional networks for MNIST and CIFAR-10. Results show that BaVerLy scales the common one by one verification by 2.3x on average and up to 4.1x, in which case it reduces the total analysis time from 24 hours to 6 hours.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Compositional Symbolic Execution for the Next 700 Memory Models
Andreas Lööw, Seung Hoon Park, Daniele Nantes-Sobrinho, Sacha-Élie Ayoun, Opale Sjöstedt, and Philippa Gardner
(Imperial College London, UK)
Multiple successful compositional symbolic execution (CSE) tools and platforms exploit separation logic (SL) for compositional verification and/or incorrectness separation logic (ISL) for compositional bug-finding, including VeriFast, Viper, Gillian, CN, and Infer-Pulse. Previous work on the Gillian platform, the only CSE platform that is parametric on the memory model, meaning that it can be instantiated to different memory models, suggests that the ability to use custom memory models allows for more flexibility in supporting analysis of a wide range of programming languages, for implementing custom automation, and for improving performance. However, the literature lacks a satisfactory formal foundation for memory-model-parametric CSE platforms.
In this paper, inspired by Gillian, we provide a new formal foundation for memory-model-parametric CSE platforms. Our foundation advances the state of the art in four ways. First, we mechanise our foundation (in the interactive theorem prover Rocq). Second, we validate our foundation by instantiating it to a broad range of memory models, including models for C and CHERI. Third, whereas previous memory-model-parametric work has only covered SL analyses, we cover both SL and ISL analyses. Fourth, our foundation is based on standard definitions of SL and ISL (including definitions of function specification validity, to ensure sound interoperation with other tools and platforms also based on standard definitions).

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Finding Compiler Bugs through Cross-Language Code Generator and Differential Testing
Qiong Feng, Xiaotian Ma, Ziyuan Feng, Marat Akhin, Wei Song, and Peng Liang
(Nanjing University of Science and Technology, China; JetBrains, Netherlands; Wuhan University, China)
Compilers play a central role in translating high-level code into executable programs, making their correctness essential for ensuring code safety and reliability. While extensive research has focused on verifying the correctness of compilers for single-language compilation, the correctness of cross-language compilation — which involves the interaction between two languages and their respective compilers — remains largely unexplored. To fill this research gap, we propose CrossLangFuzzer, a novel framework that introduces a universal intermediate representation (IR) for JVM-based languages and automatically generates cross-language test programs with diverse type parameters and complex inheritance structures. After generating the initial IR, CrossLangFuzzer applies three mutation techniques — LangShuffler, FunctionRemoval, and TypeChanger — to enhance program diversity. By evaluating both the original and mutated programs across multiple compiler versions, CrossLangFuzzer successfully uncovered 10 confirmed bugs in the Kotlin compiler, 4 confirmed bugs in the Groovy compiler, 7 confirmed bugs in the Scala 3 compiler, 2 confirmed bugs in the Scala 2 compiler, and 1 confirmed bug in the Java compiler. Among all mutators, TypeChanger is the most effective, detecting 11 of the 24 compiler bugs. Furthermore, we analyze the symptoms and root causes of cross-compilation bugs, examining the respective responsibilities of language compilers when incorrect behavior occurs during cross-language compilation. To the best of our knowledge, this is the first work specifically focused on identifying and diagnosing compiler bugs in cross-language compilation scenarios. Our research helps to understand these challenges and contributes to improving compiler correctness in multi-language environments.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Scalable Equivalence Checking and Verification of Shallow Quantum Circuits
Nengkun Yu, Xuan Du Trinh, and Thomas Reps
(Stony Brook University, USA; University of Wisconsin, USA)
This paper concerns the problem of checking whether two shallow (i.e., constant-depth) quantum circuits perform equivalent computations. Equivalence checking is a fundamental correctness question—needed, for example, to ensure that transformations applied to a quantum circuit do not alter its behavior. For quantum circuits, the problem is challenging because a straightforward representation on a classical computer of each circuit’s quantum state can require time and space that are exponential in the number of qubits n.
The paper presents Projection-Based Equivalence Checking (PBEC), which provides decision procedures for two variants of the equivalence-checking problem. Both can be carried out on a classical computer in time and space that, for any fixed depth, are linear in n. Our key insight is that local projections can serve as constraints that fully characterize the output state of a shallow quantum circuit. The output state is the unique quantum state that satisfies all the constraints. Beyond equivalence checking, we show how to use the constraint representation to check a class of assertions, both statically and at run time. Our assertion-checking methods are sound and complete for assertions expressed as conjunctions of local projections.
Our experiments showed that computing the constraint representation of a random 100-qubit 1D circuit of depth 6 takes 129.64 seconds. Equivalence checking between two random 100-qubit 1D circuits of depth 3 requires 4.46 seconds for fixed input 0⊗ 100, and no more than 31.96 seconds for arbitrary inputs. Computing the constraint representation for a random 100-qubit circuit of depth 3 takes 6.99 seconds for a 2D structure, compared to 10.67 seconds for a circuit with arbitrary connectivity. At depth 2, equivalence checking takes 0.20 seconds for fixed input and 0.44 seconds for arbitrary input, with similar performance for both 2D and arbitrary-connectivity circuits.

Publisher's Version
Extraction and Mutation at a High Level: Template-Based Fuzzing for JavaScript Engines
Wai Kin Wong, Dongwei Xiao, Cheuk Tung Lai, Yiteng Peng, Daoyuan Wu, and Shuai Wang
(Hong Kong University of Science and Technology, Hong Kong; VX Research, UK; Lingnan University, Hong Kong)
JavaScript (JS) engines implement complex language semantics and optimization strategies to support the dynamic nature of JS, making them difficult to test thoroughly and prone to subtle, security-critical bugs. Existing fuzzers often struggle to generate diverse and valid test cases. They either rely on syntax-level mutations that lack semantic awareness or perform limited, local mutations on concrete code, thus failing to explore deeper, more complex program behaviors. This paper presents TemuJs, a novel fuzzing framework that performs extraction and mutation at a high level, operating on abstract templates derived from real-world JS programs. These templates capture coarse-grained program structures with semantic placeholders, enabling semantics-aware mutations that preserve the high-level intent of the original code while diversifying its behavior. By decoupling mutation from concrete syntax and leveraging a structured intermediate representation for the templates, TemuJs explores a broader and more meaningful space of program behaviors. Evaluated on three major JS engines, namely, V8, SpiderMonkey, and JavaScriptCore, TemuJs discovers 44 bugs and achieves a 10.3% increase in edge coverage compared to state-of-the-art fuzzers on average. Our results demonstrate the efficacy of high-level, template-mutation fuzzing in testing JS engines.

Publisher's Version
Dynamic Wind for Effect Handlers
David Voigt, Philipp Schuster, and Jonathan Immanuel Brachthäuser
(University of Tübingen, Germany)
Effect handlers offer an attractive way of abstracting over effectful computation. Moreover, languages with effect handlers usually statically track effects, which ensures the user is aware of all side effects different parts of a program might have. Similarly to exception handlers, effect handlers discharge effects by locally defining their behavior. In contrast to exception handlers, they allow for resuming computation, possibly later and possibly multiple times. In this paper we present a design, formalization, and implementation for a variant of dynamic wind that integrates well with lexical effect handlers. It has well-defined semantics in the presence of arbitrary control effects in arbitrary places. Specifically, the behavior of capturing and resuming continuations in the pre- or postlude is well-defined and respects resource bracketing. We demonstrate how these features can be used to express backtracking of external state and finalization of external resources.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Detecting and Explaining (In-)equivalence of Context-Free Grammars
Marko Schmellenkamp, Thomas Zeume, Sven Argo, Sandra Kiefer, Cedric Siems, and Fynn Stebel
(Ruhr University Bochum, Germany; University of Oxford, UK)
We propose a scalable framework for deciding, proving, and explaining (in-)equivalence of context-free grammars. We present an implementation of the framework and evaluate it on large data sets collected within educational support systems. Even though the equivalence problem for context-free languages is undecidable in general, the framework is able to handle a large portion of these datasets. It introduces and combines techniques from several areas, such as an abstract grammar transformation language to identify equivalent grammars as well as sufficiently similar inequivalent grammars, theory-based comparison algorithms for a large class of context-free languages, and a graph-theory-inspired grammar canonization that allows to efficiently identify isomorphic grammars.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Embedding Quantum Program Verification into Dafny
Feifei Cheng, Sushen Vangeepuram, Henry Allard, Seyed Mohammad Reza Jafari, Alex Potanin, and Liyi Li
(Iowa State University, USA; Australian National University, Australia)
Despite recent development of quantum program verification, it is still in its early stage, where many quantum programs are hard to verify due to their inherent probabilistic nature and parallelism in quantum superposition. We propose QafnyC, a system that compiles quantum program verification into a well-established classical program verifier Dafny, enabling the formal verification of quantum programs. The key insight behind QafnyC is the separation of quantum program verification from its execution, leveraging the strength of classical verifiers to ensure correctness before compiling certified quantum programs into executable circuits. Using QafnyC, we have successfully verified 37 diverse quantum programs by compiling their verification into Dafny. To the best of our knowledge, this is the most extensive formally verified set of quantum programs.

Publisher's Version Info
We’ve Got You Covered: Type-Guided Repair of Incomplete Input Generators
Patrick LaFontaine, Zhe Zhou, Ashish Mishra, Suresh Jagannathan, and Benjamin Delaware
(Purdue University, USA; IIT Hyderabad, India)
Property-based testing (PBT) is a popular technique for automatically testing semantic properties of a program, specified as a pair of pre- and post-conditions. The efficacy of this approach depends on being able to quickly generate inputs that meet the precondition, in order to maximize the set of program behaviors that are probed. For semantically rich preconditions, purely random generation is unlikely to produce many valid inputs; when this occurs, users are forced to manually write their own specialized input generators. One common problem with handwritten generators is that they may be incomplete, i.e., they are unable to generate some values meeting the target precondition. This paper presents a novel program repair technique that patches an incomplete generator so that its range includes every valid input. Our approach uses a novel enumerative synthesis algorithm that leverages the recently developed notion of coverage types to characterize the set of missing test values as well as the coverage provided by candidate repairs. We have implemented a repair tool for OCaml generators, called Cobb, and used it to repair a suite of benchmarks drawn from the PBT literature.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Stencil-Lifting: Hierarchical Recursive Lifting System for Extracting Summary of Stencil Kernel in Legacy Codes
Mingyi Li, Junmin Xiao, Siyan Chen, Hui Ma, Xi Chen, Peihua Bao, Liang Yuan, and Guangming Tan
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
We introduce Stencil-Lifting, a novel system for automatically converting stencil kernels written in low-level languages in legacy code into semantically equivalent Domain-Specific Language (DSL) implementations. Targeting the efficiency bottlenecks of existing verified lifting systems, Stencil-Lifting achieves scalable stencil kernel abstraction through two key innovations. First, we propose a hierarchical recursive lifting theory that represents stencil kernels, structured as nested loops, using invariant subgraphs, which are customized data dependency graphs that capture loop-carried computation and structural invariants. Each vertex in the invariant subgraph is associated with a predicate-based summary, encoding its computational semantics. By enforcing self-consistency across these summaries, Stencil-Lifting ensures the derivation of correct loop invariants and postconditions for nested loops, eliminating the need for external verification. Second, we develop a hierarchical recursive lifting algorithm that guarantees termination through a convergent recursive process, avoiding the inefficiencies of search-based synthesis. The algorithm efficiently derives the valid summaries of stencil kernels, and its completeness is formally proven. We evaluate Stencil-Lifting on diverse stencil benchmarks from two different suites and on four real-world applications. Experimental results demonstrate that Stencil-Lifting achieves 31.62× and 5.8× speedups compared to the state-of-the-art verified lifting systems STNG and Dexter, respectively, while maintaining full semantic equivalence. Our work significantly enhances the translation efficiency of low-level stencil kernels to DSL implementations, effectively bridging the gap between legacy optimization techniques and modern DSL-based paradigms.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Products of Recursive Programs for Hypersafety Verification
Ruotong Cheng and Azadeh Farzan
(University of Toronto, Canada)
We study the problem of automated hypersafety verification of infinite-state recursive programs. We propose an infinite class of product programs, specifically designed with recursion in mind, that reduce the hypersafety verification of a recursive program to standard safety verification. For this, we combine insights from language theory and concurrency theory to propose an algorithmic solution for constructing an infinite class of recursive product programs. One key insight is that, using the simple theory of visibly pushdown languages, one can maintain the recursive structure of syntactic program alignments which is vital to constructing a new product program that can be viewed as a classic recursive program — that is, one that can be executed on a single stack. Another key insight is that techniques from concurrency theory can be generalized to help define product programs based on the view that the parallel composition of individual recursive programs includes all possible alignments from which a sound set of alignments that faithfully preserve the satisfaction of the hypersafety property can be selected. On the practical side, we formulate a family of parametric canonical product constructions that are intuitive to programmers and can be used as building blocks to specify recursive product programs for the purpose of relational and hypersafety verification, with the idea that the right product program can be verified automatically using existing techniques. We demonstrate the effectiveness of these techniques through an implementation and highly promising experimental results.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
DESIL: Detecting Silent Bugs in MLIR Compiler Infrastructure
Chenyao Suo, Jianrong Wang, Yongjia Wang, Jiajun Jiang, Qingchao Shen, and Junjie Chen
(Tianjin University, China)
MLIR (Multi-Level Intermediate Representation) compiler infrastructure provides an efficient framework for introducing a new abstraction level for programming languages and domain-specific languages. It has attracted widespread attention in recent years and has been applied in various domains, such as deep learning compiler construction. Recently, several MLIR compiler fuzzing techniques, such as MLIRSmith and MLIRod, have been proposed. However, none of them can detect silent bugs, i.e., bugs that incorrectly optimize code silently. The difficulty in detecting silent bugs arises from two main aspects: (1) UB-Free Program Generation: Generates programs that are free from undefined behaviors to suit the non-UB assumptions required by compiler optimizations. (2) Lowering Support: Converts the given MLIR program into an executable form with a suitable lowering path that reduces redundant lowering passes and improves the efficiency of fuzzing. To address the above issues, we propose DESIL. DESIL enables silent bug detection by defining a set of UB-elimination rules based on the MLIR documentation and applying them to input programs. To convert dialects in the MLIR program into executable form, DESIL designs a lowering path optimization strategy to convert the dialects in the given MLIR program into executable form. Furthermore, DESIL incorporates the differential testing for silent bug detection. It introduces an operation-aware optimization recommendation strategy into the compilation process to generate diverse executable files. We applied DESIL to the latest revisions of the MLIR compiler infrastructure. It detected 23 silent bugs and 19 crash bugs, of which 17/16 have been confirmed or fixed.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
HieraSynth: A Parallel Framework for Complete Super-Optimization with Hierarchical Space Decomposition
Sirui Lu and Rastislav Bodík
(University of Washington, USA; Google DeepMind, USA)
Modern optimizing compilers generate efficient code but rarely achieve theoretical optimality, often necessitating manual fine-tuning. This is especially the case for processors with vector instructions, which can grow the instruction set by an order of magnitude. Super-optimizers can synthesize optimal code, but they face a fundamental scalability constraint: as the size of the instruction set increases, the length of the longest synthesizable program decreases rapidly.
To help super-optimizers deal with large instruction sets, we introduce HieraSynth, a parallel framework for super-optimization that decomposes the problem by hierarchically partitioning the space of candidate programs, effectively decreasing the instruction set size. It also prunes search branches when the solver proves unrealizability, and explores independent subspaces in parallel, achieving near-linear speedup. HieraSynth is sufficiently efficient to run to completeness even on many hard problems, which means that it exhaustively explores the program space. This ensures that the synthesized program is optimal according to a cost model.
We implement HieraSynth as a library and demonstrate its effectiveness with a RISC-V Vector super-optimizer capable of handling instruction sets with up to 700 instructions while synthesizing 7–8-instruction programs. This is a significant advancement over previous approaches that were limited to 1–3 instructions with similar instruction set sizes. Specifically, HieraSynth can handle instruction sets up to 10.66× larger for a given program size, or synthesize up to 4.75× larger programs for a fixed instruction set. Evaluations show that HieraSynth can synthesize code surpassing human-expert optimizations and significantly reduce synthesis time, making super-optimization more practical for modern vector architectures.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Large Language Model Powered Symbolic Execution
Yihe Li, Ruijie Meng, and Gregory J. Duck
(National University of Singapore, Singapore)
Large Language Models (LLMs) have emerged as a promising alternative to traditional static program analysis methods, such as symbolic execution, offering the ability to reason over code directly without relying on theorem provers or SMT solvers. However, LLMs are also inherently approximate by nature, and therefore face significant challenges in relation to the accuracy and scale of analysis in real-world applications. Such issues often necessitate the use of larger LLMs with higher token limits, but this requires enterprise-grade hardware (GPUs) and thus limits accessibility for many users. In this paper, we propose LLM-based symbolic execution—a novel approach that enhances LLM inference via a path-based decomposition of the program analysis tasks into smaller (more tractable) subtasks. The core idea is to generalize path constraints using a generic code-based representation that the LLM can directly reason over, and without translation into another (less-expressive) formal language. We implement our approach in the form of AutoBug, an LLM-based symbolic execution engine that is lightweight and language-agnostic, making it a practical tool for analyzing code that is challenging for traditional approaches. We show that AutoBug can improve both the accuracy and scale of LLM-based program analysis, especially for smaller LLMs that can run on consumer-grade hardware.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Proof Repair across Quotient Type Equivalences
Cosmo Viola, Max Fan, and Talia Ringer
(University of Illinois at Urbana-Champaign, USA; Cornell University, USA)
Proofs in proof assistants like Rocq can be brittle, breaking easily in response to changes. To address this, recent work introduced an algorithm and tool in Rocq to automatically repair broken proofs in response to changes that correspond to type equivalences. However, many changes remained out of the scope of this algorithm and tool—especially changes in underlying behavior. We extend this proof repair algorithm so that it can express certain changes in behavior that were previously out of scope. We focus in particular on equivalences between quotient types—types equipped with a relation that describes what it means for any two elements of that type to be equal. Quotient type equivalences can be used to express interesting changes in representations of mathematical structures, as well as changes in the implementations of data structures.
We extend this algorithm and tool to support quotient type equivalences in Rocq. Notably, since Rocq lacks quotient types entirely, our extensions use Rocq’s setoid machinery in place of quotients. Specifically, (1) our extension to the algorithm supports new changes corresponding to setoids, and (2) our extension to the tool supports this new class of changes and further automates away some of the new proof obligations. We demonstrate our extensions on proof repair case studies for previously unsupported changes. We also perform manual proof repair in Cubical Agda, a language with a univalent metatheory, which allows us to construct the first ever internal proofs of correctness for proof repair.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
The Power of Regular Constraint Propagation
Matthew Hague, Artur Jeż, Anthony Widjaja Lin, Oliver Markgraf, and Philipp Rümmer
(Royal Holloway University of London, UK; University of Wrocław, Poland; RPTU Kaiserslautern-Landau, Germany; MPI-SWS, Germany; University of Regensburg, Germany; Uppsala University, Sweden)
The past decade has witnessed substantial developments in string solving. Motivated by the complexity of string solving strategies adopted in existing string solvers, we investigate a simple and generic method for solving string constraints: regular constraint propagation. The method repeatedly computes pre- or post-images of regular languages under the string functions present in a string formula, inferring more and more knowledge about the possible values of string variables, until either a conflict is found or satisfiability of the string formula can be concluded. Such a propagation strategy is applicable to string constraints with multiple operations like concatenation, replace, and almost all flavors of string transductions. We demonstrate the generality and effectiveness of this method theoretically and experimentally. On the theoretical side, we show that RCP is sound and complete for a large fragment of string constraints, subsuming both straight-line and chain-free constraints, two of the most expressive decidable fragments for which some modern string solvers provide formal completeness guarantees. On the practical side, we implement regular constraint propagation within the open-source string solver OSTRICH. Our experimental evaluation shows that this addition significantly improves OSTRICH’s performance and makes it competitive with existing solvers. In fact, it substantially outperforms other solvers on random PCP and bioinformatics benchmarks. The results also suggest that incorporating regular constraint propagation alongside other techniques could lead to substantial performance gains for existing solvers.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
On Abstraction Refinement for Bayesian Program Analysis
Yuanfeng Shi, Yifan Zhang, and Xin Zhang
(Peking University, China)
Bayesian program analysis is a systematic approach to learn from external information for better accuracy by converting logical deduction in conventional program analysis into Bayesian inference. A key challenge in Bayesian program analysis is how to select program abstractions to effectively generalize from external information. A recent approach addresses this challenge by learning a selection policy on training programs but may result in sub-optimal performance on new programs due to its learning nature and when the training set selection is not ideal. To address this problem, we propose an approach that is inspired by the framework of counterexample-guided refinement to search for an abstraction on the fly. Our key innovation is to apply the theory of conditional independence to refine the abstraction so that incorrect generalizations can be removed. To demonstrate the effectiveness of our approach, we have instantiated it on a Bayesian thread-escape analysis and a Bayesian datarace analysis and shown that it significantly improves the performance of the analyses.

Publisher's Version Published Artifact Artifacts Available
Interactive Bitvector Reasoning using Verified Bit-Blasting
Henrik Böving, Siddharth Bhat, Luisa Cicolini, Alex Keizer, Léon Frenot, Abdalrhman Mohamed, Léo Stefanesco, Harun Khan, Joshua Clune, Clark Barrett, and Tobias Grosser
(Lean FRO, Germany; University of Cambridge, UK; ENS Lyon, France; Stanford University, USA; Carnegie Mellon University, USA)
Bit-blasting SMT solvers enable efficient automatic reasoning about bitvectors, which are fundamental for the verification of compiler backends, cryptographic algorithms, hardware designs and other soft- or hardware tasks. Despite the clear demand for efficient bitvector reasoning infrastructure and the impressive advancements in state-of-the-art bit-blasting SMT solvers such as Bitwuzla, effective bitvector reasoning within interactive theorem provers (ITPs) remains a challenge, hindering their use for mechanized proofs. Incomplete bitvector libraries, unavailable or only partially integrated decision procedures, complex and hard-to-bitblast operations, and limited integration with the host language prevent the wide adoption of bitvector reasoning in proving contexts. We introduce bv_decide: the first end-to-end verified bitblaster designed for interactive bitvector reasoning in a dependently-typed ITP. Our verified bitblaster is scalable, comes with a complete end-to-end proof (trusting only the Lean compiler and kernel), and is available as a proof tactic that allows interactive reasoning right from within a programming language, in our case Lean. We use Lean’s Functional But In-Place (FBIP) paradigm to efficiently encode our core data structures (e.g., AIGs), demonstrating that fast execution of an SMT solver need not come at the expense of rigorous formalization. We enable dependable interactive verification of user-written-code by basing Lean’s C-Style standard dataypes UInt/SInt on our bitvector type, adding a lowering from enums and structs to bitvectors to enable transparent bit-blasting support for composed types, and by offering an interactive tactic that either solves a goal or provides a counter-example. Moreover, we present the design of Lean’s canonical bitvector library, which supports all operations (with reasoning principles) for the SMT-LIB 2.7 standard (including overflow modeling), is fast-to-execute, and offers a comprehensive API and automation for bit-width-independent reasoning. We thoroughly evaluate our bit-blaster on a comprehensive set of benchmarks, including the full SMT-LIB dataset, where bv_decide solves more theorems than the state-of-the-art in verified bit-blasting, CoqQFBV. We also verify over 7000 SMT statements extracted from LLVM, providing the largest mechanized verification of LLVM rewrites to date, to our knowledge. By making bit-blasting bitvector reasoning a polished, well-supported, and interactive feature of modern ITPs, we enable effective, dependable white-box reasoning for bitvector-level verification.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
The Simple Essence of Overloading: Making Ad-Hoc Polymorphism More Algebraic with Flow-Based Variational Type-Checking
Jiří Beneš and Jonathan Immanuel Brachthäuser
(University of Tübingen, Germany)
Type-directed overload resolution allows programmers to reuse the same name, offloading disambiguation to the type checker. Since many programming languages implement overload resolution by performing backtracking in the type checker, it is commonly believed to be incompatible with Hindley-Milner-style type systems. In this paper, we present an approach to overload resolution that combines insights from variational type checking and algebraic subtyping. We formalize and discuss our flow-based variational framework that captures the essence of overloads by representing them as choices. This cleanly separates constraint collection, constraint solving, and overload resolution. We believe our framework not only gives rise to more modular and efficient implementations of type checkers, but also serves as a simpler mental model and paves the way for improved error messages.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
ROSpec: A Domain-Specific Language for ROS-Based Robot Software
Paulo Canelas, Bradley Schmerl, Alcides Fonseca, and Christopher S. Timperley
(Carnegie Mellon University, USA; University of Lisbon, Portugal)
Component-based robot software frameworks, such as the Robot Operating System (ROS), allow developers to quickly compose and execute systems by focusing on configuring and integrating reusable, off-the-shelf components. However, these components often lack documentation on how to configure and integrate them correctly. Even when documentation exists, its natural language specifications are not enforced, resulting in misconfigurations that lead to unpredictable and potentially dangerous robot behaviors. In this work, we introduce ROSpec, a ROS-tailored domain-specific language designed to specify and verify component configurations and their integration. ROSpec's design is grounded in ROS domain concepts and informed by a prior empirical study on misconfigurations, allowing the language to provide a usable and expressive way of specifying and detecting misconfigurations. At a high level, ROSpec verifies the correctness of argument and component configurations, ensures the correct integration of components by checking their communication properties, and checks if configurations respect the assumptions and constraints of their deployment context. We demonstrate ROSpec's ability to specify and verify components by modeling a medium-sized warehouse robot with 19 components, and by manually analyzing, categorizing, and implementing partial specifications for components from a dataset of 182 misconfiguration questions extracted from a robotics Q&A platform.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable Results Reproduced
Enhancing APR with PRISM: A Semantic-Based Approach to Overfitting Patch Detection
Dowon Song and Hakjoo Oh
(Korea University, Republic of Korea)
We present PRISM, a novel technique for detecting overfitting patches in automatic program repair (APR). Despite significant advances in APR, overfitting patches—those passing test suites but not fixing bugs—persist, degrading performance and increasing developer review burden. To mitigate overfitting, various automatic patch correctness classification (APCC) techniques have been proposed. However, while accurate, existing APCC methods often mislabel scarce correct patches as incorrect, significantly lowering the APR fix rate. To address this, we propose (1) novel semantic features capturing patch-induced behavioral changes and (2) a tailored learning algorithm that preserves correct patches while filtering incorrect ones. Experiments on ranked patch data from 10 APR tools show that PRISM uniquely reduces review burden and finds more correct patches. Other methods lower the fix rate by misclassifying correct patches. Evaluations on 1,829 labeled patches confirm Prism removes more incorrect patches at equal correct patch preservation rates.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Validating Soundness and Completeness in Pattern-Match Coverage Analyzers
Cyril Flurin Moser, Thodoris Sotiropoulos, Chengyu Zhang, and Zhendong Su
(ETH Zurich, Switzerland)
Pattern matching is a powerful mechanism for writing safe and expressive conditional logic. Once primarily associated with functional programming, it has become a common paradigm even in non-functional languages, such as Java. Languages that support pattern matching include specific analyzers, known as pattern-match coverage analyzers, to ensure its correct and efficient use by statically verifying properties such as exhaustiveness and redundancy. However, these analyzers can suffer from soundness and completeness issues, leading to false negatives (unsafe patterns mistakenly accepted) or false positives (valid patterns incorrectly rejected).
In this work, we present a systematic approach for validating soundness and completeness in pattern-match coverage analyzers. The approach consists of a novel generator for algebraic data types and pattern-matching statements, supporting features that increase the complexity of coverage analysis, such as generalized algebraic data types. To establish the test oracle without building a reference implementation from scratch, the approach generates both exhaustive and inexhaustive pattern-matching cases, either by construction or by encoding them as SMT formulas. The latter leads to a universal test oracle that cross-checks coverage analysis results against a constraint solver, exposing soundness and completeness bugs in case of inconsistencies. We implement this approach in Ikaros, which we evaluate on three major compilers: Scala, Java, and Haskell. Despite pattern-match coverage analyzers being only a small part of these compilers, Ikaros has uncovered 16 bugs, of which 12 have been fixed. Notably, 7 instances were important soundness bugs that could lead to unexpected runtime errors. Additionally, Ikaros provides a scalable framework for extending it to any language with ML-like pattern matching.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Complete the Cycle: Reachability Types with Expressive Cyclic References
Haotian Deng, Siyuan He, Songlin Jia, Yuyan Bao, and Tiark Rompf
(Purdue University, USA; Augusta University, USA)
Local reasoning about programs that combine aliasing and mutable state is a longstanding challenge. Existing approaches – ownership systems, linear and affine types, uniqueness types, and lexical effect tracking – impose global restrictions such as uniqueness or linearity, or rely on shallow syntactic analyses. These designs fall short with higher-order functions and shared mutable state. Reachability Types (RT) track aliasing and separation in higher-order programs, ensuring runtime safety and non-interference. However, RT systems face three key limitations: (1) they prohibit cyclic references, ruling out non-terminating computations and fixed-point combinators; (2) they require deep tracking, where a qualifier must include all transitively reachable locations, reducing precision and hindering optimizations like fine-grained parallelism; and (3) referent qualifier invariance prevents referents from escaping their allocation contexts, making reference factories inexpressible.
In this work, we address these limitations by extending RT with three mechanisms that enhance expressiveness. First, we introduce cyclic references, enabling recursive patterns to be encoded directly through the store. Second, we adopt shallow qualifier tracking, decoupling references from their transitively reachable values. Finally, we introduce an escaping rule with reference subtyping, allowing referent qualifiers to outlive their allocation context. These extensions are formalized in the F<:-calculus with a mechanized proof of type soundness, and case studies illustrate expressiveness through fixpoint combinators, non-interfering parallelism, and escaping read-only references.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Borrowing from Session Types
Hannes Saffrich, Janek Spaderna, Peter Thiemann, and Vasco T. Vasconcelos
(University of Freiburg, Germany; University of Lisbon, Portugal)
Session types provide a formal framework to enforce rich communication protocols, ensuring correctness properties such as type safety and deadlock freedom. However, the traditional API of functional session type systems with first-class channels often leads to problems with modularity and composability.
This paper proposes a new, alternative session type API based on borrowing, embodied in the core calculus BGV. The borrowing-based API enables building modular and composable code for session type clients without imposing clutter or undue limitations. Its basis is a novel type system, founded on ordered linear typing, for functional session types with an explicit operation for splitting ownership of channels. We establish the semantics of BGV via a type-preserving translation to PGV, a deadlock-free functional session type calculus. We establish type safety and deadlock freedom for BGV by this translation.
We also present an external version of BGV that supports use of borrow notation. We developed an algorithmic version of the type system that includes a mechanized verified translation from the external language to BGV. This part establishes decidable type checking.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
AutoVerus: Automated Proof Generation for Rust Code
Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Weidong Cui, Yeyun Gong, Chris Hawblitzel, Shuvendu Lahiri, Jacob R. Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, and Shan Lu
(University of Illinois at Urbana-Champaign, USA; Columbia University, USA; University of California at Irvine, USA; University of Toronto, Canada; Microsoft Research, USA; Microsoft Research, China; University of Chicago, USA)
Generative AI has shown its value for many software engineering tasks. Still in its infancy, large language model (LLM)-based proof generation lags behind LLM-based code generation. In this paper, we present AutoVerus. AutoVerus uses LLMs to automatically generate correctness proof for Rust code. AutoVerus is designed to match the unique features of Verus, a verification tool that can prove the correctness of Rust code using proofs and specifications also written in Rust. AutoVerus consists of a network of agents that are crafted and orchestrated to mimic human experts' three phases of proof construction: preliminary proof generation, proof refinement guided by generic tips, and proof debugging guided by verification errors. To thoroughly evaluate AutoVerus and help foster future research in this direction, we have built a benchmark suite of 150 non-trivial proof tasks, based on existing code-generation benchmarks and verification benchmarks. Our evaluation shows that AutoVerus can automatically generate correct proof for more than 90% of them, with more than half of them tackled in less than 30 seconds or 3 LLM calls.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
HeapBuffers: Why Not Just Using a Binary Serialization Format for Your Managed Memory?
Daniele Bonetta, Júnior Löff, Matteo Basso, and Walter Binder
(Vrije Universiteit Amsterdam, Netherlands; USI Lugano, Switzerland)
In modern cloud applications, data serialization and deserialization (DSD) are common yet expensive operations. Languages such as Java incur significant overhead from DSD operations because of their managed memory, which uses an IO-incompatible, sparse, platform-dependent data layout for storing objects. Can language VMs bypass costly DSD operations by directly operating on an IO-ready binary format? In this paper, we address this question by presenting a new DSD library for Java called HeapBuffers. HeapBuffers maintains a Java-friendly programming model through managed memory, yet it operates on data stored using a dense, platform-independent, memory format. In this way, HeapBuffers data can be used to perform IO operations without the need to convert objects from/to a VM-internal memory representation, thus eliminating the need for expensive DSD operations. Compiler optimizations combined with specialized garbage collection ensure that accessing HeapBuffers data is nearly as efficient as handling regular Java objects. The answer to our question is largely positive: our experiments show that HeapBuffers data can be accessed at high performance and can be used as is to perform IO, making the cost of DSD effectively negligible. IO performance of applications using HeapBuffers is often orders of magnitude faster than that obtained using the prevailing DSD libraries.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced ACM SIGPLAN Distinguished Paper Award
Tunneling through the Hill: Multi-way Intersection for Version-Space Algebras in Program Synthesis
Guanlin Chen, Ruyi Ji, Shuhao Zhang, and Yingfei Xiong
(Peking University, China)
Version space algebra (VSA) is an effective data structure for representing sets of programs and has been extensively used in program synthesis. Despite this success, a crucial shortcoming of VSA-based synthesis is its inefficiency when processing many examples. Given a set of IO examples, a typical VSA-based synthesizer runs by first constructing an individual VSA for each example and then iteratively intersecting these VSAs one by one. However, the intersection of two VSAs can be much larger than the original ones -- this effect accumulates during the iteration, making the scale of intermediate VSAs quickly explode.
In this paper, we aim to reduce the cost of intersecting VSAs in synthesis. We investigate the process of the iterative intersection and observe that, although this process may construct some huge intermediate VSAs, its final VSA is usually small in practice because only a few programs can pass all examples. Utilizing this observation, we propose the approach of multi-way intersection, which directly intersects multiple small VSAs into the final result, thus avoiding the previous bottleneck of constructing huge intermediate VSAs. Furthermore, since the previous intersection algorithm is inefficient for multiple VSAs, we design a novel algorithm to avoid most unnecessary VSA nodes. We integrated our approach into two SOTA VSA-based synthesizers: a general synthesizer based on VSA and a specialized one for the string domain, Blaze. We evaluate them over 4 different datasets, 994 synthesis tasks; the results show that our approach can significantly improve the performance of VSA-based synthesis, with up to 105 more tasks solved and a speedup of 7.36 times.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Zero-Overhead Lexical Effect Handlers
Cong Ma, Zhaoyi Ge, Max Jung, and Yizhou Zhang
(University of Waterloo, Canada)
Exception handlers—and effect handlers more generally—are language mechanisms for structured non­local control flow. A recent trend in language-design research has introduced lexically scoped handlers, which address a modularity problem with dynamic scoping. While dynamically scoped handlers allow zero-overhead implementations when no effects are raised, existing implementations of lexically scoped handlers require programs to pay a cost just for having handlers in the lexical context.
In this paper, we present a novel approach to implementing lexically scoped handlers of exceptional effects. It satisfies the zero-overhead principle—a property otherwise met by most modern compilers supporting dynamically scoped exception handlers. The key idea is a type-directed translation that emits information indicating how handlers come into the lexical context. This information guides the runtime in walking the stack to locate the right handler. Crucially, no reified lexical identifiers of handlers are needed, and mainline code is not slowed down by the presence of handlers in the program text.
We formalize the essential aspects of this compilation scheme and prove it correct. We integrate our approach into the Lexa language, allowing the compilation strategy to be customized for each declared effect based on its expected invocation rate. Empirical results suggest that the new Lexa compiler reduces run-time overhead in low-effect or no-effect scenarios while preserving competitive performance for effect-heavy workloads.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Multi-modal Sketch-Based Behavior Tree Synthesis
Wenmeng Zhang, Zhenbang Chen, and Weijiang Hong
(National University of Defense Technology, Changsha, China)
Behavior trees (BTs) are widely adopted in the field of agent control, particularly in robotics, due to their modularity and reactivity. However, constructing a BT that meets the desired expectations is time-consuming and challenging, especially for non-experts. This paper presents BtBot, a multi-modal sketch-based behavior tree synthesis technique. Given a natural language task description and a set of positive and negative examples, BtBot automatically generates a BT program that aligns with the natural language description and meets the requirements of the examples. Inside BtBot, an LLM is employed to understand the task’s natural language description and generate a sketch of the task execution. Then, BtBot searches the sketch to synthesize a candidate BT program consistent with the user-provided positive and negative examples. When the sketch is proven to be incapable of generating the target BT, BtBot provides a multi-step repairing method that modifies the control nodes and structure of the sketch to search for the desired BT. We have implemented BtBot in a prototype and evaluated it on a benchmark of 70 tasks across multiple scenarios. The experimental results indicate that BtBot outperforms the existing BT synthesis techniques in effectiveness and efficiency. In addition, two user studies have been conducted to demonstrate the usefulness of BtBot.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Garbage Collection for Rust: The Finalizer Frontier
Jacob Hughes and Laurence Tratt
(King’s College London, UK)
Rust is a non-Garbage Collected (GCed) language, but the lack of GC makes expressing data-structures that require shared ownership awkward, inefficient, or both. In this paper we explore a new design for, and implementation of, GC in Rust, called Alloy. Unlike previous approaches to GC in Rust, Alloy allows existing Rust destructors to be automatically used as GC finalizers: this makes Alloy integrate better with existing Rust code than previous solutions but introduces surprising soundness and performance problems. Alloy provides novel solutions for the core problems: finalizer safety analysis rejects unsound destructors from automatically being reused as finalizers; finalizer elision optimises away unnecessary finalizers; and premature finalizer prevention ensures that finalizers are only run when it is provably safe to do so.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Divining Profiler Accuracy: An Approach to Approximate Profiler Accuracy through Machine Code-Level Slowdown
Humphrey Burchell and Stefan Marr
(University of Kent, UK)
Optimizing performance on top of modern runtime systems with just-in-time (JIT) compilation is a challenge for a wide range of applications from browser-based applications on mobile devices to large-scale server applications. Developers often rely on sampling-based profilers to understand where their code spends its time. Unfortunately, sampling of JIT-compiled programs can give inaccurate and sometimes unreliable results. To assess accuracy of such profilers, we would ideally want to compare their results to a known ground truth. With the complexity of today’s software and hardware stacks, such ground truth is unfortunately not available. Instead, we propose a novel technique to approximate a ground truth by accurately slowing down a Java program at the machine-code level, preserving its optimization and compilation decisions as well as its execution behavior on modern CPUs. Our experiments demonstrate that we can slow down benchmarks by a specific amount, which is a challenge because of the optimizations in modern CPUs, and we verified with hardware profiling that on a basic-block level, the slowdown is accurate for blocks that dominate the execution. With the benchmarks slowed down to specific speeds, we confirmed that Async-profiler, JFR, JProfiler, and YourKit maintain original performance behavior and assign the same percentage of run time to methods. Additionally, we identify cases of inaccuracy caused by missing debug information, which prevents the correct identification of the relevant source code. Finally, we tested the accuracy of sampling profilers by approximating the ground truth by the slowing down of specific basic blocks and found large differences in accuracy between the profilers. We believe, our slowdown-based approach is the first practical methodology to assess the accuracy of sampling profilers for JIT-compiling systems and will enable further work to improve the accuracy of profilers.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
On the Impact of Formal Verification on Software Development
Eric Mugnier, Yuanyuan Zhou, Ranjit Jhala, and Michael Coblenz
(University of California at San Diego, USA)
Auto-active verifiers like Dafny aim to make formal methods accessible to non-expert users through SMT automation. However, despite the automation and other programmer-friendly features, they remain sparsely used in real-world software development, due to the significant effort required to apply them in practice. We interviewed 14 experienced Dafny users about their experiences using it in large-scale projects. We apply grounded theory to analyze the interviews to systematically identify how auto-active verification impacts software development, and to identify opportunities to simplify the use, and hence, expand the adoption of verification in software development.

Publisher's Version Published Artifact Artifacts Available
Syntactic Completions with Material Obligations
David Moon, Andrew Blinn, Thomas J. Porter, and Cyrus Omar
(University of Michigan, USA)
Code editors provide essential services that help developers understand, navigate, and modify programs. However, these services often fail in the presence of syntax errors. Existing syntax error recovery techniques, like panic mode and multi-option repairs, are either too coarse, e.g. in deleting large swathes of code, or lead to a proliferation of possible completions. This paper introduces tall tylr, an error-handling parser and editor generator that completes malformed code with syntactic obligations that abstract over many possible completions. These obligations generalize the familiar notion of holes in structure editors to cover missing operands, operators, delimiters, and sort transitions.
tall tylr is backed by a novel theory of tile-based parsing, conceptually organized around a molder that turns tokens into tiles and a melder that completes and parses tiles into terms using an error-handling generalization of operator-precedence parsing. We formalize melding as a parsing calculus, meldr, that completes input tiles with additional obligations such that it can be parsed into a well-formed term, with success guaranteed over all inputs. We further describe how tall tylr implements molding and completion-ranking using the principle of minimizing obligations.
Obligations offer a useful way to scaffold internal program representations, but in tall tylr we go further to investigate the potential of materializing these obligations visually to the programmer. We conduct a user study to evaluate the extent to which an editor like tall tylr that materializes syntactic obligations might be usable and useful, finding both points of positivity and interesting new avenues for future work.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Faster Explicit-Trace Monitoring-Oriented Programming for Runtime Verification of Software Tests
Kevin Guan, Marcelo d'Amorim, and Owolabi Legunsen
(Cornell University, USA; North Carolina State University, USA)
Runtime verification (RV) monitors program executions for conformance with formal specifications (specs). This paper concerns Monitoring-Oriented Programming (MOP), the only RV approach shown to scale to thousands of open-source GitHub projects when simultaneously monitoring passing unit tests against dozens of specs. Explicitly storing traces—sequences of spec-related program events—can make it easier to debug spec violations or to monitor tests against hyperproperties, which requires reasoning about sets of traces. But, most online MOP algorithms are implicit trace, i.e. they work event by event to avoid the time and space costs of storing traces. Yet, TraceMOP, the only explicit-trace online MOP algorithm, is often too slow and often fails.
We propose LazyMOP, a faster explicit-trace online MOP algorithm for RV of tests that is enabled by three simple optimizations. First, whereas all existing online MOP algorithms eagerly monitor all events as they occur, LazyMOP lazily stores only unique traces at runtime and monitors them just before the test run ends. Lazy monitoring is inspired by a recent finding: 99.87% of traces during RV of tests are duplicates. Second, to speed up trace storage, LazyMOP encodes events and their locations as integers, and amortizes the cost of looking up locations across events. Lastly, LazyMOP only synchronizes accesses to its trace store after detecting multi-threading, unlike TraceMOP’s eager and wasteful synchronization of all accesses.
On 179 Java open-source projects, LazyMOP is up to 4.9x faster and uses 4.8x less memory than TraceMOP, finding the same traces (modulo test non-determinism) and violations. We show LazyMOP’s usefulness in the context of software evolution, where tests are re-run after each code change. LazyMOPe optimizes LazyMOP in this context by generating fewer duplicate traces. Using unique traces from one code version, LazyMOPe finds all pairs of method 𝑚 and spec 𝑠, where all traces for 𝑠 in 𝑚 are identical. Then, in a future version, LazyMOPe generates and monitors only one trace of 𝑠 in 𝑚. LazyMOPe is up to 3.9x faster than LazyMOP and it speeds up two recent techniques that speed up RV during evolution by up to 4.6x with no loss in violations.

Publisher's Version
On Higher-Order Model Checking of Effectful Answer-Type-Polymorphic Programs
Taro Sekiyama, Ugo Dal Lago, and Hiroshi Unno
(National Institute of Informatics, Japan; SOKENDAI, Japan; University of Bologna, Italy; Inria, France; Tohoku University, Japan)
Applying higher-order model checking techniques to programs that use effect handlers is a major challenge, given the recent undecidability result obtained by Dal Lago and Ghyselen. This challenge has been addressed by using answer-type modifications, the use of a monomorphic version of which allows to recover decidability. However, the absence of polymorphism leads to a loss of modularity, reusability, and even expressivity. In this work, we study the problem of defining a calculus that on the one hand supports answer-type polymorphism and subtyping but on the other hand ensures the underlying model checking problem to remain decidable. The solution proposed in this paper is based on the introduction of the polymorphic answer-type □ whose role is to provide a good compromise between expressiveness and decidability, the latter demonstrated through the construction of a selective type-directed CPS transformation targeting a calculus without effect handlers and any form of polymorphism. Noticeably, the introduced calculus HEPCFATM allows the answer types of effects implemented by tail-resumptive effect handlers to be polymorphic. We also implemented a proof-of-concept model checker for HEPCFATM programs.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
DepFuzz: Efficient Smart Contract Fuzzing with Function Dependence Guidance
Chenyang Ma, Wei Song, and Jeff Huang
(Nanjing University of Science and Technology, China; Texas A&M University, USA)
Fuzzing is an effective technique to detect vulnerabilities in smart contracts. The challenge of smart contract fuzzing lies in the statefulness of contracts, which indicates that certain vulnerabilities can only be manifested in specific contract states. State-of-the-art fuzzers may generate and execute a plethora of meaningless or redundant transaction sequences during fuzzing, incurring a penalty in efficiency. To this end, we present DepFuzz, a hybrid fuzzer for efficient smart contract fuzzing, which introduces a symbolic execution module into the feedback-based fuzzer. Guided by the distance-based function dependencies between functions, DepFuzz can efficiently yield meaningful transaction sequences that contribute to vulnerability exposure or code coverage. The experiments on 286 benchmark smart contracts and 500 large real-world smart contracts corroborate that, compared to state-of-the-art approaches, DepFuzz achieves higher instruction coverage rate and uncovers many more vulnerabilities with less time.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Type-Outference with Label-Listeners: Foundations for Decidable Type-Consistency for Nominal Object-Oriented Generics
Ross Tate
(Independent Researcher and Consultant, USA)
We present the first sound and complete decision procedure capable of validating nominal object-oriented programs without type annotations in expressions even in the presence of generic interfaces with irregularly-recursive signatures. Such interface signatures are exemplary of the challenges an algorithm must overcome in order to scale to the needs of modern major typed object-oriented languages. We do so by incorporating event-driven label-listeners into our constraint system, enabling more complex aspects of program validation to be generated on demand while still ensuring termination. Furthermore, we define type-consistency as a novel declarative notion of program validity that ensures safety without requiring a complex grammar of types to perform inference within. While type-inferability ensures type-consistency, the converse does not necessarily hold. Thus our algorithm decides program validity without inferring the types missing from the program, and so we instead call it a type-outference algorithm. By bypassing type-inference, the proofs involving type-consistency more directly connect the design of and reasoning about constraints to the operational semantics of the language, simplifying much of the design and verification process. We mechanically formalize and verify these concepts and techniques—type-consistency, type-outference, and label-listeners—in Rocq to provide a solid foundation for scaling decidability beyond the limitations of structural types.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Integrating Resource Analyses via Resource Decomposition
Long Pham, Yue Niu, Nathan Glover, Feras Saad, and Jan Hoffmann
(Carnegie Mellon University, USA; National Institute of Informatics, Japan)
Resource analysis aims to derive symbolic resource bounds of programs. Although numerous resource-analysis techniques have been developed—ranging from static to dynamic and manual to automated techniques—they each come with their own distinct strengths and weaknesses. To overcome the limitations of individual resource-analysis techniques, a promising approach is to combine them in such a way that retains their complementary strengths while mitigating their respective weaknesses. This article proposes a novel program translation method called resource decomposition that facilitates the combination of different resource-analysis techniques. The key idea of resource decomposition is to first identify and annotate the program with resource components, which are user-specified variables that serve as an interface between different analysis techniques. Using these resource components, our method generates a resource-guarded program, where one analysis technique is used to infer an overall cost bound parametric in the resource components, and other analysis techniques are used to infer symbolic bounds to be substituted for the resource components. We establish the soundness of resource decomposition using a denotational cost semantics and a binary logical relation. It states that composing sound bounds results in a sound bound for the original program. Furthermore, we present three instantiations of the resource-decomposition framework, each representing distinct combinations of static, data-driven, and manual resource analyses. The data-driven part of these instantiations is a novel Bayesian approach to inferring linear and logarithmic bounds of recursion depths. An implementation and empirical evaluation of resource decomposition demonstrates that it can effectively infer sound and asymptotically tight cost bounds for a number of challenging benchmarks that are beyond the reach of previous analysis methods.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
A Flow-Sensitive Refinement Type System for Verifying eBPF Programs
Ameer Hamza, Lucas Zavalía, Arie Gurfinkel, Jorge A. Navas, and Grigory Fedyukovich
(Florida State University, USA; University of Waterloo, Canada; Certora, USA)
The Extended Berkeley Packet Filter (eBPF) subsystem within an operating system’s kernel enables userspace programs to extend kernel functionality dynamically. Due to the security risks associated with runtime modification of the operating system, eBPF requires all programs to be verified before deploying them within the kernel. Existing approaches to eBPF verification are monolithic, requiring their entire analysis to be done in a secure environment, resulting in the need for extensive trusted codebases. We present a type- based verification approach that automatically infers proof certificates in userspace, thus reducing the size and complexity of the trusted codebase. At the same time, only the proof-checking component needs to be deployed in a secure environment. Moreover, compared to previous techniques, our type system enhances the debuggability of the programs for users through ergonomic type annotations when verification fails. We implemented our type inference algorithm in a tool called VeRefine and evaluated it against an existing eBPF verifier, Prevail. VeRefine outperformed Prevail on most of the industrial benchmarks.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
An Empirical Study of Bugs in the rustc Compiler
Zixi Liu, Yang Feng, Yunbo Ni, Shaohua Li, Xizhe Yin, Qingkai Shi, Baowen Xu, and Zhendong Su
(Nanjing University, China; Chinese University of Hong Kong, China; ETH Zurich, Switzerland)
Rust is gaining popularity for its well-known memory safety guarantees and high performance, distinguishing it from C/C++ and JVM-based languages. Its compiler, rustc, enforces these guarantees through specialized mechanisms such as trait solving, borrow checking, and specific optimizations. However, Rust's unique language mechanisms introduce complexity to its compiler, resulting in bugs that are uncommon in traditional compilers. With Rust's increasing adoption in safety-critical domains, understanding these language mechanisms and their impact on compiler bugs is essential for improving the reliability of both rustc and Rust programs. Such understanding could provide the foundation for developing more effective testing strategies tailored to rustc. Improving the quality of rustc testing is essential for enhancing compiler reliability, which in turn strengthens the safety and correctness of all Rust programs, as compiler bugs can silently propagate into every compiled program. Yet, we still lack a large-scale, detailed, and in-depth study of rustc bugs.

To bridge this gap, this work presents a comprehensive and systematic study of rustc bugs, specifically those originating in semantic analysis and intermediate representation (IR) processing, which are stages that implement essential Rust language features such as ownership and lifetimes. Our analysis examines issues and fixes reported between 2022 and 2024, with a manual review of 301 valid issues. We categorize these bugs based on their causes, symptoms, affected compilation stages, and test case characteristics. Additionally, we evaluate existing rustc testing tools to assess their effectiveness and limitations. Our key findings include: (1) rustc bugs primarily arise from Rust's type system and lifetime model, with frequent errors in the High-Level Intermediate Representation (HIR) and Mid-Level Intermediate Representation (MIR) modules due to complex checkers and optimizations; (2) bug-revealing test cases often involve unstable features, advanced trait usages, lifetime annotations, standard APIs, and specific optimization levels; (3) while both valid and invalid programs can trigger bugs, existing testing tools struggle to detect non-crash errors, underscoring the need for further advancements in rustc testing.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
An Empirical Evaluation of Property-Based Testing in Python
Savitha Ravi and Michael Coblenz
(University of California at San Diego, USA)
Property-based testing (PBT) is a testing methodology with origins in the functional programming community. In recent years, PBT libraries have been developed for non-functional languages, including Python. However, to date, there is little evidence regarding how effective property-based tests are at finding bugs, and whether some kinds of property-based tests might be more effective than others. To gather this evidence, we conducted a corpus study of 426 Python programs that use Hypothesis, Python’s most popular library for PBT. We developed formal definitions for 12 categories of property-based test and implemented an intraprocedural static analysis that categorizes tests. Then, we evaluated the efficacy of test suites of 40 projects using mutation testing, and found that on average, each property-based test finds about 50 times as many mutations as the average unit test. We also identified the categories with the tests most effective at finding mutations, finding that tests that look for exceptions, that test inclusion in collections, and that check types are over 19 times more effective at finding mutations than other kinds of property-based tests. Finally, we conducted a parameter sweep study to assess the strength of property-based tests as a function of the number of random inputs generated, finding that 76% of mutations found were found within the first 20 inputs.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Bennet: Randomized Specification Testing for Heap-Manipulating Programs
Zain K Aamer and Benjamin C. Pierce
(University of Pennsylvania, USA)
Property-based testing (PBT), widely used in functional languages and interactive theorem provers, works by randomly generating many inputs to a system under test. While PBT has also seen some use in low-level languages like C, users in this setting must craft all their own generators by hand, rather than letting the tool synthesize most generators automatically from types or logical specifications. For low-level code with complex memory ownership patterns, writing such generators can waste significant amounts of time. CN, a specification and verification framework for C, features a streamlined presentation of separation logic that is specially tuned to present only "easy" logical problems to an underlying constraint solver. Prior work on the Fulminate testing framework has shown that CN's streamlined specifications can also be checked effectively at run time, providing an oracle for testing whether a memory state satisfies a pre- or postcondition. We show that the restricted syntax of CN is also a good basis for deriving generators for random inputs satisfying separation-logic preconditions. We formalize the semantics for a DSL describing these generators, as well as optimizations that reorder when values are generated and propagate arithmetic constraints. Using this DSL, we implement a property-based testing tool, Bennet, that generates and runs random tests for C functions annotated with CN specifications. We evaluate Bennet on a corpus of programs with CN specifications and show that it can efficiently generate bug-revealing inputs for heap-manipulating programs with complex preconditions.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Structural Information Flow: A Fresh Look at Types for Non-interference
Hemant Gouni, Frank Pfenning, and Jonathan Aldrich
(Carnegie Mellon University, USA)
Information flow control is a long-studied approach for establishing non-interference properties of programs. For instance, it can be used to prove that a secret does not interfere with some computation, thereby establishing that the former does not leak through the latter. Despite their potential as a holy grail for security reasoning and their maturity within the literature, information flow type systems have seen limited adoption. In practice, information flow specifications tend to be excessively complex and can easily spiral out of control even for simple programs. Additionally, while non-interference is well-behaved in an idealized setting where information leakage never occurs, most practical programs must violate non-interference in order to fulfill their purpose. Useful information flow type systems in prior work must therefore contend with a definition of non-interference extended with declassification, which often offers weaker modular reasoning properties.
We introduce structural information flow, which both illuminates and addresses these issues from a logical viewpoint. In particular, we draw on established insights from the modal logic literature to argue that information flow reasoning arises from hybrid logic, rather than conventional modal logic as previously imagined. We show with a range of examples that structural information flow specifications are straightforward to write and easy to visually parse. Uniquely in the structural setting, we demonstrate that declassification emerges not as an aberration to non-interference, but as a natural and unavoidable consequence of sufficiently general machinery for information flow. This flavor of declassification features excellent local reasoning and enables our approach to account for real-world information flow needs without compromising its theoretical elegance. Finally, we establish non-interference via a logical relations approach, showing off its simplicity in the face of the expressive power captured.

Publisher's Version Published Artifact Artifacts Available
From Linearity to Borrowing
Andrew Wagner, Olek Gierczak, Brianna Marshall, John M. Li, and Amal Ahmed
(Northeastern University, USA)
Linear type systems are powerful because they can statically ensure the correct management of resources like memory, but they can also be cumbersome to work with, since even benign uses of a resource require that it be explicitly threaded through during computation. Borrowing, as popularized by Rust, reduces this burden by allowing one to temporarily disable certain resource permissions (e.g., deallocation or mutation) in exchange for enabling certain structural permissions (e.g., weakening or contraction). In particular, this mechanism spares the borrower of a resource from having to explicitly return it to the lender but nevertheless ensures that the lender eventually reclaims ownership of the resource.
In this paper, we elucidate the semantics of borrowing by starting with a standard linear type system for ensuring safe manual memory management in an untyped lambda calculus and gradually augmenting it with immutable borrows, lexical lifetimes, reborrowing, and finally mutable borrows. We prove semantic type soundness for our Borrow Calculus (BoCa) using Borrow Logic (BoLo), a novel domain-specific separation logic for borrowing. We establish the soundness of this logic using a semantic model that additionally guarantees that our calculus is terminating and free of memory leaks. We also show that our Borrow Logic is robust enough to establish the semantic safety of some syntactically ill-typed programs that temporarily break but reestablish invariants.

Publisher's Version ACM SIGPLAN Distinguished Paper Award
Advancing Performance via a Systematic Application of Research and Industrial Best Practice
Wenyu Zhao, Stephen M. Blackburn, Kathryn S. McKinley, Man Cao, and Sara S. Hamouda
(Australian National University, Australia; Google, Australia; Google, USA; Canva, USA)
An elusive facet of high-impact research is translation to production. Production deployments are intrinsically complex and specialized, whereas research exploration requires stripping away incidental complexity and extraneous requirements to create clarity and generality. Conventional wisdom suggests that promising research rarely holds up once simplifying assumptions and missing features are addressed. This paper describes a productization methodology that led to a striking result: outperforming the mature and highly optimized state of the art by more than 10%.
Concretely, this experience paper captures lessons from translating a high-performance research garbage collector published at PLDI’22, called LXR, to a hyperscale revenue-critical application. Key to our success was a new process that dovetails best practice research methodology with industrial processes, at each step assuring that neither research metrics nor production measures regressed. This paper makes three contributions. i) We advance the state of the art, nearly halving the cost of garbage collection. ii) We advance research translation by sharing our approach and five actionable lessons. iii) We pose questions about how the community should evaluate innovative ideas in mature and heavily productized fields.
We deliver an optimized version of LXR. This collector, as far as we are aware, is the fastest general-purpose garbage collector for Java to date. On standard workloads, it substantially outperforms OpenJDK’s default G1 collector and the prior version of LXR, while also meeting Google internal correctness and uptime requirements. We address all of the limitations identified in the PLDI’22 paper. We use this experience to illustrate the following key lessons that are concrete, actionable, and transferable. L1) A systematic combination of research and production methodologies can meet production objectives while simultaneously advancing the research state-of-the-art. L2) A benchmark suite cannot and need not contain facsimiles of production workloads. It is sufficient to replicate individual pathologies that manifest in production (e.g., allocation rates, trace rates, heap constraints, etc.) or configure the JVM to force a workload to manifest the pathology. L3) Productization experiences can strengthen research workloads; we upstreamed one such workload. L4) Production environment requirements are myriad and sometimes prosaic, extending well beyond those that can reasonably be addressed in a research paper. L5) This collector is not yet in production, reminding us that replacing core technology in industry is a challenging sociotechnical problem.
The artifact we deliver gives practitioners and researchers a new benchmark for high performance garbage collection. The lessons we enumerate should help academic and industrial researchers with actionable steps to close the gap between research paper results and industrial impact. The 10% ‘productization dividend’ the process delivered should spark discussion about how our field evaluates innovative ideas in mature areas.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Fray: An Efficient General-Purpose Concurrency Testing Platform for the JVM
Ao Li, Byeongjee Kang, Vasudev Vikram, Isabella Laybourn, Samvid Dharanikota, Shrey Tiwari, and Rohan Padhye
(Carnegie Mellon University, USA)
Concurrency bugs are hard to discover and reproduce, even in well-synchronized programs that are free of data races. Thankfully, prior work on controlled concurrency testing (CCT) has developed sophisticated algorithms---such as partial-order based and selectively uniform sampling---to effectively search over the space of thread interleavings. Unfortunately, in practice, these techniques cannot easily be applied to real-world Java programs due to the difficulties of controlling concurrency in the presence of the managed runtime and complex synchronization primitives. So, mature Java projects that make heavy use of concurrency still rely on naive repeated stress testing in a loop. In this paper, we take a first-principles approach for elucidating the requirements and design space to enable CCT on arbitrary real-world JVM applications. We identify practical challenges with classical design choices described in prior work---such as concurrency mocking, VM hacking, and OS-level scheduling---that affect bug-finding effectiveness and/or the scope of target applications that can be easily supported.
Based on these insights, we present Fray, a new platform for performing push-button concurrency testing (beyond data races) of JVM programs. The key design principle behind Fray is to orchestrate thread interleavings without replacing existing concurrency primitives, using a concurrency control mechanism called shadow locking for faithfully expressing the set of all possible program behaviors. With full concurrency control, Fray can test applications using a number of search algorithms from a simple random walk to sophisticated techniques like PCT, POS, and SURW. In an empirical evaluation on 53 benchmark programs with known bugs (SCTBench and JaConTeBe), Fray with random walk finds 70% more bugs than JPF and 77% more bugs than RR's chaos mode. We also demonstrate Fray's push-button applicability on 2,664 tests from Apache Kafka, Lucene, and Google Guava. In these mature projects, Fray successfully discovered 18 real-world concurrency bugs that can cause 371 of the existing tests to fail under specific interleavings.
We believe that Fray serves as a bridge between classical academic research and industrial practice--- empowering developers with advanced concurrency testing algorithms that demonstrably uncover more bugs, while simultaneously providing researchers a platform for large-scale evaluation of search techniques.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable Results Reproduced
Scaling Instruction-Selection Verification against Authoritative ISA Semantics
Michael McLoughlin, Ashley Sheng, Chris Fallin, Bryan Parno, Fraser Brown, and Alexa VanHattum
(Carnegie Mellon University, USA; Wellesley College, USA; F5, USA)
Secure, performant execution of untrusted code—as promised by WebAssembly (Wasm)—requires correct compilation to native code that enforces a sandbox. Errors in instruction selection can undermine the sandbox's guarantees, but prior verification work struggles to scale to the complexity of realistic industrial compilers.
We present Arrival, an instruction-selection verifier for the Cranelift production Wasm-to-native compiler. Arrival enables end-to-end, high-assurance verification while reducing developer effort. Arrival (1) automatically reasons about chains of instruction-selection rules, thereby reducing the need for developer-supplied intermediate specifications, (2) introduces a lightweight, efficient method for reasoning about stateful instruction-selection rules, and (3) automatically derives high-assurance machine code specifications.
Our work verifies nearly all AArch64 instruction-selection rules reachable from Wasm core. Furthermore, Arrival reduces the developer effort required: 60% of all specifications benefit from our automation, thereby requiring 2.6X fewer hand-written specifications than prior approaches. Arrival finds new bugs in Cranelift's instruction selection, and it is viable for integration into production workflows.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Contract System Metatheories à la Carte: A Transition-System View of Contracts
Shu-Hung You, Christos Dimoulas, and Robert Bruce Findler
(Northwestern University, USA)
Over the last 15 years, researchers have studied a wide variety of important aspects of contract systems, ranging from internal consistency (complete monitoring and correct blame) to subtle details about the semantics of contracts combinators (dependency) to the difficulty of efficient checking (avoiding asymptotically bad redundant checking). Although each paper offers essential insights about contract systems, they also differ in inessential ways, making it hard to know how their metatheories combine. Even worse, the metatheories share tremendous tedium in their definitions and proofs, occupying researchers' time with no benefit.
In this paper, we present the idea that higher-order contract systems can be viewed as transition systems and show that this perspective offers an important opportunity for reuse in their metatheories. We demonstrate the value of this perspective by proving representative properties from the literature, and by contributing a new proof establishing that elimination of redundant contract checks can eliminate asymptotic slowdowns. To confirm our claims and encourage the adoption of our ideas, we provide a mechanized development in Agda.

Publisher's Version Info

Corrections

Corrigendum: PAFL: Enhancing Fault Localizers by Leveraging Project-Specific Fault Patterns
Donguk Kim, Minseok Jeon, Doha Hwang, and Hakjoo Oh
(Korea University, Republic of Korea; Samsung Electronics, Republic of Korea)
This is a corrigendum for the article “PAFL: Enhancing Fault Localizers by Leveraging Project-Specific Fault Patterns” by Donguk Kim, Minseok Jeon, Doha Hwang, and Hakjoo Oh, published in Proc. ACM Program. Lang. 9, OOPSLA1, Article 129 (April 2025), https://doi.org/10.1145/3720526. The designation of Minseok Jeon, Korea University, and Hakjoo Oh, Korea University, as co-corresponding authors was erroneously left off the article. The correct corresponding authors for this article are Minseok Jeon, Korea University, and Hakjoo Oh, Korea University.

Publisher's Version

proc time: 40.54