PAgE 2026 – Proceedings

Welcome from the Chairs
Welcome to the 1st Workshop on Principles of Agentic Engineering (PAgE 2026), held on June 15, 2026, in Boulder, Colorado, United States, and co-located with PLDI 2026. PAgE brings together researchers and practitioners from programming languages, formal methods, software engineering, and artificial intelligence to develop principled foundations for safe and reliable AI agents. The workshop focuses on techniques for specifying, testing, verifying, monitoring, debugging, and repairing agentic systems that plan, call tools, maintain state, and act over real-world services and data.

Article: pldiws26pageforeword-fm001-p doi:

Towards Verified Code Reasoning by LLMs
Meghana Sistla, Gogul Balakrishnan, Pat Rondon, José P. Cambronero, Michele Tufano, and Satish Chandra
(University of Texas at Austin, USA; Google DeepMind, USA; Google, USA; Meta Platforms, USA)
While LLM-based agents are able to tackle a wide variety of code reasoning questions, the answers are not always correct. As a result of this lack of trustworthiness, the agent's answers need to be manually verified before they can be trusted, which requires substantial effort and can result in lower developer productivity. We describe a method to automatically validate the answers provided by a code reasoning agent by extracting a formal representation of the agent's response, and using formal verification and program analysis tools to verify the reasoning steps. We applied this approach to a set of 20 program equivalence queries and found that the formal verification step successfully caught 6/8 incorrect agent judgments. This work lays the foundation for what may become a new class of high-precision, verifiable code agents, paving the way for their reliable use in critical software engineering workflows.

Publisher's Version Article: pldiws26pagemain-p4-p doi:10.1145/3819802.3820578

Testing, Credible Compilation, and Verification in the Axon Verified Compiler in Lean and Claude Code
Martin C. Rinard
(National University of Singapore, Singapore; Massachusetts Institute of Technology, USA)
This paper presents the use of testing, credible compilation/translation validation, verification, and audits in the Axon compiler. Axon comes with fully machine checked proofs that guarantee the correctness of the generated code. All code and proofs were written in Lean by Claude Code, with the correctness proofs eliminating any need to audit or examine any verified code. I present a development process for using these validation techniques, evaluate the use of this process during the development of the compiler, and discuss implications for other development efforts.

Publisher's Version Article: pldiws26pagemain-p5-p doi:10.1145/3819802.3820579

The Next Frontier for AI-Generated Kernels: Correctness
Guido Martínez and Tyler Sorensen
(Microsoft Research, USA)
KernelBench was recently proposed as a GPU-programming challenge set for LLMs, measuring their ability to generate correct and efficient GPU kernels for a variety of problems. With the advances of LLMs, most frontier models can now pass most of KernelBench with ease, and the performance (and corresponding complexity) of the generated kernels is increasing. In contrast, the correctness aspect of these kernels has gotten relatively little attention. KernelBench uses a fixed test suite to accept a kernel as valid. Is this perhaps enough, given the usual regularity of GPU kernels? Do buggy kernels make it through this test suite? We claim that testing is hopelessly inadequate for such tasks, finding serious problems in KernelBench’s test suite, with correct kernels being rejected and buggy kernels be- ing accepted. While some of these issues are easily solvable, others are fundamental and unavoidable, particularly if one considers an adversarial LLM. We argue that AI-generated kernel solutions should come with formal proofs of correct- ness, and show that this is now practical. Using Kuiper, a recent verified GPU programming framework, we present verified implementations for all 100 KernelBench Level 1 tasks, written almost entirely by LLM coding agents

Publisher's Version Article: pldiws26pagemain-p6-p doi:10.1145/3819802.3820580

Testing LLM-Generated Distributed Protocol Code
Brendan Coyne and Ankush Das
(Boston University, USA)
Large Language Models (LLMs) are increasingly used for code generation, including complex distributed systems. However, their reliability in stateful and failure-prone environments remains unclear. In this paper, we present a structured evaluation framework for assessing LLM-generated implementations of distributed protocols, including Two-Phase Commit, Ring Election, and Raft. We introduce a simulation environment that injects realistic network faults such as message loss, delay, and duplication. Our results show that while LLMs perform well on simpler protocols and code completion tasks, they struggle with complex consensus algorithms and exhibit inconsistent debugging behavior. We further analyze the relationship between model size and performance, finding that parameter count alone does not predict effectiveness. These findings highlight the need for systematic testing and verification in agentic code generation systems.

Publisher's Version Article: pldiws26pagemain-p11-p doi:10.1145/3819802.3820581

Lazy Validation and Self-Healing for Agentic Programs
Theodoros Tsampouris, Eleftherios Ioannidis, and Andreas Symeonidis
(Aristotle University of Thessaloniki, Greece; Microsoft Research, USA)
Effective use of LLM CLIs requires careful use of plan mode—a sequence of natural-language steps to be executed by the agent. Our central observation is that such plans are programs, which we call agentic workflows. If plans are programs, then program safety techniques such as type systems, static analysis, and authorization semantics should apply directly. A deeper observation is that the LLM is not merely a producer of values, but also a generator of program structure. Tool calls can produce types that govern subsequent data, workflows that are executed as subroutines, and policies that restrict future operations. We present Pact (Programmatic Agent Control Types), a typed imperative domain-specific language in which human interaction, LLM tool invocation, and LLM reasoning are first-class typed expressions. Pact introduces a novel safety mechanism called lazy validation, together with a recovery mechanism called self-healing, under which types, workflows, and policies produced by an LLM are validated at the point of generation against the surrounding program context. As a result, we obtain a programming language for agentic workflows with valuable safety properties formally verified in the Lean 4 theorem prover, as well as practical benefits: self-healing enables long-running tasks, reduces output token consumption by nearly threefold, and halves wall-clock time compared to assertion-based baselines.

Publisher's Version Article: pldiws26pagemain-p14-p doi:10.1145/3819802.3820582

PAgE 2026 – Proceedings

2026 ACM SIGPLAN International Workshop on Principles of Agentic Engineering (PAgE 2026)