AIware 2026 – Author Index |
Contents -
Abstracts -
Authors
|
A B C D E F G H İ I J K L M N O P Q R S T U V W X Y Z
| Abreu, Rui |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Abubakar, Muhammad Auwal |
Matthias Galster, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Singapore Management University, Singapore) Agentic AI coding tools increasingly automate software development tasks. Developers can configure these tools through versioned repository-level artifacts such as Markdown and JSON files. We present a systematic analysis of configuration mechanisms for agentic AI coding tools, covering Claude Code, GitHub Copilot, Cursor, Gemini, and Codex. We identify eight configuration mechanisms spanning from static context to executable and external integrations and, in an empirical study of 2,853 GitHub repositories, examine whether and how they are adopted, with a detailed analysis of Context Files, Skills, and Subagents. First, Context Files dominate the configuration landscape and are often the sole mechanism in a repository, with AGENTS.md emerging as an interoperable standard across tools. Second, few repositories adopt advanced mechanisms such as Skills and Subagents. Skills predominantly rely on static instructions rather than executable scripts. Third, distinct configuration practices are forming around different tools, with Claude Code users employing the broadest range of mechanisms. These findings establish an empirical baseline for understanding how developers configure agentic tools, suggest that AGENTS.md serves as a natural starting point, and motivate longitudinal and experimental research on how configuration strategies evolve and affect agent performance. Matthias Galster, Seyedmoein Mohsenimofidi, Levi Böhme, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Universität Bayreuth, Germany; Singapore Management University, Singapore) Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration artifacts, the full content of 18,167 configuration files associated with these configuration artifacts, and 148,519 AI-co-authored commits. The dataset and the construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, AI tool adoption patterns, and human-AI collaboration. |
|
| Ahmed, Md Basim Uddin |
Md Basim Uddin Ahmed, Nima Shiri Harzevili, Jiho Shin, Hung Viet Pham, and Song Wang (York University, Canada; Queen's University, Canada) Large Language Models (LLMs) show promise for vulnerability detection, but their evaluation is limited by the lack of high-quality benchmarks. Most existing datasets rely on coarse function-level labels, overlook fine-grained vulnerability patterns, and lack critical program context such as data/control dependencies. They also suffer from data quality issues, including mislabeling and duplication, leading to unreliable evaluation and limited real-world relevance. To address these limitations, this paper introduces SecVulEval, a context-aware benchmark designed to evaluate LLMs on vulnerability detection with rich contextual information. SecVulEval focuses on real-world C/C++ vulnerabilities at the statement level. This granularity enables more precise evaluation of a model’s ability to localize and understand vulnerabilities, beyond simple binary classification at the function level. By incorporating rich contextual information, SecVulEval sets a new standard for benchmarking vulnerability detection in realistic software development scenarios. This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024. We evaluated state-of-the-art LLMs in both standalone and multi-agent settings. Results on our dataset indicate that current models remain far from accurately identifying vulnerable statements within a given function, although agent-based approaches provide modest but promising improvements. The best-performing Claude-3.7-Sonnet-driven agent achieves an F1-score of 23.83% for vulnerable statement detection. We believe this benchmark can serve as a foundation for advancing context-aware vulnerability detection with LLMs. |
|
| Aleithan, Reem |
Haoran Xue, Reem Aleithan, Nafid Enan, Gias Uddin, and Song Wang (York University, Canada) SWE-Bench has become one of the most widely used benchmarks for evaluating whether large language models (LLMs) can resolve real-world GitHub issues. However, despite its broad adoption, a systematic evaluation of the quality of the benchmark remains lacking. In this paper, we present SWE-Bench+, an enhanced benchmark framework that improves evaluation reliability by addressing two benchmark-quality risks in SWE-Bench: solution leakage in issue descriptions and weak tests that allow plausible but incorrect patches to pass. To construct SWE-Bench+, we first analyze 217 commonly resolved issues across three top-performing agents in the SWE-Bench leaderboard during our study: SWE-agent 1.0, OpenHands+CodeAct v2.1, and AutoCodeRover-v2.0), yielding 651 model-generated patches, and identify five quality-problem patterns under the two broader risks of solution leakage and weak tests. Based on this analysis, we develop two components: the SoluLeakDetector, which identifies solution-leaking content for issue sanitization, and the TestEnhancer, which strengthens test suites for more reliable patch validation. Our results show that 60.83% of the commonly resolved issues exhibit solution leakage, and 77.88% are problematic overall. After removing leaked information from issue descriptions, model resolution rates drop substantially, showing that leakage materially inflates reported performance. SoluLeakDetector achieves 80.45% accuracy on solution-leak detection, and TestEnhancer identifies plausible patches for 97.11% of weak-test issues while reducing average resolution rates by 27.00 percentage points on Lite and 36.27 percentage points on Verified. These findings show that existing SWE-Bench evaluations can substantially overestimate true issue-resolution performance and that SWE-Bench+ provides a more rigorous benchmark framework for LLM-based software engineering agents. |
|
| Alfonso, Iván |
Nadia Daoudi, Iván Alfonso, and Jordi Cabot (Luxembourg Institute of Science and Technology, Luxembourg; University of Luxembourg, Luxembourg) The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate NN code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Our artefacts are available online. |
|
| Ali, Shaukat |
Prasun Saurabh, Pablo Valle, Aitor Arrieta, Shaukat Ali, and Paolo Arcaini (Simula Research Laboratory, Norway; Oslo Metropolitan University, Norway; Mondragon University, Spain; Tokyo Institute of Technology, Japan) Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test oracle problem in software testing. Traditionally, this assessment relies on task-specific symbolic oracles for task correctness and on human manual evaluation of robot behavior, which is time-consuming, subjective, and error-prone. To address this, we propose VISOR, a Vision-Language Model (VLM)–based approach for automated test oracle assessment that eliminates the need of expensive human evaluations. VISOR performs automated evaluation of task correctness and quality, addressing the limitations of existing symbolic test oracles, which are task-specific and provide binary pass/fail judgments without explicitly quantifying task quality. Given the inherent uncertainty in VLMs, VISOR also explicitly quantifies its own uncertainty during test assessments. We evaluated VISOR using two VLMs, i.e., GPT and Gemini, across four robotic tasks on over 1,000 videos. Our results show that Gemini achieves higher recall and GPT produces more precise predictions, while both models exhibit low uncertainty. |
|
| Al-Kaswan, Ali |
Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation. |
|
| Almukhtar, Mohamed |
Mohamed Almukhtar, Anwar Ghammam, and Hua Ming (University of Michigan at Flint, USA; University of Michigan at Dearborn, USA) As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code-quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues—predominantly convention-level violations such as long lines—while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows. |
|
| ALMutairi, Mariam |
Mariam ALMutairi and Chang-Tien Lu (Virginia Polytechnic Institute, USA) Existing LLM security benchmarks evaluate code generation quality, leaving an open question: can LLMs generate tests that detect vulnerabilities? We address this with two technical contributions. First, we propose the Security Mutation Score (SMS), a metric that classifies mutant kills into semantic, functional, incidental, and crash categories using operator-aware heuristics, distinguishing genuine security awareness from coincidental detection. We further define Effective SMS (EffSMS = SMS × Secure-Pass Rate) to account for test validity. Second, we design 25 security-specific mutation operators spanning 30 CWE categories that transform secure Python code into realistic vulnerable variants, extending prior security mutation frameworks to Python and introducing 22 new operators. Evaluating eight LLMs and two static analysis baselines on 339 programs and 1,869 mutants reveals three findings: (i) traditional mutation scores overstate LLM security testing capability by 2.2× on average; (ii) the best LLM achieves only 19.7% EffSMS vs. 47.6% for expert-written tests—a 2.4× gap raw metrics obscure; and (iii) functional kills, not crashes, dominate non-semantic failures (15–36%), showing LLMs detect behavioral side-effects rather than security properties. Static analysis and mutation testing provide complementary coverage across syntactic vs. logic-flaw CWEs. Code and data are publicly available. |
|
| Arcaini, Paolo |
Prasun Saurabh, Pablo Valle, Aitor Arrieta, Shaukat Ali, and Paolo Arcaini (Simula Research Laboratory, Norway; Oslo Metropolitan University, Norway; Mondragon University, Spain; Tokyo Institute of Technology, Japan) Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test oracle problem in software testing. Traditionally, this assessment relies on task-specific symbolic oracles for task correctness and on human manual evaluation of robot behavior, which is time-consuming, subjective, and error-prone. To address this, we propose VISOR, a Vision-Language Model (VLM)–based approach for automated test oracle assessment that eliminates the need of expensive human evaluations. VISOR performs automated evaluation of task correctness and quality, addressing the limitations of existing symbolic test oracles, which are task-specific and provide binary pass/fail judgments without explicitly quantifying task quality. Given the inherent uncertainty in VLMs, VISOR also explicitly quantifies its own uncertainty during test assessments. We evaluated VISOR using two VLMs, i.e., GPT and Gemini, across four robotic tasks on over 1,000 videos. Our results show that Gemini achieves higher recall and GPT produces more precise predictions, while both models exhibit low uncertainty. |
|
| Arrieta, Aitor |
Prasun Saurabh, Pablo Valle, Aitor Arrieta, Shaukat Ali, and Paolo Arcaini (Simula Research Laboratory, Norway; Oslo Metropolitan University, Norway; Mondragon University, Spain; Tokyo Institute of Technology, Japan) Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test oracle problem in software testing. Traditionally, this assessment relies on task-specific symbolic oracles for task correctness and on human manual evaluation of robot behavior, which is time-consuming, subjective, and error-prone. To address this, we propose VISOR, a Vision-Language Model (VLM)–based approach for automated test oracle assessment that eliminates the need of expensive human evaluations. VISOR performs automated evaluation of task correctness and quality, addressing the limitations of existing symbolic test oracles, which are task-specific and provide binary pass/fail judgments without explicitly quantifying task quality. Given the inherent uncertainty in VLMs, VISOR also explicitly quantifies its own uncertainty during test assessments. We evaluated VISOR using two VLMs, i.e., GPT and Gemini, across four robotic tasks on over 1,000 videos. Our results show that Gemini achieves higher recall and GPT produces more precise predictions, while both models exhibit low uncertainty. |
|
| Azizian, Artin |
Sasan Azizian, Ayoub Hazrati, Artin Azizian, and Elham Rastegari (Bellevue University, USA; McGill University, Canada; Creighton University, USA) Object-relational mapping (ORM) frameworks simplify persistence, yet routine mapping choices—including inheritance strategy, association encoding, and denormalization—can substantially alter the generated SQL and the trade-offs among query latency, insert/update cost, and storage footprint. Existing optimization approaches either guarantee semantic validity but incur expensive per-candidate deployment and benchmarking, or learn from schema structure alone and therefore miss the fact that workload behavior is mapping-dependent. We present TriORM, a workload-aware neural symbolic framework for multi-objective ORM mapping design that removes per-candidateworkload execution from the online recommendation loop while preserving validity by construction. TriORM (i) enumerates admissible mappings via bounded relational synthesis in Alloy, (ii) concretizes an abstract workload into schema-specific SQL templates for each candidate, and (iii) predicts continuous objectives using a tri-input model that fuses a typed schema-graph encoder, a concretized-workload encoder, and interpretable static cost proxies. The resulting predictions enable Pareto filtering and user-weighted selection without deploying or executing each candidate at inference time; profiling is performed only offline to obtain supervision. On nine TradeMaker/Leant benchmark models, TriORM improves mean Pareto-front approximation over Leant (GD/HV 0.03/0.78 vs. 0.07/0.61) and reduces end-toend recommendation time (3.6×103s vs. 2.6×104s), while preserving semantic correctness within the chosen synthesis bounds. Sasan Azizian, Ayoub Hazrati, Artin Azizian, Elham Rastegari, Hamid Bagheri, and Juan Cui (Bellevue University, USA; McGill University, Canada; Creighton University, USA; University of Nebraska-Lincoln, USA) Object-relational mapping (ORM) design remains largely driven by fixed heuristics that fail to capture workload-specific tradeoffs among query latency, insert cost, and memory footprint. We present Y-Map, a hybrid neural--symbolic framework for performance-aware ORM schema design that synthesizes valid schema candidates and predicts their performance without requiring workload execution at inference time. Y-Map leverages Alloy to enumerate correctness-preserving ORM schemas and ranks them using a multi-encoder regression model that fuses structural, syntactic, and semantic representations with compact schema-level features. By predicting continuous performance objectives---insert latency, query latency, and memory footprint---Y-Map enables Pareto-aware selection without per-candidate benchmarking during inference. We evaluate Y-Map on nine object models from e-commerce, banking, and healthcare. Relative to two representative baselines, Leant and DTS, Y-Map yields improved aggregate Pareto quality (Generational Distance and Hypervolume) while reducing inference time and memory overhead. The experimental results show that integrating symbolic validity guarantees with learned performance prediction provides a practical, scalable solution for workload-aware ORM optimization. |
|
| Azizian, Sasan |
Sasan Azizian, Ayoub Hazrati, Artin Azizian, and Elham Rastegari (Bellevue University, USA; McGill University, Canada; Creighton University, USA) Object-relational mapping (ORM) frameworks simplify persistence, yet routine mapping choices—including inheritance strategy, association encoding, and denormalization—can substantially alter the generated SQL and the trade-offs among query latency, insert/update cost, and storage footprint. Existing optimization approaches either guarantee semantic validity but incur expensive per-candidate deployment and benchmarking, or learn from schema structure alone and therefore miss the fact that workload behavior is mapping-dependent. We present TriORM, a workload-aware neural symbolic framework for multi-objective ORM mapping design that removes per-candidateworkload execution from the online recommendation loop while preserving validity by construction. TriORM (i) enumerates admissible mappings via bounded relational synthesis in Alloy, (ii) concretizes an abstract workload into schema-specific SQL templates for each candidate, and (iii) predicts continuous objectives using a tri-input model that fuses a typed schema-graph encoder, a concretized-workload encoder, and interpretable static cost proxies. The resulting predictions enable Pareto filtering and user-weighted selection without deploying or executing each candidate at inference time; profiling is performed only offline to obtain supervision. On nine TradeMaker/Leant benchmark models, TriORM improves mean Pareto-front approximation over Leant (GD/HV 0.03/0.78 vs. 0.07/0.61) and reduces end-toend recommendation time (3.6×103s vs. 2.6×104s), while preserving semantic correctness within the chosen synthesis bounds. Sasan Azizian, Ayoub Hazrati, Artin Azizian, Elham Rastegari, Hamid Bagheri, and Juan Cui (Bellevue University, USA; McGill University, Canada; Creighton University, USA; University of Nebraska-Lincoln, USA) Object-relational mapping (ORM) design remains largely driven by fixed heuristics that fail to capture workload-specific tradeoffs among query latency, insert cost, and memory footprint. We present Y-Map, a hybrid neural--symbolic framework for performance-aware ORM schema design that synthesizes valid schema candidates and predicts their performance without requiring workload execution at inference time. Y-Map leverages Alloy to enumerate correctness-preserving ORM schemas and ranks them using a multi-encoder regression model that fuses structural, syntactic, and semantic representations with compact schema-level features. By predicting continuous performance objectives---insert latency, query latency, and memory footprint---Y-Map enables Pareto-aware selection without per-candidate benchmarking during inference. We evaluate Y-Map on nine object models from e-commerce, banking, and healthcare. Relative to two representative baselines, Leant and DTS, Y-Map yields improved aggregate Pareto quality (Generational Distance and Hypervolume) while reducing inference time and memory overhead. The experimental results show that integrating symbolic validity guarantees with learned performance prediction provides a practical, scalable solution for workload-aware ORM optimization. |
|
| Babu, Md. Abu Ahammed |
Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, and Miroslaw Staron (University of Gothenburg, Sweden; Chalmers University of Technology, Sweden) Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE. |
|
| Bagheri, Hamid |
Sasan Azizian, Ayoub Hazrati, Artin Azizian, Elham Rastegari, Hamid Bagheri, and Juan Cui (Bellevue University, USA; McGill University, Canada; Creighton University, USA; University of Nebraska-Lincoln, USA) Object-relational mapping (ORM) design remains largely driven by fixed heuristics that fail to capture workload-specific tradeoffs among query latency, insert cost, and memory footprint. We present Y-Map, a hybrid neural--symbolic framework for performance-aware ORM schema design that synthesizes valid schema candidates and predicts their performance without requiring workload execution at inference time. Y-Map leverages Alloy to enumerate correctness-preserving ORM schemas and ranks them using a multi-encoder regression model that fuses structural, syntactic, and semantic representations with compact schema-level features. By predicting continuous performance objectives---insert latency, query latency, and memory footprint---Y-Map enables Pareto-aware selection without per-candidate benchmarking during inference. We evaluate Y-Map on nine object models from e-commerce, banking, and healthcare. Relative to two representative baselines, Leant and DTS, Y-Map yields improved aggregate Pareto quality (Generational Distance and Hypervolume) while reducing inference time and memory overhead. The experimental results show that integrating symbolic validity guarantees with learned performance prediction provides a practical, scalable solution for workload-aware ORM optimization. |
|
| Baltes, Sebastian |
Christoph Treude, Sebastian Baltes, and Marc Cheong (Singapore Management University, Singapore; Ruprecht-Karls-Universität Heidelberg, Germany; University of Melbourne, Australia) As AI coding agents become embedded in software development workflows, developers are beginning to operationalize ethical principles by encoding behavioral rules into repository-level context files for AI agents, such as AGENTS.md files. Rather than examining the ethics of AI agents in the abstract, this vision paper investigates how ethics and values are already being translated for AI agents into actionable instructions that shape agent behavior. Through a preliminary investigation, we find that developers are already embedding guidance related to fairness, accessibility, sustainability, tone, and privacy. These artifacts function as a developer-authored governance layer, translating abstract principles into situated, natural-language directives within development workflows. We outline a research agenda for studying this emerging practice, including how encoded values vary across communities, what governance dynamics emerge when multiple contributors negotiate these files, and whether agents reliably adhere to the constraints specified. Understanding how ethics and values are operationalized for AI agents is essential to ground AI governance in modern software engineering practice. Matthias Galster, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Singapore Management University, Singapore) Agentic AI coding tools increasingly automate software development tasks. Developers can configure these tools through versioned repository-level artifacts such as Markdown and JSON files. We present a systematic analysis of configuration mechanisms for agentic AI coding tools, covering Claude Code, GitHub Copilot, Cursor, Gemini, and Codex. We identify eight configuration mechanisms spanning from static context to executable and external integrations and, in an empirical study of 2,853 GitHub repositories, examine whether and how they are adopted, with a detailed analysis of Context Files, Skills, and Subagents. First, Context Files dominate the configuration landscape and are often the sole mechanism in a repository, with AGENTS.md emerging as an interoperable standard across tools. Second, few repositories adopt advanced mechanisms such as Skills and Subagents. Skills predominantly rely on static instructions rather than executable scripts. Third, distinct configuration practices are forming around different tools, with Claude Code users employing the broadest range of mechanisms. These findings establish an empirical baseline for understanding how developers configure agentic tools, suggest that AGENTS.md serves as a natural starting point, and motivate longitudinal and experimental research on how configuration strategies evolve and affect agent performance. Matthias Galster, Seyedmoein Mohsenimofidi, Levi Böhme, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Universität Bayreuth, Germany; Singapore Management University, Singapore) Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration artifacts, the full content of 18,167 configuration files associated with these configuration artifacts, and 148,519 AI-co-authored commits. The dataset and the construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, AI tool adoption patterns, and human-AI collaboration. |
|
| Balusu, Krishna Chaitanya |
Krishna Chaitanya Balusu (Facebook, USA) LLM-based autonomous agents fail in ways that existing observability infrastructure cannot detect. OpenTelemetry’s GenAI semantic conventions cover LLM invocation and tool execution but leave five critical agent orchestration phases—planning, reasoning, safety monitoring, inter-agent delegation, and memory management—without span-level representation. We present AgentTelemetry, an open-source benchmark suite and toolkit for evaluating fault detection in agent systems. The benchmark defines (1) a taxonomy of 14 fault types mapped to 9 agent-specific span kinds, (2) a controlled evaluation harness of 490 fault-detection cells (14 faults × 5 observability conditions × 7 frameworks; enumerated as 2,940 raw configurations across 6 mock-LLM seeds), and (3) a pip-installable library (3,700+ LOC, 78 tests) with adapters for seven frameworks. On the controlled benchmark, the full span taxonomy achieves a Fault Detection Rate (FDR) of 1.000—an upper bound confirming structural completeness—compared to 0.429 for vanilla OpenTelemetry and OTel+GenAI. An ablation study proves all nine span kinds are necessary: removing any one makes at least one fault type undetectable. A case study on 112 SWE-bench Lite instances reveals that 84/112 agent runs (75%) exhausted the 8-iteration limit and are classified as reasoning loops by structural pattern (a definitional partition of the failed-trace population, not a sampling estimate)—a failure mode invisible to vanilla OTel—and a telemetry-guided intervention improves the patch rate by +12.5 pp over a matched control (Fisher’s exact p=0.53, two-sided; demonstrative not statistically significant at n=24). All code, data, and benchmark configurations are open-source for reproducibility. |
|
| Barnes, Marcus Emmanuel |
Marcus Emmanuel Barnes, Taher A. Ghaleb, and Safwat Hassan (University of Toronto, Canada; Trent University, Canada) AI agents are assuming active roles in Continuous Integration and Continuous Deployment (CI/CD) workflows, yet the research community lacks a shared vocabulary for describing what it means for CI/CD to be agentic, how much decision authority is delegated, and where control should reside. This paper presents a vision of agentic CI/CD in which the central challenge is not improving task performance but designing authority transfer, defined as the delegation of operational decisions from human-controlled pipelines to agent systems under specified constraints and recourse mechanisms. To structure this argument, we introduce a distinction between data-plane authority (localized interventions such as patch generation and test reruns) and control-plane authority (modifications to pipeline configuration, deployment policies, and approval gates). Drawing on research prototypes and industrial platforms, we show that current systems operate mainly at the data plane under bounded autonomy, with safety achieved through surrounding governance infrastructure rather than intrinsic agent guarantees. We identify three recurring patterns: constrained autonomy as the dominant design, external governance as the primary safety mechanism, and a widening gap between deployment momentum and evaluation methodology. We propose a research agenda in which control-plane safety and governance mechanisms represent the most urgent open problem, followed by formalization of autonomy boundaries, evaluation frameworks, and human–agent coordination. |
|
| Barr, Earl T. |
Claudio Spiess, Prem Devanbu, and Earl T. Barr (University of California at Davis, USA; University College London, UK) LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a standard program-output prediction task. Our results reveal a stark divergence in model behavior: while open-source reasoning models (DeepSeek-R1 family) maintain stable, albeit somewhat lower accuracies (38% to 67%) under code transformations & input perturbations, the frontier model GPT-5.2 exhibits surprising instruction-sensitive brittleness. Despite achieving a near-perfect score of 99% on the original, unperturbed CRUXEval benchmark, perturbed inputs trigger accuracy declines between 20% and 24%. In addition, we find that many models perform much worse at predicting behavior on perturbed inputs that raise exceptions, and that prediction performance depends on the kind of exception. We study remedies to address this deficiency in exception prediction, and evaluate the effect of these remedies on the ability to predict non-exception behaviors. Our findings both point to limitations in the way all models understand code, and reinforce and extend the value of using perturbation to evaluate code models. |
|
| Basu, Srijita |
Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, and Miroslaw Staron (University of Gothenburg, Sweden; Chalmers University of Technology, Sweden) Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE. |
|
| Becker, Norman |
Norman Becker, Tural Mammadov, and Andreas Zeller (CISPA Helmholtz Center for Information Security, Germany) In the past years, large language models (LLMs) have demonstrated remarkable progress in code generation. However, their ability to reason about program behavior remains an open challenge—an ability that is relevant for applications including reverse engineering, debugging, secure code generation, test-driven synthesis, input reconstruction, reverse fuzzing, behavioral monitoring, and safe execution modeling. To study this ability, we examine the capacity of LLMs to reason about the semantics of code—specifically, their ability to relate code, its inputs, and its outputs to each other. To this end, we investigate whether and how well LLMs can predict one of these three components given the other two—that is, (1) predict the input given code and output, (2) predict the output given code and input, and (3) predict the code given input and output. This way, we assess how well LLMs can reason about and understand the underlying relationships that govern program execution. We construct four datasets covering string processing, array operations, and coding challenges in JavaScript and Python to evaluate diverse program-understanding capabilities, incorporating various code mutation techniques to increase complexity. In our evaluation on tasks covering string processing, array operations, and coding challenges, we find that closed-weight models achieve the strongest performance across all datasets, including perfect input recovery on deterministic string tasks. Across tasks, output prediction is comparatively stable, whereas code prediction remains the hardest setting and often fails for smaller models. Finally, cross-codebase transfer is feasible, especially for input prediction, but highly sensitive to model capacity and fine-tuning strategy. |
|
| Böhme, Levi |
Matthias Galster, Seyedmoein Mohsenimofidi, Levi Böhme, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Universität Bayreuth, Germany; Singapore Management University, Singapore) Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration artifacts, the full content of 18,167 configuration files associated with these configuration artifacts, and 148,519 AI-co-authored commits. The dataset and the construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, AI tool adoption patterns, and human-AI collaboration. |
|
| Businge, John |
Daniel Ogenrwot and John Businge (University of Nevada at Las Vegas, USA) Software Engineering 3.0 marks a paradigm shift in software development, in which AI coding agents are no longer just assistive tools but active contributors. While prior empirical studies have examined productivity gains and acceptance patterns in AI-assisted development, the challenges associated with integrating agent-generated contributions remain less understood. In particular, merge conflicts, a fundamental aspect of collaborative software development, remain underexplored in this context. In this paper, we present AgenticFlict, a large-scale dataset of textual merge conflicts in AI coding agent pull requests (Agentic PRs). The dataset comprises 142K+ Agentic PRs collected from 59K+ repositories, of which 107K+ are successfully processed through deterministic merge simulation. Our pipeline identifies 29K+ PRs exhibiting merge conflicts, yielding a conflict rate of 27.67%, and extracts 336K+ fine-grained conflict regions across these instances. Our preliminary exploratory analysis indicates that merge conflicts are both frequent and often substantial in AI-generated contributions, with noticeable variation across agents, emphasizing the need to better understand and manage integration challenges in AI-assisted software development. The dataset, code and supplementary materials are available in zenodo: 10.5281/zenodo.19396916 |
|
| Cabot, Jordi |
Nadia Daoudi, Iván Alfonso, and Jordi Cabot (Luxembourg Institute of Science and Technology, Luxembourg; University of Luxembourg, Luxembourg) The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate NN code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Our artefacts are available online. |
|
| Carmichael, Zachariah J. |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Chandra, Satish |
Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra (Meta Platforms, USA) Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real-world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale. |
|
| Chen, Yeheng |
Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, and Xiaodong Gu (Shanghai Jiao Tong University, China; Hohai University, China; Chongqing University, China) LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes—compositional code creation, i.e., building a complete, internally structured class from a specification—remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark’s discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck. |
|
| Cheng, Xiaoyu |
Xiaoyu Cheng, Kundi Yao, Pengyu Nie, and Weiyi Shang (University of Waterloo, Canada; Ontario Tech University, Canada) Large Language Models (LLMs) for code generation risk memorizing and reproducing sensitive training data, including licensed code and proprietary information. We investigate memorization behavior in recent open-weight LLMs in code generation using a two-stage memorization evaluation pipeline, which combines a similarity-based extractability filter with a targeted data extraction attack. We evaluate four models (StarCoder2-3B, StarCoder2-7B, Llama3-8B, and DeepSeek-R1-distilled-Llama-8B) on a custom dataset of 30,000+ Python files. Our results reveal memorization rates of 42-64%, with code-specialized models exhibiting higher rates than general-purpose models. Categorical analysis shows that repetitive content (license headers, documentation) is memorized at rates up to 70%, while complex code exhibits lower susceptibility. Notably, realistic code completion scenarios trigger unintentional memorization in 13-14% of cases, posing practical risks for AI coding assistants. We demonstrate that knowledge distillation reduces extraction rates by approximately 19%, offering a cost-effective mitigation approach. Our findings confirm that memorization persists in modern LLMs and is influenced more by a complex interplay of training domain, dataset composition, architectural choices, and content characteristics, rather than parameter count alone. |
|
| Cheong, Marc |
Christoph Treude, Sebastian Baltes, and Marc Cheong (Singapore Management University, Singapore; Ruprecht-Karls-Universität Heidelberg, Germany; University of Melbourne, Australia) As AI coding agents become embedded in software development workflows, developers are beginning to operationalize ethical principles by encoding behavioral rules into repository-level context files for AI agents, such as AGENTS.md files. Rather than examining the ethics of AI agents in the abstract, this vision paper investigates how ethics and values are already being translated for AI agents into actionable instructions that shape agent behavior. Through a preliminary investigation, we find that developers are already embedding guidance related to fairness, accessibility, sustainability, tone, and privacy. These artifacts function as a developer-authored governance layer, translating abstract principles into situated, natural-language directives within development workflows. We outline a research agenda for studying this emerging practice, including how encoded values vary across communities, what governance dynamics emerge when multiple contributors negotiate these files, and whether agents reliably adhere to the constraints specified. Understanding how ethics and values are operationalized for AI agents is essential to ground AI governance in modern software engineering practice. |
|
| Chouchen, Moataz |
Mohammad Hamdaqa and Moataz Chouchen (Polytechnique Montréal, Canada; Université de Montréal, Canada; Concordia University, Canada) AI-driven software systems are increasingly developed through rapid, iterative practices that combine large language models, prompt engineering, and ad-hoc integration of external tools and services; a style often described as vibe coding. While these practices enable fast experimentation and deployment, they challenge the basic principles of software engineering. Documentation is informal and quickly outdated, requirements are often implicit, and code review and testing are applied to artifacts that only partially determine system behavior. As a result, critical questions about whether a system behaved as intended, permitted, or prohibited cannot be reliably answered after deployment. This paper presents a vision for spec-driven, contract-based AIWare systems in which specifications function as explicit communicative commitments defining required, permitted, and forbidden behavior. We argue that auditability cannot be achieved through code review alone, and instead requires specifications that are enforceable across continuous integration and deployment (CI/CD) pipelines, runtime execution, and post-hoc audit. We introduce a contract-driven framework structured around specification, execution, and audit planes, extend it with spec-driven CI/CD integration, and illustrate the approach through walkthrough examples. Our vision reframes auditability as a first-class system property and specifications as the authoritative source of correctness in AIWare systems. |
|
| Chung, Young Jo |
Young Jo Chung and Safwat Hassan (University of Toronto, Canada) |
|
| Cui, Juan |
Sasan Azizian, Ayoub Hazrati, Artin Azizian, Elham Rastegari, Hamid Bagheri, and Juan Cui (Bellevue University, USA; McGill University, Canada; Creighton University, USA; University of Nebraska-Lincoln, USA) Object-relational mapping (ORM) design remains largely driven by fixed heuristics that fail to capture workload-specific tradeoffs among query latency, insert cost, and memory footprint. We present Y-Map, a hybrid neural--symbolic framework for performance-aware ORM schema design that synthesizes valid schema candidates and predicts their performance without requiring workload execution at inference time. Y-Map leverages Alloy to enumerate correctness-preserving ORM schemas and ranks them using a multi-encoder regression model that fuses structural, syntactic, and semantic representations with compact schema-level features. By predicting continuous performance objectives---insert latency, query latency, and memory footprint---Y-Map enables Pareto-aware selection without per-candidate benchmarking during inference. We evaluate Y-Map on nine object models from e-commerce, banking, and healthcare. Relative to two representative baselines, Leant and DTS, Y-Map yields improved aggregate Pareto quality (Generational Distance and Hypervolume) while reducing inference time and memory overhead. The experimental results show that integrating symbolic validity guarantees with learned performance prediction provides a practical, scalable solution for workload-aware ORM optimization. |
|
| Daoudi, Nadia |
Nadia Daoudi, Iván Alfonso, and Jordi Cabot (Luxembourg Institute of Science and Technology, Luxembourg; University of Luxembourg, Luxembourg) The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate NN code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Our artefacts are available online. |
|
| Dedeler, Mehmet |
Mustafa Özkan İr, Mehmet Dedeler, Anıl Koyuncu, and Eray Tüzün (Bilkent University, Türkiye) Verifying bug fixes before patches are released to end users is a critical step in the software development lifecycle. However, this process is often manual, repetitive, and error-prone, especially for crash bugs triggered through Graphical User Interface (GUI) interactions in desktop applications. Despite recent advancements in LLM-driven software agents, existing work primarily targets bug reproduction without addressing bug fix verification, while approaches that do focus on verification rely on source code access, making them inapplicable to closed-source GUI-based desktop applications. This paper introduces Fixpad++, a framework designed to automatically verify bug fixes in the Notepad++ desktop application using LLM-powered agents. Fixpad++ employs a two-phase approach: first, a multi-modal multi-agent system interacts with the buggy version to reproduce the reported crash bug using visual parsing and LLM reasoning. Second, upon successful reproduction, a trajectory replay mechanism executes the recorded action sequence on the patched version to verify the fix. We evaluated Fixpad++ on FixPad-Bench, a new dataset of 105 evaluation instances derived from 22 real-world Notepad++ crash bugs, including correct and incorrect patches. The system achieved a reproduction success rate of 72.73% with an average time of 174.07 seconds. Among the successfully reproduced cases, Fixpad++ successfully verified correct fixes with 87.50% accuracy and detected incorrect fixes with 77.05% accuracy, outperforming OpenAI’s Computer-Using Agent (CUA). Fixpad++ demonstrates the effectiveness of specialized LLM agent architectures for automated bug fix verification in GUI-based desktop applications, offering a practical solution for automating verification workflows without requiring access to source code. |
|
| Devanbu, Prem |
Claudio Spiess, Prem Devanbu, and Earl T. Barr (University of California at Davis, USA; University College London, UK) LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a standard program-output prediction task. Our results reveal a stark divergence in model behavior: while open-source reasoning models (DeepSeek-R1 family) maintain stable, albeit somewhat lower accuracies (38% to 67%) under code transformations & input perturbations, the frontier model GPT-5.2 exhibits surprising instruction-sensitive brittleness. Despite achieving a near-perfect score of 99% on the original, unperturbed CRUXEval benchmark, perturbed inputs trigger accuracy declines between 20% and 24%. In addition, we find that many models perform much worse at predicting behavior on perturbed inputs that raise exceptions, and that prediction performance depends on the kind of exception. We study remedies to address this deficiency in exception prediction, and evaluate the effect of these remedies on the ability to predict non-exception behaviors. Our findings both point to limitations in the way all models understand code, and reinforce and extend the value of using perturbation to evaluate code models. |
|
| Dietrich, Jens |
Elliott Wen, Chenye Ni, Valerio Terragni, and Jens Dietrich (University of Auckland, New Zealand; Massey University, New Zealand) Reproducible independent rebuilds strengthen software supply-chain integrity by recreating the original build environment and enforcing bitwise equivalence between artifacts. However, this approach implicitly assumes a trustworthy toolchain and can fail under adversarial manipulation of the build process itself (e.g., the Ken Thompson attack). Prior work has explored introducing diversity across build environments to reduce reliance on any single toolchain, and has proposed AI-driven methods to establish behavioural equivalence while tolerating benign build variability in the Java ecosystem. In this work, we extend this line of research to Rust and present RustBuildEq, a benchmark for training and evaluating binary equivalence classier models under realistic build variability. We curate a large corpus of crates drawn from the top 20% of the crates.io ecosystem and construct datasets of equivalent (EQ) and non-equivalent (NEQ) pairs with rich provenance metadata. EQ pairs are generated from identical source revisions under varying toolchain versions and build configurations, while NEQ pairs are derived from AST rewrites or API-breaking changes across versions. Many rust crates rely heavily on generics and cannot be compiled into binaries without specifying concrete types; to address this, we develop an automated approach that combines heuristic type instantiation, witness-type synthesis, and an iterative AI repair loop. RustBuildEq comprises 19,184,671 EQ records and 273,848,531 NEQ records, and includes a Python API for dataset navigation. The dataset provides large-scale ground truth for training and evaluating AI-driven models for reasoning binary equivalence and is publicly available at https://doi.org/10.5281/zenodo.19244908. |
|
| Dong, Tao |
Tao Dong, Sherry Shi, Harini Sampath, and Andrew Macvean (Google, USA) The ongoing transition of Large Language Models (LLMs) in software engineering from one-shot code generators into agentic partners requires a shift in how we define and measure success. While models are becoming more capable, the industry lacks a clear understanding of the behavioral norms that make an interactive software engineering (SWE) agent effective in collaborative software development in the enterprise. This work addresses this gap by presenting a taxonomy of desirable SWE agent behaviors, synthesized from 91 sets of developer-defined rules for SWE agents and validated through interviewing 15 experienced professional developers. In this taxonomy, we identify four core expectations: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solve Problems Effectively, and Collaborate with the Developer. These findings offer a concrete vocabulary for aligning SWE agent behavior with developer preferences, enabling researchers and practitioners to move beyond correctness-only benchmarks and start designing evaluations that reflect the socio-technical nature of professional software development in enterprises. |
|
| Dwyer, Matthew B. |
Tasfia Tasnim, Matthew B. Dwyer, and Soneya Binta Hossain (University of Texas at Dallas, USA; University of Virginia at Charlottesville, USA) Test oracles determine whether a program execution is correct for a given input. Two common forms are assertion oracles, which compare observed outputs with expected results, and exception oracles, which verify that a program raises an expected exception. Automated test oracle generation (TOG) aims to reduce the manual effort involved in constructing such oracles. Although recent TOG methods, especially LLM-based approaches, have made rapid progress, their evaluation remains constrained by benchmarks that rely on automatically generated tests, narrow single-assert formulations, simplified developer-written tests, or limited oracle diversity. To address these limitations, we introduce OE25𝑑𝑒𝑣 , a multi-variant dataset curated from developer-written unit tests across 25 open-source Java projects spanning 56 modules, and TOGBench, an end-to-end benchmark suite for TOG. OE25𝑑𝑒𝑣 captures six oracle categories and preserves realistic settings, including single- and multi-oracle configurations, mixed assertion-and-exception oracles, and developer-authored custom oracles. TOGBench supports end-to-end experimentation by reintegrating generated oracles into runnable test suites and evaluating them via compilation, execution, false-positive analysis, and mutation testing. Our evaluation further shows that OE25𝑑𝑒𝑣 preserves substantially greater structural complexity than prior benchmarks and exposes marked performance degradation of representative TOG models on developer-written tests, particularly for assertion oracles. |
|
| El Mezouar, Mariam |
Karla Gonzalez and Mariam El Mezouar (Royal Military College of Canada, Canada) Foundation models, particularly large language models, are increasingly embedded as core components of software systems. This shift has given rise to a growing body of research on testing such systems, referred to in this paper as AIware systems. While prior work proposes numerous techniques to expose undesirable behaviors, it remains unclear how these approaches align with established software testing practices and support the software lifecycle. This survey analyzes the AIware testing literature through the lens of classical software engineering concepts. We examine testing levels, oracle strategies, automation readiness, and diagnostic support, and assess how existing approaches map to lifecycle activities such as integration testing, regression testing, and CI-integrated workflows. Our results show that the literature is strongly concentrated on system-level, pre-release evaluation, with limited operational support for integration, regression, and deployment-time testing. We further show that many of these gaps stem from fundamental challenges in oracle design, including non-determinism, underspecified correctness, and limited diagnosability. Without stable and automatable decision criteria, AIware testing techniques remain difficult to integrate into continuous development and maintenance pipelines. Overall, this survey provides a structured characterization of the current state of AIware testing research and identifies key structural challenges that must be addressed to support lifecycle-aware, reliable AIware systems. |
|
| Enan, Nafid |
Haoran Xue, Reem Aleithan, Nafid Enan, Gias Uddin, and Song Wang (York University, Canada) SWE-Bench has become one of the most widely used benchmarks for evaluating whether large language models (LLMs) can resolve real-world GitHub issues. However, despite its broad adoption, a systematic evaluation of the quality of the benchmark remains lacking. In this paper, we present SWE-Bench+, an enhanced benchmark framework that improves evaluation reliability by addressing two benchmark-quality risks in SWE-Bench: solution leakage in issue descriptions and weak tests that allow plausible but incorrect patches to pass. To construct SWE-Bench+, we first analyze 217 commonly resolved issues across three top-performing agents in the SWE-Bench leaderboard during our study: SWE-agent 1.0, OpenHands+CodeAct v2.1, and AutoCodeRover-v2.0), yielding 651 model-generated patches, and identify five quality-problem patterns under the two broader risks of solution leakage and weak tests. Based on this analysis, we develop two components: the SoluLeakDetector, which identifies solution-leaking content for issue sanitization, and the TestEnhancer, which strengthens test suites for more reliable patch validation. Our results show that 60.83% of the commonly resolved issues exhibit solution leakage, and 77.88% are problematic overall. After removing leaked information from issue descriptions, model resolution rates drop substantially, showing that leakage materially inflates reported performance. SoluLeakDetector achieves 80.45% accuracy on solution-leak detection, and TestEnhancer identifies plausible patches for 97.11% of weak-test issues while reducing average resolution rates by 27.00 percentage points on Lite and 36.27 percentage points on Verified. These findings show that existing SWE-Bench evaluations can substantially overestimate true issue-resolution performance and that SWE-Bench+ provides a more rigorous benchmark framework for LLM-based software engineering agents. |
|
| Farooque, Mahfuza |
Xuan Liu, Dheeraj Kodakandla, Kushagra Srivastva, and Mahfuza Farooque (Pennsylvania State University, USA) VeriTrans is a reliability-first ML system that compiles natural-language requirements into solver-ready logic with validator-gated reliability. The pipeline integrates an instruction-tuned NL→PL translator, round-trip reconstruction (PL→NL) used as a high-precision acceptance gate, and canonical PL→CNF compilation, all executed via fixed API configuration (temperature= 0; fine-tuning runs use seed= 42) and per-item artifact logging (prompts, outputs, hashes) to support auditability and replay-driven debugging. On SatBench (2100 specifications), VeriTrans achieves 94.46% SAT/UNSAT correctness and 87.73% median round-trip similarity. Compact fine-tuning on 100-150 curated examples improves fidelity by about 1-1.5,pp without increasing latency (mean 25.8 s/spec on our 201-spec runtime subset). A thresholded acceptance policy on the round-trip score exposes a reliability--coverage knob: at τ=75, roughly 68% of items are retained with ~94% correctness on the accepted set. Validator overhead contributes < 15% of end-to-end runtime, and all prompts/responses and timing metadata are logged to enable replay-driven debugging and regression testing. By separating learned translation from symbolic verification and enforcing deterministic, validator-gated acceptance, VeriTrans turns NL→logic front-ends into auditable, reproducible components for reliability-critical workflows. |
|
| Fotrousi, Farnaz |
Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, and Miroslaw Staron (University of Gothenburg, Sweden; Chalmers University of Technology, Sweden) Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE. |
|
| Galster, Matthias |
Matthias Galster, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Singapore Management University, Singapore) Agentic AI coding tools increasingly automate software development tasks. Developers can configure these tools through versioned repository-level artifacts such as Markdown and JSON files. We present a systematic analysis of configuration mechanisms for agentic AI coding tools, covering Claude Code, GitHub Copilot, Cursor, Gemini, and Codex. We identify eight configuration mechanisms spanning from static context to executable and external integrations and, in an empirical study of 2,853 GitHub repositories, examine whether and how they are adopted, with a detailed analysis of Context Files, Skills, and Subagents. First, Context Files dominate the configuration landscape and are often the sole mechanism in a repository, with AGENTS.md emerging as an interoperable standard across tools. Second, few repositories adopt advanced mechanisms such as Skills and Subagents. Skills predominantly rely on static instructions rather than executable scripts. Third, distinct configuration practices are forming around different tools, with Claude Code users employing the broadest range of mechanisms. These findings establish an empirical baseline for understanding how developers configure agentic tools, suggest that AGENTS.md serves as a natural starting point, and motivate longitudinal and experimental research on how configuration strategies evolve and affect agent performance. Matthias Galster, Seyedmoein Mohsenimofidi, Levi Böhme, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Universität Bayreuth, Germany; Singapore Management University, Singapore) Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration artifacts, the full content of 18,167 configuration files associated with these configuration artifacts, and 148,519 AI-co-authored commits. The dataset and the construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, AI tool adoption patterns, and human-AI collaboration. |
|
| Ge, Jun |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Ghaleb, Taher A. |
Marcus Emmanuel Barnes, Taher A. Ghaleb, and Safwat Hassan (University of Toronto, Canada; Trent University, Canada) AI agents are assuming active roles in Continuous Integration and Continuous Deployment (CI/CD) workflows, yet the research community lacks a shared vocabulary for describing what it means for CI/CD to be agentic, how much decision authority is delegated, and where control should reside. This paper presents a vision of agentic CI/CD in which the central challenge is not improving task performance but designing authority transfer, defined as the delegation of operational decisions from human-controlled pipelines to agent systems under specified constraints and recourse mechanisms. To structure this argument, we introduce a distinction between data-plane authority (localized interventions such as patch generation and test reruns) and control-plane authority (modifications to pipeline configuration, deployment policies, and approval gates). Drawing on research prototypes and industrial platforms, we show that current systems operate mainly at the data plane under bounded autonomy, with safety achieved through surrounding governance infrastructure rather than intrinsic agent guarantees. We identify three recurring patterns: constrained autonomy as the dominant design, external governance as the primary safety mechanism, and a widening gap between deployment momentum and evaluation methodology. We propose a research agenda in which control-plane safety and governance mechanisms represent the most urgent open problem, followed by formalization of autonomy boundaries, evaluation frameworks, and human–agent coordination. |
|
| Ghammam, Anwar |
Mohamed Almukhtar, Anwar Ghammam, and Hua Ming (University of Michigan at Flint, USA; University of Michigan at Dearborn, USA) As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code-quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues—predominantly convention-level violations such as long lines—while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows. |
|
| Ghorab, Mostafa Anouar |
Mostafa Anouar Ghorab, Ahmad Abdel Latif, and Mohamed Aymen Saied (Université Laval, Canada; University of Calgary, Canada) Kubernetes has become a central platform for orchestrating cloud-native applications, yet its declarative configuration model frequently introduces security misconfigurations that threaten system reliability and operational stability. Although automated detection tools are widely available, a systematic understanding of misconfiguration patterns and scalable correction mechanisms remains limited. This paper presents a comprehensive empirical study of Kubernetes security misconfigurations based on 2,662 developer reported issues from Stack Overflow. From this dataset, we derive a structured taxonomy that captures recurring security weaknesses across configuration object types and misconfiguration categories. Using this taxonomy, we analyze how severity levels vary across objects and categories, and examine how security misconfigurations evolve between incubator and stable project stages. Our findings reveal that while some operational issues decrease as projects mature, critical security misconfigurations often persist or reappear, highlighting enduring risk patterns in cloud-native systems. Building on this empirical foundation, we evaluate the effectiveness of Large Language Models (LLMs) in automatically correcting Kubernetes security misconfigurations under progressively enriched contextual conditions. Results demonstrate that contextual grounding significantly improves correction accuracy, with the best standalone model achieving 89.06%. To further enhance structural correctness and schema compliance, we introduce Kubecurity, a schema-guided validation framework that enforces compliance with official Kubernetes specifications. By combining contextual LLM reasoning with deterministic schema enforcement, the proposed hybrid approach achieves 98.50% correction accuracy while substantially reducing newly introduced misconfigurations. Overall, this work advances both the understanding and automated remediation of Kubernetes security misconfigurations. |
|
| Gonzalez, Karla |
Karla Gonzalez and Mariam El Mezouar (Royal Military College of Canada, Canada) Foundation models, particularly large language models, are increasingly embedded as core components of software systems. This shift has given rise to a growing body of research on testing such systems, referred to in this paper as AIware systems. While prior work proposes numerous techniques to expose undesirable behaviors, it remains unclear how these approaches align with established software testing practices and support the software lifecycle. This survey analyzes the AIware testing literature through the lens of classical software engineering concepts. We examine testing levels, oracle strategies, automation readiness, and diagnostic support, and assess how existing approaches map to lifecycle activities such as integration testing, regression testing, and CI-integrated workflows. Our results show that the literature is strongly concentrated on system-level, pre-release evaluation, with limited operational support for integration, regression, and deployment-time testing. We further show that many of these gaps stem from fundamental challenges in oracle design, including non-determinism, underspecified correctness, and limited diagnosability. Without stable and automatable decision criteria, AIware testing techniques remain difficult to integrate into continuous development and maintenance pipelines. Overall, this survey provides a structured characterization of the current state of AIware testing research and identifies key structural challenges that must be addressed to support lifecycle-aware, reliable AIware systems. |
|
| Gruppi, Mauricio |
Xue Qin and Mauricio Gruppi (Villanova University, USA) The rise of "vibe coding" has marginalized requirement analysis, allowing unverified "Make it Work" assumptions to accumulate into complex defects that evade standard testing. In the absence of a reliable ground truth oracle, we advocate a paradigm shift from correctness verification to logic consistency. We introduce Differential Logic Analysis (DLA), a framework that utilizes Logical Inference to detect internal contradictions across parallel and sequential development workflows. Preliminary simulations demonstrate that DLA successfully intercepts logic drift, such as contradictions and unspecified assumptions, and offers a new perspective on reliability assessment in the AI-assistant development era. |
|
| Gu, Xiaodong |
Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, and Xiaodong Gu (Shanghai Jiao Tong University, China; Hohai University, China; Chongqing University, China) LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes—compositional code creation, i.e., building a complete, internally structured class from a specification—remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark’s discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck. |
|
| Hájek, Maxim |
Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation. |
|
| Hamdaqa, Mohammad |
Mohammad Hamdaqa and Moataz Chouchen (Polytechnique Montréal, Canada; Université de Montréal, Canada; Concordia University, Canada) AI-driven software systems are increasingly developed through rapid, iterative practices that combine large language models, prompt engineering, and ad-hoc integration of external tools and services; a style often described as vibe coding. While these practices enable fast experimentation and deployment, they challenge the basic principles of software engineering. Documentation is informal and quickly outdated, requirements are often implicit, and code review and testing are applied to artifacts that only partially determine system behavior. As a result, critical questions about whether a system behaved as intended, permitted, or prohibited cannot be reliably answered after deployment. This paper presents a vision for spec-driven, contract-based AIWare systems in which specifications function as explicit communicative commitments defining required, permitted, and forbidden behavior. We argue that auditability cannot be achieved through code review alone, and instead requires specifications that are enforceable across continuous integration and deployment (CI/CD) pipelines, runtime execution, and post-hoc audit. We introduce a contract-driven framework structured around specification, execution, and audit planes, extend it with spec-driven CI/CD integration, and illustrate the approach through walkthrough examples. Our vision reframes auditability as a first-class system property and specifications as the authoritative source of correctness in AIWare systems. |
|
| Haraldsson, Bengt |
Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, and Miroslaw Staron (University of Gothenburg, Sweden; Chalmers University of Technology, Sweden) Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE. |
|
| Harzevili, Nima Shiri |
Md Basim Uddin Ahmed, Nima Shiri Harzevili, Jiho Shin, Hung Viet Pham, and Song Wang (York University, Canada; Queen's University, Canada) Large Language Models (LLMs) show promise for vulnerability detection, but their evaluation is limited by the lack of high-quality benchmarks. Most existing datasets rely on coarse function-level labels, overlook fine-grained vulnerability patterns, and lack critical program context such as data/control dependencies. They also suffer from data quality issues, including mislabeling and duplication, leading to unreliable evaluation and limited real-world relevance. To address these limitations, this paper introduces SecVulEval, a context-aware benchmark designed to evaluate LLMs on vulnerability detection with rich contextual information. SecVulEval focuses on real-world C/C++ vulnerabilities at the statement level. This granularity enables more precise evaluation of a model’s ability to localize and understand vulnerabilities, beyond simple binary classification at the function level. By incorporating rich contextual information, SecVulEval sets a new standard for benchmarking vulnerability detection in realistic software development scenarios. This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024. We evaluated state-of-the-art LLMs in both standalone and multi-agent settings. Results on our dataset indicate that current models remain far from accurately identifying vulnerable statements within a given function, although agent-based approaches provide modest but promising improvements. The best-performing Claude-3.7-Sonnet-driven agent achieves an F1-score of 23.83% for vulnerable statement detection. We believe this benchmark can serve as a foundation for advancing context-aware vulnerability detection with LLMs. |
|
| Hassan, Safwat |
Young Jo Chung and Safwat Hassan (University of Toronto, Canada) Marcus Emmanuel Barnes, Taher A. Ghaleb, and Safwat Hassan (University of Toronto, Canada; Trent University, Canada) AI agents are assuming active roles in Continuous Integration and Continuous Deployment (CI/CD) workflows, yet the research community lacks a shared vocabulary for describing what it means for CI/CD to be agentic, how much decision authority is delegated, and where control should reside. This paper presents a vision of agentic CI/CD in which the central challenge is not improving task performance but designing authority transfer, defined as the delegation of operational decisions from human-controlled pipelines to agent systems under specified constraints and recourse mechanisms. To structure this argument, we introduce a distinction between data-plane authority (localized interventions such as patch generation and test reruns) and control-plane authority (modifications to pipeline configuration, deployment policies, and approval gates). Drawing on research prototypes and industrial platforms, we show that current systems operate mainly at the data plane under bounded autonomy, with safety achieved through surrounding governance infrastructure rather than intrinsic agent guarantees. We identify three recurring patterns: constrained autonomy as the dominant design, external governance as the primary safety mechanism, and a widening gap between deployment momentum and evaluation methodology. We propose a research agenda in which control-plane safety and governance mechanisms represent the most urgent open problem, followed by formalization of autonomy boundaries, evaluation frameworks, and human–agent coordination. |
|
| Hazrati, Ayoub |
Sasan Azizian, Ayoub Hazrati, Artin Azizian, and Elham Rastegari (Bellevue University, USA; McGill University, Canada; Creighton University, USA) Object-relational mapping (ORM) frameworks simplify persistence, yet routine mapping choices—including inheritance strategy, association encoding, and denormalization—can substantially alter the generated SQL and the trade-offs among query latency, insert/update cost, and storage footprint. Existing optimization approaches either guarantee semantic validity but incur expensive per-candidate deployment and benchmarking, or learn from schema structure alone and therefore miss the fact that workload behavior is mapping-dependent. We present TriORM, a workload-aware neural symbolic framework for multi-objective ORM mapping design that removes per-candidateworkload execution from the online recommendation loop while preserving validity by construction. TriORM (i) enumerates admissible mappings via bounded relational synthesis in Alloy, (ii) concretizes an abstract workload into schema-specific SQL templates for each candidate, and (iii) predicts continuous objectives using a tri-input model that fuses a typed schema-graph encoder, a concretized-workload encoder, and interpretable static cost proxies. The resulting predictions enable Pareto filtering and user-weighted selection without deploying or executing each candidate at inference time; profiling is performed only offline to obtain supervision. On nine TradeMaker/Leant benchmark models, TriORM improves mean Pareto-front approximation over Leant (GD/HV 0.03/0.78 vs. 0.07/0.61) and reduces end-toend recommendation time (3.6×103s vs. 2.6×104s), while preserving semantic correctness within the chosen synthesis bounds. Sasan Azizian, Ayoub Hazrati, Artin Azizian, Elham Rastegari, Hamid Bagheri, and Juan Cui (Bellevue University, USA; McGill University, Canada; Creighton University, USA; University of Nebraska-Lincoln, USA) Object-relational mapping (ORM) design remains largely driven by fixed heuristics that fail to capture workload-specific tradeoffs among query latency, insert cost, and memory footprint. We present Y-Map, a hybrid neural--symbolic framework for performance-aware ORM schema design that synthesizes valid schema candidates and predicts their performance without requiring workload execution at inference time. Y-Map leverages Alloy to enumerate correctness-preserving ORM schemas and ranks them using a multi-encoder regression model that fuses structural, syntactic, and semantic representations with compact schema-level features. By predicting continuous performance objectives---insert latency, query latency, and memory footprint---Y-Map enables Pareto-aware selection without per-candidate benchmarking during inference. We evaluate Y-Map on nine object models from e-commerce, banking, and healthcare. Relative to two representative baselines, Leant and DTS, Y-Map yields improved aggregate Pareto quality (Generational Distance and Hypervolume) while reducing inference time and memory overhead. The experimental results show that integrating symbolic validity guarantees with learned performance prediction provides a practical, scalable solution for workload-aware ORM optimization. |
|
| Hellendoorn, Vincent Josua |
Manisha Mukherjee and Vincent Josua Hellendoorn (Carnegie Mellon University, USA; Google, USA) Large Language Models (LLMs) are widely used for automated code generation. Their reliance on infrequently updated pretraining data can leave them unaware of newly discovered vulnerabilities and evolving security standards, making them prone to producing insecure code. In contrast, developer communities on Stack Overflow (SO) provide an ever-evolving repository of knowledge, where security vulnerabilities are actively discussed and addressed through collective expertise. These community-driven insights remain largely untapped by LLMs. This paper introduces SOSecure, a Retrieval- Augmented Generation (RAG) system that leverages the collective security expertise found in SO discussions to improve the security of LLM-generated code. We build a security-focused knowledge base by extracting SO answers and comments that explicitly identify vulnerabilities. Unlike common uses of RAG, SOSecure triggers after code has been generated to find discussions that identify flaws in similar code. These are used in a prompt to an LLM to consider revising the code. Evaluation across three datasets (SALLM dataset, LLMSecEval, and LMSys) shows that SOSecure achieves strong fix rates of 71.7%, 91.3%, and 96.7% respectively, compared to prompting GPT-4 without relevant discussions (49.1%, 56.5%, and 37.5%), and outperforms multiple other baselines. SOSecure operates as a language-agnostic complement to existing LLMs, without requiring retraining or fine-tuning, making it easy to deploy. Our results underscore the importance of maintaining active developer forums, |
|
| Hernández López, José Antonio |
Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró (Linköping University, Sweden; University of Murcia, Spain) Jupyter notebooks are widely used for machine learning (ML) prototyping and experimentation, yet debugging support for notebook-based ML development remains limited, partly due to the lack of realistic and executable bug benchmarks. We introduce JunoBench, to our knowledge the first executable benchmark of real-world crashes in Python-based ML notebooks. JunoBench contains 111 curated and reproducible crashes from public Kaggle notebooks, each paired with a verified fix. The benchmark covers widely used ML libraries (e.g., TensorFlow/Keras, PyTorch, and Scikit-learn) as well as notebook-specific failures such as out-of-order execution. To ensure reproducibility and ease of evaluation, JunoBench provides a unified execution environment that reliably reproduces all crashes. In addition, each crash is accompanied by human-validated annotations, including library cause, crash type, root cause, ML pipeline stage, and natural-language diagnostic summaries. By combining realistic crashes, verified fixes, structured labels, and reproducible execution, JunoBench enables systematic evaluation of crash detection, diagnosis, and automated repair techniques for notebook-based ML development. |
|
| Hossain, Soneya Binta |
Tasfia Tasnim, Matthew B. Dwyer, and Soneya Binta Hossain (University of Texas at Dallas, USA; University of Virginia at Charlottesville, USA) Test oracles determine whether a program execution is correct for a given input. Two common forms are assertion oracles, which compare observed outputs with expected results, and exception oracles, which verify that a program raises an expected exception. Automated test oracle generation (TOG) aims to reduce the manual effort involved in constructing such oracles. Although recent TOG methods, especially LLM-based approaches, have made rapid progress, their evaluation remains constrained by benchmarks that rely on automatically generated tests, narrow single-assert formulations, simplified developer-written tests, or limited oracle diversity. To address these limitations, we introduce OE25𝑑𝑒𝑣 , a multi-variant dataset curated from developer-written unit tests across 25 open-source Java projects spanning 56 modules, and TOGBench, an end-to-end benchmark suite for TOG. OE25𝑑𝑒𝑣 captures six oracle categories and preserves realistic settings, including single- and multi-oracle configurations, mixed assertion-and-exception oracles, and developer-authored custom oracles. TOGBench supports end-to-end experimentation by reintegrating generated oracles into runnable test suites and evaluating them via compilation, execution, false-positive analysis, and mutation testing. Our evaluation further shows that OE25𝑑𝑒𝑣 preserves substantially greater structural complexity than prior benchmarks and exposes marked performance degradation of representative TOG models on developer-written tests, particularly for assertion oracles. |
|
| İr, Mustafa Özkan |
Mustafa Özkan İr, Mehmet Dedeler, Anıl Koyuncu, and Eray Tüzün (Bilkent University, Türkiye) Verifying bug fixes before patches are released to end users is a critical step in the software development lifecycle. However, this process is often manual, repetitive, and error-prone, especially for crash bugs triggered through Graphical User Interface (GUI) interactions in desktop applications. Despite recent advancements in LLM-driven software agents, existing work primarily targets bug reproduction without addressing bug fix verification, while approaches that do focus on verification rely on source code access, making them inapplicable to closed-source GUI-based desktop applications. This paper introduces Fixpad++, a framework designed to automatically verify bug fixes in the Notepad++ desktop application using LLM-powered agents. Fixpad++ employs a two-phase approach: first, a multi-modal multi-agent system interacts with the buggy version to reproduce the reported crash bug using visual parsing and LLM reasoning. Second, upon successful reproduction, a trajectory replay mechanism executes the recorded action sequence on the patched version to verify the fix. We evaluated Fixpad++ on FixPad-Bench, a new dataset of 105 evaluation instances derived from 22 real-world Notepad++ crash bugs, including correct and incorrect patches. The system achieved a reproduction success rate of 72.73% with an average time of 174.07 seconds. Among the successfully reproduced cases, Fixpad++ successfully verified correct fixes with 87.50% accuracy and detected incorrect fixes with 77.05% accuracy, outperforming OpenAI’s Computer-Using Agent (CUA). Fixpad++ demonstrates the effectiveness of specialized LLM agent architectures for automated bug fix verification in GUI-based desktop applications, offering a practical solution for automating verification workflows without requiring access to source code. |
|
| Izadi, Maliheh |
Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation. |
|
| Jabre, Kosay |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Jacopin, Éric |
Éric Jacopin (Cosmic AI, France) AI coding assistants increasingly generate code alongside tests. How developers structure test code, whether inline with the implementation or in separate blocks, has traditionally been a matter of testing philosophy. We investigate whether this choice affects AI code generation quality. We conduct a large-scale empirical study (830+ generated files, 12 models, 3 providers) using SEGA, a three-dimensional evaluation framework measuring Determinism, Preservation, and Correctness. Comparing inline test syntax (Python doctests) against separated test syntax (Rust #[test] blocks) on a d-ary heap implementation, we find that: (1) inline tests yield near-perfect preservation (100%) and correctness (92–100%) across all models except Claude 3.5 Haiku, which strips all doctests; (2) separated tests expose stark model-tier gaps (0–100% correctness) and independence between preservation and correctness; (3) model behavior evolves across generations, and notably one model breaks the test suppression pattern of its three predecessors; (4) mechanistic analysis on 7 open-source architectures (6 transformers and a gated-linear Recurrent Neural Network (RNN)) reveals inline test markers receive 2.8–4.4× stronger attention in 5/7 models, with causal validation via knockout and steering experiments on the 4 code-specialized transformers and RWKV-6; the co-location mechanism extends to a non-transformer architecture, suggesting the design recommendation is robust to future architectural shifts. In the Foundation Model era, test syntax structure is a software design concern: co-locating tests with implementation code produces measurably better AI-generated code. We further qualify the effect as bounded by both model capability and programming language. |
|
| Jha, Smriti |
Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra (Meta Platforms, USA) Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real-world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale. |
|
| Jin, Xin |
Jun Yeon Won, Xin Jin, Shiqing Ma, and Zhiqiang Lin (Ohio State University, Columbus, USA; Meta, USA; University of Massachusetts at Amherst, USA) Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical tasks such as function and variable name recovery and type inference. However, despite the rapid growth of research in this area, progress has been hindered by the absence of a standardized dataset. Existing studies rely on disparate datasets, preprocessing pipelines, and evaluation metrics, making fair comparisons between approaches difficult and obscuring a clear understanding of LLM capabilities in binary analysis. To address these challenges, we present REBench, a comprehensive benchmark dataset for evaluating LLMs on binary reverse engineering tasks. REBench consolidates a superset of existing datasets, comprising hundreds of millions of lines of source code and a diverse collection of binaries spanning multiple architectures and optimization levels. REBench adopts a knowledge-base-driven methodology that stores byte-level stack information to generate ground truth, ensuring that task difficulty is preserved while maintaining universal applicability. This design enables fair evaluation across tasks while avoiding simplifications that could bias results. As a use case, we apply REBench to measure the reverse engineering performance of two LLMs and the result demonstrates difficulties in complex tasks. |
|
| Kanyiki, Yanick |
Yanick Kanyiki (InvarLock, Canada) Release evaluation for FM-powered software often grows by habit rather than policy: teams repeat runs until budget or time is exhausted, without clear evidence that more passes change release decisions. We study a release-evaluation protocol that separates three concerns: artifact readiness, decision-stability stopping, and cross-hardware promotion gating. The study uses 340 runs spanning seven edit families (five core plus two probes), four model families, ten seeds, and dual-host H100/H200 execution. In this matrix and under this policy setting, additional seed repetition did not change promote/block outcomes, edit-family breadth remained decision-informative, and small H100/H200 score differences could still alter promotion outcomes near strict boundaries. These findings motivate workload-conditional resource allocation for release engineering: in this evidence setting, additional budget is more decision-informative when spent on edit diversity and host-parity checks than on deeper seed repetition. The contribution is an operational decision framework, with explicit sensitivity reporting, that turns release evaluation from a fixed checklist into a defensible governance process. In this matrix, seed-stop reduced measured GPU-hours by about 90% versus fixed 10-pass seed evaluation. Numeric thresholds are workload-derived; the transferable contribution is the gate-setting process. |
|
| Kassab, Mohamad |
Mohamad Kassab (Boston University, USA) Generative image systems are increasingly embedded in software engineering artifacts such as slides, documentation, and recruiting collateral, shaping implicit signals about who is seen to “belong.” We present a mixed-methods empirical audit of 880 images generated by four widely used text-to-image models (GPT-4o/DALL·E 3, Llama-4/Emu, Qwen3-235B-A22B, Stable Diffusion) using 22 demographically neutral prompts, intentionally uniform to isolate default model priors, varying role, seniority, team context, geography, and language. Independent human annotations, triangulated with automated raters and validated through per-category agreement analysis, capture both demographic representation (gender, race/ethnicity, age) and portrayal cues (setting, attire, props, emotion). We analyze intersectional distributions and benchmark them against occupational reference statistics using risk ratios, JensenShannon divergence, and the Theil index. Across models, outputs consistently converge on a narrow archetype: young men dominate (95.8% male, 88% under 40), women and older professionals are rare, and several racial and ethnic groups are systematically underrepresented relative to workforce baselines. Prompt variation modestly shifts racialized appearance but leaves gender imbalance largely intact, while model differences are primarily of degree rather than direction. We translate these findings into actionable implications for AI-aware software engineering practice, including representational regression tests in CI/CD pipelines and diversity-aware generation defaults, arguing that evaluation of AI systems in software engineering must account for the societal signals conveyed by generated imagery alongside functional performance. |
|
| Khan, Euna Mehnaz |
Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra (Meta Platforms, USA) Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real-world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale. |
|
| Khanna, Simarjot |
Simarjot Khanna (Independent Researcher, Canada) Long-horizon autonomous agents suffer from semantic livelock: they continue generating tokens and calling tools without making progress. Unlike a crash, this "zombie" state consumes API budget and time while remaining operationally active. We treat this as a progress violation and propose the Convergence Monitor, a light-weight sidecar that fingerprints agent states in embedding space. In a forensic analysis of real-world failures (SWE-agent corpus), we identified that 25% of the analyzed long-duration failures exhibited semantic livelock patterns. In one extreme case, an agent wasted 208 steps in a checkerboard oscillation pattern invisible to standard string-matching repetition guards. We argue that future Agentware requires a "liveness coprocessor" to ensure software makers do not pay for stalled execution. |
|
| Khatib, Lara |
Lara Khatib, Michael Pu, Bogdan Vasilescu, and Meiyappan Nagappan (University of Waterloo, Canada; Carnegie Mellon University, USA) As developers increasingly rely on LLM-generated code summaries for documentation, testing, and review, it is important to study whether these summaries accurately reflect what the program actually does. LLMs often produce confident descriptions of what the code looks like it should do (intent), while missing subtle edge cases or logic changes that define what it actually does (behavior). We present a mutation-based evaluation methodology that directly tests whether a summary truly matches the code’s logic. Our approach generates a summary, injects a targeted mutation into the code, and checks if the LLM updates its summary to reflect the new behavior. We validate it through three experiments totalling 624 mutation–summary evaluations across 62 programs. First, on 12 controlled synthetic programs with 324 mutations varying in type (statement, value, decision) and location (beginning, middle, end). We find that, for GPT-4, summary accuracy decreases sharply with complexity from 76.5% for single functions to 17.3% for multi-threaded systems, while mutation type and location exhibit weaker effects. Second, testing 150 mutated samples on 50 human-written programs from the Less Basic Python Problems (LBPP) dataset confirms the same failure patterns persist as models often describe algorithmic intent rather than actual mutated behavior with a summary accuracy rate of 49.3%. Furthermore, while a comparison between GPT-4 and GPT-5.2 shows a substantial performance leap (from 49.3% to 85.3%) and an improved ability to identify mutations as ”bugs”, both models continue to struggle with distinguishing implementation details from standard algorithmic patterns. This work establishes mutation analysis as a systematic approach for assessing whether LLM-generated summaries reflect program behavior rather than superficial textual patterns, enabling direct comparison across models, prompts, and datasets. |
|
| Khavari, David |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Khomh, Foutse |
Anthonia Oluchukwu Njoku, Zohreh Sharafi, and Foutse Khomh (Polytechnique Montréal, Canada) Autonomous coding agents are increasingly participating in collaborative software development by generating repository-level pull requests (PRs) that must be reviewed and integrated by human teams. While prior work has examined the technical characteristics of agent-generated patches, less is known about how autonomous authorship reshapes human collaboration dynamics in real-world workflows. In this paper, we investigate large-scale human–agent collaboration by comparing 40,214 pull requests across 2,807 GitHub repositories, including 33,596 agent-authored PRs from five autonomous coding agents and 6,618 human-authored PRs. We examine differences across three dimensions: integration outcomes, structural characteristics, and collaboration signals. Our findings reveal a socio-technical trade-off. Agent-authored PRs are integrated significantly faster, yet exhibit lower merge rates overall. Task type moderates this effect: agents outperform humans in documentation tasks but underperform in behavior-changing contributions. Beyond outcomes, collaboration patterns differ systematically. Agent-authored PRs attract proportionally more bot-generated comments and elicit more analytic, less socially oriented review communication. In contrast, human-authored PRs receive more elaborative and socially engaged feedback. Incorporating psycholinguistic features into predictive models significantly improves merge outcome prediction, demonstrating that communication style carries explanatory power beyond structural code characteristics. These results suggest that autonomous agents do not merely introduce technical artifacts into repositories, but actively reshape review interaction patterns and evaluative behavior. The impact of coding agents is therefore fundamentally socio-technical, highlighting the importance of studying AI systems within authentic human–AI collaborative environments. |
|
| Kjellberg, Viktor |
Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, and Miroslaw Staron (University of Gothenburg, Sweden; Chalmers University of Technology, Sweden) Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE. |
|
| Kodakandla, Dheeraj |
Xuan Liu, Dheeraj Kodakandla, Kushagra Srivastva, and Mahfuza Farooque (Pennsylvania State University, USA) VeriTrans is a reliability-first ML system that compiles natural-language requirements into solver-ready logic with validator-gated reliability. The pipeline integrates an instruction-tuned NL→PL translator, round-trip reconstruction (PL→NL) used as a high-precision acceptance gate, and canonical PL→CNF compilation, all executed via fixed API configuration (temperature= 0; fine-tuning runs use seed= 42) and per-item artifact logging (prompts, outputs, hashes) to support auditability and replay-driven debugging. On SatBench (2100 specifications), VeriTrans achieves 94.46% SAT/UNSAT correctness and 87.73% median round-trip similarity. Compact fine-tuning on 100-150 curated examples improves fidelity by about 1-1.5,pp without increasing latency (mean 25.8 s/spec on our 201-spec runtime subset). A thresholded acceptance policy on the round-trip score exposes a reliability--coverage knob: at τ=75, roughly 68% of items are retained with ~94% correctness on the accepted set. Validator overhead contributes < 15% of end-to-end runtime, and all prompts/responses and timing metadata are logged to enable replay-driven debugging and regression testing. By separating learned translation from symbolic verification and enforcing deterministic, validator-gated acceptance, VeriTrans turns NL→logic front-ends into auditable, reproducible components for reliability-critical workflows. |
|
| Koyuncu, Anıl |
Mustafa Özkan İr, Mehmet Dedeler, Anıl Koyuncu, and Eray Tüzün (Bilkent University, Türkiye) Verifying bug fixes before patches are released to end users is a critical step in the software development lifecycle. However, this process is often manual, repetitive, and error-prone, especially for crash bugs triggered through Graphical User Interface (GUI) interactions in desktop applications. Despite recent advancements in LLM-driven software agents, existing work primarily targets bug reproduction without addressing bug fix verification, while approaches that do focus on verification rely on source code access, making them inapplicable to closed-source GUI-based desktop applications. This paper introduces Fixpad++, a framework designed to automatically verify bug fixes in the Notepad++ desktop application using LLM-powered agents. Fixpad++ employs a two-phase approach: first, a multi-modal multi-agent system interacts with the buggy version to reproduce the reported crash bug using visual parsing and LLM reasoning. Second, upon successful reproduction, a trajectory replay mechanism executes the recorded action sequence on the patched version to verify the fix. We evaluated Fixpad++ on FixPad-Bench, a new dataset of 105 evaluation instances derived from 22 real-world Notepad++ crash bugs, including correct and incorrect patches. The system achieved a reproduction success rate of 72.73% with an average time of 174.07 seconds. Among the successfully reproduced cases, Fixpad++ successfully verified correct fixes with 87.50% accuracy and detected incorrect fixes with 77.05% accuracy, outperforming OpenAI’s Computer-Using Agent (CUA). Fixpad++ demonstrates the effectiveness of specialized LLM agent architectures for automated bug fix verification in GUI-based desktop applications, offering a practical solution for automating verification workflows without requiring access to source code. |
|
| Kumar, Rajesh |
Naing Oo Lwin and Rajesh Kumar (Bucknell University, USA) Modernizing legacy COBOL systems remains difficult due to scarce expertise, large and long-lived codebases, and strict correctness requirements. Recent large language model (LLM)-based modernization systems increasingly rely on agentic workflows in which the model controls multi-step tool execution. However, it remains unclear whether delegating execution control to the LLM improves correctness, robustness, or efficiency in structured software engineering workflows. We present a controlled empirical study of deterministic and LLM-controlled orchestration for COBOL-to-Python modernization. Using a unified experimental framework, we hold the language models, prompts, tools, configurations, and source programs constant while varying only the execution control strategy. This isolates orchestration as the sole experimental variable. We evaluate both approaches using functional correctness, robustness across repeated stochastic runs, and computational efficiency. Across multiple models, deterministic orchestration achieves comparable computational accuracy to LLM-controlled orchestration while improving worst-case robustness and reducing performance variability across runs. Deterministic execution also reduces token consumption by up to 3.5x, leading to substantially lower operational cost. These results suggest that, in structured modernization workflows with explicit validation stages, fixed execution policies provide more stable and cost-efficient behavior than fully agentic orchestration without reducing translation quality. |
|
| Latif, Ahmad Abdel |
Mostafa Anouar Ghorab, Ahmad Abdel Latif, and Mohamed Aymen Saied (Université Laval, Canada; University of Calgary, Canada) Kubernetes has become a central platform for orchestrating cloud-native applications, yet its declarative configuration model frequently introduces security misconfigurations that threaten system reliability and operational stability. Although automated detection tools are widely available, a systematic understanding of misconfiguration patterns and scalable correction mechanisms remains limited. This paper presents a comprehensive empirical study of Kubernetes security misconfigurations based on 2,662 developer reported issues from Stack Overflow. From this dataset, we derive a structured taxonomy that captures recurring security weaknesses across configuration object types and misconfiguration categories. Using this taxonomy, we analyze how severity levels vary across objects and categories, and examine how security misconfigurations evolve between incubator and stable project stages. Our findings reveal that while some operational issues decrease as projects mature, critical security misconfigurations often persist or reappear, highlighting enduring risk patterns in cloud-native systems. Building on this empirical foundation, we evaluate the effectiveness of Large Language Models (LLMs) in automatically correcting Kubernetes security misconfigurations under progressively enriched contextual conditions. Results demonstrate that contextual grounding significantly improves correction accuracy, with the best standalone model achieving 89.06%. To further enhance structural correctness and schema compliance, we introduce Kubecurity, a schema-guided validation framework that enforces compliance with official Kubernetes specifications. By combining contextual LLM reasoning with deterministic schema enforcement, the proposed hybrid approach achieves 98.50% correction accuracy while substantially reducing newly introduced misconfigurations. Overall, this work advances both the understanding and automated remediation of Kubernetes security misconfigurations. |
|
| Lin, Zhiqiang |
Jun Yeon Won, Xin Jin, Shiqing Ma, and Zhiqiang Lin (Ohio State University, Columbus, USA; Meta, USA; University of Massachusetts at Amherst, USA) Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical tasks such as function and variable name recovery and type inference. However, despite the rapid growth of research in this area, progress has been hindered by the absence of a standardized dataset. Existing studies rely on disparate datasets, preprocessing pipelines, and evaluation metrics, making fair comparisons between approaches difficult and obscuring a clear understanding of LLM capabilities in binary analysis. To address these challenges, we present REBench, a comprehensive benchmark dataset for evaluating LLMs on binary reverse engineering tasks. REBench consolidates a superset of existing datasets, comprising hundreds of millions of lines of source code and a diverse collection of binaries spanning multiple architectures and optimization levels. REBench adopts a knowledge-base-driven methodology that stores byte-level stack information to generate ground truth, ensuring that task difficulty is preserved while maintaining universal applicability. This design enables fair evaluation across tasks while avoiding simplifications that could bias results. As a use case, we apply REBench to measure the reverse engineering performance of two LLMs and the result demonstrates difficulties in complex tasks. |
|
| Liu, Kaijie |
Kaijie Liu and Yulei Sui (UNSW, Australia) Neural network (NN) verifiers are increasingly used to certify safety properties such as robustness (i.e., small allowed perturbations to an input should not alter a model’s decision). Since verifiers aim to prove the absence of violations by considering all possible specified behaviors, the soundness of their implementations is therefore critical to guaranteeing correctness. Detecting unsoundness is particularly important and challenging, because a verifier typically spans multiple components, including specifications, neural networks, operator semantics, and constraint solving, where subtle implementation bugs can silently lead to false certified results. We present an approach for neural network robustness verifiers that detects and localizes soundness-relevant faults via two types of concrete–abstract consistency checks: (1) Counterexample-Based Refutation (CBR), where a certification is supposed to be refuted if a concrete counterexample is found at runtime; and (2) Bounds-Based Localization (BBL), which audits per-neuron containment (concrete activations must lie within abstract bounds as an invariant) to pinpoint incorrect implementations at particular NN layers or operators. To reduce representation drift, we use specification-embedded models that wrap the core NN with input and output specifications as two additional layers. We further develop an operator-aware NN generator that can produce diverse NN models spanning a wide range of layer types, parameters, and architectures, enabling systematic exposure and exercise of different operator behaviors. We evaluate verifiers on three abstract domains using six mutation operators. Across 450 soundness-violating instances, our framework detects 72% of injected soundness violations. CBR mainly exposes input-output-level soundness failures when a concrete counterexample is found during input sampling, while BBL catches internal bound-containment violations and localizes them to specific layers/operators, even when CBR becomes ineffective in high-dimensional inputs. These results indicate that combining coarse refutation (CBR) with fine-grained invariant checking (BBL) provides assurance for verifiers, and operator-aware generation further boosts both coverage and discovery of unsoundness issues. |
|
| Liu, Xuan |
Xuan Liu, Dheeraj Kodakandla, Kushagra Srivastva, and Mahfuza Farooque (Pennsylvania State University, USA) VeriTrans is a reliability-first ML system that compiles natural-language requirements into solver-ready logic with validator-gated reliability. The pipeline integrates an instruction-tuned NL→PL translator, round-trip reconstruction (PL→NL) used as a high-precision acceptance gate, and canonical PL→CNF compilation, all executed via fixed API configuration (temperature= 0; fine-tuning runs use seed= 42) and per-item artifact logging (prompts, outputs, hashes) to support auditability and replay-driven debugging. On SatBench (2100 specifications), VeriTrans achieves 94.46% SAT/UNSAT correctness and 87.73% median round-trip similarity. Compact fine-tuning on 100-150 curated examples improves fidelity by about 1-1.5,pp without increasing latency (mean 25.8 s/spec on our 201-spec runtime subset). A thresholded acceptance policy on the round-trip score exposes a reliability--coverage knob: at τ=75, roughly 68% of items are retained with ~94% correctness on the accepted set. Validator overhead contributes < 15% of end-to-end runtime, and all prompts/responses and timing metadata are logged to enable replay-driven debugging and regression testing. By separating learned translation from symbolic verification and enforcing deterministic, validator-gated acceptance, VeriTrans turns NL→logic front-ends into auditable, reproducible components for reliability-critical workflows. |
|
| Liu, Yalin |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Lu, Chang-Tien |
Mariam ALMutairi and Chang-Tien Lu (Virginia Polytechnic Institute, USA) Existing LLM security benchmarks evaluate code generation quality, leaving an open question: can LLMs generate tests that detect vulnerabilities? We address this with two technical contributions. First, we propose the Security Mutation Score (SMS), a metric that classifies mutant kills into semantic, functional, incidental, and crash categories using operator-aware heuristics, distinguishing genuine security awareness from coincidental detection. We further define Effective SMS (EffSMS = SMS × Secure-Pass Rate) to account for test validity. Second, we design 25 security-specific mutation operators spanning 30 CWE categories that transform secure Python code into realistic vulnerable variants, extending prior security mutation frameworks to Python and introducing 22 new operators. Evaluating eight LLMs and two static analysis baselines on 339 programs and 1,869 mutants reveals three findings: (i) traditional mutation scores overstate LLM security testing capability by 2.2× on average; (ii) the best LLM achieves only 19.7% EffSMS vs. 47.6% for expert-written tests—a 2.4× gap raw metrics obscure; and (iii) functional kills, not crashes, dominate non-semantic failures (15–36%), showing LLMs detect behavioral side-effects rather than security properties. Static analysis and mutation testing provide complementary coverage across syntactic vs. logic-flaw CWEs. Code and data are publicly available. |
|
| Lulla, Jai Lal |
Matthias Galster, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Singapore Management University, Singapore) Agentic AI coding tools increasingly automate software development tasks. Developers can configure these tools through versioned repository-level artifacts such as Markdown and JSON files. We present a systematic analysis of configuration mechanisms for agentic AI coding tools, covering Claude Code, GitHub Copilot, Cursor, Gemini, and Codex. We identify eight configuration mechanisms spanning from static context to executable and external integrations and, in an empirical study of 2,853 GitHub repositories, examine whether and how they are adopted, with a detailed analysis of Context Files, Skills, and Subagents. First, Context Files dominate the configuration landscape and are often the sole mechanism in a repository, with AGENTS.md emerging as an interoperable standard across tools. Second, few repositories adopt advanced mechanisms such as Skills and Subagents. Skills predominantly rely on static instructions rather than executable scripts. Third, distinct configuration practices are forming around different tools, with Claude Code users employing the broadest range of mechanisms. These findings establish an empirical baseline for understanding how developers configure agentic tools, suggest that AGENTS.md serves as a natural starting point, and motivate longitudinal and experimental research on how configuration strategies evolve and affect agent performance. Matthias Galster, Seyedmoein Mohsenimofidi, Levi Böhme, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Universität Bayreuth, Germany; Singapore Management University, Singapore) Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration artifacts, the full content of 18,167 configuration files associated with these configuration artifacts, and 148,519 AI-co-authored commits. The dataset and the construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, AI tool adoption patterns, and human-AI collaboration. |
|
| Lwin, Naing Oo |
Naing Oo Lwin and Rajesh Kumar (Bucknell University, USA) Modernizing legacy COBOL systems remains difficult due to scarce expertise, large and long-lived codebases, and strict correctness requirements. Recent large language model (LLM)-based modernization systems increasingly rely on agentic workflows in which the model controls multi-step tool execution. However, it remains unclear whether delegating execution control to the LLM improves correctness, robustness, or efficiency in structured software engineering workflows. We present a controlled empirical study of deterministic and LLM-controlled orchestration for COBOL-to-Python modernization. Using a unified experimental framework, we hold the language models, prompts, tools, configurations, and source programs constant while varying only the execution control strategy. This isolates orchestration as the sole experimental variable. We evaluate both approaches using functional correctness, robustness across repeated stochastic runs, and computational efficiency. Across multiple models, deterministic orchestration achieves comparable computational accuracy to LLM-controlled orchestration while improving worst-case robustness and reducing performance variability across runs. Deterministic execution also reduces token consumption by up to 3.5x, leading to substantially lower operational cost. These results suggest that, in structured modernization workflows with explicit validation stages, fixed execution policies provide more stable and cost-efficient behavior than fully agentic orchestration without reducing translation quality. |
|
| Ma, Shiqing |
Jun Yeon Won, Xin Jin, Shiqing Ma, and Zhiqiang Lin (Ohio State University, Columbus, USA; Meta, USA; University of Massachusetts at Amherst, USA) Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical tasks such as function and variable name recovery and type inference. However, despite the rapid growth of research in this area, progress has been hindered by the absence of a standardized dataset. Existing studies rely on disparate datasets, preprocessing pipelines, and evaluation metrics, making fair comparisons between approaches difficult and obscuring a clear understanding of LLM capabilities in binary analysis. To address these challenges, we present REBench, a comprehensive benchmark dataset for evaluating LLMs on binary reverse engineering tasks. REBench consolidates a superset of existing datasets, comprising hundreds of millions of lines of source code and a diverse collection of binaries spanning multiple architectures and optimization levels. REBench adopts a knowledge-base-driven methodology that stores byte-level stack information to generate ground truth, ensuring that task difficulty is preserved while maintaining universal applicability. This design enables fair evaluation across tasks while avoiding simplifications that could bias results. As a use case, we apply REBench to measure the reverse engineering performance of two LLMs and the result demonstrates difficulties in complex tasks. |
|
| Macvean, Andrew |
Tao Dong, Sherry Shi, Harini Sampath, and Andrew Macvean (Google, USA) The ongoing transition of Large Language Models (LLMs) in software engineering from one-shot code generators into agentic partners requires a shift in how we define and measure success. While models are becoming more capable, the industry lacks a clear understanding of the behavioral norms that make an interactive software engineering (SWE) agent effective in collaborative software development in the enterprise. This work addresses this gap by presenting a taxonomy of desirable SWE agent behaviors, synthesized from 91 sets of developer-defined rules for SWE agents and validated through interviewing 15 experienced professional developers. In this taxonomy, we identify four core expectations: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solve Problems Effectively, and Collaborate with the Developer. These findings offer a concrete vocabulary for aligning SWE agent behavior with developer preferences, enabling researchers and practitioners to move beyond correctness-only benchmarks and start designing evaluations that reflect the socio-technical nature of professional software development in enterprises. |
|
| Maddila, Chandra |
Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra (Meta Platforms, USA) Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real-world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale. |
|
| Majumdar, Arunabh |
Arunabh Majumdar (Independent Researcher, India) We present CrossCommitVuln-Bench, a curated benchmark of 30 real-world Python vulnerabilities (CVEs) in which the exploitable condition was introduced across multiple commits—each individually benign to per-commit static analysis—but collectively critical. We manually annotate each CVE with its contributing commit chain, a structured rationale for why each commit evades per-commit analysis, and baseline evaluations using Semgrep and Bandit in both per-commit and cumulative scanning modes. Our central finding: the per-commit detection rate (CCDR) is 7% across all 30 vulnerabilities — 93% of chains are invisible to per-commit SAST. Critically, both per-commit detections are qualitatively poor: one occurs on commits framed as security fixes (where developers suppress the alert), and the other detects only the minor hardcoded-key component while completely missing the primary vulnerability (200+ unprotected API endpoints). Even in cumulative mode (full codebase present), the detection rate is only 13%, confirming that snapshot-based SAST tools systematically miss vulnerabilities whose introduction spans multiple commits. The dataset, annotation schema, evaluation scripts, and reproducible baselines are released under open-source licenses to support research on cross-commit vulnerability detection. |
|
| Mammadov, Tural |
Norman Becker, Tural Mammadov, and Andreas Zeller (CISPA Helmholtz Center for Information Security, Germany) In the past years, large language models (LLMs) have demonstrated remarkable progress in code generation. However, their ability to reason about program behavior remains an open challenge—an ability that is relevant for applications including reverse engineering, debugging, secure code generation, test-driven synthesis, input reconstruction, reverse fuzzing, behavioral monitoring, and safe execution modeling. To study this ability, we examine the capacity of LLMs to reason about the semantics of code—specifically, their ability to relate code, its inputs, and its outputs to each other. To this end, we investigate whether and how well LLMs can predict one of these three components given the other two—that is, (1) predict the input given code and output, (2) predict the output given code and input, and (3) predict the code given input and output. This way, we assess how well LLMs can reason about and understand the underlying relationships that govern program execution. We construct four datasets covering string processing, array operations, and coding challenges in JavaScript and Python to evaluate diverse program-understanding capabilities, incorporating various code mutation techniques to increase complexity. In our evaluation on tasks covering string processing, array operations, and coding challenges, we find that closed-weight models achieve the strongest performance across all datasets, including perfect input recovery on deterministic string tasks. Across tasks, output prediction is comparatively stable, whereas code prediction remains the hardest setting and often fails for smaller models. Finally, cross-codebase transfer is feasible, especially for input prediction, but highly sensitive to model capacity and fine-tuning strategy. |
|
| Meding, Wilhelm |
Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, and Miroslaw Staron (University of Gothenburg, Sweden; Chalmers University of Technology, Sweden) Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE. |
|
| Ming, Hua |
Mohamed Almukhtar, Anwar Ghammam, and Hua Ming (University of Michigan at Flint, USA; University of Michigan at Dearborn, USA) As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code-quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues—predominantly convention-level violations such as long lines—while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows. |
|
| Mockus, Audris |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Mohsenimofidi, Seyedmoein |
Matthias Galster, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Singapore Management University, Singapore) Agentic AI coding tools increasingly automate software development tasks. Developers can configure these tools through versioned repository-level artifacts such as Markdown and JSON files. We present a systematic analysis of configuration mechanisms for agentic AI coding tools, covering Claude Code, GitHub Copilot, Cursor, Gemini, and Codex. We identify eight configuration mechanisms spanning from static context to executable and external integrations and, in an empirical study of 2,853 GitHub repositories, examine whether and how they are adopted, with a detailed analysis of Context Files, Skills, and Subagents. First, Context Files dominate the configuration landscape and are often the sole mechanism in a repository, with AGENTS.md emerging as an interoperable standard across tools. Second, few repositories adopt advanced mechanisms such as Skills and Subagents. Skills predominantly rely on static instructions rather than executable scripts. Third, distinct configuration practices are forming around different tools, with Claude Code users employing the broadest range of mechanisms. These findings establish an empirical baseline for understanding how developers configure agentic tools, suggest that AGENTS.md serves as a natural starting point, and motivate longitudinal and experimental research on how configuration strategies evolve and affect agent performance. Matthias Galster, Seyedmoein Mohsenimofidi, Levi Böhme, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Universität Bayreuth, Germany; Singapore Management University, Singapore) Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration artifacts, the full content of 18,167 configuration files associated with these configuration artifacts, and 148,519 AI-co-authored commits. The dataset and the construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, AI tool adoption patterns, and human-AI collaboration. |
|
| Mothukuri, Viraaji |
Viraaji Mothukuri and Reza M. Parizi (Kennesaw State University, USA) The LLM foundation model era has inverted a fundamental assumption in software engineering. Code, once written, no longer belongs exclusively to its creators. Any publicly accessible code becomes training data, absorbed into models that can reproduce, adapt, and redistribute it without consent. This paper argues that such circumstances represent not merely a legal or ethical challenge, but a technical one requiring new defensive primitives. Without technical defenses, the most valuable software becomes training data for systems its authors do not control. We introduce the concept of Statistical Opacity, defined as the deliberate design of code representations that resist neural pattern extraction while preserving human readability and machine executability, by exploiting the gap between executability and learnability. We articulate a research agenda spanning theory, pathways, tools, and evaluation. |
|
| Mukherjee, Manisha |
Manisha Mukherjee and Vincent Josua Hellendoorn (Carnegie Mellon University, USA; Google, USA) Large Language Models (LLMs) are widely used for automated code generation. Their reliance on infrequently updated pretraining data can leave them unaware of newly discovered vulnerabilities and evolving security standards, making them prone to producing insecure code. In contrast, developer communities on Stack Overflow (SO) provide an ever-evolving repository of knowledge, where security vulnerabilities are actively discussed and addressed through collective expertise. These community-driven insights remain largely untapped by LLMs. This paper introduces SOSecure, a Retrieval- Augmented Generation (RAG) system that leverages the collective security expertise found in SO discussions to improve the security of LLM-generated code. We build a security-focused knowledge base by extracting SO answers and comments that explicitly identify vulnerabilities. Unlike common uses of RAG, SOSecure triggers after code has been generated to find discussions that identify flaws in similar code. These are used in a prompt to an LLM to consider revising the code. Evaluation across three datasets (SALLM dataset, LLMSecEval, and LMSys) shows that SOSecure achieves strong fix rates of 71.7%, 91.3%, and 96.7% respectively, compared to prompting GPT-4 without relevant discussions (49.1%, 56.5%, and 37.5%), and outperforms multiple other baselines. SOSecure operates as a language-agnostic complement to existing LLMs, without requiring retraining or fine-tuning, making it easy to deploy. Our results underscore the importance of maintaining active developer forums, |
|
| Murali, Vijayaraghavan |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Nagappan, Meiyappan |
Lara Khatib, Michael Pu, Bogdan Vasilescu, and Meiyappan Nagappan (University of Waterloo, Canada; Carnegie Mellon University, USA) As developers increasingly rely on LLM-generated code summaries for documentation, testing, and review, it is important to study whether these summaries accurately reflect what the program actually does. LLMs often produce confident descriptions of what the code looks like it should do (intent), while missing subtle edge cases or logic changes that define what it actually does (behavior). We present a mutation-based evaluation methodology that directly tests whether a summary truly matches the code’s logic. Our approach generates a summary, injects a targeted mutation into the code, and checks if the LLM updates its summary to reflect the new behavior. We validate it through three experiments totalling 624 mutation–summary evaluations across 62 programs. First, on 12 controlled synthetic programs with 324 mutations varying in type (statement, value, decision) and location (beginning, middle, end). We find that, for GPT-4, summary accuracy decreases sharply with complexity from 76.5% for single functions to 17.3% for multi-threaded systems, while mutation type and location exhibit weaker effects. Second, testing 150 mutated samples on 50 human-written programs from the Less Basic Python Problems (LBPP) dataset confirms the same failure patterns persist as models often describe algorithmic intent rather than actual mutated behavior with a summary accuracy rate of 49.3%. Furthermore, while a comparison between GPT-4 and GPT-5.2 shows a substantial performance leap (from 49.3% to 85.3%) and an improved ability to identify mutations as ”bugs”, both models continue to struggle with distinguishing implementation details from standard algorithmic patterns. This work establishes mutation analysis as a systematic approach for assessing whether LLM-generated summaries reflect program behavior rather than superficial textual patterns, enabling direct comparison across models, prompts, and datasets. |
|
| Nagappan, Nachiappan |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Nanda, Rahul |
Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra (Meta Platforms, USA) Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real-world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale. |
|
| Ni, Chenye |
Elliott Wen, Chenye Ni, Valerio Terragni, and Jens Dietrich (University of Auckland, New Zealand; Massey University, New Zealand) Reproducible independent rebuilds strengthen software supply-chain integrity by recreating the original build environment and enforcing bitwise equivalence between artifacts. However, this approach implicitly assumes a trustworthy toolchain and can fail under adversarial manipulation of the build process itself (e.g., the Ken Thompson attack). Prior work has explored introducing diversity across build environments to reduce reliance on any single toolchain, and has proposed AI-driven methods to establish behavioural equivalence while tolerating benign build variability in the Java ecosystem. In this work, we extend this line of research to Rust and present RustBuildEq, a benchmark for training and evaluating binary equivalence classier models under realistic build variability. We curate a large corpus of crates drawn from the top 20% of the crates.io ecosystem and construct datasets of equivalent (EQ) and non-equivalent (NEQ) pairs with rich provenance metadata. EQ pairs are generated from identical source revisions under varying toolchain versions and build configurations, while NEQ pairs are derived from AST rewrites or API-breaking changes across versions. Many rust crates rely heavily on generics and cannot be compiled into binaries without specifying concrete types; to address this, we develop an automated approach that combines heuristic type instantiation, witness-type synthesis, and an iterative AI repair loop. RustBuildEq comprises 19,184,671 EQ records and 273,848,531 NEQ records, and includes a Python API for dataset navigation. The dataset provides large-scale ground truth for training and evaluating AI-driven models for reasoning binary equivalence and is publicly available at https://doi.org/10.5281/zenodo.19244908. |
|
| Nie, Pengyu |
Xiaoyu Cheng, Kundi Yao, Pengyu Nie, and Weiyi Shang (University of Waterloo, Canada; Ontario Tech University, Canada) Large Language Models (LLMs) for code generation risk memorizing and reproducing sensitive training data, including licensed code and proprietary information. We investigate memorization behavior in recent open-weight LLMs in code generation using a two-stage memorization evaluation pipeline, which combines a similarity-based extractability filter with a targeted data extraction attack. We evaluate four models (StarCoder2-3B, StarCoder2-7B, Llama3-8B, and DeepSeek-R1-distilled-Llama-8B) on a custom dataset of 30,000+ Python files. Our results reveal memorization rates of 42-64%, with code-specialized models exhibiting higher rates than general-purpose models. Categorical analysis shows that repetitive content (license headers, documentation) is memorized at rates up to 70%, while complex code exhibits lower susceptibility. Notably, realistic code completion scenarios trigger unintentional memorization in 13-14% of cases, posing practical risks for AI coding assistants. We demonstrate that knowledge distillation reduces extraction rates by approximately 19%, offering a cost-effective mitigation approach. Our findings confirm that memorization persists in modern LLMs and is influenced more by a complex interplay of training domain, dataset composition, architectural choices, and content characteristics, rather than parameter count alone. |
|
| Nilsson, Ulf |
Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró (Linköping University, Sweden; University of Murcia, Spain) Jupyter notebooks are widely used for machine learning (ML) prototyping and experimentation, yet debugging support for notebook-based ML development remains limited, partly due to the lack of realistic and executable bug benchmarks. We introduce JunoBench, to our knowledge the first executable benchmark of real-world crashes in Python-based ML notebooks. JunoBench contains 111 curated and reproducible crashes from public Kaggle notebooks, each paired with a verified fix. The benchmark covers widely used ML libraries (e.g., TensorFlow/Keras, PyTorch, and Scikit-learn) as well as notebook-specific failures such as out-of-order execution. To ensure reproducibility and ease of evaluation, JunoBench provides a unified execution environment that reliably reproduces all crashes. In addition, each crash is accompanied by human-validated annotations, including library cause, crash type, root cause, ML pipeline stage, and natural-language diagnostic summaries. By combining realistic crashes, verified fixes, structured labels, and reproducible execution, JunoBench enables systematic evaluation of crash detection, diagnosis, and automated repair techniques for notebook-based ML development. |
|
| Njoku, Anthonia Oluchukwu |
Anthonia Oluchukwu Njoku, Zohreh Sharafi, and Foutse Khomh (Polytechnique Montréal, Canada) Autonomous coding agents are increasingly participating in collaborative software development by generating repository-level pull requests (PRs) that must be reviewed and integrated by human teams. While prior work has examined the technical characteristics of agent-generated patches, less is known about how autonomous authorship reshapes human collaboration dynamics in real-world workflows. In this paper, we investigate large-scale human–agent collaboration by comparing 40,214 pull requests across 2,807 GitHub repositories, including 33,596 agent-authored PRs from five autonomous coding agents and 6,618 human-authored PRs. We examine differences across three dimensions: integration outcomes, structural characteristics, and collaboration signals. Our findings reveal a socio-technical trade-off. Agent-authored PRs are integrated significantly faster, yet exhibit lower merge rates overall. Task type moderates this effect: agents outperform humans in documentation tasks but underperform in behavior-changing contributions. Beyond outcomes, collaboration patterns differ systematically. Agent-authored PRs attract proportionally more bot-generated comments and elicit more analytic, less socially oriented review communication. In contrast, human-authored PRs receive more elaborative and socially engaged feedback. Incorporating psycholinguistic features into predictive models significantly improves merge outcome prediction, demonstrating that communication style carries explanatory power beyond structural code characteristics. These results suggest that autonomous agents do not merely introduce technical artifacts into repositories, but actively reshape review interaction patterns and evaluative behavior. The impact of coding agents is therefore fundamentally socio-technical, highlighting the importance of studying AI systems within authentic human–AI collaborative environments. |
|
| Ogenrwot, Daniel |
Daniel Ogenrwot and John Businge (University of Nevada at Las Vegas, USA) Software Engineering 3.0 marks a paradigm shift in software development, in which AI coding agents are no longer just assistive tools but active contributors. While prior empirical studies have examined productivity gains and acceptance patterns in AI-assisted development, the challenges associated with integrating agent-generated contributions remain less understood. In particular, merge conflicts, a fundamental aspect of collaborative software development, remain underexplored in this context. In this paper, we present AgenticFlict, a large-scale dataset of textual merge conflicts in AI coding agent pull requests (Agentic PRs). The dataset comprises 142K+ Agentic PRs collected from 59K+ repositories, of which 107K+ are successfully processed through deterministic merge simulation. Our pipeline identifies 29K+ PRs exhibiting merge conflicts, yielding a conflict rate of 27.67%, and extracts 336K+ fine-grained conflict regions across these instances. Our preliminary exploratory analysis indicates that merge conflicts are both frequent and often substantial in AI-generated contributions, with noticeable variation across agents, emphasizing the need to better understand and manage integration challenges in AI-assisted software development. The dataset, code and supplementary materials are available in zenodo: 10.5281/zenodo.19396916 |
|
| Paltenghi, Matteo |
Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra (Meta Platforms, USA) Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real-world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale. |
|
| Parizi, Reza M. |
Viraaji Mothukuri and Reza M. Parizi (Kennesaw State University, USA) The LLM foundation model era has inverted a fundamental assumption in software engineering. Code, once written, no longer belongs exclusively to its creators. Any publicly accessible code becomes training data, absorbed into models that can reproduce, adapt, and redistribute it without consent. This paper argues that such circumstances represent not merely a legal or ethical challenge, but a technical one requiring new defensive primitives. Without technical defenses, the most valuable software becomes training data for systems its authors do not control. We introduce the concept of Statistical Opacity, defined as the deliberate design of code representations that resist neural pattern extraction while preserving human readability and machine executability, by exploiting the gap between executability and learnability. We articulate a research agenda spanning theory, pathways, tools, and evaluation. |
|
| Patel, Akshay |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Pattanaik, Sitesh |
Xin Zhao, Brian Vu, and Sitesh Pattanaik (Seattle University, USA; University of California at Irvine, USA) Background: Artificial intelligence is rapidly changing the landscape of software development. With the unique ability to quickly generate code and the potential to disrupt traditional workflows, AI tools have found growing adoption within the software development process. Subsequently, this topic has been the focus of academic work, including research examining qualitative impacts to productivity and the analysis of sentiments from the developers who utilize AI tools. While this material is extensive, our research team identified a gap within existing literature: what do software managers have to say? Goals: The overarching goal of this study is to examine the views of software managers on how AI tools have affected software development. We seek to understand how managers, who leverage a top-down view of the development process, perceive the influence of AI on developers, their own roles, and the broader labor market. Methodology: To answer these questions, we conducted an empirical study by releasing an online questionnaire containing both qualitative and quantitative questions, sampling software managers employed across both tech-focused and non-tech-focused companies. Results: Through a survey of 42 managers, we found that managers hold nuanced views on the introduction of AI into software development. They encourage developers to use AI, perceive it as valuable for testing, and apply it themselves for knowledge work. At the same time, they raise concerns about privacy, responsibility, transparency, and over-reliance. Many also predict a loss of jobs within the software development market due to consolidation driven by AI. Conclusion: AI is seen by managers as both a powerful productivity tool and a source of new ethical challenges. Our investigation paves the way for a comprehensive understanding of how AI is perceived by those who directly manage the introduction of these tools into traditional software development workflows, revealing a road map for future endeavors for the software development community. |
|
| Pham, Hung Viet |
Md Basim Uddin Ahmed, Nima Shiri Harzevili, Jiho Shin, Hung Viet Pham, and Song Wang (York University, Canada; Queen's University, Canada) Large Language Models (LLMs) show promise for vulnerability detection, but their evaluation is limited by the lack of high-quality benchmarks. Most existing datasets rely on coarse function-level labels, overlook fine-grained vulnerability patterns, and lack critical program context such as data/control dependencies. They also suffer from data quality issues, including mislabeling and duplication, leading to unreliable evaluation and limited real-world relevance. To address these limitations, this paper introduces SecVulEval, a context-aware benchmark designed to evaluate LLMs on vulnerability detection with rich contextual information. SecVulEval focuses on real-world C/C++ vulnerabilities at the statement level. This granularity enables more precise evaluation of a model’s ability to localize and understand vulnerabilities, beyond simple binary classification at the function level. By incorporating rich contextual information, SecVulEval sets a new standard for benchmarking vulnerability detection in realistic software development scenarios. This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024. We evaluated state-of-the-art LLMs in both standalone and multi-agent settings. Results on our dataset indicate that current models remain far from accurately identifying vulnerable statements within a given function, although agent-based approaches provide modest but promising improvements. The best-performing Claude-3.7-Sonnet-driven agent achieves an F1-score of 23.83% for vulnerable statement detection. We believe this benchmark can serve as a foundation for advancing context-aware vulnerability detection with LLMs. |
|
| Plotnikov, Maksim |
Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation. |
|
| Pu, Michael |
Lara Khatib, Michael Pu, Bogdan Vasilescu, and Meiyappan Nagappan (University of Waterloo, Canada; Carnegie Mellon University, USA) As developers increasingly rely on LLM-generated code summaries for documentation, testing, and review, it is important to study whether these summaries accurately reflect what the program actually does. LLMs often produce confident descriptions of what the code looks like it should do (intent), while missing subtle edge cases or logic changes that define what it actually does (behavior). We present a mutation-based evaluation methodology that directly tests whether a summary truly matches the code’s logic. Our approach generates a summary, injects a targeted mutation into the code, and checks if the LLM updates its summary to reflect the new behavior. We validate it through three experiments totalling 624 mutation–summary evaluations across 62 programs. First, on 12 controlled synthetic programs with 324 mutations varying in type (statement, value, decision) and location (beginning, middle, end). We find that, for GPT-4, summary accuracy decreases sharply with complexity from 76.5% for single functions to 17.3% for multi-threaded systems, while mutation type and location exhibit weaker effects. Second, testing 150 mutated samples on 50 human-written programs from the Less Basic Python Problems (LBPP) dataset confirms the same failure patterns persist as models often describe algorithmic intent rather than actual mutated behavior with a summary accuracy rate of 49.3%. Furthermore, while a comparison between GPT-4 and GPT-5.2 shows a substantial performance leap (from 49.3% to 85.3%) and an improved ability to identify mutations as ”bugs”, both models continue to struggle with distinguishing implementation details from standard algorithmic patterns. This work establishes mutation analysis as a systematic approach for assessing whether LLM-generated summaries reflect program behavior rather than superficial textual patterns, enabling direct comparison across models, prompts, and datasets. |
|
| Qin, Xue |
Xue Qin and Mauricio Gruppi (Villanova University, USA) The rise of "vibe coding" has marginalized requirement analysis, allowing unverified "Make it Work" assumptions to accumulate into complex defects that evade standard testing. In the absence of a reliable ground truth oracle, we advocate a paradigm shift from correctness verification to logic consistency. We introduce Differential Logic Analysis (DLA), a framework that utilizes Logical Inference to detect internal contradictions across parallel and sequential development workflows. Preliminary simulations demonstrate that DLA successfully intercepts logic drift, such as contradictions and unspecified assumptions, and offers a new perspective on reliability assessment in the AI-assistant development era. |
|
| Rabbi, Fazle |
Fazle Rabbi and Jinqiu Yang (Concordia University, Canada) Recent Large Language Models (LLMs) have shown strong perfor- mance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations com- monly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics- preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations. Fazle Rabbi, Soumit Kanti Saha, and Jinqiu Yang (Concordia University, Canada) Large Language Models (LLMs) have achieved remarkable success in automated code translation. While prior work has focused on improving translation accuracy through advanced prompting and iterative repair, the reliability of the underlying evaluation frame- works has received less attention. In this paper, we demonstrate that a significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. We conduct a large-scale empirical study across five programming languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus), covering 6,164 translations generated by GPT-4o, DeepSeek-Coder, and Magicoder. Our analysis identifies and categorizes common false negatives, distinguishing pipeline-induced failures that affect any model from model-dependent behaviors that vary across LLMs. Our findings highlight the necessity for transparent, configuration- aware evaluation standards to accurately assess progress in LLM- based code translation. |
|
| Rastegari, Elham |
Sasan Azizian, Ayoub Hazrati, Artin Azizian, and Elham Rastegari (Bellevue University, USA; McGill University, Canada; Creighton University, USA) Object-relational mapping (ORM) frameworks simplify persistence, yet routine mapping choices—including inheritance strategy, association encoding, and denormalization—can substantially alter the generated SQL and the trade-offs among query latency, insert/update cost, and storage footprint. Existing optimization approaches either guarantee semantic validity but incur expensive per-candidate deployment and benchmarking, or learn from schema structure alone and therefore miss the fact that workload behavior is mapping-dependent. We present TriORM, a workload-aware neural symbolic framework for multi-objective ORM mapping design that removes per-candidateworkload execution from the online recommendation loop while preserving validity by construction. TriORM (i) enumerates admissible mappings via bounded relational synthesis in Alloy, (ii) concretizes an abstract workload into schema-specific SQL templates for each candidate, and (iii) predicts continuous objectives using a tri-input model that fuses a typed schema-graph encoder, a concretized-workload encoder, and interpretable static cost proxies. The resulting predictions enable Pareto filtering and user-weighted selection without deploying or executing each candidate at inference time; profiling is performed only offline to obtain supervision. On nine TradeMaker/Leant benchmark models, TriORM improves mean Pareto-front approximation over Leant (GD/HV 0.03/0.78 vs. 0.07/0.61) and reduces end-toend recommendation time (3.6×103s vs. 2.6×104s), while preserving semantic correctness within the chosen synthesis bounds. Sasan Azizian, Ayoub Hazrati, Artin Azizian, Elham Rastegari, Hamid Bagheri, and Juan Cui (Bellevue University, USA; McGill University, Canada; Creighton University, USA; University of Nebraska-Lincoln, USA) Object-relational mapping (ORM) design remains largely driven by fixed heuristics that fail to capture workload-specific tradeoffs among query latency, insert cost, and memory footprint. We present Y-Map, a hybrid neural--symbolic framework for performance-aware ORM schema design that synthesizes valid schema candidates and predicts their performance without requiring workload execution at inference time. Y-Map leverages Alloy to enumerate correctness-preserving ORM schemas and ranks them using a multi-encoder regression model that fuses structural, syntactic, and semantic representations with compact schema-level features. By predicting continuous performance objectives---insert latency, query latency, and memory footprint---Y-Map enables Pareto-aware selection without per-candidate benchmarking during inference. We evaluate Y-Map on nine object models from e-commerce, banking, and healthcare. Relative to two representative baselines, Leant and DTS, Y-Map yields improved aggregate Pareto quality (Generational Distance and Hypervolume) while reducing inference time and memory overhead. The experimental results show that integrating symbolic validity guarantees with learned performance prediction provides a practical, scalable solution for workload-aware ORM optimization. |
|
| Rigby, Peter C. |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Saha, Soumit Kanti |
Fazle Rabbi, Soumit Kanti Saha, and Jinqiu Yang (Concordia University, Canada) Large Language Models (LLMs) have achieved remarkable success in automated code translation. While prior work has focused on improving translation accuracy through advanced prompting and iterative repair, the reliability of the underlying evaluation frame- works has received less attention. In this paper, we demonstrate that a significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. We conduct a large-scale empirical study across five programming languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus), covering 6,164 translations generated by GPT-4o, DeepSeek-Coder, and Magicoder. Our analysis identifies and categorizes common false negatives, distinguishing pipeline-induced failures that affect any model from model-dependent behaviors that vary across LLMs. Our findings highlight the necessity for transparent, configuration- aware evaluation standards to accurately assess progress in LLM- based code translation. |
|
| Saied, Mohamed Aymen |
Mostafa Anouar Ghorab, Ahmad Abdel Latif, and Mohamed Aymen Saied (Université Laval, Canada; University of Calgary, Canada) Kubernetes has become a central platform for orchestrating cloud-native applications, yet its declarative configuration model frequently introduces security misconfigurations that threaten system reliability and operational stability. Although automated detection tools are widely available, a systematic understanding of misconfiguration patterns and scalable correction mechanisms remains limited. This paper presents a comprehensive empirical study of Kubernetes security misconfigurations based on 2,662 developer reported issues from Stack Overflow. From this dataset, we derive a structured taxonomy that captures recurring security weaknesses across configuration object types and misconfiguration categories. Using this taxonomy, we analyze how severity levels vary across objects and categories, and examine how security misconfigurations evolve between incubator and stable project stages. Our findings reveal that while some operational issues decrease as projects mature, critical security misconfigurations often persist or reappear, highlighting enduring risk patterns in cloud-native systems. Building on this empirical foundation, we evaluate the effectiveness of Large Language Models (LLMs) in automatically correcting Kubernetes security misconfigurations under progressively enriched contextual conditions. Results demonstrate that contextual grounding significantly improves correction accuracy, with the best standalone model achieving 89.06%. To further enhance structural correctness and schema compliance, we introduce Kubecurity, a schema-guided validation framework that enforces compliance with official Kubernetes specifications. By combining contextual LLM reasoning with deterministic schema enforcement, the proposed hybrid approach achieves 98.50% correction accuracy while substantially reducing newly introduced misconfigurations. Overall, this work advances both the understanding and automated remediation of Kubernetes security misconfigurations. |
|
| Sampath, Harini |
Tao Dong, Sherry Shi, Harini Sampath, and Andrew Macvean (Google, USA) The ongoing transition of Large Language Models (LLMs) in software engineering from one-shot code generators into agentic partners requires a shift in how we define and measure success. While models are becoming more capable, the industry lacks a clear understanding of the behavioral norms that make an interactive software engineering (SWE) agent effective in collaborative software development in the enterprise. This work addresses this gap by presenting a taxonomy of desirable SWE agent behaviors, synthesized from 91 sets of developer-defined rules for SWE agents and validated through interviewing 15 experienced professional developers. In this taxonomy, we identify four core expectations: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solve Problems Effectively, and Collaborate with the Developer. These findings offer a concrete vocabulary for aligning SWE agent behavior with developer preferences, enabling researchers and practitioners to move beyond correctness-only benchmarks and start designing evaluations that reflect the socio-technical nature of professional software development in enterprises. |
|
| Saurabh, Prasun |
Prasun Saurabh, Pablo Valle, Aitor Arrieta, Shaukat Ali, and Paolo Arcaini (Simula Research Laboratory, Norway; Oslo Metropolitan University, Norway; Mondragon University, Spain; Tokyo Institute of Technology, Japan) Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test oracle problem in software testing. Traditionally, this assessment relies on task-specific symbolic oracles for task correctness and on human manual evaluation of robot behavior, which is time-consuming, subjective, and error-prone. To address this, we propose VISOR, a Vision-Language Model (VLM)–based approach for automated test oracle assessment that eliminates the need of expensive human evaluations. VISOR performs automated evaluation of task correctness and quality, addressing the limitations of existing symbolic test oracles, which are task-specific and provide binary pass/fail judgments without explicitly quantifying task quality. Given the inherent uncertainty in VLMs, VISOR also explicitly quantifies its own uncertainty during test assessments. We evaluated VISOR using two VLMs, i.e., GPT and Gemini, across four robotic tasks on over 1,000 videos. Our results show that Gemini achieves higher recall and GPT produces more precise predictions, while both models exhibit low uncertainty. |
|
| Shang, Weiyi |
Xiaoyu Cheng, Kundi Yao, Pengyu Nie, and Weiyi Shang (University of Waterloo, Canada; Ontario Tech University, Canada) Large Language Models (LLMs) for code generation risk memorizing and reproducing sensitive training data, including licensed code and proprietary information. We investigate memorization behavior in recent open-weight LLMs in code generation using a two-stage memorization evaluation pipeline, which combines a similarity-based extractability filter with a targeted data extraction attack. We evaluate four models (StarCoder2-3B, StarCoder2-7B, Llama3-8B, and DeepSeek-R1-distilled-Llama-8B) on a custom dataset of 30,000+ Python files. Our results reveal memorization rates of 42-64%, with code-specialized models exhibiting higher rates than general-purpose models. Categorical analysis shows that repetitive content (license headers, documentation) is memorized at rates up to 70%, while complex code exhibits lower susceptibility. Notably, realistic code completion scenarios trigger unintentional memorization in 13-14% of cases, posing practical risks for AI coding assistants. We demonstrate that knowledge distillation reduces extraction rates by approximately 19%, offering a cost-effective mitigation approach. Our findings confirm that memorization persists in modern LLMs and is influenced more by a complex interplay of training domain, dataset composition, architectural choices, and content characteristics, rather than parameter count alone. |
|
| Sharafi, Zohreh |
Anthonia Oluchukwu Njoku, Zohreh Sharafi, and Foutse Khomh (Polytechnique Montréal, Canada) Autonomous coding agents are increasingly participating in collaborative software development by generating repository-level pull requests (PRs) that must be reviewed and integrated by human teams. While prior work has examined the technical characteristics of agent-generated patches, less is known about how autonomous authorship reshapes human collaboration dynamics in real-world workflows. In this paper, we investigate large-scale human–agent collaboration by comparing 40,214 pull requests across 2,807 GitHub repositories, including 33,596 agent-authored PRs from five autonomous coding agents and 6,618 human-authored PRs. We examine differences across three dimensions: integration outcomes, structural characteristics, and collaboration signals. Our findings reveal a socio-technical trade-off. Agent-authored PRs are integrated significantly faster, yet exhibit lower merge rates overall. Task type moderates this effect: agents outperform humans in documentation tasks but underperform in behavior-changing contributions. Beyond outcomes, collaboration patterns differ systematically. Agent-authored PRs attract proportionally more bot-generated comments and elicit more analytic, less socially oriented review communication. In contrast, human-authored PRs receive more elaborative and socially engaged feedback. Incorporating psycholinguistic features into predictive models significantly improves merge outcome prediction, demonstrating that communication style carries explanatory power beyond structural code characteristics. These results suggest that autonomous agents do not merely introduce technical artifacts into repositories, but actively reshape review interaction patterns and evaluative behavior. The impact of coding agents is therefore fundamentally socio-technical, highlighting the importance of studying AI systems within authentic human–AI collaborative environments. |
|
| Shi, Sherry |
Tao Dong, Sherry Shi, Harini Sampath, and Andrew Macvean (Google, USA) The ongoing transition of Large Language Models (LLMs) in software engineering from one-shot code generators into agentic partners requires a shift in how we define and measure success. While models are becoming more capable, the industry lacks a clear understanding of the behavioral norms that make an interactive software engineering (SWE) agent effective in collaborative software development in the enterprise. This work addresses this gap by presenting a taxonomy of desirable SWE agent behaviors, synthesized from 91 sets of developer-defined rules for SWE agents and validated through interviewing 15 experienced professional developers. In this taxonomy, we identify four core expectations: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solve Problems Effectively, and Collaborate with the Developer. These findings offer a concrete vocabulary for aligning SWE agent behavior with developer preferences, enabling researchers and practitioners to move beyond correctness-only benchmarks and start designing evaluations that reflect the socio-technical nature of professional software development in enterprises. |
|
| Shi, Yuling |
Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, and Xiaodong Gu (Shanghai Jiao Tong University, China; Hohai University, China; Chongqing University, China) LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes—compositional code creation, i.e., building a complete, internally structured class from a specification—remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark’s discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck. |
|
| Shin, Jiho |
Md Basim Uddin Ahmed, Nima Shiri Harzevili, Jiho Shin, Hung Viet Pham, and Song Wang (York University, Canada; Queen's University, Canada) Large Language Models (LLMs) show promise for vulnerability detection, but their evaluation is limited by the lack of high-quality benchmarks. Most existing datasets rely on coarse function-level labels, overlook fine-grained vulnerability patterns, and lack critical program context such as data/control dependencies. They also suffer from data quality issues, including mislabeling and duplication, leading to unreliable evaluation and limited real-world relevance. To address these limitations, this paper introduces SecVulEval, a context-aware benchmark designed to evaluate LLMs on vulnerability detection with rich contextual information. SecVulEval focuses on real-world C/C++ vulnerabilities at the statement level. This granularity enables more precise evaluation of a model’s ability to localize and understand vulnerabilities, beyond simple binary classification at the function level. By incorporating rich contextual information, SecVulEval sets a new standard for benchmarking vulnerability detection in realistic software development scenarios. This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024. We evaluated state-of-the-art LLMs in both standalone and multi-agent settings. Results on our dataset indicate that current models remain far from accurately identifying vulnerable statements within a given function, although agent-based approaches provide modest but promising improvements. The best-performing Claude-3.7-Sonnet-driven agent achieves an F1-score of 23.83% for vulnerable statement detection. We believe this benchmark can serve as a foundation for advancing context-aware vulnerability detection with LLMs. |
|
| Spiess, Claudio |
Claudio Spiess, Prem Devanbu, and Earl T. Barr (University of California at Davis, USA; University College London, UK) LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a standard program-output prediction task. Our results reveal a stark divergence in model behavior: while open-source reasoning models (DeepSeek-R1 family) maintain stable, albeit somewhat lower accuracies (38% to 67%) under code transformations & input perturbations, the frontier model GPT-5.2 exhibits surprising instruction-sensitive brittleness. Despite achieving a near-perfect score of 99% on the original, unperturbed CRUXEval benchmark, perturbed inputs trigger accuracy declines between 20% and 24%. In addition, we find that many models perform much worse at predicting behavior on perturbed inputs that raise exceptions, and that prediction performance depends on the kind of exception. We study remedies to address this deficiency in exception prediction, and evaluate the effect of these remedies on the ability to predict non-exception behaviors. Our findings both point to limitations in the way all models understand code, and reinforce and extend the value of using perturbation to evaluate code models. |
|
| Srivastva, Kushagra |
Xuan Liu, Dheeraj Kodakandla, Kushagra Srivastva, and Mahfuza Farooque (Pennsylvania State University, USA) VeriTrans is a reliability-first ML system that compiles natural-language requirements into solver-ready logic with validator-gated reliability. The pipeline integrates an instruction-tuned NL→PL translator, round-trip reconstruction (PL→NL) used as a high-precision acceptance gate, and canonical PL→CNF compilation, all executed via fixed API configuration (temperature= 0; fine-tuning runs use seed= 42) and per-item artifact logging (prompts, outputs, hashes) to support auditability and replay-driven debugging. On SatBench (2100 specifications), VeriTrans achieves 94.46% SAT/UNSAT correctness and 87.73% median round-trip similarity. Compact fine-tuning on 100-150 curated examples improves fidelity by about 1-1.5,pp without increasing latency (mean 25.8 s/spec on our 201-spec runtime subset). A thresholded acceptance policy on the round-trip score exposes a reliability--coverage knob: at τ=75, roughly 68% of items are retained with ~94% correctness on the accepted set. Validator overhead contributes < 15% of end-to-end runtime, and all prompts/responses and timing metadata are logged to enable replay-driven debugging and regression testing. By separating learned translation from symbolic verification and enforcing deterministic, validator-gated acceptance, VeriTrans turns NL→logic front-ends into auditable, reproducible components for reliability-critical workflows. |
|
| Staron, Miroslaw |
Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, and Miroslaw Staron (University of Gothenburg, Sweden; Chalmers University of Technology, Sweden) Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE. |
|
| Sui, Yulei |
Kaijie Liu and Yulei Sui (UNSW, Australia) Neural network (NN) verifiers are increasingly used to certify safety properties such as robustness (i.e., small allowed perturbations to an input should not alter a model’s decision). Since verifiers aim to prove the absence of violations by considering all possible specified behaviors, the soundness of their implementations is therefore critical to guaranteeing correctness. Detecting unsoundness is particularly important and challenging, because a verifier typically spans multiple components, including specifications, neural networks, operator semantics, and constraint solving, where subtle implementation bugs can silently lead to false certified results. We present an approach for neural network robustness verifiers that detects and localizes soundness-relevant faults via two types of concrete–abstract consistency checks: (1) Counterexample-Based Refutation (CBR), where a certification is supposed to be refuted if a concrete counterexample is found at runtime; and (2) Bounds-Based Localization (BBL), which audits per-neuron containment (concrete activations must lie within abstract bounds as an invariant) to pinpoint incorrect implementations at particular NN layers or operators. To reduce representation drift, we use specification-embedded models that wrap the core NN with input and output specifications as two additional layers. We further develop an operator-aware NN generator that can produce diverse NN models spanning a wide range of layer types, parameters, and architectures, enabling systematic exposure and exercise of different operator behaviors. We evaluate verifiers on three abstract domains using six mutation operators. Across 450 soundness-violating instances, our framework detects 72% of injected soundness violations. CBR mainly exposes input-output-level soundness failures when a concrete counterexample is found during input sampling, while BBL catches internal bound-containment violations and localizes them to specific layers/operators, even when CBR becomes ineffective in high-dimensional inputs. These results indicate that combining coarse refutation (CBR) with fine-grained invariant checking (BBL) provides assurance for verifiers, and operator-aware generation further boosts both coverage and discovery of unsoundness issues. |
|
| Sun, Simin |
Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, and Miroslaw Staron (University of Gothenburg, Sweden; Chalmers University of Technology, Sweden) Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE. |
|
| Sun, Weiyan |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Tasnim, Tasfia |
Tasfia Tasnim, Matthew B. Dwyer, and Soneya Binta Hossain (University of Texas at Dallas, USA; University of Virginia at Charlottesville, USA) Test oracles determine whether a program execution is correct for a given input. Two common forms are assertion oracles, which compare observed outputs with expected results, and exception oracles, which verify that a program raises an expected exception. Automated test oracle generation (TOG) aims to reduce the manual effort involved in constructing such oracles. Although recent TOG methods, especially LLM-based approaches, have made rapid progress, their evaluation remains constrained by benchmarks that rely on automatically generated tests, narrow single-assert formulations, simplified developer-written tests, or limited oracle diversity. To address these limitations, we introduce OE25𝑑𝑒𝑣 , a multi-variant dataset curated from developer-written unit tests across 25 open-source Java projects spanning 56 modules, and TOGBench, an end-to-end benchmark suite for TOG. OE25𝑑𝑒𝑣 captures six oracle categories and preserves realistic settings, including single- and multi-oracle configurations, mixed assertion-and-exception oracles, and developer-authored custom oracles. TOGBench supports end-to-end experimentation by reintegrating generated oracles into runnable test suites and evaluating them via compilation, execution, false-positive analysis, and mutation testing. Our evaluation further shows that OE25𝑑𝑒𝑣 preserves substantially greater structural complexity than prior benchmarks and exposes marked performance degradation of representative TOG models on developer-written tests, particularly for assertion oracles. |
|
| Terragni, Valerio |
Elliott Wen, Chenye Ni, Valerio Terragni, and Jens Dietrich (University of Auckland, New Zealand; Massey University, New Zealand) Reproducible independent rebuilds strengthen software supply-chain integrity by recreating the original build environment and enforcing bitwise equivalence between artifacts. However, this approach implicitly assumes a trustworthy toolchain and can fail under adversarial manipulation of the build process itself (e.g., the Ken Thompson attack). Prior work has explored introducing diversity across build environments to reduce reliance on any single toolchain, and has proposed AI-driven methods to establish behavioural equivalence while tolerating benign build variability in the Java ecosystem. In this work, we extend this line of research to Rust and present RustBuildEq, a benchmark for training and evaluating binary equivalence classier models under realistic build variability. We curate a large corpus of crates drawn from the top 20% of the crates.io ecosystem and construct datasets of equivalent (EQ) and non-equivalent (NEQ) pairs with rich provenance metadata. EQ pairs are generated from identical source revisions under varying toolchain versions and build configurations, while NEQ pairs are derived from AST rewrites or API-breaking changes across versions. Many rust crates rely heavily on generics and cannot be compiled into binaries without specifying concrete types; to address this, we develop an automated approach that combines heuristic type instantiation, witness-type synthesis, and an iterative AI repair loop. RustBuildEq comprises 19,184,671 EQ records and 273,848,531 NEQ records, and includes a Python API for dataset navigation. The dataset provides large-scale ground truth for training and evaluating AI-driven models for reasoning binary equivalence and is publicly available at https://doi.org/10.5281/zenodo.19244908. |
|
| Treude, Christoph |
Christoph Treude, Sebastian Baltes, and Marc Cheong (Singapore Management University, Singapore; Ruprecht-Karls-Universität Heidelberg, Germany; University of Melbourne, Australia) As AI coding agents become embedded in software development workflows, developers are beginning to operationalize ethical principles by encoding behavioral rules into repository-level context files for AI agents, such as AGENTS.md files. Rather than examining the ethics of AI agents in the abstract, this vision paper investigates how ethics and values are already being translated for AI agents into actionable instructions that shape agent behavior. Through a preliminary investigation, we find that developers are already embedding guidance related to fairness, accessibility, sustainability, tone, and privacy. These artifacts function as a developer-authored governance layer, translating abstract principles into situated, natural-language directives within development workflows. We outline a research agenda for studying this emerging practice, including how encoded values vary across communities, what governance dynamics emerge when multiple contributors negotiate these files, and whether agents reliably adhere to the constraints specified. Understanding how ethics and values are operationalized for AI agents is essential to ground AI governance in modern software engineering practice. Matthias Galster, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Singapore Management University, Singapore) Agentic AI coding tools increasingly automate software development tasks. Developers can configure these tools through versioned repository-level artifacts such as Markdown and JSON files. We present a systematic analysis of configuration mechanisms for agentic AI coding tools, covering Claude Code, GitHub Copilot, Cursor, Gemini, and Codex. We identify eight configuration mechanisms spanning from static context to executable and external integrations and, in an empirical study of 2,853 GitHub repositories, examine whether and how they are adopted, with a detailed analysis of Context Files, Skills, and Subagents. First, Context Files dominate the configuration landscape and are often the sole mechanism in a repository, with AGENTS.md emerging as an interoperable standard across tools. Second, few repositories adopt advanced mechanisms such as Skills and Subagents. Skills predominantly rely on static instructions rather than executable scripts. Third, distinct configuration practices are forming around different tools, with Claude Code users employing the broadest range of mechanisms. These findings establish an empirical baseline for understanding how developers configure agentic tools, suggest that AGENTS.md serves as a natural starting point, and motivate longitudinal and experimental research on how configuration strategies evolve and affect agent performance. Matthias Galster, Seyedmoein Mohsenimofidi, Levi Böhme, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes (Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Universität Bayreuth, Germany; Singapore Management University, Singapore) Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration artifacts, the full content of 18,167 configuration files associated with these configuration artifacts, and 148,519 AI-co-authored commits. The dataset and the construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, AI tool adoption patterns, and human-AI collaboration. Christoph Treude (Singapore Management University, Singapore) AI coding assistants and autonomous agents are becoming integral to software development workflows, reshaping how code is produced, reviewed, and maintained. While recent research has focused mainly on the capabilities and impacts of productivity of these systems, much less attention has been paid to accountability: who is responsible when agents generate, modify, or recommend code? In practice, accountability is defined through the Terms of Service (ToS) and related policy documents that govern the use of AI-powered development tools. In this vision paper, we present a comparative analysis of the Terms of Service for widely used AI coding assistants and agent-enabled development tools. We examine how these documents allocate ownership, responsibility, liability, and disclosure obligations between tool providers and software developers, and we identify common patterns and divergences between providers. Our analysis reveals a consistent tendency to shift responsibility for correctness, safety, and legal compliance onto users, as well as substantial variation in how providers address issues such as indemnification, data reuse, and acceptable use. Based on these findings, we argue that existing policy frameworks are poorly aligned with increasingly agent-mediated and autonomous software development workflows. We outline a research roadmap for accountable agents in software engineering, identifying challenges and opportunities for modeling responsibility, designing governance artifacts, developing tooling that supports accountability, and conducting empirical studies of developers’ perceptions and practices. |
|
| Tüzün, Eray |
Mustafa Özkan İr, Mehmet Dedeler, Anıl Koyuncu, and Eray Tüzün (Bilkent University, Türkiye) Verifying bug fixes before patches are released to end users is a critical step in the software development lifecycle. However, this process is often manual, repetitive, and error-prone, especially for crash bugs triggered through Graphical User Interface (GUI) interactions in desktop applications. Despite recent advancements in LLM-driven software agents, existing work primarily targets bug reproduction without addressing bug fix verification, while approaches that do focus on verification rely on source code access, making them inapplicable to closed-source GUI-based desktop applications. This paper introduces Fixpad++, a framework designed to automatically verify bug fixes in the Notepad++ desktop application using LLM-powered agents. Fixpad++ employs a two-phase approach: first, a multi-modal multi-agent system interacts with the buggy version to reproduce the reported crash bug using visual parsing and LLM reasoning. Second, upon successful reproduction, a trajectory replay mechanism executes the recorded action sequence on the patched version to verify the fix. We evaluated Fixpad++ on FixPad-Bench, a new dataset of 105 evaluation instances derived from 22 real-world Notepad++ crash bugs, including correct and incorrect patches. The system achieved a reproduction success rate of 72.73% with an average time of 174.07 seconds. Among the successfully reproduced cases, Fixpad++ successfully verified correct fixes with 87.50% accuracy and detected incorrect fixes with 77.05% accuracy, outperforming OpenAI’s Computer-Using Agent (CUA). Fixpad++ demonstrates the effectiveness of specialized LLM agent architectures for automated bug fix verification in GUI-based desktop applications, offering a practical solution for automating verification workflows without requiring access to source code. |
|
| Uddin, Gias |
Haoran Xue, Gias Uddin, and Song Wang (York University, Canada) Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, it remains unclear which trace characteristics are informative and the quality of the reasoning chains. In this paper, we present an empirical study examining the reasoning processes and the quality of thinking LLMs on code generation tasks. We evaluate six state-of-the-art reasoning LLMs (DeepSeek-R1, OpenAI-o3-mini, Claude-3.7-Sonnet, Gemini-2.0-Flash-Thinking, Gemini-2.5-Flash, and Qwen-QwQ) on 100 BigCodeBench code generation tasks (600 model–task instances; 3,772 reasoning steps). To characterize reasoning-chain structure, we measure step count and per-step verbosity, and compare successful versus failed attempts under difficulty stratification (Hard vs. Non-Hard). We further perform a 21-participant human evaluation of reasoning quality across three dimensions: efficiency, logical consistency, and completeness, and we build a taxonomy of problematic reasoning patterns. We find the model- and difficulty-dependent relationship between step count and success, and verbosity is not a reliable correctness signal. Human analysis indicates that completeness issues dominate failures (44.5%), most often due to missed edge cases and boundary conditions, and incompleteness is a stronger predictor of failure on Hard tasks than on Non-Hard tasks (𝜌 = −0.219 vs. 𝜌 = −0.096). Haoran Xue, Reem Aleithan, Nafid Enan, Gias Uddin, and Song Wang (York University, Canada) SWE-Bench has become one of the most widely used benchmarks for evaluating whether large language models (LLMs) can resolve real-world GitHub issues. However, despite its broad adoption, a systematic evaluation of the quality of the benchmark remains lacking. In this paper, we present SWE-Bench+, an enhanced benchmark framework that improves evaluation reliability by addressing two benchmark-quality risks in SWE-Bench: solution leakage in issue descriptions and weak tests that allow plausible but incorrect patches to pass. To construct SWE-Bench+, we first analyze 217 commonly resolved issues across three top-performing agents in the SWE-Bench leaderboard during our study: SWE-agent 1.0, OpenHands+CodeAct v2.1, and AutoCodeRover-v2.0), yielding 651 model-generated patches, and identify five quality-problem patterns under the two broader risks of solution leakage and weak tests. Based on this analysis, we develop two components: the SoluLeakDetector, which identifies solution-leaking content for issue sanitization, and the TestEnhancer, which strengthens test suites for more reliable patch validation. Our results show that 60.83% of the commonly resolved issues exhibit solution leakage, and 77.88% are problematic overall. After removing leaked information from issue descriptions, model resolution rates drop substantially, showing that leakage materially inflates reported performance. SoluLeakDetector achieves 80.45% accuracy on solution-leak detection, and TestEnhancer identifies plausible patches for 97.11% of weak-test issues while reducing average resolution rates by 27.00 percentage points on Lite and 36.27 percentage points on Verified. These findings show that existing SWE-Bench evaluations can substantially overestimate true issue-resolution performance and that SWE-Bench+ provides a more rigorous benchmark framework for LLM-based software engineering agents. |
|
| Valle, Pablo |
Prasun Saurabh, Pablo Valle, Aitor Arrieta, Shaukat Ali, and Paolo Arcaini (Simula Research Laboratory, Norway; Oslo Metropolitan University, Norway; Mondragon University, Spain; Tokyo Institute of Technology, Japan) Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test oracle problem in software testing. Traditionally, this assessment relies on task-specific symbolic oracles for task correctness and on human manual evaluation of robot behavior, which is time-consuming, subjective, and error-prone. To address this, we propose VISOR, a Vision-Language Model (VLM)–based approach for automated test oracle assessment that eliminates the need of expensive human evaluations. VISOR performs automated evaluation of task correctness and quality, addressing the limitations of existing symbolic test oracles, which are task-specific and provide binary pass/fail judgments without explicitly quantifying task quality. Given the inherent uncertainty in VLMs, VISOR also explicitly quantifies its own uncertainty during test assessments. We evaluated VISOR using two VLMs, i.e., GPT and Gemini, across four robotic tasks on over 1,000 videos. Our results show that Gemini achieves higher recall and GPT produces more precise predictions, while both models exhibit low uncertainty. |
|
| Van Deursen, Arie |
Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation. |
|
| Varró, Dániel |
Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró (Linköping University, Sweden; University of Murcia, Spain) Jupyter notebooks are widely used for machine learning (ML) prototyping and experimentation, yet debugging support for notebook-based ML development remains limited, partly due to the lack of realistic and executable bug benchmarks. We introduce JunoBench, to our knowledge the first executable benchmark of real-world crashes in Python-based ML notebooks. JunoBench contains 111 curated and reproducible crashes from public Kaggle notebooks, each paired with a verified fix. The benchmark covers widely used ML libraries (e.g., TensorFlow/Keras, PyTorch, and Scikit-learn) as well as notebook-specific failures such as out-of-order execution. To ensure reproducibility and ease of evaluation, JunoBench provides a unified execution environment that reliably reproduces all crashes. In addition, each crash is accompanied by human-validated annotations, including library cause, crash type, root cause, ML pipeline stage, and natural-language diagnostic summaries. By combining realistic crashes, verified fixes, structured labels, and reproducible execution, JunoBench enables systematic evaluation of crash detection, diagnosis, and automated repair techniques for notebook-based ML development. |
|
| Vasilescu, Bogdan |
Lara Khatib, Michael Pu, Bogdan Vasilescu, and Meiyappan Nagappan (University of Waterloo, Canada; Carnegie Mellon University, USA) As developers increasingly rely on LLM-generated code summaries for documentation, testing, and review, it is important to study whether these summaries accurately reflect what the program actually does. LLMs often produce confident descriptions of what the code looks like it should do (intent), while missing subtle edge cases or logic changes that define what it actually does (behavior). We present a mutation-based evaluation methodology that directly tests whether a summary truly matches the code’s logic. Our approach generates a summary, injects a targeted mutation into the code, and checks if the LLM updates its summary to reflect the new behavior. We validate it through three experiments totalling 624 mutation–summary evaluations across 62 programs. First, on 12 controlled synthetic programs with 324 mutations varying in type (statement, value, decision) and location (beginning, middle, end). We find that, for GPT-4, summary accuracy decreases sharply with complexity from 76.5% for single functions to 17.3% for multi-threaded systems, while mutation type and location exhibit weaker effects. Second, testing 150 mutated samples on 50 human-written programs from the Less Basic Python Problems (LBPP) dataset confirms the same failure patterns persist as models often describe algorithmic intent rather than actual mutated behavior with a summary accuracy rate of 49.3%. Furthermore, while a comparison between GPT-4 and GPT-5.2 shows a substantial performance leap (from 49.3% to 85.3%) and an improved ability to identify mutations as ”bugs”, both models continue to struggle with distinguishing implementation details from standard algorithmic patterns. This work establishes mutation analysis as a systematic approach for assessing whether LLM-generated summaries reflect program behavior rather than superficial textual patterns, enabling direct comparison across models, prompts, and datasets. |
|
| Vízner, Roland |
Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner, Arie van Deursen, and Maliheh Izadi (Delft University of Technology, Netherlands) Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation. |
|
| Vu, Brian |
Xin Zhao, Brian Vu, and Sitesh Pattanaik (Seattle University, USA; University of California at Irvine, USA) Background: Artificial intelligence is rapidly changing the landscape of software development. With the unique ability to quickly generate code and the potential to disrupt traditional workflows, AI tools have found growing adoption within the software development process. Subsequently, this topic has been the focus of academic work, including research examining qualitative impacts to productivity and the analysis of sentiments from the developers who utilize AI tools. While this material is extensive, our research team identified a gap within existing literature: what do software managers have to say? Goals: The overarching goal of this study is to examine the views of software managers on how AI tools have affected software development. We seek to understand how managers, who leverage a top-down view of the development process, perceive the influence of AI on developers, their own roles, and the broader labor market. Methodology: To answer these questions, we conducted an empirical study by releasing an online questionnaire containing both qualitative and quantitative questions, sampling software managers employed across both tech-focused and non-tech-focused companies. Results: Through a survey of 42 managers, we found that managers hold nuanced views on the introduction of AI into software development. They encourage developers to use AI, perceive it as valuable for testing, and apply it themselves for knowledge work. At the same time, they raise concerns about privacy, responsibility, transparency, and over-reliance. Many also predict a loss of jobs within the software development market due to consolidation driven by AI. Conclusion: AI is seen by managers as both a powerful productivity tool and a source of new ethical challenges. Our investigation paves the way for a comprehensive understanding of how AI is perceived by those who directly manage the introduction of these tools into traditional software development workflows, revealing a road map for future endeavors for the software development community. |
|
| Wang, Song |
Haoran Xue, Gias Uddin, and Song Wang (York University, Canada) Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, it remains unclear which trace characteristics are informative and the quality of the reasoning chains. In this paper, we present an empirical study examining the reasoning processes and the quality of thinking LLMs on code generation tasks. We evaluate six state-of-the-art reasoning LLMs (DeepSeek-R1, OpenAI-o3-mini, Claude-3.7-Sonnet, Gemini-2.0-Flash-Thinking, Gemini-2.5-Flash, and Qwen-QwQ) on 100 BigCodeBench code generation tasks (600 model–task instances; 3,772 reasoning steps). To characterize reasoning-chain structure, we measure step count and per-step verbosity, and compare successful versus failed attempts under difficulty stratification (Hard vs. Non-Hard). We further perform a 21-participant human evaluation of reasoning quality across three dimensions: efficiency, logical consistency, and completeness, and we build a taxonomy of problematic reasoning patterns. We find the model- and difficulty-dependent relationship between step count and success, and verbosity is not a reliable correctness signal. Human analysis indicates that completeness issues dominate failures (44.5%), most often due to missed edge cases and boundary conditions, and incompleteness is a stronger predictor of failure on Hard tasks than on Non-Hard tasks (𝜌 = −0.219 vs. 𝜌 = −0.096). Haoran Xue, Reem Aleithan, Nafid Enan, Gias Uddin, and Song Wang (York University, Canada) SWE-Bench has become one of the most widely used benchmarks for evaluating whether large language models (LLMs) can resolve real-world GitHub issues. However, despite its broad adoption, a systematic evaluation of the quality of the benchmark remains lacking. In this paper, we present SWE-Bench+, an enhanced benchmark framework that improves evaluation reliability by addressing two benchmark-quality risks in SWE-Bench: solution leakage in issue descriptions and weak tests that allow plausible but incorrect patches to pass. To construct SWE-Bench+, we first analyze 217 commonly resolved issues across three top-performing agents in the SWE-Bench leaderboard during our study: SWE-agent 1.0, OpenHands+CodeAct v2.1, and AutoCodeRover-v2.0), yielding 651 model-generated patches, and identify five quality-problem patterns under the two broader risks of solution leakage and weak tests. Based on this analysis, we develop two components: the SoluLeakDetector, which identifies solution-leaking content for issue sanitization, and the TestEnhancer, which strengthens test suites for more reliable patch validation. Our results show that 60.83% of the commonly resolved issues exhibit solution leakage, and 77.88% are problematic overall. After removing leaked information from issue descriptions, model resolution rates drop substantially, showing that leakage materially inflates reported performance. SoluLeakDetector achieves 80.45% accuracy on solution-leak detection, and TestEnhancer identifies plausible patches for 97.11% of weak-test issues while reducing average resolution rates by 27.00 percentage points on Lite and 36.27 percentage points on Verified. These findings show that existing SWE-Bench evaluations can substantially overestimate true issue-resolution performance and that SWE-Bench+ provides a more rigorous benchmark framework for LLM-based software engineering agents. Md Basim Uddin Ahmed, Nima Shiri Harzevili, Jiho Shin, Hung Viet Pham, and Song Wang (York University, Canada; Queen's University, Canada) Large Language Models (LLMs) show promise for vulnerability detection, but their evaluation is limited by the lack of high-quality benchmarks. Most existing datasets rely on coarse function-level labels, overlook fine-grained vulnerability patterns, and lack critical program context such as data/control dependencies. They also suffer from data quality issues, including mislabeling and duplication, leading to unreliable evaluation and limited real-world relevance. To address these limitations, this paper introduces SecVulEval, a context-aware benchmark designed to evaluate LLMs on vulnerability detection with rich contextual information. SecVulEval focuses on real-world C/C++ vulnerabilities at the statement level. This granularity enables more precise evaluation of a model’s ability to localize and understand vulnerabilities, beyond simple binary classification at the function level. By incorporating rich contextual information, SecVulEval sets a new standard for benchmarking vulnerability detection in realistic software development scenarios. This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024. We evaluated state-of-the-art LLMs in both standalone and multi-agent settings. Results on our dataset indicate that current models remain far from accurately identifying vulnerable statements within a given function, although agent-based approaches provide modest but promising improvements. The best-performing Claude-3.7-Sonnet-driven agent achieves an F1-score of 23.83% for vulnerable statement detection. We believe this benchmark can serve as a foundation for advancing context-aware vulnerability detection with LLMs. |
|
| Wang, Yiran |
Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró (Linköping University, Sweden; University of Murcia, Spain) Jupyter notebooks are widely used for machine learning (ML) prototyping and experimentation, yet debugging support for notebook-based ML development remains limited, partly due to the lack of realistic and executable bug benchmarks. We introduce JunoBench, to our knowledge the first executable benchmark of real-world crashes in Python-based ML notebooks. JunoBench contains 111 curated and reproducible crashes from public Kaggle notebooks, each paired with a verified fix. The benchmark covers widely used ML libraries (e.g., TensorFlow/Keras, PyTorch, and Scikit-learn) as well as notebook-specific failures such as out-of-order execution. To ensure reproducibility and ease of evaluation, JunoBench provides a unified execution environment that reliably reproduces all crashes. In addition, each crash is accompanied by human-validated annotations, including library cause, crash type, root cause, ML pipeline stage, and natural-language diagnostic summaries. By combining realistic crashes, verified fixes, structured labels, and reproducible execution, JunoBench enables systematic evaluation of crash detection, diagnosis, and automated repair techniques for notebook-based ML development. |
|
| Wang, Yongpan |
Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, and Xiaodong Gu (Shanghai Jiao Tong University, China; Hohai University, China; Chongqing University, China) LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes—compositional code creation, i.e., building a complete, internally structured class from a specification—remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark’s discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck. |
|
| Wen, Elliott |
Elliott Wen, Chenye Ni, Valerio Terragni, and Jens Dietrich (University of Auckland, New Zealand; Massey University, New Zealand) Reproducible independent rebuilds strengthen software supply-chain integrity by recreating the original build environment and enforcing bitwise equivalence between artifacts. However, this approach implicitly assumes a trustworthy toolchain and can fail under adversarial manipulation of the build process itself (e.g., the Ken Thompson attack). Prior work has explored introducing diversity across build environments to reduce reliance on any single toolchain, and has proposed AI-driven methods to establish behavioural equivalence while tolerating benign build variability in the Java ecosystem. In this work, we extend this line of research to Rust and present RustBuildEq, a benchmark for training and evaluating binary equivalence classier models under realistic build variability. We curate a large corpus of crates drawn from the top 20% of the crates.io ecosystem and construct datasets of equivalent (EQ) and non-equivalent (NEQ) pairs with rich provenance metadata. EQ pairs are generated from identical source revisions under varying toolchain versions and build configurations, while NEQ pairs are derived from AST rewrites or API-breaking changes across versions. Many rust crates rely heavily on generics and cannot be compiled into binaries without specifying concrete types; to address this, we develop an automated approach that combines heuristic type instantiation, witness-type synthesis, and an iterative AI repair loop. RustBuildEq comprises 19,184,671 EQ records and 273,848,531 NEQ records, and includes a Python API for dataset navigation. The dataset provides large-scale ground truth for training and evaluating AI-driven models for reasoning binary equivalence and is publicly available at https://doi.org/10.5281/zenodo.19244908. |
|
| Won, Jun Yeon |
Jun Yeon Won, Xin Jin, Shiqing Ma, and Zhiqiang Lin (Ohio State University, Columbus, USA; Meta, USA; University of Massachusetts at Amherst, USA) Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical tasks such as function and variable name recovery and type inference. However, despite the rapid growth of research in this area, progress has been hindered by the absence of a standardized dataset. Existing studies rely on disparate datasets, preprocessing pipelines, and evaluation metrics, making fair comparisons between approaches difficult and obscuring a clear understanding of LLM capabilities in binary analysis. To address these challenges, we present REBench, a comprehensive benchmark dataset for evaluating LLMs on binary reverse engineering tasks. REBench consolidates a superset of existing datasets, comprising hundreds of millions of lines of source code and a diverse collection of binaries spanning multiple architectures and optimization levels. REBench adopts a knowledge-base-driven methodology that stores byte-level stack information to generate ground truth, ensuring that task difficulty is preserved while maintaining universal applicability. This design enables fair evaluation across tasks while avoiding simplifications that could bias results. As a use case, we apply REBench to measure the reverse engineering performance of two LLMs and the result demonstrates difficulties in complex tasks. |
|
| Xie, Chaoxiang |
Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, and Xiaodong Gu (Shanghai Jiao Tong University, China; Hohai University, China; Chongqing University, China) LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes—compositional code creation, i.e., building a complete, internally structured class from a specification—remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark’s discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck. |
|
| Xue, Haoran |
Haoran Xue, Gias Uddin, and Song Wang (York University, Canada) Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, it remains unclear which trace characteristics are informative and the quality of the reasoning chains. In this paper, we present an empirical study examining the reasoning processes and the quality of thinking LLMs on code generation tasks. We evaluate six state-of-the-art reasoning LLMs (DeepSeek-R1, OpenAI-o3-mini, Claude-3.7-Sonnet, Gemini-2.0-Flash-Thinking, Gemini-2.5-Flash, and Qwen-QwQ) on 100 BigCodeBench code generation tasks (600 model–task instances; 3,772 reasoning steps). To characterize reasoning-chain structure, we measure step count and per-step verbosity, and compare successful versus failed attempts under difficulty stratification (Hard vs. Non-Hard). We further perform a 21-participant human evaluation of reasoning quality across three dimensions: efficiency, logical consistency, and completeness, and we build a taxonomy of problematic reasoning patterns. We find the model- and difficulty-dependent relationship between step count and success, and verbosity is not a reliable correctness signal. Human analysis indicates that completeness issues dominate failures (44.5%), most often due to missed edge cases and boundary conditions, and incompleteness is a stronger predictor of failure on Hard tasks than on Non-Hard tasks (𝜌 = −0.219 vs. 𝜌 = −0.096). Haoran Xue, Reem Aleithan, Nafid Enan, Gias Uddin, and Song Wang (York University, Canada) SWE-Bench has become one of the most widely used benchmarks for evaluating whether large language models (LLMs) can resolve real-world GitHub issues. However, despite its broad adoption, a systematic evaluation of the quality of the benchmark remains lacking. In this paper, we present SWE-Bench+, an enhanced benchmark framework that improves evaluation reliability by addressing two benchmark-quality risks in SWE-Bench: solution leakage in issue descriptions and weak tests that allow plausible but incorrect patches to pass. To construct SWE-Bench+, we first analyze 217 commonly resolved issues across three top-performing agents in the SWE-Bench leaderboard during our study: SWE-agent 1.0, OpenHands+CodeAct v2.1, and AutoCodeRover-v2.0), yielding 651 model-generated patches, and identify five quality-problem patterns under the two broader risks of solution leakage and weak tests. Based on this analysis, we develop two components: the SoluLeakDetector, which identifies solution-leaking content for issue sanitization, and the TestEnhancer, which strengthens test suites for more reliable patch validation. Our results show that 60.83% of the commonly resolved issues exhibit solution leakage, and 77.88% are problematic overall. After removing leaked information from issue descriptions, model resolution rates drop substantially, showing that leakage materially inflates reported performance. SoluLeakDetector achieves 80.45% accuracy on solution-leak detection, and TestEnhancer identifies plausible patches for 97.11% of weak-test issues while reducing average resolution rates by 27.00 percentage points on Lite and 36.27 percentage points on Verified. These findings show that existing SWE-Bench evaluations can substantially overestimate true issue-resolution performance and that SWE-Bench+ provides a more rigorous benchmark framework for LLM-based software engineering agents. |
|
| Yang, Jinqiu |
Fazle Rabbi and Jinqiu Yang (Concordia University, Canada) Recent Large Language Models (LLMs) have shown strong perfor- mance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations com- monly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics- preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations. Fazle Rabbi, Soumit Kanti Saha, and Jinqiu Yang (Concordia University, Canada) Large Language Models (LLMs) have achieved remarkable success in automated code translation. While prior work has focused on improving translation accuracy through advanced prompting and iterative repair, the reliability of the underlying evaluation frame- works has received less attention. In this paper, we demonstrate that a significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. We conduct a large-scale empirical study across five programming languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus), covering 6,164 translations generated by GPT-4o, DeepSeek-Coder, and Magicoder. Our analysis identifies and categorizes common false negatives, distinguishing pipeline-induced failures that affect any model from model-dependent behaviors that vary across LLMs. Our findings highlight the necessity for transparent, configuration- aware evaluation standards to accurately assess progress in LLM- based code translation. |
|
| Yao, Kundi |
Xiaoyu Cheng, Kundi Yao, Pengyu Nie, and Weiyi Shang (University of Waterloo, Canada; Ontario Tech University, Canada) Large Language Models (LLMs) for code generation risk memorizing and reproducing sensitive training data, including licensed code and proprietary information. We investigate memorization behavior in recent open-weight LLMs in code generation using a two-stage memorization evaluation pipeline, which combines a similarity-based extractability filter with a targeted data extraction attack. We evaluate four models (StarCoder2-3B, StarCoder2-7B, Llama3-8B, and DeepSeek-R1-distilled-Llama-8B) on a custom dataset of 30,000+ Python files. Our results reveal memorization rates of 42-64%, with code-specialized models exhibiting higher rates than general-purpose models. Categorical analysis shows that repetitive content (license headers, documentation) is memorized at rates up to 70%, while complex code exhibits lower susceptibility. Notably, realistic code completion scenarios trigger unintentional memorization in 13-14% of cases, posing practical risks for AI coding assistants. We demonstrate that knowledge distillation reduces extraction rates by approximately 19%, offering a cost-effective mitigation approach. Our findings confirm that memorization persists in modern LLMs and is influenced more by a complex interplay of training domain, dataset composition, architectural choices, and content characteristics, rather than parameter count alone. |
|
| Zeller, Andreas |
Norman Becker, Tural Mammadov, and Andreas Zeller (CISPA Helmholtz Center for Information Security, Germany) In the past years, large language models (LLMs) have demonstrated remarkable progress in code generation. However, their ability to reason about program behavior remains an open challenge—an ability that is relevant for applications including reverse engineering, debugging, secure code generation, test-driven synthesis, input reconstruction, reverse fuzzing, behavioral monitoring, and safe execution modeling. To study this ability, we examine the capacity of LLMs to reason about the semantics of code—specifically, their ability to relate code, its inputs, and its outputs to each other. To this end, we investigate whether and how well LLMs can predict one of these three components given the other two—that is, (1) predict the input given code and output, (2) predict the output given code and input, and (3) predict the code given input and output. This way, we assess how well LLMs can reason about and understand the underlying relationships that govern program execution. We construct four datasets covering string processing, array operations, and coding challenges in JavaScript and Python to evaluate diverse program-understanding capabilities, incorporating various code mutation techniques to increase complexity. In our evaluation on tasks covering string processing, array operations, and coding challenges, we find that closed-weight models achieve the strongest performance across all datasets, including perfect input recovery on deterministic string tasks. Across tasks, output prediction is comparatively stable, whereas code prediction remains the hardest setting and often fails for smaller models. Finally, cross-codebase transfer is feasible, especially for input prediction, but highly sensitive to model capacity and fine-tuning strategy. |
|
| Zeng, Wenhao |
Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, and Xiaodong Gu (Shanghai Jiao Tong University, China; Hohai University, China; Chongqing University, China) LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes—compositional code creation, i.e., building a complete, internally structured class from a specification—remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark’s discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck. |
|
| Zhang, Cong |
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan (Facebook, USA; Facebook, Canada; Rice University, USA; Southern Methodist University, USA; University of Tennessee at Knoxville, USA; Concordia University, Montreal, Canada) Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. In the context of just-in-time de- fect prediction, highlighting small portions of a software change (diff )—beyond rule-based lints—where risk may be concentrated has not yet been extensively investigated. In this work, we leverage attention weights from an LLM-based Diff Risk Score (DRS) model to highlight parts of a diff that the model focuses on when pre- dicting risk. We aggregate token-level attention into interpretable code units (lines, hunks, and files), and present the top-𝐾 units to developers as a lightweight form of guidance during code re- view. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that the highlighted snip- pets cover expert-labeled outage-causing change lines 53.85% of the time when highlighting the top-2 hunks, while requiring developers to review 26.28% of the changed lines on average. Because atten- tion is produced during standard model inference, the approach is scalable for large development workflows and can be surfaced in the code review UI with low additional latency. |
|
| Zhang, Hongyu |
Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, and Xiaodong Gu (Shanghai Jiao Tong University, China; Hohai University, China; Chongqing University, China) LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes—compositional code creation, i.e., building a complete, internally structured class from a specification—remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark’s discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck. |
|
| Zhao, Xin |
Xin Zhao, Brian Vu, and Sitesh Pattanaik (Seattle University, USA; University of California at Irvine, USA) Background: Artificial intelligence is rapidly changing the landscape of software development. With the unique ability to quickly generate code and the potential to disrupt traditional workflows, AI tools have found growing adoption within the software development process. Subsequently, this topic has been the focus of academic work, including research examining qualitative impacts to productivity and the analysis of sentiments from the developers who utilize AI tools. While this material is extensive, our research team identified a gap within existing literature: what do software managers have to say? Goals: The overarching goal of this study is to examine the views of software managers on how AI tools have affected software development. We seek to understand how managers, who leverage a top-down view of the development process, perceive the influence of AI on developers, their own roles, and the broader labor market. Methodology: To answer these questions, we conducted an empirical study by releasing an online questionnaire containing both qualitative and quantitative questions, sampling software managers employed across both tech-focused and non-tech-focused companies. Results: Through a survey of 42 managers, we found that managers hold nuanced views on the introduction of AI into software development. They encourage developers to use AI, perceive it as valuable for testing, and apply it themselves for knowledge work. At the same time, they raise concerns about privacy, responsibility, transparency, and over-reliance. Many also predict a loss of jobs within the software development market due to consolidation driven by AI. Conclusion: AI is seen by managers as both a powerful productivity tool and a source of new ethical challenges. Our investigation paves the way for a comprehensive understanding of how AI is perceived by those who directly manage the introduction of these tools into traditional software development workflows, revealing a road map for future endeavors for the software development community. |
175 authors
proc time: 41.75