AIware 2026
3rd ACM International Conference on AI-Powered Software (AIware 2026)

Powered by

3rd ACM International Conference on AI-Powered Software (AIware 2026), July 6–7, 2026, Montreal, QC, Canada

AIware 2026 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page
Article: fseaiware26foreword-fm000-p doi:

Welcome from the Chairs
Welcome to the 3rd ACM International Conference on AI-Powered Software (AIware 2026), held on 6th and 7th July 2026 in Montreal, Canada, co-located with the ACM International Conference on the Foundations of Software Engineering (FSE 2026). Following the success of the first two editions, AIware 2026 continues to bring together researchers, practitioners, educators, and industry leaders to examine the transformative changes driven by Foundation Models, including Large Language Models, and by the emerging generation of autonomous and collaborative AI-powered software systems. The conference aims to foster cross-disciplinary dialogue, identify emerging research challenges, and establish a forward-looking research agenda for the era of Foundation Models and AI-powered software.
AIware is founded on the vision that “software for all and by all” is the future of humanity. AI-powered software has the potential to democratize software creation, enabling individuals of diverse backgrounds and levels of software engineering expertise to participate in the creation, evolution, and oversight of software with higher reliability and quality. As software evolves from human-driven Codeware to Neuralware, Promptware, Agentware, and eventually Mindware, AIware is reshaping the assumptions on which software engineering has long been built. The boundary between users, developers, tools, and software artifacts is becoming increasingly fluid: users can express intent directly, agents can plan and act across complex development environments, and software systems can increasingly adapt, generate, and evolve through interaction. This shift raises fundamental questions about specification, correctness, trust, accountability, collaboration, maintainability, education, and human oversight. The software engineering community therefore reimagines its theories, methods, tools, education, and research agendas for a world in which software creation is no longer limited to expert developers, but increasingly involves Software Makers working together with intelligent agents. We believe AIware provides a timely forum for the community to examine these questions collectively, move beyond isolated advances in models or tools, and develop principled foundations for the next generation of software and software engineering.

Article: fseaiware26foreword-fm001-p doi:

AIware 2026 Organization
Article: fseaiware26foreword-fm002-p doi:

Main Track

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests
Mohamed Almukhtar, Anwar Ghammam, and Hua Ming
(University of Michigan at Flint, USA; University of Michigan at Dearborn, USA)

Publisher's Version Article: fseaiware26main-pp063-p doi:10.1145/3805760.3814886

Configuring Agentic AI Coding Tools: An Exploratory Study
Matthias Galster, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes
(Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Singapore Management University, Singapore)

Publisher's Version Article: fseaiware26main-pp062-p doi:10.1145/3805760.3814887

Can LLMs Really Reason about Code? Studying How Well LLMs Understand the Relation between Input, Code, and Output
Norman Becker, Tural Mammadov, and Andreas Zeller
(CISPA Helmholtz Center for Information Security, Germany)

Publisher's Version Article: fseaiware26main-pp061-p doi:10.1145/3805760.3814888

Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap
Christoph Treude
(Singapore Management University, Singapore)

Publisher's Version Article: fseaiware26main-pp056-p doi:10.1145/3805760.3814889

Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
Fazle Rabbi, Soumit Kanti Saha, and Jinqiu Yang
(Concordia University, Canada)

Publisher's Version Article: fseaiware26main-pp054-p doi:10.1145/3805760.3814890

Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization
Naing Oo Lwin and Rajesh Kumar
(Bucknell University, USA)

Publisher's Version Article: fseaiware26main-pp053-p doi:10.1145/3805760.3814891

Using Mutation-Analysis to Examine an LLM’s Ability to Summarize Code
Lara Khatib, Michael Pu, Bogdan Vasilescu, and Meiyappan Nagappan
(University of Waterloo, Canada; Carnegie Mellon University, USA)

Publisher's Version Article: fseaiware26main-pp051-p doi:10.1145/3805760.3814892

Collaborator or Assistant? How AI Coding Agents Partition Work across Pull Request Lifecycles
Young Jo Chung and Safwat Hassan
(University of Toronto, Canada)

Publisher's Version Article: fseaiware26main-pp050-p doi:10.1145/3805760.3814893

Testing AIware Systems: A Software Engineering Survey
Karla Gonzalez and Mariam El Mezouar
(Royal Military College of Canada, Canada)

Publisher's Version Article: fseaiware26main-pp049-p doi:10.1145/3805760.3814894

Zombie Agents: Detecting Semantic Livelock in Long-Horizon Autonomous Software
Simarjot Khanna
(Independent Researcher, Canada)

Publisher's Version Article: fseaiware26main-pp047-p doi:10.1145/3805760.3814895

Executable but Unlearnable: Designing Code That Resists LLM-Based Learning
Viraaji Mothukuri and Reza M. Parizi
(Kennesaw State University, USA)

Publisher's Version Article: fseaiware26main-pp044-p doi:10.1145/3805760.3814896

From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines
Marcus Emmanuel Barnes, Taher A. Ghaleb, and Safwat Hassan
(University of Toronto, Canada; Trent University, Canada)

Publisher's Version Article: fseaiware26main-pp043-p doi:10.1145/3805760.3814897

From Code Review to Spec-Driven Contracts: A Vision for Auditable AIWare Systems
Mohammad Hamdaqa and Moataz Chouchen
(Polytechnique Montréal, Canada; Concordia University, Canada)

Publisher's Version Article: fseaiware26main-pp042-p doi:10.1145/3805760.3814898

Operationalizing Ethics for AI Agents: How Developers Encode Values into Repository Context Files
Christoph Treude, Sebastian Baltes, and Marc Cheong
(Singapore Management University, Singapore; Ruprecht-Karls-Universität Heidelberg, Germany; University of Melbourne, Australia)

Publisher's Version Article: fseaiware26main-pp041-p doi:10.1145/3805760.3814899

Detecting Unsoundness in Neural Network Verifiers via Concrete–Abstract Consistency
Kaijie Liu and Yulei Sui
(UNSW, Australia)
Neural network (NN) verifiers are increasingly used to certify safety properties such as robustness (i.e., small allowed perturbations to an input should not alter a model’s decision). Since verifiers aim to prove the absence of violations by considering all possible specified behaviors, the soundness of their implementations is therefore critical to guaranteeing correctness. Detecting unsoundness is particularly important and challenging, because a verifier typically spans multiple components, including specifications, neural networks, operator semantics, and constraint solving, where subtle implementation bugs can silently lead to false certified results.
We present an approach for neural network robustness verifiers that detects and localizes soundness-relevant faults via two types of concrete–abstract consistency checks: (1) Counterexample-Based Refutation (CBR), where a certification is supposed to be refuted if a concrete counterexample is found at runtime; and (2) Bounds-Based Localization (BBL), which audits per-neuron containment (concrete activations must lie within abstract bounds as an invariant) to pinpoint incorrect implementations at particular NN layers or operators. To reduce representation drift, we use specification-embedded models that wrap the core NN with input and output specifications as two additional layers. We further develop an operator-aware NN generator that can produce diverse NN models spanning a wide range of layer types, parameters, and architectures, enabling systematic exposure and exercise of different operator behaviors.
We evaluate verifiers on three abstract domains using six mutation operators. Across 450 soundness-violating instances, our framework detects 72% of injected soundness violations. CBR mainly exposes input-output-level soundness failures when a concrete counterexample is found during input sampling, while BBL catches internal bound-containment violations and localizes them to specific layers/operators, even when CBR becomes ineffective in high-dimensional inputs. These results indicate that combining coarse refutation (CBR) with fine-grained invariant checking (BBL) provides assurance for verifiers, and operator-aware generation further boosts both coverage and discovery of unsoundness issues.

Publisher's Version Article: fseaiware26main-pp040-p doi:10.1145/3805760.3814900

Artifact Readiness Gates with Saturation Stop Rules and Host-Parity Admissibility for FM Release Evaluation
Yanick Kanyiki
(InvarLock, Canada)

Publisher's Version Article: fseaiware26main-pp039-p doi:10.1145/3805760.3814901

When AI Coding Assistants Leak Training Data: A Study of LLM Memorization in Code Generation
Xiaoyu Cheng, Kundi Yao, Pengyu Nie, and Weiyi Shang
(University of Waterloo, Canada; Ontario Tech University, Canada)

Publisher's Version Article: fseaiware26main-pp038-p doi:10.1145/3805760.3814902

SOSecure: The Wisdom of the Crowd for Safer AI-Generated Code
Manisha Mukherjee and Vincent Josua Hellendoorn
(Carnegie Mellon University, USA; Google, USA)

Publisher's Version Article: fseaiware26main-pp037-p doi:10.1145/3805760.3814903

From Correctness to Consistency: Redefining Reliability for the Agentware Era
Xue Qin and Maurício Gruppi
(Villanova University, USA)

Publisher's Version Article: fseaiware26main-pp034-p doi:10.1145/3805760.3814904

An Empirical Study of Reasoning Steps in Thinking Code LLMs
Haoran Xue, Gias Uddin, and Song Wang
(York University, Canada)

Publisher's Version Article: fseaiware26main-pp033-p doi:10.1145/3805760.3814905

VeriTrans: Fine-Tuned LLM-Assisted NL→PL Translation via a Deterministic Neuro-symbolic Pipeline
Xuan Liu, Dheeraj Kodakandla, Kushagra Srivastva, and Mahfuza Farooque
(Pennsylvania State University, USA)

Publisher's Version Article: fseaiware26main-pp032-p doi:10.1145/3805760.3814906

Is Artificial Intelligence an Elixir to the Software Engineering Community? An Empirical Study among Managers
Xin Zhao, Brian Vu, and Sitesh Pattanaik
(Seattle University, USA; Amazon, USA)
Artificial intelligence is rapidly changing the landscape of software development. With the unique ability to quickly generate code and the potential to disrupt traditional workflows, AI tools have found growing adoption within the software development process. Subsequently, this topic has been the focus of academic work, including research examining qualitative impacts to productivity and the analysis of sentiments from the developers who utilize AI tools. While this material is extensive, our research team identified a gap within existing literature: what do software managers have to say? The overarching goal of this study is to examine the views of software managers on how AI tools have affected software development. We seek to understand how managers, who leverage a top-down view of the development process, perceive the influence of AI on developers, their own roles, and the broader labor market. To answer these questions, we conducted an empirical study by releasing an online questionnaire containing both qualitative and quantitative questions, sampling software managers employed across both tech-focused and non-tech-focused companies. Through a survey of 42 managers, we found that managers hold nuanced views on the introduction of AI into software development. They encourage developers to use AI, perceive it as valuable for testing, and apply it themselves for knowledge work. At the same time, they raise concerns about privacy, responsibility, transparency, and over-reliance. Many also predict a loss of jobs within the software development market due to consolidation driven by AI. Overall, AI is seen by managers as both a powerful productivity tool and a source of new ethical challenges. Our investigation paves the way for a comprehensive understanding of how AI is perceived by those who directly manage the introduction of these tools into traditional software development workflows, revealing a road map for future endeavors for the software development community.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fseaiware26main-pp031-p doi:10.1145/3805760.3814907

Kubernetes Misconfigurations in the Wild: Taxonomy, Evolution, and Automated Repair with Large Language Models
Mostafa Anouar Ghorab, Ahmad Abdellatif, and Mohamed Aymen Saied
(Université Laval, Canada; University of Calgary, Canada)
Kubernetes has become a central platform for orchestrating cloud-native applications, yet its declarative configuration model frequently introduces security misconfigurations that threaten system reliability and operational stability. Although automated detection tools are widely available, a systematic understanding of misconfiguration patterns and scalable correction mechanisms remains limited. This paper presents a comprehensive empirical study of Kubernetes security misconfigurations based on 2,662 developer reported issues from Stack Overflow. From this dataset, we derive a structured taxonomy that captures recurring security weaknesses across configuration object types and misconfiguration categories. Using this taxonomy, we analyze how severity levels vary across objects and categories, and examine how security misconfigurations evolve between incubator and stable project stages. Our findings reveal that while some operational issues decrease as projects mature, critical security misconfigurations often persist or reappear, highlighting enduring risk patterns in cloud-native systems. Building on this empirical foundation, we evaluate the effectiveness of Large Language Models (LLMs) in automatically correcting Kubernetes security misconfigurations under progressively enriched contextual conditions. Results demonstrate that contextual grounding significantly improves correction accuracy, with the best standalone model achieving 89.06%. To further enhance structural correctness and schema compliance, we introduce Kubecurity, a schema-guided validation framework that enforces compliance with official Kubernetes specifications. By combining contextual LLM reasoning with deterministic schema enforcement, the proposed hybrid approach achieves 98.50% correction accuracy while substantially reducing newly introduced misconfigurations. Overall, this work advances both the understanding and automated remediation of Kubernetes security misconfigurations.

Publisher's Version Article: fseaiware26main-pp030-p doi:10.1145/3805760.3814908

When Code Authors Are Agents: A Large-Scale Study of Human–Agent Collaboration in Pull Requests
Anthonia Oluchukwu Njoku, Zohreh Sharafi, and Foutse Khomh
(Polytechnique Montréal, Canada)

Publisher's Version Article: fseaiware26main-pp028-p doi:10.1145/3805760.3814909

VISOR: A Vision-Language Model-Based Test Oracle for Testing Robots
Prasun Saurabh, Pablo Valle, Aitor Arrieta, Shaukat Ali, and Paolo Arcaini
(Simula Research Laboratory, Norway; Oslo Metropolitan University, Norway; Mondragon University, Spain; Tokyo Institute of Technology, Japan)

Publisher's Version Article: fseaiware26main-pp025-p doi:10.1145/3805760.3814910

Wink: Recovering from Misbehaviors in Coding Agents
Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra
(Meta Platforms, USA)

Publisher's Version Article: fseaiware26main-pp024-p doi:10.1145/3805760.3814911

Co-located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
Éric Jacopin
(Cosmic AI, France)

Publisher's Version

Published Artifact

Artifacts Available Article: fseaiware26main-pp023-p doi:10.1145/3805760.3814912

Towards AI as a Collaborative Partner: A Taxonomy of AI Agent Behavior in Software Engineering
Tao Dong, Sherry Shi, Harini Sampath, and Andrew Macvean
(Google, USA)

Publisher's Version Article: fseaiware26main-pp020-p doi:10.1145/3805760.3814913

Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development
Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, and Miroslaw Staron
(Chalmers University of Technology - University of Gothenburg, Sweden; Scania, Sweden; Volvo Car Corporation, Sweden; Ericsson, Sweden)

Publisher's Version Article: fseaiware26main-pp019-p doi:10.1145/3805760.3814914

Fixpad++: Automated Bug Fix Verification using LLM Agents
Mustafa Özkan İr, Mehmet Dedeler, Anıl Koyuncu, and Eray Tüzün
(Bilkent University, Türkiye)

Publisher's Version Article: fseaiware26main-pp017-p doi:10.1145/3805760.3814915

A Preliminary Study on Explaining Risk of Code Changes using LLM-Based Prediction Models
Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J. Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter C. Rigby, and Nachiappan Nagappan
(Meta Platforms, USA)

Publisher's Version Article: fseaiware26main-pp016-p doi:10.1145/3805760.3814916

Towards Migrating Neural Network Implementations
Nadia Daoudi, Iván Alfonso, and Jordi Cabot
(Luxembourg Institute of Science and Technology, Luxembourg; University of Luxembourg, Luxembourg)

Publisher's Version Article: fseaiware26main-pp014-p doi:10.1145/3805760.3814917

Auditing Who Appears to Belong: A Large-Scale Empirical Study of Bias in Deployed Text-to-Image Systems for Software Engineering
Mohamad Kassab
(Boston University, USA)

Publisher's Version Article: fseaiware26main-pp008-p doi:10.1145/3805760.3814918

How Robustly Do LLMs Understand Execution Semantics?
Claudio Spiess, Prem Devanbu, and Earl T. Barr
(University of California at Davis, USA; University College London, UK)

Publisher's Version Article: fseaiware26main-pp005-p doi:10.1145/3805760.3814919

TriORM: Workload-Aware Neural-Symbolic Multi-objective Optimization for ORM Mapping Design
Sasan Azizian, Ayoub Hazrati, Artin Azizian, and Elham Rastegari
(Bellevue University, USA; Vanguard Group, USA; McGill University, Canada; Creighton University, USA)

Publisher's Version Article: fseaiware26main-pp004-p doi:10.1145/3805760.3814920

Neural-Symbolic Multi-objective Optimization for Performance-Aware ORM Database Design
Sasan Azizian, Ayoub Hazrati, Artin Azizian, Elham Rastegari, Hamid Bagheri, and Juan Cui
(Bellevue University, USA; Vanguard Group, USA; McGill University, Canada; Creighton University, USA; University of Nebraska-Lincoln, USA)

Publisher's Version Article: fseaiware26main-pp002-p doi:10.1145/3805760.3814921

Benchmark and Dataset Track

A Dataset of Agentic AI Coding Tool Configurations
Matthias Galster, Seyedmoein Mohsenimofidi, Levi Böhme, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes
(Otto-Friedrich-Universität Bamberg, Germany; Ruprecht-Karls-Universität Heidelberg, Germany; Universität Bayreuth, Germany; Singapore Management University, Singapore)

Publisher's Version Article: fseaiware26main-pp028-data-p doi:10.1145/3805760.3814922

AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub
Daniel Ogenrwot and John Businge
(University of Nevada at Las Vegas, USA)

Publisher's Version Article: fseaiware26main-pp027-data-p doi:10.1145/3805760.3814923

SWE-Bench+: Enhanced LLM Coding Benchmark
Haoran Xue, Reem Aleithan, Nafid Enan, Gias Uddin, and Song Wang
(York University, Canada)

Publisher's Version Article: fseaiware26main-pp024-data-p doi:10.1145/3805760.3814924

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, and Xiaodong Gu
(Shanghai Jiao Tong University, China; Hohai University, China; Chongqing University, China)

Publisher's Version Article: fseaiware26main-pp023-data-p doi:10.1145/3805760.3814925

Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges
Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner, Arie van Deursen, and Maliheh Izadi
(Delft University of Technology, Netherlands)

Publisher's Version Article: fseaiware26main-pp022-data-p doi:10.1145/3805760.3814926

TOGBench: A Developer-Written Multi-variant Dataset and Benchmark Suite for Test Oracle Generation
Tasfia Tasnim, Matthew B. Dwyer, and Soneya Binta Hossain
(University of Texas at Dallas, USA; University of Virginia at Charlottesville, USA)

Publisher's Version Article: fseaiware26main-pp018-data-p doi:10.1145/3805760.3814927

CrossCommitVuln-Bench: A Dataset of Multi-commit Python Vulnerabilities Invisible to Per-Commit Static Analysis
Arunabh Majumdar
(Independent Researcher, India)

Publisher's Version Article: fseaiware26main-pp012-data-p doi:10.1145/3805760.3814928

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
Fazle Rabbi and Jinqiu Yang
(Concordia University, Canada)

Publisher's Version Article: fseaiware26main-pp011-data-p doi:10.1145/3805760.3814929

RustBuildEq: A Benchmark for Binary Equivalence under Build Variability
Elliott Wen, Chenye Ni, Valerio Terragni, and Jens Dietrich
(University of Auckland, New Zealand; Victoria University of Wellington, New Zealand)

Publisher's Version Article: fseaiware26main-pp010-data-p doi:10.1145/3805760.3814930

AgentTelemetry: A Fault Detection Benchmark and Toolkit for LLM Agent Observability
Krishna Chaitanya Balusu
(Independent Researcher, USA)

Publisher's Version

Published Artifact

Artifacts Available Article: fseaiware26main-pp009-data-p doi:10.1145/3805760.3814931

SecVulEval: Context-Aware Benchmarking of LLMs for Vulnerability Detection
Md Basim Uddin Ahmed, Nima Shiri Harzevili, Jiho Shin, Hung Viet Pham, and Song Wang
(York University, Canada; Queen's University, Canada)

Publisher's Version Article: fseaiware26main-pp008-data-p doi:10.1145/3805760.3814932

SecMutBench: Evaluating LLM-Generated Security Tests via Mutation-Based Vulnerability Detection
Mariam ALMutairi and Chang-Tien Lu
(Virginia Polytechnic Institute, USA)

Publisher's Version Article: fseaiware26main-pp007-data-p doi:10.1145/3805760.3814933

JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks
Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró
(Linköping University, Sweden; University of Murcia, Spain)

Publisher's Version

Published Artifact

Artifacts Available Article: fseaiware26main-pp006-data-p doi:10.1145/3805760.3814934

REBench: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names
Jun Yeon Won, Xin Jin, Shiqing Ma, and Zhiqiang Lin
(Ohio State University, Columbus, USA; Meta, USA; University of Massachusetts at Amherst, USA)

Publisher's Version

Published Artifact

Artifacts Available Article: fseaiware26main-pp003-data-p doi:10.1145/3805760.3814935

proc time: 0.74