ICST Workshops 2025 – Proceedings

Welcome to the AIST 2025 Workshop
Welcome to the 5th International Workshop on Artificial Intelligence in Software Testing (AIST), co-located with the IEEE International Conference on Software Testing, Verification, and Validation (ICST) 2025, in the beautiful city of Naples, Italy. As the Program Co-Chairs of AIST, we are pleased to present is specialised event exploring the integration of Artificial Intelligence (AI) into software testing.

Welcome to the A-MOST 2025 Workshop
We are pleased to welcome you to the 21st edition of the Advances in Model-Based Testing Workshop (A-MOST 2025), collocated with the IEEE International Conference on Software Testing, Verification and Validation (ICST 2025). The increasing complexity and ever-evolving nature of software-based systems and the need for assurance pose new challenges for testing. Model-based testing (MBT) is an important research area, where new approaches, methods, and tools make MBT more deployable and useful for industry than ever, contributing to improving the testing process's effectiveness and efficiency. Models and different abstractions can ease the comprehension of complex systems and allow systematization and automation of test generation. The A-MOST workshop provides a forum for researchers from academia and industry to exchange ideas, address fundamental challenges, and discuss new applications of model-based testing.

Welcome to the A-TEST 2025 Workshop
elcome to the 16th edition of the Workshop on Automated Testing (A-TEST 2025), co-located with the ICST 2025 conference on April 1, 2025, in Naples, Italy. This year, A-TEST reaches an exciting milestone as it merges with INTUITESTBEDS (International Workshop on User Interface Test Automation and Testing Techniques for Event-Based Software), bringing together a broader community of researchers, practitioners, and tool developers. This unified workshop serves as a forum for exchanging ideas and discussing the latest advancements in automated UI testing and UI-based testing.

Welcome to the CCIW 2025 Workshop
Welcome to the 5th instance of the CI/CD industry workshop (CCIW). This workshop is a forum for industry practitioners and academics to meet with each other and share the challenging problems they are facing. It is also a time to celebrate accomplishments and new opportunities in build, test, and release automation. CCIW is an informal workshop, we don't solicit or publish full papers but are instead focused on sharing ideas, fostering connections, and encouraging broad participation.

Welcome to the InSTA 2025 Workshop
It is our pleasure to welcome you to the 12th International Workshop on Software Test Architecture (InSTA 2025) collocated with the IEEE International Conference on Software Testing, Verification and Validation (ICST 2025) in Naples, Italy. Software Test architectures must be approached indirectly as a part of the test strategies. Some organizations are working to establish new ways to design novel test architectures, but there is no unified understanding of the key concepts of test architectures. This workshop is intended to allow researchers and practitioners to comprehensively discuss the central concepts of test architectures. InSTA 2025 provides sessions of research, industry experiences and emerging idea papers about software test architecture. We would like to thank the program committee to contribute the workshop. Please enjoy InSTA 2025.

Welcome to the ITEQS 2025 Workshop
As software systems continue to integrate deeply into both social and physical environments, the importance of extra-functional properties (EFPs)—such as performance, security, robustness, and scalability—has grown significantly. The success of software products is no longer measured solely by their functional correctness but increasingly by their quality characteristics. These EFPs are critical, particularly in resource-constrained environments such as real-time embedded systems, cyber-physical systems, IoT devices, and edge-based services, where failure can have severe consequences. Testing these properties presents unique challenges, often requiring approaches beyond traditional functional testing methodologies. The intricate nature of EFPs, their interdependencies, and their susceptibility to both design and deployment environment make them a critical area for research and industrial application. ITEQS serves as a dedicated platform for researchers and practitioners to exchange ideas, discuss challenges, and propose solutions for advancing EFP testing techniques. This year, ITEQS 2025 attracted high-quality contributions from diverse research areas, including protocol fuzzing for IoT security, quality assurance for large language models with retrieval augmented generation, reinforcement learning for security testing, and fault localization techniques.

Welcome to the NEXTA 2025 Workshop
The IEEE NEXTA 2025 workshop marks the 8th edition of a growing forum that brings together researchers and practitioners to explore advancements in test automation. In today’s rapidly evolving software landscape, test automation plays a critical role not only in accelerating test cycles but also in ensuring secure, repeatable, and reliable validation of software systems. The workshop emphasizes emerging trends such as AI-powered testing, DevOps-driven automation pipelines, model-based testing, and the integration of Large Language Models (LLMs) into the software testing lifecycle. This year’s edition highlights the shift toward increased AI involvement in testing, with tools and techniques now supporting tasks like intelligent test case generation, self-healing scripts, and autonomous defect detection. These developments are reshaping the role of human testers, making testing more effective and efficient across industries. NEXTA 2025 will showcase the latest findings, tools, and practices shaping the next level of automation. We thank the contributors, reviewers, and ICST organizers for their efforts and support. We warmly invite the community to engage in the stimulating discussions and collaborations that define NEXTA.

Welcome to the SAFE-ML 2025 Workshop
Welcome to the 1st International Workshop on Secure, Accountable, and Verifiable Machine Learning (SAFE-ML 2025), co-located with the 18th IEEE International Conference on Software Testing, Verification, and Validation (ICST 2025) in Naples, Italy. As the Program Co-Chairs, it is an honor to introduce this event, which addresses the critical intersection of Machine Learning (ML) and software testing.

5th International Workshop on Artificial Intelligence in Software Testing (AIST 2025)

Session 1: AI/ML for Software Testing Applications

Generating Latent Space-Aware Test Cases for Neural Networks using Gradient-Based Search
Simon Speth, Christoph Jasper, Claudius Jordan, and Alexander Pretschner
(TUM, Germany)
Autonomous vehicles rely on deep learning (DL) models like object detectors and traffic sign classifiers. Assessing the robustness of these safety-critical components requires good test cases that are both realistic, lying in the distribution of the real-world data, and cost-effective in revealing potential failures. Unlike previous methods that use adversarial attacks on the pixel space, our approach identifies latent space-aware test cases using a conditional variational autoencoder (CVAE) through three steps: (1) Train a CVAE on the dataset. (2) Generate test cases by computing adversarial examples in the CVAE’s latent space. (3) Cluster challenging test cases based on their latent representations. The resulting clusters characterize regions that reveal potential defects in the DL model, which require further analysis. Our results show that our approach is capable of generating failing test cases for all classes of the MNIST and GTSRB datasets in a purely data-driven way, surpassing the baseline of random latent space sampling by up to 75 times. Finally, we validate our approach by detecting previously introduced faults in a faulty DL model. We suggest complementing expert-driven testing methods with our purely data-driven approach to uncover defects experts otherwise might miss. To strengthen transparency and facilitate replication, we provide a replication package and digital appendix to make our code, models, visualizations, and results publicly available.

Adaptive Test Healing using LLM/GPT and Reinforcement Learning
Nariman Mani and Salma Attaranasl
(Nutrosal Inc., Canada)
Flaky tests disrupt software development pipelines by producing inconsistent results, undermining reliability and efficiency. This paper introduces a hybrid framework for adaptive test healing, combining Large Language Models (LLMs) like GPT with Reinforcement Learning (RL) to address test flakiness dynamically. LLMs analyze test logs to classify failures and extract contextual insights, while the RL agent learns optimal strategies for test retries, parameter tuning, and environment resets. Experimental results demonstrate the framework's effectiveness in reducing flakiness and improving CI/CD pipeline stability, outperforming traditional approaches. This work paves the way for scalable, intelligent test automation in dynamic development environments.

Test2Text: AI-Based Mapping between Autogenerated Tests and Atomic Requirements
Elena Treshcheva, Iosif Itkin, Rostislav Yavorski, and Nikolay Dorofeev
(Exactpro, USA; Exactpro LLC, UK; Exactpro, Georgia)
Artificial intelligence is transforming software testing by scaling up test data generation and analysis, creating new possibilities, but also introducing new challenges. One of the common problems with large-scale test data is the lack of traceability between test scenarios and system requirements. The paper addresses this challenge by proposing a traceability solution tailored to an industrial setting employing a data-driven approach. Building on an existing model-based testing framework, the design extends its annotation capabilities through a multilayer taxonomy. The suggested architecture leverages AI techniques for bidirectional mapping: linking requirements to test scripts for coverage analysis and tracing test scripts back to requirements to understand the tested functionality.

Session 2: LLMs for Test Case Generation

LLM Prompt Engineering for Automated White-Box Integration Test Generation in REST APIs
André Mesquita Rincon, Auri Marcelo Rizzo Vincenzi, and João Pascoal Faria
(Federal Institute of Tocantins (IFTO); Federal University of São Carlos (UFSCar), Brazil; Department of Computing (DC), Federal University of São Carlos (UFSCar), Brazil; INESC TEC, Faculty of Engineering, University of Porto, Portugal)
This study explores prompt engineering for automated white-box integration testing of RESTful APIs using Large Language Models (LLMs). Four versions of prompts were designed and tested across three OpenAI models (GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o) to assess their impact on code coverage, token consumption, execution time, and financial cost. The results indicate that different prompt versions, especially with more advanced models, achieved up to 90% coverage, although at higher costs. Additionally, combining test sets from different models increased coverage, reaching 96% in some cases. We also compared the results with EvoMaster, a specialized tool for generating tests for REST APIs, where LLM-generated tests achieved comparable or higher coverage in the benchmark projects. Despite higher execution costs, LLMs demonstrated superior adaptability and flexibility in test generation.

Info

A System for Automated Unit Test Generation using Large Language Models and Assessment of Generated Test Suites
Andrea Lops, Fedelucio Narducci, Azzurra Ragone, Michelantonio Trizio, and Claudio Bartolini
(Polytechnic University of Bari - Wideverse s.r.l., Italy; Polytechnic University of Bari, Italy; University of Bari, Italy; Wideverse s.r.l., Italy)
Unit tests are fundamental for ensuring software correctness but are costly and time-intensive to design and create. Recent advances in Large Language Models (LLMs) have shown potential for automating test generation, though existing evaluations often focus on simple scenarios and lack scalability for real-world applications. To address these limitations, we present AgoneTest, an automated system for generating and assessing complex, class-level test suites for Java projects. Leveraging the Methods2Test dataset, we developed Classes2Test, a new dataset enabling the evaluation of LLM-generated tests against human-written tests. Our key contributions include a scalable automated software system, a new dataset, and a detailed methodology for evaluating test quality.

Info

From Implemented to Expected Behaviors: Leveraging Regression Oracles for Non-regression Fault Detection using LLMs
Stefano Ruberto, Judith Perera, Gunel Jahangirova, and Valerio Terragni
(JRC European Commission, Italy; University of Auckland, New Zealand; King’s College London, UK)
Automated test generation tools often produce assertions that reflect implemented behavior, limiting their usage to regression testing. In this paper, we propose LLMProphet, a black-box approach that applies Few-Shot Learning with LLMs, using automatically generated regression tests as context to identify non-regression faults without relying on source code. By employing iterative cross-validation and a leave-one-out strategy, LLMProphet identifies regression assertions that are misaligned with expected behaviors. We outline LLMProphet’s workflow, feasibility, and preliminary findings, demonstrating its potential for LLM-driven fault detection.

21th International Workshop on Advances in Model Based Testing (A-MOST 2025)

Coverage and Path-Based Testing

Novel Algorithm to Solve the Constrained Path-Based Testing Problem
Matej Klima, Miroslav Bures, Marek Miltner, Chad Zanocco, Gordon Fraser, Sebastian Schweikl, and Patric Feldmeier
(Czech Technical University in Prague, Czechia; Stanford University, USA; University of Passau, Germany)
Constrained Path-based Testing (CPT) is a technique that extends traditional path-based testing by adding constraints on the order or presence of specific sequences of actions in the tests of System Under Test (SUT) processes. Through such an extension, CPT enhances the ability of the model to capture more real-life situations. In CPT, we define four types of constraints that either enforce or prohibit the use of a pair of actions in the resulting test set. We propose a novel Constrained Path-based Testing Composition (CPC) algorithm to solve the Constrained Path-based Testing Problem. We compare the results returned by the CPC algorithm with two alternatives, (1) the Filter algorithm, which solves the CPT problem in a greedy manner, and (2) the Edge algorithm, which generates a set of test cases that satisfy edge coverage. We evaluated the algorithms on 200 problem instances, with the CPC algorithm returning test sets (T) that have, on average, 350 edges, which is 2.4% and 11.1% shorter than the average number of edges in T returned by the Filter algorithm and the Edge algorithm, respectively. Regarding the compliance of the generated T with the constraints, the CPC algorithm produced T that satisfied the constraints in 95% of the cases, the Filter algorithm in 45% cases, and the Edge algorithm returned T that satisfied the constraints only for 6% SUT instances. Regarding the coverage of edges, the CPC algorithm returned test sets that contained, on average, 91.5% of edges in the graphs, while for T returned by the Filter algorithm, it was 90.8% edges. When comparing the average results of the edge coverage criterion and the fulfillment of the constraint criterion by individual algorithms, we consider the incomplete edge coverage achieved by the CPC algorithm and, at the same time, 95% fulfillment of the graph constraints to be a reasonable compromise.

Archive submitted (42 MB)

CPT Manager: An Open Environment for Constrained Path-Based Testing
Matej Klima, Miroslav Bures, Daniel Holotik, Maximilian Herczeg, Marek Miltner, and Chad Zanocco
(Czech Technical University in Prague, Czechia; Stanford University, USA)
Path-based Testing is a common technique to test System Under Test (SUT) processes. Generally, a directed graph that models a system's workflow is input to the test path generation process, as well as the selected test coverage criterion. Several algorithms are proposed in the literature that traverse the graph and facilitate the generation of test cases for the selected coverage criterion. However, a plain directed graph used for modeling SUT processes does not allow for the capture of real-life dependencies and constraints between actions in the tested processes, which might pose a limit in application of this technique. Therefore, we defined an extended model that allows the specification of constraints upon the graph's elements and a set of algorithms that allow the generation of the set of test cases that satisfy the given constraints with the edge coverage. Considering the fact that in path-based testing, there is no platform in which engineers and researchers can share SUT models to be further assembled into open datasets to test performance of evolved path-based testing MBT algorithms, especially for the given problem of test paths generation with the constraints, this paper presents a summary of the problem and a novel management for creation and management of SUT models with constraints that allows the generation of test paths as well as to serve as a platform for creation of such benchmark datasets.

Towards Improving Automated Testing with GraphWalker
Yavuz Koroglu, Mutlu Beyazıt, Onur Kilincceker, Serge Demeyer, and Franz Wotawa
(Graz University of Technology, Austria; University of Antwerp and Flanders Make, Belgium)
GraphWalker is a widespread automated model-based testing tool that generates executable test cases from graph models of a system under test. GraphWalker implements only two random test generation algorithms and has no optimization-based algorithm, where an evaluation of its performance in test lengths and coverage remains an open question in the literature. In this work, we performed experiments on three realistic systems to evaluate redundancy, coverage, and length of GraphWalker test cases. The experimental results show that even the best GraphWalker test cases are highly redundant, limited in edge pair coverage, and need significantly longer test cases to increase edge pair coverage. Overall, we establish a baseline to compare future optimization-based algorithms with, while the amount of improvement and its impact are important research questions for the future.

Model and Machine Learning

Automata Learning for React Web Applications
Peter Grubelnik and Franz Wotawa
(Technische Universitaet Graz, Austria)
Testing is an inevitable part of any software engineering process to ensure quality and reliability. Model-based testing is a successful approach for the automated generation of test cases but requires a model of the system under test. Constructing such models can take time and effort. Hence, there is even a need to automate model generation, where automata learning is a promising approach. In this paper, we address the application of automata learning for React web forms used in a wide range of web applications. We provide information regarding a tool we developed to couple web forms to automata learning for obtaining a model. In addition, we discuss the application of the tool, limitations, challenges, and solutions for reducing the state space and making the approach feasible for practical applications.

Mutating Skeletons: Learning Timed Automata via Domain Knowledge
Felix Wallner, Bernhard K. Aichernig, Florian Lorber, and Martin Tappler
(Graz University of Technology and Silicon Austria Labs, TU Graz - SAL DES Lab, Austria; Graz University of Technology, Austria; Silicon Austria Labs, Austria; TU Wien, Austria)
Formal verification techniques, such as model checking, can provide valuable insights and guarantees for (safety-critical) devices and their possible behavior. However, these guarantees only hold true as long as the model correctly reflects the system. Automata learning provides a huge advantage there as it enables not only the automatic creation of the needed models but also ensures their correct reflection of the system behavior. However, and this holds especially true for real-time systems, model learning techniques can become very time consuming. To combat this, we show how to integrate given domain knowledge into an existing approach based on genetic programming to speed up the learning process. In particular, we show how the genetic programming approach can take a (possibly abstracted, incomplete or incorrect) untimed skeleton of an automaton, which can often be obtained very cheaply, and augment it with timing behavior to form timed automata in a fast and efficient manner. We demonstrate the approach on several examples of varying sizes.

SelfBehave, Generating a Synthetic Behaviour-Driven Development Dataset using SELF-INSTRUCT
Manon Galloy, Martin Balfroid, Benoît Vanderose, and Xavier Devroey
(NADI, University of Namur, Belgium)
While state-of-the-art large language models (LLMs) show great potential for automating various Behavioral-Driven Development (BDD) related tasks, such as test generation, smaller models depend on high-quality data, which are challenging to find in sufficient quantity. To address this challenge, we adapt the SELF-INSTRUCT method to generate a large synthetic dataset from a small set of human-written high-quality scenarios. We evaluate the impact of the initial seeded scenarios' quality on the generated scenarios by generating two synthetic datasets: one from 175 high-quality seeds and one from 175 seeds that did not meet all quality criteria. We performed a qualitative analysis using state-of-the-art quality criteria and found that the quality of seeds does not significantly influence the generation of complete and essential scenarios. However, it impacts the scenarios' capability to focus on a single action and outcome and their compliance with Gherkin syntactic rules. During our evaluation, we also found that while raters agreed on whether a scenario was of high quality or not, they often disagreed on individual criteria, indicating a need for quality criteria easier to apply in practice.

AI and Testing

Model-Based Testing Computer Games: Does It Work?
Wishnu Prasetya
(Utrecht University, Netherlands)
Model-based testing (MBT) allows a target software to be tested systematically and automatically by making use of a model of the software under test. It has been successfully applied in various domains. However its application for testing computer games has not been much studied. The highly dynamic nature of computer games makes it challenging for modeling. In this paper we propose a predicate-based modeling approach coupled with an on-line test generation approach. Both aspects ease the details that need to be incorporated in the model to facilitate effective test generation. Additionally we also leverage the use of intelligent agents, so that dealing with hazards and obstacles can optionally be delegated to such an agent, hence keeping the model clean from such aspects. A case study with a game called MiniDungeon is included to discuss the viability and benefit of the approach, e.g. in terms of code coverage and ability to cover deep states.

16th International Workshop on Automated Testing (A-TEST)

AI-Driven Testing Automation

Test Case Generation for Dialogflow Task-Based Chatbots
Rocco Gianni Rapisarda, Davide Ginelli, Diego Clerissi, and Leonardo Mariani
(University of Milano-Bicocca, Italy; University of Genova, Italy)
Chatbots are software typically embedded in Web and Mobile applications designed to assist the user in a plethora of activities, from chit-chatting to task completion. They enable diverse forms of interactions, like text and voice commands. As any software, even chatbots are susceptible to bugs, and their pervasiveness in our lives, as well as the underlying technological advancements, call for tailored quality assurance techniques. However, test case generation techniques for conversational chatbots are still limited. In this paper, we present Chatbot Test Generator (CTG), an automated testing technique designed for task-based chatbots. We conducted an experiment comparing CTG with state-of-the-art BOTIUM and CHARM tools with seven chatbots, observing that the test cases generated by CTG outperformed the competitors, in terms of robustness and effectiveness.

Automated Testing of the GUI of a Real-Life Engineering Software using Large Language Models
Tim Rosenbach, Alexander Weinert, and David Heidrich
(German Aerospace Center (DLR), Institute for Software Technology, Germany)
One important step in software development is testing the finished product with actual users. These tests aim, among other goals, at determining unintuitive behavior of the software as it is presented to the end-user. Moreover, they aim to determine inconsistencies in the user-facing interface. They provide valuable feedback for the development of the software, but are time-intensive to conduct. In this work, we present GERALLT, a system that uses Large Language Models (LLMs) to perform exploratory tests of the Graphical User Interface (GUI) of a real-life engineering software. GERALLT automatically generates a list of potential unintuitive and inconsistent parts of the interface. We present the architecture of GERALLT and evaluate it on a real-world use case of the engineering software, which has been extensively tested by developers and users. Our results show that GERALLT is able to determine issues with the interface that support the software development team in future development of the software.

SleepReplacer-GPT: AI-Based Thread Sleep Replacement in Selenium WebDriver Tests
Dario Olianas, Maurizio Leotta, and Filippo Ricca
(Università di Genova, Italy; DIBRIS, Università di Genova - Italy, Italy)
Ensuring the quality of modern web applications through end-to-end (E2E) testing is crucial, especially for dynamic systems like single-page applications. Managing asynchronous calls effectively is a key challenge, often addressed using thread sleeps or explicit waits. While thread sleeps are simple to use, they cause inefficiencies and flakiness, whereas explicit waits are more efficient but demand careful implementation. This work explores extending SleepReplacer, a tool that automatically replaces thread sleeps with explicit waits in Selenium WebDriver test suites. We aim to enhance its capabilities by integrating it with ChatGPT, enabling intelligent and automated replacement of thread sleeps with optimal explicit waits. This integration aims to improve code quality and reduce flakiness. We developed a structured procedure for interacting with ChatGPT and validated it on three test suites and synthetic examples covering diverse cases. Results show that the LLM-based approach correctly replaces thread sleeps with explicit waits on the first attempt, consistently outperforming SleepReplacer. These findings support integrating ChatGPT with SleepReplacer to create a smarter, more efficient tool for managing asynchronous behavior in test suites.

Curiosity Driven Multi-agent Reinforcement Learning for 3D Game Testing
Raihana Ferdous, Fitsum Meshesha Kifetew, Davide Prandi, and Angelo Susi
(Consiglio Nazionale delle Ricerche (CNR), Italy; Fondazione Bruno Kessler, Italy; FONDAZIONE BRUNO KESSLER, Italy; Fondazione Bruno Kessler - Irst, Italy)
Recently testing of games via autonomous agents has shown great promise in tackling challenges faced by the game industry, which mainly relied on either manual testing or record/replay. In particular Reinforcement Learning (RL) solutions have shown potential by learning directly from playing the game without the need for human intervention. In this paper, we present cMarlTest, an approach for testing 3D games through curiosity driven Multi-Agent Reinforcement Learning (MARL). cMarlTest deploys multiple agents that work collaboratively to achieve the testing objective. The use of multiple agents helps resolve issues faced by a single agent approach. We carried out experiments on different levels of a 3D game comparing the performance of cMarlTest with a single agent RL variant. Results are promising where, considering three different types of coverage criteria, cMarlTest achieved higher coverage. cMarlTest was also more efficient in terms of the time taken, with respect to the single agent based variant.

Automated Testing in Critical Systems

Introducing a Testing Automation Framework for Basic Integrity of Java Applications in Hitachi Railway Systems
Angelo Venditto, Chiara Aprile, Fabrizio Zanini, and Fausto Del Villano
(Hitachi Rail, Italy)
In the railway industry, where safety, reliability, and performance are critical, test automation is essential to ensure Verification and Validation (V&V) of software and its quality, especially for Java applications. In order to reduce validation time and minimize human error, automatic tests represent a powerful tool: they can make the validation process more efficient and systematic. Building an automated testing framework provides continuity and responsiveness, enabling both early issue detection and overall system performance optimization. These concepts are crucial in a constantly evolving environment that must comply with strict standards and several customer requirements. This paper outlines the automated testing framework developed for Java-based applications in Hitachi Rail, specifically targeting railway central control systems. The framework must support SW and V&V lifecycles and is made of three main stages: API Test Automation using Postman used to ensure proper API behavior under expected conditions and against the given requirements, development of test suites with JUnit to validate the reliability and accuracy of individual modules and their interactions, and system tests through stress and fault injection scenarios. The structured approach enables efficient testing by significantly increasing the number of tests completed in a shorter time, focusing on robustness and reliability, which are paramount in railway central control systems.

Automated Testing in Railways: Leveraging Robotic Arm Precision and Efficiency for Superior Customizability and Usability Test Case Execution
Giuseppe Guida, Mario D'Avino, Massimo Nisci, Roberto Villano, Barbara Di Giampaolo, Michele Schettino, and Erasmo Alessio Bortone
(Hitachi Rail STS, Italy)
In the context of software testing of SIL4 railway onboard signalling application, this study aims to investigate the benefits of using a testing environment composed of a robotic arm, supported by other tools, for the creation and execution of test cases. Starting from the application context of railway signalling, a brief introduction of the European Railway Traffic Management System is provided, describing the approaches and methodologies used by the V&V team of Hitachi Rail STS in using such environment for the validation of a SIL4 software based on a 2oo2 architecture, whose GUI is the Driver Machine Interface. Furthermore, the contribute of Visual Inspection DMI Robotic Arm (VIDRA) tool in the verification and validation process is presented. In order to collect data on the benefits brought by the robotic testing environment, a survey was conducted by interviewing colleagues from V&V department, asking them questions about the ease of use of the environment, the learning curve after one month of use, how much their productivity improved, together with personal considerations. The results show that regardless of the seniority and experience of the interviewed staff, the evaluations on both the ease of use and the learning curve are mostly positive, while productivity is the parameter that received extremely positive evaluations. The personal considerations have shown encouraging comments with propositive insights aimed to enrich the environment with innovative features. These results suggested that the testing environment can be further improved thus possible future developments are described in conclusion.

A Workflow for Automated Testing of a Safety-Critical Embedded Subsystem
Michele Ignarra, Maria Guarino, Andrea Aiello, Vincenzo Tonziello, Giovanni De Donato, Emanuele Pascale, Renato De Guglielmo, Antonio Costa, and Cosimo Affuso
(Hitachi Rail, Italy; Alten, Italy)
As embedded systems in safety-critical domains, such as transportation, become increasingly complex, ensuring their reliability thorough testing becomes essential. Manual testing methods are often time-consuming, error-prone, and inadequate to cover all possible failure scenarios, especially in systems that require a high level of functional safety and robustness. To address these challenges, the need for automation in the testing process is evident, particularly for real-time, embedded, and hardware-dependent systems where validation is crucial for ensuring safe operation
This paper proposes a structured workflow for automating the testing of a safety-critical embedded subsystem, utilizing a real hardware-in-the-loop (HIL) environment. The approach covers all major phases of the testing process, including test specification, execution, and results assessment. The inspiring idea is to try to shift the focus of intellectual effort toward the early stages of the validation process, facilitating a clear and shared understanding between testers and developers regarding the system's behavioral and functional requirements.
The first phase involves creating a detailed test specification, which includes: developing a behavioral model of the feature to be validated; defining data probes for monitoring the subsystem under test; creating simulation scenarios to provide stimuli to the subsystem’s external interfaces. In the second phase, executable test scripts are generated either automatically or semi-automatically using multi-technology scripting languages, followed by the creation of a sequence of test cases. The final phase involves executing the tests on the actual hardware platform, collecting execution logs, and running automated checks to evaluate the results.
The workflow and tools are evaluated using a case study of the onboard subsystem hardware and software platform developed by Hitachi Rail.

5th International CI/CD Industry Workshop (CCIW 2025)

The Purpose of CI

Continuous Evaluation: Using CI Techniques for Experimentation at Scale
Nilesh Jagnik
(Google, USA)
Experimentation is a critical product development practice which can help developers understand the impact of product changes. Through experimentation, various business metrics are generated that can help understand the usefulness of product changes. Software changes at Google undergo a cycle of rigorous experimentation and refinement before they are released. In this workshop, we present an overview of Google Search’s experimentation process. We discuss the working of Google Search’s experimentation platform, the challenges faced when supporting experimentation at scale, and how they were overcome.

CI at Scale

Shifting Gears in Continuous Integration: BMW’s Strategies for High-Velocity Builds
Maximilian Jungwirth, Simon Rummert, Alexander Scott, and Gordon Fraser
(BMW Group, University of Passau, Germany; BMW Group, Germany; University of Passau, Germany)
Increasing customer demands have lead to high complexity automotive software, necessitating continuous refinement of our software development strategies at BMW. The transition to Bazel and the adoption of remote caching, remote build execution, and metastatic nodes were major stepping stones in keeping our Continuous Integration (CI) process feasible. However, challenges persist, including non-cached test executions and long integration times. Our future plans include utilising Machine Learning to optimise our CI pipelines for faster developer feedback cycles and more efficient resource consumption.

Multi-architecture Testing at Google
Tim A. D. Henderson, Sushmita Azad, Avi Kondareddy, and Abhay Singh
(Google, USA; Google LLC, USA; Google LLC, UK)
With the end of Moore's law, the need for higher performance per watt, and the rise of domain-specific chips Google is adopting new general purpose compute architectures. Google now routinely runs large scale software (Big Query, Spanner, PubSub, Blobstore, etc.) on both x86 and Arm CPUs. Previously, most server software at Google was written for the x86-64 architecture. When adopting a new architecture, the software needs to be verified for the new platform. In this talk, we will discuss how we are changing our central continuous integration platforms to support multi-architecture testing, while ensuring that the associated increase in testing cost and complexity remains sub-linear.

The Art of Managing Flaky Tests at Scale
Oleg Andreyev
(Okta, USA)
Test flakiness, characterized by non-deterministic test outcomes despite unchanged code, poses a significant challenge to continuous integration system efficiency and developer productivity. We comprehensively study a year-long initiative to mitigate test flakiness within a large-scale engineering organization. We adjust test failure sensitivity thresholds to account for alert fatigue and resolution quality, refocus data visualization for prioritization and pattern analysis, and introduce incentives to improve the rate of solutions contributing to test stabilization. Key findings reveal the importance of balancing technical efficiency gains with organizational dynamics.

Centralized, ML-Based, CI Optimizations

Unlocking New Practical Advantages of Machine Learning via Generating Large Amounts of High-Quality Data about Software Faults
Neetha Jambigi, Marc Tuerke, Bartosz Bogacz, Thomas Bach, and Michael Felderer
(University of Cologne, Germany; SAP, Germany; German Aerospace Center (DLR), Germany)
Large Language Models (LLMs) are promising machine learning (ML) tools for fault detection in software engineering but require large datasets for adapting them for downstream tasks. To address data scarcity in public and industrial repositories, we generate synthetic faults by mutating C++ code in SAP HANA, creating high-quality training data with clear cause-effect linkages. This dataset enables ML models to predict crash causes, stack traces, and detect failing test cases, enhancing debugging and improving CI/CD processes. Our discussion highlights the practical benefits and applications of ML-based fault prediction in large industrial projects.

12th International Workshop on Software Test Architecture (InSTA 2025)

Research Papers

Establishing Utilization Decision Criteria for Open-Source Software Projects Based on Issue Resolution Time Evaluation
Keisuke Inoue and Kazuhiko Tsuda
(University of Tsukuba, Japan)
In the face of growing demand for high-quality software solutions at reduced costs and development times, the use of open-source software (OSS) has become increasingly prevalent. This study evaluates the speed of OSS support activities—an essential quality factor for OSS users—and proposes a method for early-stage evaluation based on issue resolution time. We analyzed 100 popular OSS projects hosted on GitHub, collecting data on issue creation and completion dates. The results indicate a correlation coefficient of 0.587 between 30 and 180 days of observation, aligning with previous findings on OSS support continuity. Notably, a satisfactory correlation of 0.525 was achieved at just 21 days, suggesting that organizations can assess OSS support speed earlier and potentially accelerate their decision-making process. Our findings provide a practical criterion for evaluating the speed of OSS support activities, which may help organizations adopt OSS with greater confidence and enhance productivity in software-related projects. Future research will expand these criteria by exploring additional factors affecting issue resolution time, such as project scale and corporate support.

The Evaluation of Ambiguity Based on the Distance between Transitive Verbs and Objects in Japanese
Toshiharu Kato and Kazuhiko Tsuda
(Graduate School of University of Tsukuba, Japan)
A method exists for detecting ambiguous expressions in requirements specifications using a dictionary focused on transitive verbs and their objects. This dictionary is constructed by extracting transitive verbs and objects from the requirements specifications and classifying them for inclusion based on specific conditions. Sentences with ambiguous expressions can be detected by searching for sentences containing transitive verbs and objects found in the dictionary. The accuracy of this dictionary has been observed to improve when more requirements specifications are used as input data. However, the selection criteria for constructing the dictionary and filtering ambiguous sentences after extraction have not been thoroughly examined, indicating that the accuracy of ambiguous sentence extraction could be further improved. In this study, to further enhance the accuracy of this dictionary, we measured the distance between the detected transitive verbs and objects within sentences to determine whether the sentence is ambiguous. We conducted a t-test to compare the average distance between transitive verbs and objects in ambiguous and non-ambiguous sentences. The two-tailed test yielded a p-value of 0.0367 at a 5% significance level, indicating a significant difference in the distances between transitive verbs and objects. Therefore, we confirmed that the accuracy of the dictionary can be improved by weighting the words in the dictionary based on the distance between transitive verbs and objects. Detecting ambiguous expressions with high accuracy is expected to increase the efficiency of reviewing the contents of requirements specifications and reduce the difficulty of clarifying specifications. This will contribute to better scope management in project management. Consequently, the proposed approach is expected to prevent delays and cost overruns, contributing to more effective schedule and cost management.

A Language Framework for Test(-ware) Architecture
Luis León
(e-Quallity, Mexico)
In a previous paper we presented the approach called Formal Testing, in which computer languages are used in testing activities. In this conceptual framework we showed definitions for Software Architecture and Architectural Pattern, and outlined the concept of Testware Architecture. This paper elaborates on that and describes how a framework of three integrated languages is used to approach Testware Architecture: a Process Definition Language, a Specification Language, and a GUI Description Language. It includes representations of several architectures and a pattern.

Industry Reports and Emgerging Ideas

Enhancing SaaS Product Reliability and Release Velocity through Optimized Testing Approach
Ren Karita and Tsuyoshi Yumoto
(freee K.K., Japan)
In mission-critical SaaS products such as payment systems, maintaining rapid release velocity while ensuring high reliability is essential. However, as these products grow more complex, the areas requiring regression testing expand, increasing testing costs and often compromising release speed. This paper proposes a methodology to achieve both rapid release velocity and high reliability by optimizing the regression test suite through defect severity analysis. Furthermore, a test architecture is proposed to balance unit, integration, and system testing. This approach reduces regression testing costs, shortens testing duration, and accelerates release cycles.

AIs Understanding of Software Test Architecture
Jon Hagar and Tom Wissink
(Grand Software Testing, USA)
The software test industry is discussing whether software test architectures (STA) with associated system hardware exist and whether supporting STA standards are needed. This paper reports on queries to artificial intelligence (AI) systems to understand what these tools know about architectures. The premise is that if AI systems know about the subject with depth, insight, and references, then STA has some common standing within the industry, and a standard might be in order.

9th International Workshop on Testing Extra-Functional Properties and Quality Characteristics of Software Systems (ITEQS 2025)

AI and Testing

Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing
Bestoun S. Ahmed, Ludwig Otto Baader, Firas Bayram, Siri Jagstedt, and Peter Magnusson
(Karlstads Universitet, Sweden; American University of Bahrain, Bahrain; Ludwig Maximilian University Munich, Germany; Karlstad University, Sweden)
This paper presents a comprehensive framework for testing and evaluating quality characteristics of Large Language Model (LLM) systems enhanced with Retrieval-Augmented Generation (RAG) in tourism applications. Through systematic empirical evaluation of three different LLM variants across multiple parameter configurations, we demonstrate the effectiveness of our testing methodology in assessing both functional correctness and extra-functional properties. Our framework implements 17 distinct metrics that encompass syntactic analysis, semantic evaluation, and behavioral evaluation through LLM judges. The study reveals significant information about how different architectural choices and parameter configurations affect system performance, particularly highlighting the impact of temperature and top-p parameters on response quality. The tests were carried out on a tourism recommendation system for the Värmland region in Sweden, utilizing standard and RAG-enhanced configurations. The results indicate that the newer LLM versions show modest improvements in performance metrics, though the differences are more pronounced in response length and complexity rather than in semantic quality. The research contributes practical insights for implementing robust testing practices in LLM-RAG systems, providing valuable guidance to organizations deploying these architectures in production environments.

Using Reinforcement Learning for Security Testing: A Systematic Mapping Study
Tanwir Ahmad, Matko Butkovic, and Dragos Truscan
(Åbo Akademi University, Finland)
Security of software systems has become increasingly important due to the advancement in technology that occurs on a daily basis and due to the interconnectivity that the Internet network system provides. The manual testing process is time-consuming process and inefficient, especially for very large and complex systems. Reinforcement learning has shown promising results in different test generation approaches due to its ability to optimize the test generation process towards relevant parts of the system. A considerable body of work has been developed in recent years to exploit reinforcement learning for security test generation. This study provides a list of approaches and tools for security test generation using Reinforcement Learning (RL). By searching popular research publication databases, a list of 47 relevant studies has been identified and classified according to the type of approach, RL algorithm, application domain and publication metadata.

Visual Spectrum-Based Fault Localization for Python Programs Based on the Differentiation of Execution Slices
Shehroz Khan, Gaadha Sudheerbabu, Bianca Elena Staicu, Tanwir Ahmad, and Dragos Truscan
(Åbo Akademi University, Finland; Abo Akademi University, Finland)
We present an automated fault localization technique that can assist developers to localize effectively faults in Python programs. The proposed method uses spectrum-based fault localization techniques, program slicing, and graph-based visualization to formulate an efficient method for reducing the effort needed in fault localization. The approach takes the source code of a program, a set of passed and failed tests and collects the program spectra information by executing the tests. A tool, FaultLocalizer, facilitates the generation of a call graph for inter-procedural dependency analysis and annotated control flow graphs for different modules with spectra information and suspiciousness scores. The focus of the approach is on the visual analysis of the source code, and it is intended to complement existing fault localization approaches. The effectiveness of the proposed approach is evaluated on a set of buggy Python programs. The results show that the approach reduces debugging efforts and can be applied to programs with conditional branching.

Cyber-Physical Systems

A Protocol Fuzzing Framework to Detect Remotely Exploitable Vulnerabilities in IoT Nodes
Phi Tuong Lau and Stefan Katzenbeisser
(Vietnam, Viet Nam; University of Passau, Germany)
Firmware attacks pose significant security risks to IoT devices due to often insufficient security protection. Such attacks can result in data breaches, privacy violations, and operational disruptions. For example, attackers can alter the structure of packets (e.g., header, payload) to create malformed packets and send them to nodes in a victim network. These packets can then remotely trigger vulnerabilities (e.g., buffer overflows) in the firmware, potentially resulting in crashes or unexpected behavior. In this paper, we first investigate how firmware vulnerabilities (e.g., out-of-bounds errors) can be remotely exploited in IoT networks by analyzing bug reports from leading IoT platforms. We categorize these findings and introduce a protocol fuzzing framework that leverages the GPT-4 model to guide the generation of test cases in form of malformed packets. These packets are later sent to 6LoWPAN-based nodes to observe their behavior and identify potential security vulnerabilities. As a result, we discovered four new vulnerabilities, classified as improper input validation, which can be remotely exploited in the Contiki-NG operating system.

Unified Search for Multi-requirement Falsification for Cyber-Physical Systems
Jesper Winsten and Ivan Porres
(Åbo Akademi University, Finland)
This paper addresses the challenge of efficiently falsifying multiple requirements in cyber-physical systems (CPSs). Traditional falsification approaches typically evaluate requirements sequentially, leading to redundant computations and decreased efficiency. We present Multi-Requirement Unified Search (MRUS), an algorithm that evaluates all requirements simultaneously using conjunctive Signal Temporal Logic (STL) formulas. MRUS combines an Online Generative Adversarial Network (OGAN) for test case generation with a unified search algorithm to evaluate multiple requirements conjunctively. The performance of the algorithm was evaluated using the ARCH-COMP 2024 falsification competition as a benchmark suite. The results demonstrate that MRUS achieves a high Falsification Rate (FR) across all benchmarks while requiring a small number of total execution counts to find falsifying inputs.

14th International Workshop on Combinatorial Testing (IWCT 2025)

Theoretical Aspects of CT

Utilizing Ontologies for Combinatorial Testing
Franz Wotawa
(Technische Universitaet Graz, Austria)
Test case generation using combinatorial testing requires an input model comprising parameters and their ranges of possible values, i.e., their domains, which might be appropriate for ordinary applications but not for more sophisticated application domains like autonomous systems. Therefore, we discuss the use of ontologies for modeling and its utilization for combinatorial testing. We focus on ontologies describing the interface between the system under test and its environment, discuss the rationale behind such an approach, and introduce the basic algorithm for mapping ontologies to input models. We further consider modeling challenges and provide some initial modeling principles.

Combinatorial Testing and ML/AI

A Combinatorial Approach to Reduce Machine Learning Dataset Size
Megan Olsen, M S Raunak, Rick Kuhn, Fenrir Badorf, Hans van Lierop, and Francis Durso
(Loyola University Maryland, USA; National Institute of Standards and Technology, USA; Natl Institute of Standards & Technology, USA; Johns Hopkins University, USA)
Although large datasets may seem to be the best choice for machine learning, a smaller dataset that better represents the important parts of the problem will be faster to train and potentially more effective. Although dimensionality reduction focuses on reducing features (columns) of data, dataset reduction focuses on removing data points (rows). Typical approaches for reducing dataset size include random sampling, or using machine learning to understand the data. We propose using combinatorial coverage and frequency difference (CFD) techniques from software testing to choose the most effective rows of data to generate a smaller training dataset. We explore the effectiveness of four approaches to reduce a dataset using CFD, and a case study showing that we can produce a significantly smaller dataset that is more effective in training a Support Vector Machine than the original dataset or datasets generated by other approaches.

Data Frequency Coverage Impact on AI Performance
Erin Lanus, Brian Lee, Jaganmohan Chandrasekaran, Laura Freeman, M S Raunak, Raghu Kacker, and Rick Kuhn
(Virginia Tech, USA; Virginia Polytechnic Institute and State University (Virginia Tech), USA; National Institute of Standards and Technology, USA; Natl Institute of Standards & Technology, USA)
Artificial Intelligence (AI) models use statistical learning over data to solve complex problems for which straightforward rules or algorithms may be difficult or impossible to design; however, a side effect is that models that are complex enough to sufficiently represent the function may be uninterpretable. Combinatorial testing, a black-box approach arising from software testing, has been applied to test AI models. A key differentiator between traditional software and AI is that many traditional software faults are deterministic, requiring a failure-inducing combination of inputs to appear only once in the test set for it to be discovered. On the other hand, AI models learn statistically by reinforcing weights through repeated appearances in the training dataset, and the frequency of input combinations plays a significant role in influencing the model’s behavior. Thus, a single occurrence of a combination of feature values may not be sufficient to influence the model’s behavior. Consequently, measures like simple combinatorial coverage that are applicable to software testing do not capture the frequency with which interactions are covered in the AI model’s input space. This work develops methods to characterize the data frequency coverage of feature interactions in training datasets and analyze the impact of imbalance, or skew, in the combinatorial frequency coverage of the training data on model performance. We demonstrate our methods with experiments on an open-source dataset using several classical machine learning algorithms. This pilot study makes three observations: performance may increase or decrease with data skew, feature importance methods do not predict skew impact, and adding more data may not mitigate skew effects.

Fairness Testing of Machine Learning Models using Combinatorial Testing in Latent Space
Arjun Dahal, Sunny Shree, Yu Lei, Raghu N. Kacker, and D. Richard Kuhn
(The University of Texas at Arlington, USA; Information Technology Laboratory, National Institute of Standards and Technology, USA)
Decision-making by Machine Learning (ML) models can exhibit biased behavior, resulting in unfair outcomes. Testing ML models for such biases is essential to ensure unbiased decision-making. In this paper, we propose a combinatorial testing-based approach in the latent space of a generative model to generate instances that assess the fairness of black-box ML models. Our approach involves a two-step process: generating t-way test cases in the latent space of a Variational AutoEncoder and performing fairness testing using the instances reconstructed from these test cases. We experimentally evaluated our approach against an approach that generates t-way instances in the input space for fairness testing. The results indicate that the latent-space approach produces more natural test cases while detecting the first fairness violation faster and achieving a higher ratio of discriminatory instances to the total number of generated instances.

Combinatorial Testing Tools

A Search-Based Benchmark Generator for Constrained Combinatorial Testing Models
Paolo Arcaini, Andrea Bombarda, and Angelo Gargantini
(National Institute of Informatics, Japan; University of Bergamo, Italy)
In combinatorial testing, the generation of benchmarks that meet specific constraints and complexity requirements is essential for the rigorous assessment of testing tools. In previous work, we presented BenCiGen, a benchmark generator for combinatorial testing, which had the limitation of often discarding input parameter models (IPMs) that fail to meet the targeted requirements in terms of ratio (i.e., the number of valid tests, or tuples, over the total number of possible tests or tuples) and solvability. This paper presents an extension to BenCiGen, BenCiGen_S, that integrates a search-based generation approach, aimed at reducing the number of discarded IPMs and enhancing the efficiency of benchmark generation. Instead of rejecting IPMs that do not fulfill the desired characteristics, BenCiGen_S employs search-based techniques that iteratively mutate IPMs, optimizing them accordingly to a fitness function that measures their distance from target requirements. Experimental results demonstrate that BenCiGen_S generates a significantly higher proportion of benchmarks adhering to specified characteristics, with sometimes a reduced computation time. This approach not only improves the generation of suitable benchmarks but also enhances the tool's overall effectiveness.

Towards Continuous Integration for Combinatorial Testing
Manuel Leithner, Jovan Zivanovic, Reinhard Kugler, and Dimitris E. Simos
(SBA Research, Austria; SBA-Research, Austria)
Combinatorial testing is an efficient black-box approach that permits practitioners to pseudo-exhaustively cover the input space of a system under test. It offers mathematically guaranteed coverage up to a user-defined strength while requiring a small number of test cases. Despite these advantages, industrial uptake of this technique has been slow, not least because of the significant investment required to construct and maintain an accurate input parameter model, create reliable oracles and automate testing processes. This work introduces a hierarchy of embeddings of combinatorial testing into continuous integration and deployment pipelines for use in real-world software development workflows. It further describes the practical implementation of a combinatorial security testing pipeline, enabling automated detection of SQL injection vulnerabilities throughout software evolution. Finally, it details lessons learned throughout the design, deployment and utilization of the resulting processes.

Testing Tool for Combinatorial Transition Testing in Dynamically Adaptive Software Systems
Pierre Martou, Benoît Duhoux, Kim Mens, and Axel Legay
(Université catholique de Louvain (UCLouvain), ICTEAM, Belgium; Université catholique de Louvain (UCL), ICTEAM, Belgium; Nexova, Belgium)
Dynamically adaptive software systems exhibit an exponential number of configurations, posing significant testing challenges. Combinatorial Interaction Testing (CIT) and Combinatorial Transition Testing (CTT) enable concise test suite generation to cover valid pairs of features and transitions. Detecting behavioural transition errors in such test suites with full transition coverage still remains complex and time-consuming. We propose a modular and adaptable testing tool implementing state-of-the-art CTT. It generates test suites using either CIT or CTT and integrates a dedicated test oracle for efficient detection of behavioural transition errors, with minimum user effort.

Towards Accessibility of Covering Arrays for Practitioners of Combinatorial Testing
Ulrike Grömping
(BHT - Berliner Hochschule für Technik, Germany)
Combinatorial testing can be useful not only for large software-related testing efforts. While it seems worthwhile to evaluate its benefits for more general applications, lack of immediate availability of suitable arrays is a major impediment against its wider use by practitioners. This paper discusses requirements for making covering arrays (CAs), a key tool in combinatorial testing, accessible to practitioners of various application areas in a way that satisfies their needs. It is argued that, by fulfilling such requirements, the expert community also improves exchange of results among its members. The paper is thus intended as a call for action for the combinatorial testing community to improve the sharing of CA constructions.

Applications of CT

Evaluating Large Language Model Robustness using Combinatorial Testing
Jaganmohan Chandrasekaran, Ankita Ramjibhai Patel, Erin Lanus, and Laura Freeman
(Virginia Tech, USA; Independent Researcher, USA)
Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM’s versatile capabilities, current evaluation practices assess LLMs across a wide variety of tasks, including answer generation, sentiment analysis, text completion, and question and answers, to name a few. Multiple choice questions (MCQ) have emerged as a widely used evaluation task to assess LLM’s understanding and reasoning across various subject areas. However, studies from the literature have revealed that LLMs exhibit sensitivity to the ordering of options in MCQ tasks, with performance variations based on option sequence, thus underscoring the robustness concerns in LLM performance. This work presents a combinatorial testing-based framework for systematic and comprehensive robustness assessment of pre-trained LLMs. By leveraging the sequence covering array, the framework constructs test sets by systematically swapping the order of options, which are then used in ascertaining the robustness of LLMs. We performed an experimental evaluation using the Measuring Massive Multitask Language Understanding (MMLU) dataset, a widely used MCQ dataset and evaluated the robustness of GPT 3.5 Turbo, a pre-trained LLM. Results suggest the framework can effectively identify numerous robustness issues with a relatively minimal number of tests.

Combinatorial Methods for Enhancing the Resilience of Production Facilities
Klaus Kieseberg, Konstantin Gerner, Bernhard Garn, Wolfgang Czerni, Dimitris E. Simos, D. Richard Kuhn, and Raghu Kacker
(SBA Research, Austria; Infraprotect, Austria; National Institute of Standards and Technology, USA)
In this paper, we apply combinatorial methods to generate crisis scenarios for production facilities with the goal of strengthening their resilience by identifying weaknesses in operational aspects or crisis response plans, extending previous work from the domain of disaster research. We use notions of combinatorial coverage for quantifying diversity in scenarios and as one objective when sampling the considered scenario space. We report on a small case study targeting the resilience of a production facility. Specifically, we generated covering arrays of different strengths (for five ternary parameters, for strengths two to five) to obtain paths through the production chain and various combinatorial sequence test sets to obtain crisis scenarios (sequence covering arrays for twelve events of strengths two and three as well as sequence test sets with constraints of length six and seven featuring different constraints). Finally, we state reflections as well as practical feedback to the proposed combinatorial approach.

Combinatorial Test Design Model Creation using Large Language Models
Deborah Furman, Eitan Farchi, Michael Gildein, Andrew Hicks, and Ryan Rawlins
(IBM, USA; IBM HRL, Israel)
In this paper, we report on our initial experience in using Large Language Models (LLMs), which continue to impact a growing multitude of domains, further expanding machine learning applicability. One possible use case is to apply LLMs as a tool to help drive more optimized test coverage via assisting to generate Combinatorial Test Design (CTD) models. This can lower the entry barrier for new CTD practitioners by requiring less subject matter expertise to generate a basic CTD model. In this paper we report on our initial experience in using LLMs to generate a base CTD model and analyze the usefulness of the approach. In common testing scenarios, the LLMs easily provide the necessary attributes and values that are needed to define the CTD model. Prompting the LLM for additional use cases is useful in highlighting possible interactions and determining constraints of the attributes identified in the first stage. Combining the two stages together facilitates the creation of base CTD models.

Extended Abstract of Poster: STARS: Tree-Based Classification and Testing of Feature Combinations in the Automated Robotic Domain
Till Schallau, Dominik Schmid, Nick Pawlinorz, Stefan Naujokat, and Falk Howar
(TU Dortmund University, Germany)
Testing complex systems is crucial for ensuring safety, especially in automated driving, where diverse data sources and variable environments pose challenges. Here, robust safety validation is critical but exhaustive n-way combinatorial testing is impractical due to the vast number of test cases. The STARS framework uses tree-based scenario classifiers to limit feature combinations in a given domain.

International Workshop on Mutation Testing (Mutation 2025)

Session 1

Equivalent Mutants: Deductive Verification to the Rescue
Serge Demeyer and Reiner Hähnle
(Universiteit Antwerpen (ANSYMO), Belgium; Technische Universität Darmstadt, Germany)
Already since the dawn of mutation testing, equivalent mutants have been a subject of academic research. Up until now, all the investigated program analysis techniques (infeasible paths, trivial compiler equivalence, program slicing, symbolic execution) focused on shielding the test engineer from the decision whether a mutant is equivalent or not. This paper argues for a complementary viewpoint: providing test engineers with powerful analysis tools (namely deductive verification) which show why a mutant is equivalent, or come up with a counter example if not. We illustrate by means of a series of increasingly challenging examples (drawn from the MutantBench dataset) how such an approach provides valuable insights to the test engineer, as such paving the way for an actionable improvement of the test suite under analysis.

Semantic-Preserving Transformations as Mutation Operators: A Study on Their Effectiveness in Defect Detection
Max Hort, Linas Vidziunas, and Leon Moonen
(Simula Research Laboratory, Norway; Simula Research Laboratory & BI Norwegian Business School, Norway)
Recent advances in defect detection use language models. Existing works enhanced the training data to improve the models’ robustness when applied to semantically identical code (i.e., predictions should be the same). However, the use of semantically identical code has not been considered for improving the tools during their application - a concept closely related to metamorphic testing. The goal of our study is to determine whether we can use semantic-preserving transformations, analogue to mutation operators, to improve the performance of defect detection tools in the testing stage. We first collect existing publications which implemented semantic-preserving transformations and share their implementation, such that we can reuse them. We empirically study the effectiveness of three different ensemble strategies for enhancing defect detection tools. We apply the collected transformations on the Devign dataset, considering vulnerabilities as a type of defect, and two fine-tuned large language models for defect detection (VulBERTa, PLBART). We found 28 publications with 94 different transformation. We choose to implement 39 transformations from four of the publications, but a manual check revealed that 23 out 39 transformations change code semantics. Using the 16 remaining, correct transformations and three ensemble strategies, we were not able to increase the accuracy of the defect detection models. Our results show that reusing shared semantic-preserving transformation is difficult, sometimes even causing wrongful changes to the semantics.

Intent-Based Mutation Testing: From Naturally Written Programming Intents to Mutants
Asma Hamidi, Ahmed Khanfir, and Mike Papadakis
(University of Luxembourg, Luxembourg; South Mediterranean University (MSB- MedTech-LCI), Tunis, Tunisia, Tunisia)
This paper presents intent-based mutation testing, a testing approach that generates mutations by changing the programming intents that are implemented in the programs under test. In contrast to traditional mutation testing, which changes (mutates) the way programs are written, intent mutation changes (mutates) the behavior of the programs by producing mutations that implement (slightly) different intents than those implemented in the original program. The mutations of the programming intents represent possible corner cases and misunderstandings of the program behavior, i.e., program specifications, and thus can capture different classes of faults than traditional (syntax-based) mutation. Moreover, since programming intents can be implemented in different ways, intent-based mutation testing can generate diverse and complex mutations that are close to the original programming intents (specifications) and thus direct testing towards the intent variants of the program behavior/specifications. We implement intent-based mutation testing using Large Language Models (LLMs) that mutate programming intents and transform them into mutants. We evaluate intent-based mutation on 29 programs and show that it generates mutations that are syntactically complex, semantically diverse, and quite different (semantically) from the traditional ones. We also show that 55% of the intent-based mutations are not subsumed by traditional mutations. Overall, our analysis shows that intent-based mutation testing can be a powerful complement to traditional (syntax-based) mutation testing.

Session 2

Mutation Testing via Iterative Large Language Model-Driven Scientific Debugging
Philipp Straubinger, Marvin Kreis, Stephan Lukasczyk, and Gordon Fraser
(University of Passau, Germany; University of Passau and JetBrains, Germany)
Large Language Models (LLMs) can generate plausible test code. Intuitively they generate this by imitating tests seen in their training data, rather than reasoning about execution semantics. However, such reasoning is important when applying mutation testing, where individual tests need to demonstrate differences in program behavior between a program and specific artificial defects (mutants). In this paper, we evaluate whether Scientific Debugging, which has been shown to help LLMs when debugging, can also help them to generate tests for mutants. In the resulting approach, LLMs form hypotheses about how to kill specific mutants, and then iteratively generate and refine tests until they succeed, all with detailed explanations for each step. We compare this method to three baselines: (1) directly asking the LLM to generate tests, (2) repeatedly querying the LLM when tests fail, and (3) search-based test generation with Pynguin. Our experiments evaluate these methods based on several factors, including mutation score, code coverage, success rate, and the ability to identify equivalent mutants. The results demonstrate that LLMs, although requiring higher computational cost, consistently outperform Pynguin in generating tests with better fault detection and coverage. Importantly, we observe that the iterative refinement of test cases is important for achieving high-quality test suites.

Exploring Robustness of Image Recognition Models on Hardware Accelerators
Nikolaos Louloudakis, Perry Gibson, José Cano, and Ajitha Rajan
(The University of Edinburgh, UK; University of Glasgow, UK)
As the usage of Artificial Intelligence (AI) on resource-intensive and safety-critical tasks increases, a variety of Machine Learning (ML) compilers have been developed, enabling compatibility of Deep Neural Networks (DNNs) with a variety of hardware acceleration devices. However, given that DNNs are widely utilized for challenging and demanding tasks, the behavior of these compilers must be verified. To this direction, we propose MutateNN, a tool that utilizes elements of both differential and mutation testing in order to examine the robustness of image recognition models when deployed on hardware accelerators with different capabilities, in the presence of faults in their target device code - introduced either by developers, or problems in their compilation process. We focus on the image recognition domain by applying mutation testing to 7 well-established DNN models, introducing 21 mutations of 6 different categories. We deployed our mutants on 4 different hardware acceleration devices of varying capabilities and observed that DNN models presented discrepancies of up to 90.3% in mutants related to conditional operators across devices. We also observed that mutations related to layer modification, arithmetic types and input affected severely the overall model performance (up to 99.8%) or led to model crashes, in a consistent manner across devices.

Video

Info

8th International IEEE Workshop on Next Level of Test Automation (NextA 2025)

Testing and LLMs

Adaptive Testing for LLM-Based Applications: A Diversity-Based Approach
Juyeon Yoon, Robert Feldt, and Shin Yoo
(Korea Advanced Institute of Science and Technology, South Korea; Chalmers, Sweden; Korea Advanced Institute of Science Technology, South Korea)
The recent surge of building software systems powered by Large Language Models (LLMs) has led to the development of various testing frameworks, primarily focused on treating prompt templates as the unit of testing. Despite the significant costs associated with test input execution and output assessment, the curation of optimized test suites is yet overlooked in these tools, which calls for tailored test selection or prioritization strategies. In this paper, we show that diversity-based testing techniques, such as Adaptive Random Testing (ART) with appropriate string distance metrics, can be effectively applied to the testing of prompt templates. Our proposed adaptive testing approach adjusts the conventional ART process to this context by selecting new test inputs based on scores derived from existing test suite and their labelling results. Our results, obtained using various implementations that explore several string-based distances, confirm that our approach enables the discovery of failures with reduced testing budgets and promotes the generation of more varied outputs.

Fault Localization

Witness Test Program Generation through AST Node Combinations
Heuichan Lim
(Davidson College, USA)
Compiler bugs can be critical, but localizing them is challenging because of the limited debugging information and complexity of compiler internals. A common approach to address this challenge is to analyze the compiler’s execution behavior while it processes witness test programs. To accurately localize a bug in the compiler, the approach must generate effective witness programs by identifying and modifying the code sections in the seed program that are directly associated with the bug. However, existing methods often struggle to generate such programs due to the random selection of code parts for mutation. In this paper, we propose a novel automated technique to (1) identify abstract syntax tree nodes that are most likely to be related to the bug in the compiler, which can transform a failing program into a passing one, and (2) use this information to generate witness test programs that can effectively localize compiler bugs. To evaluate the effectiveness of our approach, we built a prototype and experimented with it using 40 real-world LLVM bugs. Our approach localized 40% and 100% of bugs within the Top-1 and Top-5 compiler files, respectively.

Influence of Pure and Unit-Like Tests on SBFL Effectiveness: An Empirical Study
Attila Szatmári, Tamás Gergely, and Árpád Beszédes
(Szegedi Tudományegyetem, Hungary; Department of Software Engineering, University of Szeged, Hungary)
One of the most challenging and time-consuming aspects of debugging is identifying the exact location of the bug. We propose the concept of a Pure Unit Test (PUT), which, when fails, can unambiguously determine the location of the faulty method. Based on developer experience, we established three heuristics to evaluate the degree to which a test can be considered a unit test, if it cannot be considered as a PUT (we call these Unit-Like Tests, or ULTs). We examined how and when PUTs and ULTs affect Spectrum-Based Fault Localization efficiency. The results demonstrate that, for more complex systems, a higher proportion of unit tests in the relevant test cases can enhance the effectiveness of fault localization. When the number of PUTs is high enough, fault localization becomes trivial, in that case running SBFL is not necessary. Moreover, our findings indicate that different kinds of ULTs can have a large impact on the efficiency of fault localization, particularly for simpler bugs where they can quickly and effectively pinpoint the problem areas.

1st International Workshop on Secure, Accountable, and Verifiable Machine Learning (SAFE-ML 2025)

Robustness, Verification, and Security in AI Systems

Structural Backdoor Attack on IoT Malware Detectors via Graph Explainability
Yu-Cheng Chiu, Maina Bernard Mwangi, Shin-Ming Cheng, and Hahn-Ming Lee
(National Taiwan University of Science and Technology, Taiwan)
In AI-based malware detection, structural features such as function call graphs (FCGs) and control flow graphs (CFGs) are widely used for their ability to encapsulate program execution flow and facilitate cross-architectural malware detection in IoT environments. When combined with deep learning (DL) models, such as graph neural networks (GNNs) that capture node interdependencies, these features enhance malware detection and enable the identification of novel threats. However, AI-based detectors require frequent updates and retraining to adapt to evolving malware strains, often relying on datasets from online crowdsourced threat intelligence platforms, which are vulnerable to poisoning and backdoor attacks. Backdoor attacks implant triggers in training samples, embedding vulnerabilities into ML models that can later be exploited. While extensive research exists on backdoor attacks in other domains, their implications for structural AI-based IoT malware detection remain unexplored. This study proposes a novel backdoor attack targeting IoT malware detectors trained on structural features. By leveraging CFGExplainer, we identify the most influential subgraphs from benign samples, extract them to serve as triggers, and inject them into malicious samples in the training dataset. Additionally, we introduce a partition-trigger strategy that injects triggers into malware while splitting critical malicious nodes to reduce their influence on label prediction. Ultimately, we achieve high attack success rates of up to 100% against state-of-the-art structural IoT malware detectors, underscoring critical security vulnerabilities and emphasizing the need for advanced countermeasures against backdoor attacks.

Quantifying Correlations of Machine Learning Models
Yuanyuan Li, Neeraj Sarna, and Yang Lin
(Munich RE, USA; Munich RE, Germany; Hartford Steam Boiler, USA)
Machine Learning models are being extensively used in safety critical applications where errors from these models could cause harm to the user. Such risks are amplified when multiple machine learning models, which are deployed concurrently, interact and make errors simultaneously. This paper explores three scenarios where error correlations between multiple models arise, resulting in such aggregated risks. Using real-world data, we simulate these scenarios and quantify the correlations in errors of different models. Our findings indicate that aggregated risks are substantial, particularly when models share similar algorithms, training datasets, or foundational models. Overall, we observe that correlations across models are pervasive and likely to intensify with increased reliance on foundational models and widely used public datasets, highlighting the need for effective mitigation strategies to address these challenges.

Towards a Probabilistic Framework for Analyzing and Improving LLM-Enabled Software
Juan Manuel Baldonado, Flavia Bonomo-Braberman, and Víctor Adrián Braberman
(ICC UBA/CONICET and DC, FCEN, Universidad de Buenos Aires, Argentina)
Ensuring the reliability and verifiability of large language model (LLM)-enabled systems remains a significant challenge in software engineering. We propose a probabilistic framework for systematically analyzing and improving these systems by modeling and refining distributions over clusters of semantically equivalent outputs. This framework facilitates the evaluation and iterative improvement of Transference Models–key software components that utilize LLMs to transform inputs into outputs for downstream tasks. To illustrate its utility, we apply the framework to the autoformalization problem, where natural language documentation is transformed into formal program specifications. Our case illustrates how distribution-aware analysis enables the identification of weaknesses and guides focused alignment improvements, resulting in more reliable and interpretable outputs. This principled approach offers a foundation for addressing critical challenges in the development of robust LLM-enabled systems.

Black-Box Multi-Robustness Testing for Neural Networks
Mara Downing and Tevfik Bultan
(University of California Santa Barbara, USA; University of California at Santa Barbara, USA)
Neural networks are increasingly prevalent in day-to-day life, including in safety-critical applications such as self-driving cars and medical diagnoses. This prevalence has spurred extensive research into testing the robustness of neural networks against adversarial attacks, most commonly by determining if misclassified inputs can exist within a region around a correctly classified input. While most prior work focuses on robustness analysis around a single input at a time, in this paper we look at simultaneous analysis of multiple robustness regions. Our approach finds robustness violating inputs away from expected decision boundaries, identifies varied types of misclassifications by increasing confusion matrix coverage, and effectively discovers robustness violating inputs that do not violate input feasibility constraints. We demonstrate the capabilities of our approach on multiple networks trained from several datasets, including ImageNet and a street sign identification dataset.

Security and Privacy in Fedetated Learning Systems

Towards A Common Task Framework for Distributed Collaborative Machine Learning
Qianying Liao, Dimitri Van Landuyt, Davy Preuveneers, and Wouter Joosen
(KU Leuven, Belgium)
Distributed Collaborative Machine Learning (DCML) enables collaborative model training without the need to reveal or share datasets. However, its implementation incurs complexities in balancing performance, efficiency, and security and privacy. Different architectural approaches in how the learning can be distributed and coordinated over different parties have emerged, leading to significant diversity. Current academic work often lacks a comprehensive evaluation of the diverse architectural DCML approaches, which in turn hinders objective comparison of different DCML approaches and impedes broader adoption. This paper highlights the need for a general and holistic evaluation framework for DCML approaches, and proposes a number of relevant evaluation metrics and criteria.

Exploring and Mitigating Gradient Leakage Vulnerabilities in Federated Learning
Harshit Gupta, Ghena Barakat, Luca D'Agati, Francesco Longo, Giovanni Merlino, and Antonio Puliafito
(University of Messina, Italy)
Due to the importance of data privacy, Federated Learning (FL) is increasingly adopted in various applications. Its importance lies in sharing only local model gradients with the central server instead of the raw training data. The model gradients shared for aggregation are key components of FL. However, these gradients are mathematical values and are prone to attack, which may lead to data leakage issues. Therefore, implementing robust security measures is crucial to prevent data leakage when sharing model gradients with the central server. Such measures are essential to protect gradients from being exploited by attackers to retrieve real data. This work proposed and demonstrated how public gradients can be used to retrieve private training data and how this may be avoided by using the technique of differential privacy.

Federated Learning under Attack: Game-Theoretic Mitigation of Data Poisoning
Marco De Santis and Christian Esposito
(University of Salerno, Italy)
Federated Learning (FL) is vulnerable to various attacks, including data poisoning and model poisoning, which can degrade the quality of the global model. Current solutions are based on cryptographic primitives or byzantine aggregation but suffer from performance and/or quality worsening and are even ineffective in case of data poisoning. This work proposes a game-theoretic solution to identify malicious weights and differentiate between benign and compromised updates. We use the Prisoner's Dilemma and Signaling Games to model interactions between local learners and the aggregator, allowing a precise evaluation of the legitimacy of shared weights. Upon detecting an attack, the system activates a rollback mechanism to restore the model to a safe state. The proposed approach enhances FL robustness by mitigating attack impacts while preserving the global model's generalization capabilities.

Privacy-Preserving in Federated Learning: A Comparison between Differential Privacy and Homomorphic Encryption across Different Scenarios
Alessio Catalfamo, Maria Fazio, Antonio Celesti, and Massimo Villari
(Università degli Studi di Messina, Italy; University of Messina, Italy; Department of MIFT, University of Messina, Italy)
Federated Learning is a decentralized machine learning paradigm where multiple devices collaboratively train a model while keeping their training data local so to enhance data privacy and reduce communication costs. However, it is vulnerable to some privacy threats, such as model inversion and membership inference attacks, which can expose sensitive data in the devices. This paper explores the integration of different Privacy-Preserving approaches in FL, which are Differential Privacy and Homomorphic Encryption. We assess the impact of these techniques on model accuracy, training efficiency, and computational overhead through experimental evaluations across three different application use cases. Our evaluation results depict the trade-off between privacy enhancement and performance for different scenarios and models. They highlight the effectiveness of combining Federated Learning with advanced cryptographic and privacy-preserving techniques to achieve secure, scalable, and privacy-aware distributed learning.

ICST Workshops 2025 – Proceedings

Frontmatter

5th International Workshop on Artificial Intelligence in Software Testing (AIST 2025)

Session 1: AI/ML for Software Testing Applications

Session 2: LLMs for Test Case Generation

21th International Workshop on Advances in Model Based Testing (A-MOST 2025)

Coverage and Path-Based Testing

Model and Machine Learning

AI and Testing

16th International Workshop on Automated Testing (A-TEST)

AI-Driven Testing Automation

Automated Testing in Critical Systems

5th International CI/CD Industry Workshop (CCIW 2025)

The Purpose of CI

CI at Scale

Centralized, ML-Based, CI Optimizations

12th International Workshop on Software Test Architecture (InSTA 2025)

Research Papers

Industry Reports and Emgerging Ideas

9th International Workshop on Testing Extra-Functional Properties and Quality Characteristics of Software Systems (ITEQS 2025)

AI and Testing

Cyber-Physical Systems

14th International Workshop on Combinatorial Testing (IWCT 2025)

Theoretical Aspects of CT

Combinatorial Testing and ML/AI

Combinatorial Testing Tools

Applications of CT

International Workshop on Mutation Testing (Mutation 2025)

Session 1

Session 2

8th International IEEE Workshop on Next Level of Test Automation (NextA 2025)

Testing and LLMs

Fault Localization

1st International Workshop on Secure, Accountable, and Verifiable Machine Learning (SAFE-ML 2025)

Robustness, Verification, and Security in AI Systems

Security and Privacy in Fedetated Learning Systems