Powered by
2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW),
March 31 – April 4, 2025,
Naples, Italy
Frontmatter
Welcome to the AIST 2025 Workshop
Welcome to the 5th International Workshop on Artificial Intelligence in Software Testing (AIST), co-located with the IEEE International Conference on Software Testing, Verification, and Validation (ICST) 2025, in the beautiful city of Naples, Italy. As the Program Co-Chairs of AIST, we are pleased to present is specialised event exploring the integration of Artificial Intelligence (AI) into software testing.
Welcome to the A-MOST 2025 Workshop
We are pleased to welcome you to the 21st edition of the Advances in Model-Based Testing Workshop (A-MOST 2025), collocated with the IEEE International Conference on Software Testing, Verification and Validation (ICST 2025). The increasing complexity and ever-evolving nature of software-based systems and the need for assurance pose new challenges for testing. Model-based testing (MBT) is an important research area, where new approaches, methods, and tools make MBT more deployable and useful for industry than ever, contributing to improving the testing process's effectiveness and efficiency. Models and different abstractions can ease the comprehension of complex systems and allow systematization and automation of test generation. The A-MOST workshop provides a forum for researchers from academia and industry to exchange ideas, address fundamental challenges, and discuss new applications of model-based testing.
Welcome to the A-TEST 2025 Workshop
elcome to the 16th edition of the Workshop on Automated Testing (A-TEST 2025), co-located with the ICST 2025 conference on April 1, 2025, in Naples, Italy. This year, A-TEST reaches an exciting milestone as it merges with INTUITESTBEDS (International Workshop on User Interface Test Automation and Testing Techniques for Event-Based Software), bringing together a broader community of researchers, practitioners, and tool developers. This unified workshop serves as a forum for exchanging ideas and discussing the latest advancements in automated UI testing and UI-based testing.
Welcome to the CCIW 2025 Workshop
Welcome to the 5th instance of the CI/CD industry workshop (CCIW). This workshop is a forum for industry practitioners and academics to meet with each other and share the challenging problems they are facing. It is also a time to celebrate accomplishments and new opportunities in build, test, and release automation. CCIW is an informal workshop, we don't solicit or publish full papers but are instead focused on sharing ideas, fostering connections, and encouraging broad participation.
Welcome to the InSTA 2025 Workshop
It is our pleasure to welcome you to the 12th International Workshop on Software Test Architecture (InSTA 2025) collocated with the IEEE International Conference on Software Testing, Verification and Validation (ICST 2025) in Naples, Italy.
Software Test architectures must be approached indirectly as a part of the test strategies. Some organizations are working to establish new ways to design novel test architectures, but there is no unified understanding of the key concepts of test architectures. This workshop is intended to allow researchers and practitioners to comprehensively discuss the central concepts of test architectures. InSTA 2025 provides sessions of research, industry experiences and emerging idea papers about software test architecture.
We would like to thank the program committee to contribute the workshop. Please enjoy InSTA 2025.
Welcome to the ITEQS 2025 Workshop
As software systems continue to integrate deeply into both social and physical environments, the importance of extra-functional properties (EFPs)—such as performance, security, robustness, and scalability—has grown significantly. The success of software products is no longer measured solely by their functional correctness but increasingly by their quality characteristics. These EFPs are critical, particularly in resource-constrained environments such as real-time embedded systems, cyber-physical systems, IoT devices, and edge-based services, where failure can have severe consequences. Testing these properties presents unique challenges, often requiring approaches beyond traditional functional testing methodologies. The intricate nature of EFPs, their interdependencies, and their susceptibility to both design and deployment environment make them a critical area for research and industrial application.
ITEQS serves as a dedicated platform for researchers and practitioners to exchange ideas, discuss challenges, and propose solutions for advancing EFP testing techniques. This year, ITEQS 2025 attracted high-quality contributions from diverse research areas, including protocol fuzzing for IoT security, quality assurance for large language models with retrieval augmented generation, reinforcement learning for security testing, and fault localization techniques.
Welcome to the Mutation 2025 Workshop
It is a pleasure to welcome you to the 20th International Workshop on Mutation Analysis, Mutation 2025. Mutation is widely recognised as one of the most important techniques to assess the fault-revealing capabilities of test suites, test generators, and other testing techniques.
Welcome to the NEXTA 2025 Workshop
The IEEE NEXTA 2025 workshop marks the 8th edition of a growing forum that brings together researchers and practitioners to explore advancements in test automation. In today’s rapidly evolving software landscape, test automation plays a critical role not only in accelerating test cycles but also in ensuring secure, repeatable, and reliable validation of software systems. The workshop emphasizes emerging trends such as AI-powered testing, DevOps-driven automation pipelines, model-based testing, and the integration of Large Language Models (LLMs) into the software testing lifecycle.
This year’s edition highlights the shift toward increased AI involvement in testing, with tools and techniques now supporting tasks like intelligent test case generation, self-healing scripts, and autonomous defect detection. These developments are reshaping the role of human testers, making testing more effective and efficient across industries.
NEXTA 2025 will showcase the latest findings, tools, and practices shaping the next level of automation. We thank the contributors, reviewers, and ICST organizers for their efforts and support. We warmly invite the community to engage in the stimulating discussions and collaborations that define NEXTA.
Welcome to the SAFE-ML 2025 Workshop
Welcome to the 1st International Workshop on Secure, Accountable, and Verifiable Machine Learning (SAFE-ML 2025), co-located with the 18th IEEE International Conference on Software Testing, Verification, and Validation (ICST 2025) in Naples, Italy. As the Program Co-Chairs, it is an honor to introduce this event, which addresses the critical intersection of Machine Learning (ML) and software testing.
5th International Workshop on Artificial Intelligence in Software Testing (AIST 2025)
Session 1: AI/ML for Software Testing Applications
Generating Latent Space-Aware Test Cases for Neural Networks using Gradient-Based Search
Simon Speth,
Christoph Jasper,
Claudius Jordan, and
Alexander Pretschner
(TUM, Germany)
Autonomous vehicles rely on deep learning (DL) models like object detectors and traffic sign classifiers. Assessing the robustness of these safety-critical components requires good test cases that are both realistic, lying in the distribution of the real-world data, and cost-effective in revealing potential failures. Unlike previous methods that use adversarial attacks on the pixel space, our approach identifies latent space-aware test cases using a conditional variational autoencoder (CVAE) through three steps: (1) Train a CVAE on the dataset. (2) Generate test cases by computing adversarial examples in the CVAE’s latent space. (3) Cluster challenging test cases based on their latent representations. The resulting clusters characterize regions that reveal potential defects in the DL model, which require further analysis. Our results show that our approach is capable of generating failing test cases for all classes of the MNIST and GTSRB datasets in a purely data-driven way, surpassing the baseline of random latent space sampling by up to 75 times. Finally, we validate our approach by detecting previously introduced faults in a faulty DL model. We suggest complementing expert-driven testing methods with our purely data-driven approach to uncover defects experts otherwise might miss. To strengthen transparency and facilitate replication, we provide a replication package and digital appendix to make our code, models, visualizations, and results publicly available.
@InProceedings{ICST Workshops25p1,
author = {Simon Speth and Christoph Jasper and Claudius Jordan and Alexander Pretschner},
title = {Generating Latent Space-Aware Test Cases for Neural Networks using Gradient-Based Search},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {1--8},
doi = {},
year = {2025},
}
Adaptive Test Healing using LLM/GPT and Reinforcement Learning
Nariman Mani and
Salma Attaranasl
(Nutrosal Inc., Canada)
Flaky tests disrupt software development pipelines by producing inconsistent results, undermining reliability and efficiency. This paper introduces a hybrid framework for adaptive test healing, combining Large Language Models (LLMs) like GPT with Reinforcement Learning (RL) to address test flakiness dynamically. LLMs analyze test logs to classify failures and extract contextual insights, while the RL agent learns optimal strategies for test retries, parameter tuning, and environment resets. Experimental results demonstrate the framework's effectiveness in reducing flakiness and improving CI/CD pipeline stability, outperforming traditional approaches. This work paves the way for scalable, intelligent test automation in dynamic development environments.
@InProceedings{ICST Workshops25p9,
author = {Nariman Mani and Salma Attaranasl},
title = {Adaptive Test Healing using LLM/GPT and Reinforcement Learning},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {9--16},
doi = {},
year = {2025},
}
Test2Text: AI-Based Mapping between Autogenerated Tests and Atomic Requirements
Elena Treshcheva,
Iosif Itkin,
Rostislav Yavorski, and
Nikolay Dorofeev
(Exactpro, USA; Exactpro LLC, UK; Exactpro, Georgia)
Artificial intelligence is transforming software testing by scaling up test data generation and analysis, creating new possibilities, but also introducing new challenges. One of the common problems with large-scale test data is the lack of traceability between test scenarios and system requirements. The paper addresses this challenge by proposing a traceability solution tailored to an industrial setting employing a data-driven approach. Building on an existing model-based testing framework, the design extends its annotation capabilities through a multilayer taxonomy. The suggested architecture leverages AI techniques for bidirectional mapping: linking requirements to test scripts for coverage analysis and tracing test scripts back to requirements to understand the tested functionality.
@InProceedings{ICST Workshops25p17,
author = {Elena Treshcheva and Iosif Itkin and Rostislav Yavorski and Nikolay Dorofeev},
title = {Test2Text: AI-Based Mapping between Autogenerated Tests and Atomic Requirements},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {17--20},
doi = {},
year = {2025},
}
Session 2: LLMs for Test Case Generation
LLM Prompt Engineering for Automated White-Box Integration Test Generation in REST APIs
André Mesquita Rincon,
Auri Marcelo Rizzo Vincenzi, and
João Pascoal Faria
(Federal Institute of Tocantins (IFTO); Federal University of São Carlos (UFSCar), Brazil; Department of Computing (DC), Federal University of São Carlos (UFSCar), Brazil; INESC TEC, Faculty of Engineering, University of Porto, Portugal)
This study explores prompt engineering for automated white-box integration testing of RESTful APIs using Large Language Models (LLMs). Four versions of prompts were designed and tested across three OpenAI models (GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o) to assess their impact on code coverage, token consumption, execution time, and financial cost. The results indicate that different prompt versions, especially with more advanced models, achieved up to 90% coverage, although at higher costs. Additionally, combining test sets from different models increased coverage, reaching 96% in some cases. We also compared the results with EvoMaster, a specialized tool for generating tests for REST APIs, where LLM-generated tests achieved comparable or higher coverage in the benchmark projects. Despite higher execution costs, LLMs demonstrated superior adaptability and flexibility in test generation.
@InProceedings{ICST Workshops25p21,
author = {André Mesquita Rincon and Auri Marcelo Rizzo Vincenzi and João Pascoal Faria},
title = {LLM Prompt Engineering for Automated White-Box Integration Test Generation in REST APIs},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {21--28},
doi = {},
year = {2025},
}
Info
A System for Automated Unit Test Generation using Large Language Models and Assessment of Generated Test Suites
Andrea Lops,
Fedelucio Narducci,
Azzurra Ragone,
Michelantonio Trizio, and
Claudio Bartolini
(Polytechnic University of Bari - Wideverse s.r.l., Italy; Polytechnic University of Bari, Italy; University of Bari, Italy; Wideverse s.r.l., Italy)
Unit tests are fundamental for ensuring software correctness but are costly and time-intensive to design and create. Recent advances in Large Language Models (LLMs) have shown potential for automating test generation, though existing evaluations often focus on simple scenarios and lack scalability for real-world applications. To address these limitations, we present AgoneTest, an automated system for generating and assessing complex, class-level test suites for Java projects. Leveraging the Methods2Test dataset, we developed Classes2Test, a new dataset enabling the evaluation of LLM-generated tests against human-written tests.
Our key contributions include a scalable automated software system, a new dataset, and a detailed methodology for evaluating test quality.
@InProceedings{ICST Workshops25p29,
author = {Andrea Lops and Fedelucio Narducci and Azzurra Ragone and Michelantonio Trizio and Claudio Bartolini},
title = {A System for Automated Unit Test Generation using Large Language Models and Assessment of Generated Test Suites},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {29--36},
doi = {},
year = {2025},
}
Info
From Implemented to Expected Behaviors: Leveraging Regression Oracles for Non-regression Fault Detection using LLMs
Stefano Ruberto,
Judith Perera,
Gunel Jahangirova, and
Valerio Terragni
(JRC European Commission, Italy; University of Auckland, New Zealand; King’s College London, UK)
Automated test generation tools often produce assertions that reflect implemented behavior, limiting their usage to regression testing. In this paper, we propose LLMProphet, a black-box approach that applies Few-Shot Learning with LLMs, using automatically generated regression tests as context to
identify non-regression faults without relying on source code. By employing iterative cross-validation and a leave-one-out strategy, LLMProphet identifies regression assertions that are misaligned with expected behaviors. We outline LLMProphet’s workflow, feasibility, and preliminary findings, demonstrating its potential for LLM-driven fault detection.
@InProceedings{ICST Workshops25p37,
author = {Stefano Ruberto and Judith Perera and Gunel Jahangirova and Valerio Terragni},
title = {From Implemented to Expected Behaviors: Leveraging Regression Oracles for Non-regression Fault Detection using LLMs},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {37--40},
doi = {},
year = {2025},
}
21th International Workshop on Advances in Model Based Testing (A-MOST 2025)
Coverage and Path-Based Testing
Novel Algorithm to Solve the Constrained Path-Based Testing Problem
Matej Klima,
Miroslav Bures,
Marek Miltner,
Chad Zanocco,
Gordon Fraser,
Sebastian Schweikl, and
Patric Feldmeier
(Czech Technical University in Prague, Czechia; Stanford University, USA; University of Passau, Germany)
Constrained Path-based Testing (CPT) is a technique that extends traditional path-based testing by adding constraints on the order or presence of specific sequences of actions in the tests of System Under Test (SUT) processes. Through such an extension, CPT enhances the ability of the model to capture more real-life situations. In CPT, we define four types of constraints that either enforce or prohibit the use of a pair of actions in the resulting test set. We propose a novel Constrained Path-based Testing Composition (CPC) algorithm to solve the Constrained Path-based Testing Problem. We compare the results returned by the CPC algorithm with two alternatives, (1) the Filter algorithm, which solves the CPT problem in a greedy manner, and (2) the Edge algorithm, which generates a set of test cases that satisfy edge coverage. We evaluated the algorithms on 200 problem instances, with the CPC algorithm returning test sets (T) that have, on average, 350 edges, which is 2.4% and 11.1% shorter than the average number of edges in T returned by the Filter algorithm and the Edge algorithm, respectively. Regarding the compliance of the generated T with the constraints, the CPC algorithm produced T that satisfied the constraints in 95% of the cases, the Filter algorithm in 45% cases, and the Edge algorithm returned T that satisfied the constraints only for 6% SUT instances. Regarding the coverage of edges, the CPC algorithm returned test sets that contained, on average, 91.5% of edges in the graphs, while for T returned by the Filter algorithm, it was 90.8% edges. When comparing the average results of the edge coverage criterion and the fulfillment of the constraint criterion by individual algorithms, we consider the incomplete edge coverage achieved by the CPC algorithm and, at the same time, 95% fulfillment of the graph constraints to be a reasonable compromise.
@InProceedings{ICST Workshops25p41,
author = {Matej Klima and Miroslav Bures and Marek Miltner and Chad Zanocco and Gordon Fraser and Sebastian Schweikl and Patric Feldmeier},
title = {Novel Algorithm to Solve the Constrained Path-Based Testing Problem},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {41--49},
doi = {},
year = {2025},
}
Archive submitted (42 MB)
CPT Manager: An Open Environment for Constrained Path-Based Testing
Matej Klima,
Miroslav Bures,
Daniel Holotik,
Maximilian Herczeg,
Marek Miltner, and
Chad Zanocco
(Czech Technical University in Prague, Czechia; Stanford University, USA)
Path-based Testing is a common technique to test System Under Test (SUT) processes. Generally, a directed graph that models a system's workflow is input to the test path generation process, as well as the selected test coverage criterion. Several algorithms are proposed in the literature that traverse the graph and facilitate the generation of test cases for the selected coverage criterion. However, a plain directed graph used for modeling SUT processes does not allow for the capture of real-life dependencies and constraints between actions in the tested processes, which might pose a limit in application of this technique. Therefore, we defined an extended model that allows the specification of constraints upon the graph's elements and a set of algorithms that allow the generation of the set of test cases that satisfy the given constraints with the edge coverage. Considering the fact that in path-based testing, there is no platform in which engineers and researchers can share SUT models to be further assembled into open datasets to test performance of evolved path-based testing MBT algorithms, especially for the given problem of test paths generation with the constraints, this paper presents a summary of the problem and a novel management for creation and management of SUT models with constraints that allows the generation of test paths as well as to serve as a platform for creation of such benchmark datasets.
@InProceedings{ICST Workshops25p50,
author = {Matej Klima and Miroslav Bures and Daniel Holotik and Maximilian Herczeg and Marek Miltner and Chad Zanocco},
title = {CPT Manager: An Open Environment for Constrained Path-Based Testing},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {50--53},
doi = {},
year = {2025},
}
Towards Improving Automated Testing with GraphWalker
Yavuz Koroglu,
Mutlu Beyazıt,
Onur Kilincceker,
Serge Demeyer, and
Franz Wotawa
(Graz University of Technology, Austria; University of Antwerp and Flanders Make, Belgium)
GraphWalker is a widespread automated model-based testing tool that generates executable test cases from graph models of a system under test. GraphWalker implements only two random test generation algorithms and has no optimization-based algorithm, where an evaluation of its performance in test lengths and coverage remains an open question in the literature. In this work, we performed experiments on three realistic systems to evaluate redundancy, coverage, and length of GraphWalker test cases. The experimental results show that even the best GraphWalker test cases are highly redundant, limited in edge pair coverage, and need significantly longer test cases to increase edge pair coverage. Overall, we establish a baseline to compare future optimization-based algorithms with, while the amount of improvement and its impact are important research questions for the future.
@InProceedings{ICST Workshops25p54,
author = {Yavuz Koroglu and Mutlu Beyazıt and Onur Kilincceker and Serge Demeyer and Franz Wotawa},
title = {Towards Improving Automated Testing with GraphWalker},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {54--58},
doi = {},
year = {2025},
}
Model and Machine Learning
Automata Learning for React Web Applications
Peter Grubelnik and
Franz Wotawa
(Technische Universitaet Graz, Austria)
Testing is an inevitable part of any software engineering process to ensure quality and reliability. Model-based testing is a successful approach for the automated generation of test cases but requires a model of the system under test. Constructing such models can take time and effort. Hence, there is even a need to automate model generation, where automata learning is a promising approach. In this paper, we address the application of automata learning for React web forms used in a wide range of web applications. We provide information regarding a tool we developed to couple web forms to automata learning for obtaining a model. In addition, we discuss the application of the tool, limitations, challenges, and solutions for reducing the state space and making the approach feasible for practical applications.
@InProceedings{ICST Workshops25p59,
author = {Peter Grubelnik and Franz Wotawa},
title = {Automata Learning for React Web Applications},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {59--66},
doi = {},
year = {2025},
}
Mutating Skeletons: Learning Timed Automata via Domain Knowledge
Felix Wallner,
Bernhard K. Aichernig,
Florian Lorber, and
Martin Tappler
(Graz University of Technology and Silicon Austria Labs, TU Graz - SAL DES Lab, Austria; Graz University of Technology, Austria; Silicon Austria Labs, Austria; TU Wien, Austria)
Formal verification techniques, such as model checking, can provide valuable insights and guarantees for (safety-critical) devices and their possible behavior. However, these guarantees only hold true as long as the model correctly reflects the system. Automata learning provides a huge advantage there as it enables not only the automatic creation of the needed models but also ensures their correct reflection of the system behavior. However, and this holds especially true for real-time systems, model learning techniques can become very time consuming. To combat this, we show how to integrate given domain knowledge into an existing approach based on genetic programming to speed up the learning process. In particular, we show how the genetic programming approach can take a (possibly abstracted, incomplete or incorrect) untimed skeleton of an automaton, which can often be obtained very cheaply, and augment it with timing behavior to form timed automata in a fast and efficient manner. We demonstrate the approach on several examples of varying sizes.
@InProceedings{ICST Workshops25p67,
author = {Felix Wallner and Bernhard K. Aichernig and Florian Lorber and Martin Tappler},
title = {Mutating Skeletons: Learning Timed Automata via Domain Knowledge},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {67--77},
doi = {},
year = {2025},
}
SelfBehave, Generating a Synthetic Behaviour-Driven Development Dataset using SELF-INSTRUCT
Manon Galloy,
Martin Balfroid,
Benoît Vanderose, and
Xavier Devroey
(NADI, University of Namur, Belgium)
While state-of-the-art large language models (LLMs) show great potential for automating various Behavioral-Driven Development (BDD) related tasks, such as test generation, smaller models depend on high-quality data, which are challenging to find in sufficient quantity. To address this challenge, we adapt the SELF-INSTRUCT method to generate a large synthetic dataset from a small set of human-written high-quality scenarios. We evaluate the impact of the initial seeded scenarios' quality on the generated scenarios by generating two synthetic datasets: one from 175 high-quality seeds and one from 175 seeds that did not meet all quality criteria. We performed a qualitative analysis using state-of-the-art quality criteria and found that the quality of seeds does not significantly influence the generation of complete and essential scenarios. However, it impacts the scenarios' capability to focus on a single action and outcome and their compliance with Gherkin syntactic rules. During our evaluation, we also found that while raters agreed on whether a scenario was of high quality or not, they often disagreed on individual criteria, indicating a need for quality criteria easier to apply in practice.
@InProceedings{ICST Workshops25p78,
author = {Manon Galloy and Martin Balfroid and Benoît Vanderose and Xavier Devroey},
title = {SelfBehave, Generating a Synthetic Behaviour-Driven Development Dataset using SELF-INSTRUCT},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {78--84},
doi = {},
year = {2025},
}
AI and Testing
Model-Based Testing Computer Games: Does It Work?
Wishnu Prasetya
(Utrecht University, Netherlands)
Model-based testing (MBT) allows a target software to be tested systematically and automatically by making use of a model of the software under test. It has been successfully applied in various domains. However its application for testing computer games has not been much studied. The highly dynamic nature of computer games makes it challenging for modeling. In this paper we propose a predicate-based modeling approach coupled with an on-line test generation approach. Both aspects ease the details that need to be incorporated in the model to facilitate effective test generation. Additionally we also leverage the use of intelligent agents, so that dealing with hazards and obstacles can optionally be delegated to such an agent, hence keeping the model clean from such aspects. A case study with a game called MiniDungeon is included to discuss the viability and benefit of the approach, e.g. in terms of code coverage and ability to cover deep states.
@InProceedings{ICST Workshops25p85,
author = {Wishnu Prasetya},
title = {Model-Based Testing Computer Games: Does It Work?},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {85--92},
doi = {},
year = {2025},
}
16th International Workshop on Automated Testing (A-TEST)
AI-Driven Testing Automation
Test Case Generation for Dialogflow Task-Based Chatbots
Rocco Gianni Rapisarda,
Davide Ginelli,
Diego Clerissi, and
Leonardo Mariani
(University of Milano-Bicocca, Italy; University of Genova, Italy)
Chatbots are software typically embedded in Web and Mobile applications designed to assist the user in a plethora of activities, from chit-chatting to task completion. They enable diverse forms of interactions, like text and voice commands. As any software, even chatbots are susceptible to bugs, and their pervasiveness in our lives, as well as the underlying technological advancements, call for tailored quality assurance techniques. However, test case generation techniques for conversational
chatbots are still limited.
In this paper, we present Chatbot Test Generator (CTG), an automated testing technique designed for
task-based chatbots. We conducted an experiment comparing CTG with state-of-the-art BOTIUM and CHARM tools with seven chatbots, observing that the test cases generated by CTG outperformed the competitors, in terms of robustness and effectiveness.
@InProceedings{ICST Workshops25p93,
author = {Rocco Gianni Rapisarda and Davide Ginelli and Diego Clerissi and Leonardo Mariani},
title = {Test Case Generation for Dialogflow Task-Based Chatbots},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {93--102},
doi = {},
year = {2025},
}
Automated Testing of the GUI of a Real-Life Engineering Software using Large Language Models
Tim Rosenbach,
Alexander Weinert, and
David Heidrich
(German Aerospace Center (DLR), Institute for Software Technology, Germany)
One important step in software development is testing the finished product with actual users. These tests aim, among other goals, at determining unintuitive behavior of the software as it is presented to the end-user. Moreover, they aim to determine inconsistencies in the user-facing interface. They provide valuable feedback for the development of the software, but are time-intensive to conduct. In this work, we present GERALLT, a system that uses Large Language Models (LLMs) to perform exploratory tests of the Graphical User Interface (GUI) of a real-life engineering software. GERALLT automatically generates a list of potential unintuitive and inconsistent parts of the interface. We present the architecture of GERALLT and evaluate it on a real-world use case of the engineering software, which has been extensively tested by developers and users. Our results show that GERALLT is able to determine issues with the interface that support the software development team in future development of the software.
@InProceedings{ICST Workshops25p103,
author = {Tim Rosenbach and Alexander Weinert and David Heidrich},
title = {Automated Testing of the GUI of a Real-Life Engineering Software using Large Language Models},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {103--110},
doi = {},
year = {2025},
}
SleepReplacer-GPT: AI-Based Thread Sleep Replacement in Selenium WebDriver Tests
Dario Olianas,
Maurizio Leotta, and
Filippo Ricca
(Università di Genova, Italy; DIBRIS, Università di Genova - Italy, Italy)
Ensuring the quality of modern web applications through end-to-end (E2E) testing is crucial, especially for dynamic systems like single-page applications. Managing asynchronous calls effectively is a key challenge, often addressed using thread sleeps or explicit waits. While thread sleeps are simple to use, they cause inefficiencies and flakiness, whereas explicit waits are more efficient but demand careful implementation.
This work explores extending SleepReplacer, a tool that automatically replaces thread sleeps with explicit waits in Selenium WebDriver test suites. We aim to enhance its capabilities by integrating it with ChatGPT, enabling intelligent and automated replacement of thread sleeps with optimal explicit waits. This integration aims to improve code quality and reduce flakiness.
We developed a structured procedure for interacting with ChatGPT and validated it on three test suites and synthetic examples covering diverse cases.
Results show that the LLM-based approach correctly replaces thread sleeps with explicit waits on the first attempt, consistently outperforming SleepReplacer. These findings support integrating ChatGPT with SleepReplacer to create a smarter, more efficient tool for managing asynchronous behavior in test suites.
@InProceedings{ICST Workshops25p111,
author = {Dario Olianas and Maurizio Leotta and Filippo Ricca},
title = {SleepReplacer-GPT: AI-Based Thread Sleep Replacement in Selenium WebDriver Tests},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {111--120},
doi = {},
year = {2025},
}
Curiosity Driven Multi-agent Reinforcement Learning for 3D Game Testing
Raihana Ferdous,
Fitsum Meshesha Kifetew,
Davide Prandi, and
Angelo Susi
(Consiglio Nazionale delle Ricerche (CNR), Italy; Fondazione Bruno Kessler, Italy; FONDAZIONE BRUNO KESSLER, Italy; Fondazione Bruno Kessler - Irst, Italy)
Recently testing of games via autonomous agents
has shown great promise in tackling challenges faced by the
game industry, which mainly relied on either manual testing
or record/replay. In particular Reinforcement Learning (RL)
solutions have shown potential by learning directly from playing
the game without the need for human intervention.
In this paper, we present cMarlTest, an approach for testing
3D games through curiosity driven Multi-Agent Reinforcement
Learning (MARL). cMarlTest deploys multiple agents that
work collaboratively to achieve the testing objective. The use
of multiple agents helps resolve issues faced by a single agent
approach.
We carried out experiments on different levels of a 3D game
comparing the performance of cMarlTest with a single agent
RL variant. Results are promising where, considering three
different types of coverage criteria, cMarlTest achieved higher
coverage. cMarlTest was also more efficient in terms of the
time taken, with respect to the single agent based variant.
@InProceedings{ICST Workshops25p121,
author = {Raihana Ferdous and Fitsum Meshesha Kifetew and Davide Prandi and Angelo Susi},
title = {Curiosity Driven Multi-agent Reinforcement Learning for 3D Game Testing},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {121--129},
doi = {},
year = {2025},
}
Automated Testing in Critical Systems
Introducing a Testing Automation Framework for Basic Integrity of Java Applications in Hitachi Railway Systems
Angelo Venditto,
Chiara Aprile,
Fabrizio Zanini, and
Fausto Del Villano
(Hitachi Rail, Italy)
In the railway industry, where safety, reliability, and
performance are critical, test automation is essential to ensure
Verification and Validation (V&V) of software and its quality,
especially for Java applications. In order to reduce validation time
and minimize human error, automatic tests represent a powerful
tool: they can make the validation process more efficient and
systematic. Building an automated testing framework provides
continuity and responsiveness, enabling both early issue detection
and overall system performance optimization. These concepts are
crucial in a constantly evolving environment that must comply
with strict standards and several customer requirements. This
paper outlines the automated testing framework developed for
Java-based applications in Hitachi Rail, specifically targeting
railway central control systems. The framework must support
SW and V&V lifecycles and is made of three main stages:
API Test Automation using Postman used to ensure proper
API behavior under expected conditions and against the given
requirements, development of test suites with JUnit to validate
the reliability and accuracy of individual modules and their
interactions, and system tests through stress and fault injection
scenarios. The structured approach enables efficient testing by
significantly increasing the number of tests completed in a shorter
time, focusing on robustness and reliability, which are paramount
in railway central control systems.
@InProceedings{ICST Workshops25p130,
author = {Angelo Venditto and Chiara Aprile and Fabrizio Zanini and Fausto Del Villano},
title = {Introducing a Testing Automation Framework for Basic Integrity of Java Applications in Hitachi Railway Systems},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {130--138},
doi = {},
year = {2025},
}
Automated Testing in Railways: Leveraging Robotic Arm Precision and Efficiency for Superior Customizability and Usability Test Case Execution
Giuseppe Guida,
Mario D'Avino,
Massimo Nisci,
Roberto Villano,
Barbara Di Giampaolo,
Michele Schettino, and
Erasmo Alessio Bortone
(Hitachi Rail STS, Italy)
In the context of software testing of SIL4 railway onboard signalling application, this study aims to investigate the benefits of using a testing environment composed of a robotic arm, supported by other tools, for the creation and execution of test cases. Starting from the application context of railway signalling, a brief introduction of the European Railway Traffic Management System is provided, describing the approaches and methodologies used by the V&V team of Hitachi Rail STS in using such environment for the validation of a SIL4 software based on a 2oo2 architecture, whose GUI is the Driver Machine Interface. Furthermore, the contribute of Visual Inspection DMI Robotic Arm (VIDRA) tool in the verification and validation process is presented. In order to collect data on the benefits brought by the robotic testing environment, a survey was conducted by interviewing colleagues from V&V department, asking them questions about the ease of use of the environment, the learning curve after one month of use, how much their productivity improved, together with personal considerations. The results show that regardless of the seniority and experience of the interviewed staff, the evaluations on both the ease of use and the learning curve are mostly positive, while productivity is the parameter that received extremely positive evaluations. The personal considerations have shown encouraging comments with propositive insights aimed to enrich the environment with innovative features. These results suggested that the testing environment can be further improved thus possible future developments are described in conclusion.
@InProceedings{ICST Workshops25p139,
author = {Giuseppe Guida and Mario D'Avino and Massimo Nisci and Roberto Villano and Barbara Di Giampaolo and Michele Schettino and Erasmo Alessio Bortone},
title = {Automated Testing in Railways: Leveraging Robotic Arm Precision and Efficiency for Superior Customizability and Usability Test Case Execution},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {139--148},
doi = {},
year = {2025},
}
A Workflow for Automated Testing of a Safety-Critical Embedded Subsystem
Michele Ignarra,
Maria Guarino,
Andrea Aiello,
Vincenzo Tonziello,
Giovanni De Donato,
Emanuele Pascale,
Renato De Guglielmo,
Antonio Costa, and
Cosimo Affuso
(Hitachi Rail, Italy; Alten, Italy)
As embedded systems in safety-critical domains, such as
transportation, become increasingly complex, ensuring their reliability thorough
testing becomes essential. Manual testing methods are often time-consuming,
error-prone, and inadequate to cover all possible failure scenarios, especially
in systems that require a high level of functional safety and robustness. To
address these challenges, the need for automation in the testing process is
evident, particularly for real-time, embedded, and hardware-dependent systems
where validation is crucial for ensuring safe operation
This paper
proposes a structured workflow for automating the testing of a safety-critical
embedded subsystem, utilizing a real hardware-in-the-loop (HIL) environment.
The approach covers all major phases of the testing process, including test
specification, execution, and results assessment. The inspiring idea is to try
to shift the focus of intellectual effort toward the early stages of the
validation process, facilitating a clear and shared understanding between
testers and developers regarding the system's behavioral and functional
requirements.
The first phase
involves creating a detailed test specification, which includes: developing a
behavioral model of the feature to be validated; defining data probes for
monitoring the subsystem under test; creating simulation scenarios to provide
stimuli to the subsystem’s external interfaces. In the second phase,
executable test scripts are generated either automatically or
semi-automatically using multi-technology scripting languages, followed by the
creation of a sequence of test cases. The final phase involves executing the
tests on the actual hardware platform, collecting execution logs, and running
automated checks to evaluate the results.
The workflow and tools are evaluated using
a case study of the onboard subsystem hardware and software platform developed
by Hitachi Rail.
@InProceedings{ICST Workshops25p149,
author = {Michele Ignarra and Maria Guarino and Andrea Aiello and Vincenzo Tonziello and Giovanni De Donato and Emanuele Pascale and Renato De Guglielmo and Antonio Costa and Cosimo Affuso},
title = {A Workflow for Automated Testing of a Safety-Critical Embedded Subsystem},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {149--156},
doi = {},
year = {2025},
}
5th International CI/CD Industry Workshop (CCIW 2025)
The Purpose of CI
Continuous Evaluation: Using CI Techniques for Experimentation at Scale
Nilesh Jagnik
(Google, USA)
Experimentation is a critical product development practice which can help developers understand the impact of product changes. Through experimentation, various business metrics are generated that can help understand the usefulness of product changes. Software changes at Google undergo a cycle of rigorous experimentation and refinement before they are released. In this workshop, we present an overview of Google Search’s experimentation process. We discuss the working of Google Search’s experimentation platform, the challenges faced when supporting experimentation at scale, and how they were overcome.
@InProceedings{ICST Workshops25p157,
author = {Nilesh Jagnik},
title = {Continuous Evaluation: Using CI Techniques for Experimentation at Scale},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {157--157},
doi = {},
year = {2025},
}
CI at Scale
Shifting Gears in Continuous Integration: BMW’s Strategies for High-Velocity Builds
Maximilian Jungwirth,
Simon Rummert,
Alexander Scott, and
Gordon Fraser
(BMW Group, University of Passau, Germany; BMW Group, Germany; University of Passau, Germany)
Increasing customer demands have lead to high complexity automotive software, necessitating continuous refinement of our software development strategies at BMW. The transition to Bazel and the adoption of remote caching, remote build execution, and metastatic nodes were major stepping stones in keeping our Continuous Integration (CI) process feasible. However, challenges persist, including non-cached test executions and long integration times. Our future plans include utilising Machine Learning to optimise our CI pipelines for faster developer feedback cycles and more efficient resource consumption.
@InProceedings{ICST Workshops25p158,
author = {Maximilian Jungwirth and Simon Rummert and Alexander Scott and Gordon Fraser},
title = {Shifting Gears in Continuous Integration: BMW’s Strategies for High-Velocity Builds},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {158--159},
doi = {},
year = {2025},
}
Multi-architecture Testing at Google
Tim A. D. Henderson,
Sushmita Azad,
Avi Kondareddy, and
Abhay Singh
(Google, USA; Google LLC, USA; Google LLC, UK)
With the end of Moore's law, the need for higher performance per watt, and the
rise of domain-specific chips Google is adopting new general purpose compute
architectures. Google now routinely runs large scale software (Big Query,
Spanner, PubSub, Blobstore, etc.) on both x86 and Arm CPUs. Previously, most
server software at Google was written for the x86-64 architecture. When adopting
a new architecture, the software needs to be verified for the new platform. In
this talk, we will discuss how we are changing our central continuous
integration platforms to support multi-architecture testing, while ensuring that
the associated increase in testing cost and complexity remains sub-linear.
@InProceedings{ICST Workshops25p160,
author = {Tim A. D. Henderson and Sushmita Azad and Avi Kondareddy and Abhay Singh},
title = {Multi-architecture Testing at Google},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {160--160},
doi = {},
year = {2025},
}
The Art of Managing Flaky Tests at Scale
Oleg Andreyev
(Okta, USA)
Test flakiness, characterized by non-deterministic
test outcomes despite unchanged code, poses a significant
challenge to continuous integration system efficiency and
developer productivity. We comprehensively study a year-long
initiative to mitigate test flakiness within a large-scale engineering
organization. We adjust test failure sensitivity thresholds to
account for alert fatigue and resolution quality, refocus data
visualization for prioritization and pattern analysis, and introduce
incentives to improve the rate of solutions contributing to test
stabilization. Key findings reveal the importance of balancing
technical efficiency gains with organizational dynamics.
@InProceedings{ICST Workshops25p161,
author = {Oleg Andreyev},
title = {The Art of Managing Flaky Tests at Scale},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {161--162},
doi = {},
year = {2025},
}
Centralized, ML-Based, CI Optimizations
Unlocking New Practical Advantages of Machine Learning via Generating Large Amounts of High-Quality Data about Software Faults
Neetha Jambigi,
Marc Tuerke,
Bartosz Bogacz,
Thomas Bach, and
Michael Felderer
(University of Cologne, Germany; SAP, Germany; German Aerospace Center (DLR), Germany)
Large Language Models (LLMs) are promising machine learning (ML) tools for fault detection in software engineering but require large datasets for adapting them for downstream tasks. To address data scarcity in public and industrial repositories, we generate synthetic faults by mutating C++ code in SAP HANA, creating high-quality training data with clear cause-effect linkages. This dataset enables ML models to predict crash causes, stack traces, and detect failing test cases, enhancing debugging and improving CI/CD processes. Our discussion highlights the practical benefits and applications of ML-based fault prediction in large industrial projects.
@InProceedings{ICST Workshops25p163,
author = {Neetha Jambigi and Marc Tuerke and Bartosz Bogacz and Thomas Bach and Michael Felderer},
title = {Unlocking New Practical Advantages of Machine Learning via Generating Large Amounts of High-Quality Data about Software Faults},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {163--163},
doi = {},
year = {2025},
}
12th International Workshop on Software Test Architecture (InSTA 2025)
Research Papers
Establishing Utilization Decision Criteria for Open-Source Software Projects Based on Issue Resolution Time Evaluation
Keisuke Inoue and
Kazuhiko Tsuda
(University of Tsukuba, Japan)
In the face of growing demand for high-quality software solutions at reduced costs and development times, the use of open-source software (OSS) has become increasingly prevalent. This study evaluates the speed of OSS support activities—an essential quality factor for OSS users—and proposes a method for early-stage evaluation based on issue resolution time. We analyzed 100 popular OSS projects hosted on GitHub, collecting data on issue creation and completion dates. The results indicate a correlation coefficient of 0.587 between 30 and 180 days of observation, aligning with previous findings on OSS support continuity. Notably, a satisfactory correlation of 0.525 was achieved at just 21 days, suggesting that organizations can assess OSS support speed earlier and potentially accelerate their decision-making process. Our findings provide a practical criterion for evaluating the speed of OSS support activities, which may help organizations adopt OSS with greater confidence and enhance productivity in software-related projects. Future research will expand these criteria by exploring additional factors affecting issue resolution time, such as project scale and corporate support.
@InProceedings{ICST Workshops25p164,
author = {Keisuke Inoue and Kazuhiko Tsuda},
title = {Establishing Utilization Decision Criteria for Open-Source Software Projects Based on Issue Resolution Time Evaluation},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {164--170},
doi = {},
year = {2025},
}
The Evaluation of Ambiguity Based on the Distance between Transitive Verbs and Objects in Japanese
Toshiharu Kato and
Kazuhiko Tsuda
(Graduate School of University of Tsukuba, Japan)
A method exists for detecting ambiguous expressions in requirements specifications using a dictionary focused on transitive verbs and their objects. This dictionary is constructed by extracting transitive verbs and objects from the requirements specifications and classifying them for inclusion based on specific conditions. Sentences with ambiguous expressions can be detected by searching for sentences containing transitive verbs and objects found in the dictionary. The accuracy of this dictionary has been observed to improve when more requirements specifications are used as input data. However, the selection criteria for constructing the dictionary and filtering ambiguous sentences after extraction have not been thoroughly examined, indicating that the accuracy of ambiguous sentence extraction could be further improved.
In this study, to further enhance the accuracy of this dictionary, we measured the distance between the detected transitive verbs and objects within sentences to determine whether the sentence is ambiguous. We conducted a t-test to compare the average distance between transitive verbs and objects in ambiguous and non-ambiguous sentences. The two-tailed test yielded a p-value of 0.0367 at a 5% significance level, indicating a significant difference in the distances between transitive verbs and objects. Therefore, we confirmed that the accuracy of the dictionary can be improved by weighting the words in the dictionary based on the distance between transitive verbs and objects.
Detecting ambiguous expressions with high accuracy is expected to increase the efficiency of reviewing the contents of requirements specifications and reduce the difficulty of clarifying specifications. This will contribute to better scope management in project management. Consequently, the proposed approach is expected to prevent delays and cost overruns, contributing to more effective schedule and cost management.
@InProceedings{ICST Workshops25p171,
author = {Toshiharu Kato and Kazuhiko Tsuda},
title = {The Evaluation of Ambiguity Based on the Distance between Transitive Verbs and Objects in Japanese},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {171--177},
doi = {},
year = {2025},
}
A Language Framework for Test(-ware) Architecture
Luis León
(e-Quallity, Mexico)
In a previous paper we presented the approach called Formal Testing, in which computer languages are used in testing activities. In this conceptual framework we showed definitions for Software Architecture and Architectural Pattern, and outlined the concept of Testware Architecture. This paper elaborates on that and describes how a framework of three integrated languages is used to approach Testware Architecture: a Process Definition Language, a Specification Language, and a GUI Description Language. It includes representations of several architectures and a pattern.
@InProceedings{ICST Workshops25p178,
author = {Luis León},
title = {A Language Framework for Test(-ware) Architecture},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {178--187},
doi = {},
year = {2025},
}
Industry Reports and Emgerging Ideas
Enhancing SaaS Product Reliability and Release Velocity through Optimized Testing Approach
Ren Karita and
Tsuyoshi Yumoto
(freee K.K., Japan)
In mission-critical SaaS products such as payment systems, maintaining rapid release velocity while ensuring high reliability is essential. However, as these products grow more complex, the areas requiring regression testing expand, increasing testing costs and often compromising release speed. This paper proposes a methodology to achieve both rapid release velocity and high reliability by optimizing the regression test suite through defect severity analysis. Furthermore, a test architecture is proposed to balance unit, integration, and system testing. This approach reduces regression testing costs, shortens testing duration, and accelerates release cycles.
@InProceedings{ICST Workshops25p188,
author = {Ren Karita and Tsuyoshi Yumoto},
title = {Enhancing SaaS Product Reliability and Release Velocity through Optimized Testing Approach},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {188--193},
doi = {},
year = {2025},
}
AIs Understanding of Software Test Architecture
Jon Hagar and
Tom Wissink
(Grand Software Testing, USA)
The software test industry is discussing whether software test architectures (STA) with associated system hardware exist and whether supporting STA standards are needed. This paper reports on queries to artificial intelligence (AI) systems to understand what these tools know about architectures. The premise is that if AI systems know about the subject with depth, insight, and references, then STA has some common standing within the industry, and a standard might be in order.
@InProceedings{ICST Workshops25p194,
author = {Jon Hagar and Tom Wissink},
title = {AIs Understanding of Software Test Architecture},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {194--199},
doi = {},
year = {2025},
}
9th International Workshop on Testing Extra-Functional Properties and Quality Characteristics of Software Systems (ITEQS 2025)
AI and Testing
Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing
Bestoun S. Ahmed,
Ludwig Otto Baader,
Firas Bayram,
Siri Jagstedt, and
Peter Magnusson
(Karlstads Universitet, Sweden; American University of Bahrain, Bahrain; Ludwig Maximilian University Munich, Germany; Karlstad University, Sweden)
This paper presents a comprehensive framework for testing and evaluating quality characteristics of Large Language Model (LLM) systems enhanced with Retrieval-Augmented Generation (RAG) in tourism applications. Through systematic empirical evaluation of three different LLM variants across multiple parameter configurations, we demonstrate the effectiveness of our testing methodology in assessing both functional correctness and extra-functional properties. Our framework implements 17 distinct metrics that encompass syntactic analysis, semantic evaluation, and behavioral evaluation through LLM judges. The study reveals significant information about how different architectural choices and parameter configurations affect system performance, particularly highlighting the impact of temperature and top-p parameters on response quality. The tests were carried out on a tourism recommendation system for the Värmland region in Sweden, utilizing standard and RAG-enhanced configurations. The results indicate that the newer LLM versions show modest improvements in performance metrics, though the differences are more pronounced in response length and complexity rather than in semantic quality. The research contributes practical insights for implementing robust testing practices in LLM-RAG systems, providing valuable guidance to organizations deploying these architectures in production environments.
@InProceedings{ICST Workshops25p200,
author = {Bestoun S. Ahmed and Ludwig Otto Baader and Firas Bayram and Siri Jagstedt and Peter Magnusson},
title = {Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {200--207},
doi = {},
year = {2025},
}
Using Reinforcement Learning for Security Testing: A Systematic Mapping Study
Tanwir Ahmad,
Matko Butkovic, and
Dragos Truscan
(Åbo Akademi University, Finland)
Security of software systems has become increasingly important due to the advancement in technology that occurs on a daily basis and due to the interconnectivity that the Internet network system provides. The manual testing process is time-consuming process and inefficient, especially for very large and complex systems. Reinforcement learning has shown promising results in different test generation approaches due to its ability to optimize the test generation process towards relevant parts of the system. A considerable body of work has been developed in recent years to exploit reinforcement learning for security test generation. This study provides a list of approaches and tools for security test generation using Reinforcement Learning (RL). By searching popular research publication databases, a list of 47 relevant studies has been identified and classified according to the type of approach, RL algorithm, application domain and publication metadata.
@InProceedings{ICST Workshops25p208,
author = {Tanwir Ahmad and Matko Butkovic and Dragos Truscan},
title = {Using Reinforcement Learning for Security Testing: A Systematic Mapping Study},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {208--216},
doi = {},
year = {2025},
}
Visual Spectrum-Based Fault Localization for Python Programs Based on the Differentiation of Execution Slices
Shehroz Khan,
Gaadha Sudheerbabu,
Bianca Elena Staicu,
Tanwir Ahmad, and
Dragos Truscan
(Åbo Akademi University, Finland; Abo Akademi University, Finland)
We present an automated fault localization technique that can assist developers to localize effectively faults in Python programs. The proposed method uses spectrum-based fault localization techniques, program slicing, and graph-based visualization to formulate an efficient method for reducing the effort needed in fault localization. The approach takes the source code of a program, a set of passed and failed tests and collects the program spectra information by executing the tests. A tool, FaultLocalizer, facilitates the generation of a call graph for inter-procedural dependency analysis and annotated control flow graphs for different modules with spectra information and suspiciousness scores. The focus of the approach is on the visual analysis of the source code, and it is intended to complement existing fault localization approaches. The effectiveness of the proposed approach is evaluated on a set of buggy Python programs. The results show that the approach reduces debugging efforts and can be applied to programs with conditional branching.
@InProceedings{ICST Workshops25p217,
author = {Shehroz Khan and Gaadha Sudheerbabu and Bianca Elena Staicu and Tanwir Ahmad and Dragos Truscan},
title = {Visual Spectrum-Based Fault Localization for Python Programs Based on the Differentiation of Execution Slices},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {217--225},
doi = {},
year = {2025},
}
Cyber-Physical Systems
A Protocol Fuzzing Framework to Detect Remotely Exploitable Vulnerabilities in IoT Nodes
Phi Tuong Lau and
Stefan Katzenbeisser
(Vietnam, Viet Nam; University of Passau, Germany)
Firmware attacks pose significant security risks to IoT devices due to often insufficient security protection. Such attacks can result in data breaches, privacy violations, and operational disruptions. For example, attackers can alter the structure of packets (e.g., header, payload) to create malformed packets and send them to nodes in a victim network. These packets can then remotely trigger vulnerabilities (e.g., buffer overflows) in the firmware, potentially resulting in crashes or unexpected behavior.
In this paper, we first investigate how firmware vulnerabilities (e.g., out-of-bounds errors) can be remotely exploited in IoT networks by analyzing bug reports from leading IoT platforms. We categorize these findings and introduce a protocol fuzzing framework that leverages the GPT-4 model to guide the generation of test cases in form of malformed packets. These packets are later sent to 6LoWPAN-based nodes to observe their behavior and identify potential security vulnerabilities. As a result, we discovered four new vulnerabilities, classified as improper input validation, which can be remotely exploited in the Contiki-NG operating system.
@InProceedings{ICST Workshops25p226,
author = {Phi Tuong Lau and Stefan Katzenbeisser},
title = {A Protocol Fuzzing Framework to Detect Remotely Exploitable Vulnerabilities in IoT Nodes},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {226--234},
doi = {},
year = {2025},
}
Unified Search for Multi-requirement Falsification for Cyber-Physical Systems
Jesper Winsten and
Ivan Porres
(Åbo Akademi University, Finland)
This paper addresses the challenge of efficiently falsifying multiple requirements in cyber-physical systems (CPSs). Traditional falsification approaches typically evaluate requirements sequentially, leading to redundant computations and decreased efficiency. We present Multi-Requirement Unified Search (MRUS), an algorithm that evaluates all requirements simultaneously using conjunctive Signal Temporal Logic (STL) formulas. MRUS combines an Online Generative Adversarial Network (OGAN) for test case generation with a unified search algorithm to evaluate multiple requirements conjunctively.
The performance of the algorithm was evaluated using the ARCH-COMP 2024 falsification competition as a benchmark suite. The results demonstrate that MRUS achieves a high Falsification Rate (FR) across all benchmarks while requiring a small number of total execution counts to find falsifying inputs.
@InProceedings{ICST Workshops25p235,
author = {Jesper Winsten and Ivan Porres},
title = {Unified Search for Multi-requirement Falsification for Cyber-Physical Systems},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {235--243},
doi = {},
year = {2025},
}
14th International Workshop on Combinatorial Testing (IWCT 2025)
Theoretical Aspects of CT
Utilizing Ontologies for Combinatorial Testing
Franz Wotawa
(Technische Universitaet Graz, Austria)
Test case generation using combinatorial testing requires an input model comprising parameters and their ranges of possible values, i.e., their domains, which might be appropriate for ordinary applications but not for more sophisticated application domains like autonomous systems. Therefore, we discuss the use of ontologies for modeling and its utilization for combinatorial testing. We focus on ontologies describing the interface between the system under test and its environment, discuss the rationale behind such an approach, and introduce the basic algorithm for mapping ontologies to input models. We further consider modeling challenges and provide some initial modeling principles.
@InProceedings{ICST Workshops25p244,
author = {Franz Wotawa},
title = {Utilizing Ontologies for Combinatorial Testing},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {244--247},
doi = {},
year = {2025},
}
Combinatorial Testing and ML/AI
A Combinatorial Approach to Reduce Machine Learning Dataset Size
Megan Olsen,
M S Raunak,
Rick Kuhn,
Fenrir Badorf,
Hans van Lierop, and
Francis Durso
(Loyola University Maryland, USA; National Institute of Standards and Technology, USA; Natl Institute of Standards & Technology, USA; Johns Hopkins University, USA)
Although large datasets may seem to be the best choice for machine learning, a smaller dataset that better represents the important parts of the problem will be faster to train and potentially more effective. Although dimensionality reduction focuses on reducing features (columns) of data, dataset reduction focuses on removing data points (rows). Typical approaches for reducing dataset size include random sampling, or using machine learning to understand the data. We propose using combinatorial coverage and frequency difference (CFD) techniques from software testing to choose the most effective rows of data to generate a smaller training dataset. We explore the effectiveness of four approaches to reduce a dataset using CFD, and a case study showing that we can produce a significantly smaller dataset that is more effective in training a Support Vector Machine than the original dataset or datasets generated by other approaches.
@InProceedings{ICST Workshops25p248,
author = {Megan Olsen and M S Raunak and Rick Kuhn and Fenrir Badorf and Hans van Lierop and Francis Durso},
title = {A Combinatorial Approach to Reduce Machine Learning Dataset Size},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {248--257},
doi = {},
year = {2025},
}
Data Frequency Coverage Impact on AI Performance
Erin Lanus,
Brian Lee,
Jaganmohan Chandrasekaran,
Laura Freeman,
M S Raunak,
Raghu Kacker, and
Rick Kuhn
(Virginia Tech, USA; Virginia Polytechnic Institute and State University (Virginia Tech), USA; National Institute of Standards and Technology, USA; Natl Institute of Standards & Technology, USA)
Artificial Intelligence (AI) models use statistical learning over data to solve complex problems for which straightforward rules or algorithms may be difficult or impossible to design; however, a side effect is that models that are complex enough to sufficiently represent the function may be uninterpretable. Combinatorial testing, a black-box approach arising from software testing, has been applied to test AI models. A key differentiator between traditional software and AI is that many traditional software faults are deterministic, requiring a failure-inducing combination of inputs to appear only once in the test set for it to be discovered. On the other hand, AI models learn statistically by reinforcing weights through repeated appearances in the training dataset, and the frequency of input combinations plays a significant role in influencing the model’s behavior. Thus, a single occurrence of a combination of feature values may not be sufficient to influence the model’s behavior. Consequently, measures like simple combinatorial coverage that are applicable to software testing do not capture the frequency with which interactions are covered in the AI model’s input space. This work develops methods to characterize the data frequency coverage of feature interactions in training datasets and analyze the impact of imbalance, or skew, in the combinatorial frequency coverage of the training data on model performance. We demonstrate our methods with experiments on an open-source dataset using several classical machine learning algorithms. This pilot study makes three observations: performance may increase or decrease with data skew, feature importance methods do not predict skew impact, and adding more data may not mitigate skew effects.
@InProceedings{ICST Workshops25p258,
author = {Erin Lanus and Brian Lee and Jaganmohan Chandrasekaran and Laura Freeman and M S Raunak and Raghu Kacker and Rick Kuhn},
title = {Data Frequency Coverage Impact on AI Performance},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {258--267},
doi = {},
year = {2025},
}
Fairness Testing of Machine Learning Models using Combinatorial Testing in Latent Space
Arjun Dahal,
Sunny Shree,
Yu Lei,
Raghu N. Kacker, and
D. Richard Kuhn
(The University of Texas at Arlington, USA; Information Technology Laboratory, National Institute of Standards and Technology, USA)
Decision-making by Machine Learning (ML) models can exhibit biased behavior, resulting in unfair outcomes. Testing ML models for such biases is essential to ensure unbiased decision-making. In this paper, we propose a combinatorial testing-based approach in the latent space of a generative model to generate instances that assess the fairness of black-box ML models. Our approach involves a two-step process: generating t-way test cases in the latent space of a Variational AutoEncoder and performing
fairness testing using the instances reconstructed from these test cases. We experimentally evaluated our approach against an approach that generates t-way instances in the input space for fairness testing. The results indicate that the latent-space approach produces more natural test cases while detecting the first fairness violation faster and achieving a higher ratio of discriminatory instances to the total number of generated instances.
@InProceedings{ICST Workshops25p268,
author = {Arjun Dahal and Sunny Shree and Yu Lei and Raghu N. Kacker and D. Richard Kuhn},
title = {Fairness Testing of Machine Learning Models using Combinatorial Testing in Latent Space},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {268--277},
doi = {},
year = {2025},
}
Combinatorial Testing Tools
A Search-Based Benchmark Generator for Constrained Combinatorial Testing Models
Paolo Arcaini,
Andrea Bombarda, and
Angelo Gargantini
(National Institute of Informatics, Japan; University of Bergamo, Italy)
In combinatorial testing, the generation of benchmarks that meet specific constraints and complexity requirements is essential for the rigorous assessment of testing tools. In previous work, we presented BenCiGen, a benchmark generator for combinatorial testing, which had the limitation of often discarding input parameter models (IPMs) that fail to meet the targeted requirements in terms of ratio (i.e., the number of valid tests, or tuples, over the total number of possible tests or tuples) and solvability.
This paper presents an extension to BenCiGen, BenCiGen_S, that integrates a search-based generation approach, aimed at reducing the number of discarded IPMs and enhancing the efficiency of benchmark generation.
Instead of rejecting IPMs that do not fulfill the desired characteristics, BenCiGen_S employs search-based techniques that iteratively mutate IPMs, optimizing them accordingly to a fitness function that measures their distance from target requirements.
Experimental results demonstrate that BenCiGen_S generates a significantly higher proportion of benchmarks adhering to specified characteristics, with sometimes a reduced computation time.
This approach not only improves the generation of suitable benchmarks but also enhances the tool's overall effectiveness.
@InProceedings{ICST Workshops25p278,
author = {Paolo Arcaini and Andrea Bombarda and Angelo Gargantini},
title = {A Search-Based Benchmark Generator for Constrained Combinatorial Testing Models},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {278--287},
doi = {},
year = {2025},
}
Towards Continuous Integration for Combinatorial Testing
Manuel Leithner,
Jovan Zivanovic,
Reinhard Kugler, and
Dimitris E. Simos
(SBA Research, Austria; SBA-Research, Austria)
Combinatorial testing is an efficient black-box approach that permits practitioners to pseudo-exhaustively cover the input space of a system under test. It offers mathematically guaranteed coverage up to a user-defined strength while requiring a small number of test cases. Despite these advantages, industrial uptake of this technique has been slow, not least because of the significant investment required to construct and maintain an accurate input parameter model, create reliable oracles and automate testing processes.
This work introduces a hierarchy of embeddings of combinatorial testing into continuous integration and deployment pipelines for use in real-world software development workflows. It further describes the practical implementation of a combinatorial security testing pipeline, enabling automated detection of SQL injection vulnerabilities throughout software evolution. Finally, it details lessons learned throughout the design, deployment and utilization of the resulting processes.
@InProceedings{ICST Workshops25p288,
author = {Manuel Leithner and Jovan Zivanovic and Reinhard Kugler and Dimitris E. Simos},
title = {Towards Continuous Integration for Combinatorial Testing},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {288--291},
doi = {},
year = {2025},
}
Testing Tool for Combinatorial Transition Testing in Dynamically Adaptive Software Systems
Pierre Martou,
Benoît Duhoux,
Kim Mens, and
Axel Legay
(Université catholique de Louvain (UCLouvain), ICTEAM, Belgium; Université catholique de Louvain (UCL), ICTEAM, Belgium; Nexova, Belgium)
Dynamically adaptive software systems exhibit an exponential number of configurations, posing significant testing challenges. Combinatorial Interaction Testing (CIT) and Combinatorial Transition Testing (CTT) enable concise test suite generation to cover valid pairs of features and transitions.
Detecting behavioural transition errors in such test suites with full transition coverage still remains complex and time-consuming.
We propose a modular and adaptable testing tool implementing state-of-the-art CTT. It generates test suites using either CIT or CTT and integrates a dedicated test oracle for efficient detection of behavioural transition errors, with minimum user effort.
@InProceedings{ICST Workshops25p292,
author = {Pierre Martou and Benoît Duhoux and Kim Mens and Axel Legay},
title = {Testing Tool for Combinatorial Transition Testing in Dynamically Adaptive Software Systems},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {292--295},
doi = {},
year = {2025},
}
Towards Accessibility of Covering Arrays for Practitioners of Combinatorial Testing
Ulrike Grömping
(BHT - Berliner Hochschule für Technik, Germany)
Combinatorial testing can be useful not only for large software-related testing efforts. While it seems worthwhile to evaluate its benefits for more general applications, lack of immediate availability of suitable arrays is a major impediment against its wider use by practitioners. This paper discusses requirements for making covering arrays (CAs), a key tool in combinatorial testing, accessible to practitioners of various application areas in a way that satisfies their needs. It is argued that, by fulfilling such requirements, the expert community also improves exchange of results among its members. The paper is thus intended as a call for action for the combinatorial testing community to improve the sharing of CA constructions.
@InProceedings{ICST Workshops25p296,
author = {Ulrike Grömping},
title = {Towards Accessibility of Covering Arrays for Practitioners of Combinatorial Testing},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {296--299},
doi = {},
year = {2025},
}
Applications of CT
Evaluating Large Language Model Robustness using Combinatorial Testing
Jaganmohan Chandrasekaran,
Ankita Ramjibhai Patel,
Erin Lanus, and
Laura Freeman
(Virginia Tech, USA; Independent Researcher, USA)
Recent advancements in large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like text, leading to widespread adoption across domains. Given LLM’s versatile capabilities, current evaluation practices assess LLMs across a wide variety of tasks, including answer generation, sentiment analysis, text completion, and question and answers, to name a few. Multiple choice questions (MCQ) have emerged as a widely used evaluation task to assess LLM’s understanding and reasoning across various subject areas. However, studies from the literature have revealed that LLMs exhibit sensitivity to the ordering of options in MCQ tasks, with performance variations based on option sequence, thus underscoring the robustness concerns in LLM performance.
This work presents a combinatorial testing-based framework for systematic and comprehensive robustness assessment of pre-trained LLMs. By leveraging the sequence covering array, the framework constructs test sets by systematically swapping the order of options, which are then used in ascertaining the robustness of LLMs. We performed an experimental evaluation using the Measuring Massive Multitask Language Understanding (MMLU) dataset, a widely used MCQ dataset and evaluated the robustness of GPT 3.5 Turbo, a pre-trained LLM. Results suggest the framework can effectively identify numerous robustness issues with a relatively minimal number of tests.
@InProceedings{ICST Workshops25p300,
author = {Jaganmohan Chandrasekaran and Ankita Ramjibhai Patel and Erin Lanus and Laura Freeman},
title = {Evaluating Large Language Model Robustness using Combinatorial Testing},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {300--309},
doi = {},
year = {2025},
}
Combinatorial Methods for Enhancing the Resilience of Production Facilities
Klaus Kieseberg,
Konstantin Gerner,
Bernhard Garn,
Wolfgang Czerni,
Dimitris E. Simos,
D. Richard Kuhn, and
Raghu Kacker
(SBA Research, Austria; Infraprotect, Austria; National Institute of Standards and Technology, USA)
In this paper, we apply combinatorial methods to generate crisis scenarios for production facilities with the goal of strengthening their resilience by identifying weaknesses in operational aspects or crisis response plans, extending previous work from the domain of disaster research. We use notions of combinatorial coverage for quantifying diversity in scenarios and as one objective when sampling the considered scenario space. We report on a small case study targeting the resilience of a production facility. Specifically, we generated covering arrays of different strengths (for five ternary parameters, for strengths two to five) to obtain paths through the production chain and various combinatorial sequence test sets to obtain crisis scenarios (sequence covering arrays for twelve events of strengths two and three as well as sequence test sets with constraints of length six and seven featuring different constraints). Finally, we state reflections as well as practical feedback to the proposed combinatorial approach.
@InProceedings{ICST Workshops25p310,
author = {Klaus Kieseberg and Konstantin Gerner and Bernhard Garn and Wolfgang Czerni and Dimitris E. Simos and D. Richard Kuhn and Raghu Kacker},
title = {Combinatorial Methods for Enhancing the Resilience of Production Facilities},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {310--313},
doi = {},
year = {2025},
}
Combinatorial Test Design Model Creation using Large Language Models
Deborah Furman,
Eitan Farchi,
Michael Gildein,
Andrew Hicks, and
Ryan Rawlins
(IBM, USA; IBM HRL, Israel)
In this paper, we report on our initial experience in using Large Language Models (LLMs), which continue to impact a growing multitude of domains, further expanding machine learning applicability. One possible use case is to apply LLMs as a tool to help drive more optimized test coverage via assisting to generate Combinatorial Test Design (CTD) models. This can lower the entry barrier for new CTD practitioners by requiring less subject matter expertise to generate a basic CTD model. In this paper we report on our initial experience in using LLMs to generate a base CTD model and analyze the usefulness of the approach. In common testing scenarios, the LLMs easily provide the necessary attributes and values that are needed to define the CTD model. Prompting the LLM for additional use cases is useful in highlighting possible interactions and determining constraints of the attributes identified in the first stage. Combining the two stages together facilitates the creation of base CTD models.
@InProceedings{ICST Workshops25p314,
author = {Deborah Furman and Eitan Farchi and Michael Gildein and Andrew Hicks and Ryan Rawlins},
title = {Combinatorial Test Design Model Creation using Large Language Models},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {314--323},
doi = {},
year = {2025},
}
Extended Abstract of Poster: STARS: Tree-Based Classification and Testing of Feature Combinations in the Automated Robotic Domain
Till Schallau,
Dominik Schmid,
Nick Pawlinorz,
Stefan Naujokat, and
Falk Howar
(TU Dortmund University, Germany)
Testing complex systems is crucial for ensuring safety, especially in automated driving, where diverse data sources and variable environments pose challenges. Here, robust safety validation is critical but exhaustive n-way combinatorial testing is impractical due to the vast number of test cases. The STARS framework uses tree-based scenario classifiers to limit feature combinations in a given domain.
@InProceedings{ICST Workshops25p324,
author = {Till Schallau and Dominik Schmid and Nick Pawlinorz and Stefan Naujokat and Falk Howar},
title = {Extended Abstract of Poster: STARS: Tree-Based Classification and Testing of Feature Combinations in the Automated Robotic Domain},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {324--325},
doi = {},
year = {2025},
}
International Workshop on Mutation Testing (Mutation 2025)
Session 1
Equivalent Mutants: Deductive Verification to the Rescue
Serge Demeyer and
Reiner Hähnle
(Universiteit Antwerpen (ANSYMO), Belgium; Technische Universität Darmstadt, Germany)
Already since the dawn of mutation testing, equivalent mutants have been a subject of academic research. Up until now, all the investigated program analysis techniques (infeasible paths, trivial compiler equivalence, program slicing, symbolic execution) focused on shielding the test engineer from the decision whether a mutant is equivalent or not. This paper argues for a complementary viewpoint: providing test engineers with powerful analysis tools (namely deductive verification) which show why a mutant is equivalent, or come up with a counter example if not. We illustrate by means of a series of increasingly challenging examples (drawn from the MutantBench dataset) how such an approach provides valuable insights to the test engineer, as such paving the way for an actionable improvement of the test suite under analysis.
@InProceedings{ICST Workshops25p326,
author = {Serge Demeyer and Reiner Hähnle},
title = {Equivalent Mutants: Deductive Verification to the Rescue},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {326--336},
doi = {},
year = {2025},
}
Semantic-Preserving Transformations as Mutation Operators: A Study on Their Effectiveness in Defect Detection
Max Hort,
Linas Vidziunas, and
Leon Moonen
(Simula Research Laboratory, Norway; Simula Research Laboratory & BI Norwegian Business School, Norway)
Recent advances in defect detection use language models. Existing works enhanced the training data to improve the models’ robustness when applied to semantically identical code (i.e., predictions should be the same). However, the use of semantically identical code has not been considered for improving the tools during their application - a concept closely related to metamorphic testing.
The goal of our study is to determine whether we can use semantic-preserving transformations, analogue to mutation operators, to improve the performance of defect detection tools in the testing stage. We first collect existing publications which implemented semantic-preserving transformations and share their implementation, such that we can reuse them. We empirically study the effectiveness of three different ensemble strategies for enhancing defect detection tools. We apply the collected
transformations on the Devign dataset, considering vulnerabilities as a type of defect, and two fine-tuned large language models for defect detection (VulBERTa, PLBART).
We found 28 publications with 94 different transformation. We choose to implement 39 transformations from four of the publications, but a manual check revealed that 23 out 39 transformations change code semantics. Using the 16 remaining, correct transformations and three ensemble strategies, we were not able to increase the accuracy of the defect detection models. Our results show that reusing shared semantic-preserving transformation is difficult, sometimes even causing wrongful changes to the semantics.
@InProceedings{ICST Workshops25p337,
author = {Max Hort and Linas Vidziunas and Leon Moonen},
title = {Semantic-Preserving Transformations as Mutation Operators: A Study on Their Effectiveness in Defect Detection},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {337--346},
doi = {},
year = {2025},
}
Intent-Based Mutation Testing: From Naturally Written Programming Intents to Mutants
Asma Hamidi,
Ahmed Khanfir, and
Mike Papadakis
(University of Luxembourg, Luxembourg; South Mediterranean University (MSB- MedTech-LCI), Tunis, Tunisia, Tunisia)
This paper presents intent-based mutation testing, a testing approach that generates mutations by changing the programming intents that are implemented in the programs under test. In contrast to traditional mutation testing, which changes (mutates) the way programs are written, intent mutation changes (mutates) the behavior of the programs by producing mutations that implement (slightly) different intents than those implemented in the original program. The mutations of the programming intents represent possible corner cases and misunderstandings of the program behavior, i.e., program specifications, and thus can capture different classes of faults than traditional (syntax-based) mutation. Moreover, since programming intents can be implemented in different ways, intent-based mutation testing can generate diverse and complex mutations that are close to the original programming intents (specifications) and thus direct testing towards the intent variants of the program behavior/specifications. We implement intent-based mutation testing using Large Language Models (LLMs) that mutate programming intents and transform them into mutants. We evaluate intent-based mutation on 29 programs and show that it generates mutations that are syntactically complex, semantically diverse, and quite different (semantically) from the traditional ones. We also show that 55% of the intent-based mutations are not subsumed by traditional mutations. Overall, our analysis shows that intent-based mutation testing can be a powerful complement to traditional (syntax-based) mutation testing.
@InProceedings{ICST Workshops25p347,
author = {Asma Hamidi and Ahmed Khanfir and Mike Papadakis},
title = {Intent-Based Mutation Testing: From Naturally Written Programming Intents to Mutants},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {347--357},
doi = {},
year = {2025},
}
Session 2
Mutation Testing via Iterative Large Language Model-Driven Scientific Debugging
Philipp Straubinger,
Marvin Kreis,
Stephan Lukasczyk, and
Gordon Fraser
(University of Passau, Germany; University of Passau and JetBrains, Germany)
Large Language Models (LLMs) can generate plausible test code. Intuitively they generate this by imitating tests seen in their training data, rather than reasoning about execution semantics. However, such reasoning is important when applying mutation testing, where individual tests need to demonstrate differences in program behavior between a program and specific artificial defects (mutants). In this paper, we evaluate whether Scientific Debugging, which has been shown to help LLMs when debugging, can also help them to generate tests for mutants. In the resulting approach, LLMs form hypotheses about how to kill specific mutants, and then iteratively generate and refine tests until they succeed, all with detailed explanations for each step. We compare this method to three baselines: (1) directly asking the LLM to generate tests, (2) repeatedly querying the LLM when tests fail, and (3) search-based test generation with Pynguin. Our experiments evaluate these methods based on several factors, including mutation score, code coverage, success rate, and the ability to identify equivalent mutants. The results demonstrate that LLMs, although requiring higher computational cost, consistently outperform Pynguin in generating tests with better fault detection and coverage. Importantly, we observe that the iterative refinement of test cases is important for achieving high-quality test suites.
@InProceedings{ICST Workshops25p358,
author = {Philipp Straubinger and Marvin Kreis and Stephan Lukasczyk and Gordon Fraser},
title = {Mutation Testing via Iterative Large Language Model-Driven Scientific Debugging},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {358--367},
doi = {},
year = {2025},
}
Exploring Robustness of Image Recognition Models on Hardware Accelerators
Nikolaos Louloudakis,
Perry Gibson,
José Cano, and
Ajitha Rajan
(The University of Edinburgh, UK; University of Glasgow, UK)
As the usage of Artificial Intelligence (AI) on resource-intensive and safety-critical tasks increases, a variety of Machine Learning (ML) compilers have been developed, enabling compatibility of Deep Neural Networks (DNNs) with a variety of hardware acceleration devices. However, given that DNNs are widely utilized for challenging and demanding tasks, the behavior of these compilers must be verified.
To this direction, we propose MutateNN, a tool that utilizes elements of both differential and mutation testing in order to examine the robustness of image recognition models when deployed on hardware accelerators with different capabilities, in the presence of faults in their target device code - introduced either by developers, or problems in their compilation process.
We focus on the image recognition domain by applying mutation testing to 7 well-established DNN models, introducing 21 mutations of 6 different categories. We deployed our mutants on 4 different hardware acceleration devices of varying capabilities and observed that DNN models presented discrepancies of up to 90.3% in mutants related to conditional operators across devices. We also observed that mutations related to layer modification, arithmetic types and input affected severely the overall model performance (up to 99.8%) or led to model crashes, in a consistent manner across devices.
@InProceedings{ICST Workshops25p368,
author = {Nikolaos Louloudakis and Perry Gibson and José Cano and Ajitha Rajan},
title = {Exploring Robustness of Image Recognition Models on Hardware Accelerators},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {368--374},
doi = {},
year = {2025},
}
Video
Info
8th International IEEE Workshop on Next Level of Test Automation (NextA 2025)
Testing and LLMs
Adaptive Testing for LLM-Based Applications: A Diversity-Based Approach
Juyeon Yoon,
Robert Feldt, and
Shin Yoo
(Korea Advanced Institute of Science and Technology, South Korea; Chalmers, Sweden; Korea Advanced Institute of Science Technology, South Korea)
The recent surge of building software systems powered by Large Language Models (LLMs) has led to the development of various testing frameworks, primarily focused on treating prompt templates as the unit of testing. Despite the significant costs associated with test input execution and output assessment, the curation of optimized test suites is yet overlooked in these tools, which calls for tailored test selection or prioritization strategies. In this paper, we show that diversity-based testing techniques, such as Adaptive Random Testing (ART) with appropriate string distance metrics, can be effectively applied to the testing of prompt templates. Our proposed adaptive testing approach adjusts the conventional ART process to this context by selecting new test inputs based on scores derived from existing test suite and their labelling results. Our results, obtained using various implementations that explore several string-based distances, confirm that our approach enables the discovery of failures with reduced testing budgets and promotes the generation of more varied outputs.
@InProceedings{ICST Workshops25p375,
author = {Juyeon Yoon and Robert Feldt and Shin Yoo},
title = {Adaptive Testing for LLM-Based Applications: A Diversity-Based Approach},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {375--382},
doi = {},
year = {2025},
}
Fault Localization
Witness Test Program Generation through AST Node Combinations
Heuichan Lim
(Davidson College, USA)
Compiler bugs can be critical, but localizing them
is challenging because of the limited debugging information and
complexity of compiler internals. A common approach to address
this challenge is to analyze the compiler’s execution behavior
while it processes witness test programs. To accurately localize a
bug in the compiler, the approach must generate effective witness
programs by identifying and modifying the code sections in the
seed program that are directly associated with the bug. However,
existing methods often struggle to generate such programs due
to the random selection of code parts for mutation. In this paper,
we propose a novel automated technique to (1) identify abstract
syntax tree nodes that are most likely to be related to the bug
in the compiler, which can transform a failing program into a
passing one, and (2) use this information to generate witness
test programs that can effectively localize compiler bugs. To
evaluate the effectiveness of our approach, we built a prototype
and experimented with it using 40 real-world LLVM bugs. Our
approach localized 40% and 100% of bugs within the Top-1 and
Top-5 compiler files, respectively.
@InProceedings{ICST Workshops25p383,
author = {Heuichan Lim},
title = {Witness Test Program Generation through AST Node Combinations},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {383--391},
doi = {},
year = {2025},
}
Influence of Pure and Unit-Like Tests on SBFL Effectiveness: An Empirical Study
Attila Szatmári,
Tamás Gergely, and
Árpád Beszédes
(Szegedi Tudományegyetem, Hungary; Department of Software Engineering, University of Szeged, Hungary)
One of the most challenging and time-consuming aspects of debugging is identifying the exact location of the bug.
We propose the concept of a Pure Unit Test (PUT), which, when fails, can unambiguously determine the location of the faulty method.
Based on developer experience, we established three heuristics to evaluate the degree to which a test can be considered a unit test, if it cannot be considered as a PUT (we call these Unit-Like Tests, or ULTs).
We examined how and when PUTs and ULTs affect Spectrum-Based Fault Localization efficiency.
The results demonstrate that, for more complex systems, a higher proportion of unit tests in the relevant test cases can enhance the effectiveness of fault localization.
When the number of PUTs is high enough, fault localization becomes trivial, in that case running SBFL is not necessary.
Moreover, our findings indicate that different kinds of ULTs can have a large impact on the efficiency of fault localization, particularly for simpler bugs where they can quickly and effectively pinpoint the problem areas.
@InProceedings{ICST Workshops25p392,
author = {Attila Szatmári and Tamás Gergely and Árpád Beszédes},
title = {Influence of Pure and Unit-Like Tests on SBFL Effectiveness: An Empirical Study},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {392--399},
doi = {},
year = {2025},
}
1st International Workshop on Secure, Accountable, and Verifiable Machine Learning (SAFE-ML 2025)
Robustness, Verification, and Security in AI Systems
Structural Backdoor Attack on IoT Malware Detectors via Graph Explainability
Yu-Cheng Chiu,
Maina Bernard Mwangi,
Shin-Ming Cheng, and
Hahn-Ming Lee
(National Taiwan University of Science and Technology, Taiwan)
In AI-based malware detection, structural features such as function call graphs (FCGs) and control flow graphs (CFGs) are widely used for their ability to encapsulate program execution flow and facilitate cross-architectural malware detection in IoT environments. When combined with deep learning (DL) models, such as graph neural networks (GNNs) that capture node interdependencies, these features enhance malware detection and enable the identification of novel threats. However, AI-based detectors require frequent updates and retraining to adapt to evolving malware strains, often relying on datasets from online crowdsourced threat intelligence platforms, which are vulnerable to poisoning and backdoor attacks. Backdoor attacks implant triggers in training samples, embedding vulnerabilities into ML models that can later be exploited. While extensive research exists on backdoor attacks in other domains, their implications for structural AI-based IoT malware detection remain unexplored. This study proposes a novel backdoor attack targeting IoT malware detectors trained on structural features. By leveraging CFGExplainer, we identify the most influential subgraphs from benign samples, extract them to serve as triggers, and inject them into malicious samples in the training dataset. Additionally, we introduce a partition-trigger strategy that injects triggers into malware while splitting critical malicious nodes to reduce their influence on label prediction. Ultimately, we achieve high attack success rates of up to 100% against state-of-the-art structural IoT malware detectors, underscoring critical security vulnerabilities and emphasizing the need for advanced countermeasures against backdoor attacks.
@InProceedings{ICST Workshops25p400,
author = {Yu-Cheng Chiu and Maina Bernard Mwangi and Shin-Ming Cheng and Hahn-Ming Lee},
title = {Structural Backdoor Attack on IoT Malware Detectors via Graph Explainability},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {400--409},
doi = {},
year = {2025},
}
Quantifying Correlations of Machine Learning Models
Yuanyuan Li,
Neeraj Sarna, and
Yang Lin
(Munich RE, USA; Munich RE, Germany; Hartford Steam Boiler, USA)
Machine Learning models are being extensively used in safety critical applications where errors from these models could cause harm to the user. Such risks are amplified when multiple machine learning models, which are deployed concurrently, interact and make errors simultaneously. This paper explores three scenarios where error correlations between multiple models arise, resulting in such aggregated risks. Using real-world data, we simulate these scenarios and quantify the correlations in errors of different models. Our findings indicate that aggregated risks are substantial, particularly when models share similar algorithms, training datasets, or foundational models. Overall, we observe that correlations across models are pervasive and likely to intensify with increased reliance on foundational models and widely used public datasets, highlighting the need for effective mitigation strategies to address these challenges.
@InProceedings{ICST Workshops25p410,
author = {Yuanyuan Li and Neeraj Sarna and Yang Lin},
title = {Quantifying Correlations of Machine Learning Models},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {410--417},
doi = {},
year = {2025},
}
Towards a Probabilistic Framework for Analyzing and Improving LLM-Enabled Software
Juan Manuel Baldonado,
Flavia Bonomo-Braberman, and
Víctor Adrián Braberman
(ICC UBA/CONICET and DC, FCEN, Universidad de Buenos Aires, Argentina)
Ensuring the reliability and verifiability of large language model (LLM)-enabled systems remains a significant challenge in software engineering. We propose a probabilistic framework for systematically analyzing and improving these systems by modeling and refining distributions over clusters of semantically equivalent outputs. This framework facilitates the evaluation and iterative improvement of Transference Models–key software components that utilize LLMs to transform inputs into outputs for downstream tasks. To illustrate its utility, we apply the framework to the autoformalization problem, where natural language documentation is transformed into formal program specifications. Our case illustrates how distribution-aware analysis enables the identification of weaknesses and guides focused alignment improvements, resulting in more reliable and interpretable outputs. This principled approach offers a foundation for addressing critical challenges in the development of robust LLM-enabled systems.
@InProceedings{ICST Workshops25p418,
author = {Juan Manuel Baldonado and Flavia Bonomo-Braberman and Víctor Adrián Braberman},
title = {Towards a Probabilistic Framework for Analyzing and Improving LLM-Enabled Software},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {418--422},
doi = {},
year = {2025},
}
Black-Box Multi-Robustness Testing for Neural Networks
Mara Downing and
Tevfik Bultan
(University of California Santa Barbara, USA; University of California at Santa Barbara, USA)
Neural networks are increasingly prevalent in day-to-day life, including in safety-critical applications such as self-driving cars and medical diagnoses. This prevalence has spurred extensive research into testing the robustness of neural networks against adversarial attacks, most commonly by determining if misclassified inputs can exist within a region around a correctly classified input. While most prior work focuses on robustness analysis around a single input at a time, in this paper we look at simultaneous analysis of multiple robustness regions. Our approach finds robustness violating inputs away from expected decision boundaries, identifies varied types of misclassifications by increasing confusion matrix coverage, and effectively discovers robustness violating inputs that do not violate input feasibility constraints. We demonstrate the capabilities of our approach on multiple networks trained from several datasets, including ImageNet and a street sign identification dataset.
@InProceedings{ICST Workshops25p423,
author = {Mara Downing and Tevfik Bultan},
title = {Black-Box Multi-Robustness Testing for Neural Networks},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {423--432},
doi = {},
year = {2025},
}
Security and Privacy in Fedetated Learning Systems
Towards A Common Task Framework for Distributed Collaborative Machine Learning
Qianying Liao,
Dimitri Van Landuyt,
Davy Preuveneers, and
Wouter Joosen
(KU Leuven, Belgium)
Distributed Collaborative Machine Learning (DCML) enables collaborative model training without the need to reveal or share datasets. However, its implementation incurs complexities in balancing performance, efficiency, and security and privacy. Different architectural approaches in how the learning can be distributed and coordinated over different parties have emerged, leading to significant diversity. Current academic work often lacks a comprehensive evaluation of the diverse architectural DCML approaches, which in turn hinders objective comparison of different DCML approaches and impedes broader adoption. This paper highlights the need for a general and holistic evaluation framework for DCML approaches, and proposes a number of relevant evaluation metrics and criteria.
@InProceedings{ICST Workshops25p433,
author = {Qianying Liao and Dimitri Van Landuyt and Davy Preuveneers and Wouter Joosen},
title = {Towards A Common Task Framework for Distributed Collaborative Machine Learning},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {433--437},
doi = {},
year = {2025},
}
Exploring and Mitigating Gradient Leakage Vulnerabilities in Federated Learning
Harshit Gupta,
Ghena Barakat,
Luca D'Agati,
Francesco Longo,
Giovanni Merlino, and
Antonio Puliafito
(University of Messina, Italy)
Due to the importance of data privacy, Federated Learning (FL) is increasingly adopted in various applications. Its importance lies in sharing only local model gradients with the central server instead of the raw training data. The model gradients shared for aggregation are key components of FL. However, these gradients are mathematical values and are prone to attack, which may lead to data leakage issues. Therefore, implementing robust security measures is crucial to prevent data leakage when sharing model gradients with the central server. Such measures are essential to protect gradients from being exploited by attackers to retrieve real data. This work proposed and demonstrated how public gradients can be used to retrieve private training data and how this may be avoided by using the technique of differential privacy.
@InProceedings{ICST Workshops25p438,
author = {Harshit Gupta and Ghena Barakat and Luca D'Agati and Francesco Longo and Giovanni Merlino and Antonio Puliafito},
title = {Exploring and Mitigating Gradient Leakage Vulnerabilities in Federated Learning},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {438--442},
doi = {},
year = {2025},
}
Federated Learning under Attack: Game-Theoretic Mitigation of Data Poisoning
Marco De Santis and
Christian Esposito
(University of Salerno, Italy)
Federated Learning (FL) is vulnerable to various attacks, including data poisoning and model poisoning, which can degrade the quality of the global model. Current solutions are based on cryptographic primitives or byzantine aggregation but suffer from performance and/or quality worsening and are even ineffective in case of data poisoning.
This work proposes a game-theoretic solution to identify malicious weights and differentiate between benign and compromised updates. We use the Prisoner's Dilemma and Signaling Games to model interactions between local learners and the aggregator, allowing a precise evaluation of the legitimacy of shared weights. Upon detecting an attack, the system activates a rollback mechanism to restore the model to a safe state. The proposed approach enhances FL robustness by mitigating attack impacts while preserving the global model's generalization capabilities.
@InProceedings{ICST Workshops25p443,
author = {Marco De Santis and Christian Esposito},
title = {Federated Learning under Attack: Game-Theoretic Mitigation of Data Poisoning},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {443--450},
doi = {},
year = {2025},
}
Privacy-Preserving in Federated Learning: A Comparison between Differential Privacy and Homomorphic Encryption across Different Scenarios
Alessio Catalfamo,
Maria Fazio,
Antonio Celesti, and
Massimo Villari
(Università degli Studi di Messina, Italy; University of Messina, Italy; Department of MIFT, University of Messina, Italy)
Federated Learning is a decentralized machine learning paradigm where multiple devices collaboratively train a model while keeping their training data local so to enhance data privacy and reduce communication costs. However, it is vulnerable to some privacy threats, such as model inversion and membership inference attacks, which can expose sensitive data in the devices. This paper explores the integration of different Privacy-Preserving approaches in FL, which are Differential Privacy and Homomorphic Encryption.
We assess the impact of these techniques on model accuracy, training efficiency, and computational overhead through experimental evaluations across three different application use cases.
Our evaluation results depict the trade-off between privacy enhancement and performance for different scenarios and models. They highlight the effectiveness of combining Federated Learning with advanced cryptographic and privacy-preserving techniques to achieve secure, scalable, and privacy-aware distributed learning.
@InProceedings{ICST Workshops25p451,
author = {Alessio Catalfamo and Maria Fazio and Antonio Celesti and Massimo Villari},
title = {Privacy-Preserving in Federated Learning: A Comparison between Differential Privacy and Homomorphic Encryption across Different Scenarios},
booktitle = {Proc.\ ICST Workshops},
publisher = {IEEE},
pages = {451--459},
doi = {},
year = {2025},
}
proc time: 0.35