ICST 2025
2025 IEEE Conference on Software Testing, Verification and Validation (ICST)

Powered by

2025 IEEE Conference on Software Testing, Verification and Validation (ICST), March 31 – April 4, 2025, Naples, Italy

ICST 2025 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page

Message from the General Chairs

Info

Message from the Program Co-Chairs

ICST 2025 Organization

Journal First Papers

ICST 2025 Sponsors and Supporters

Technical-Research Track

SPIDER: Fuzzing for Stateful Performance Issues in the ONOS Software-Defined Network Controller
Ao Li, Rohan Padhye, and Vyas Sekar
(Carnegie Mellon University, USA)

Detecting and Evaluating Order-Dependent Flaky Tests in JavaScript
Negar Hashemi, Amjed Tahir, Shawn Rasheed, August Shi, and Rachel Blagojevic
(Massey University, New Zealand; UCOL, New Zealand; University of Texas at Austin, USA)

The Impact of List Reduction for Language Agnostic Test Case Reducers
Tobias Heineken and Michael Philippsen
(Friedrich-Alexander University Erlangen-Nürnberg, Germany)

Hybrid Equivalence/Non-equivalence Testing
Laboni Sarker and Tevfik Bultan
(University of California at Santa Barbara, USA)

Archive submitted (1 GB)

Metamorphic Testing for Pose Estimation Systems
Matias Duran, Thomas Laurent, Ellen Rushe, and Anthony Ventresque
(SFI Lero, Ireland; Trinity College Dublin, Ireland; Dublin City University, Ireland)

Mutation-Based Fuzzing of the Swift Compiler with Incomplete Type Information
Sarah Canto Hyatt and Kyle Dewey
(University of California at Santa Barbara, USA; California State University, USA)

Scalable SMT Sampling for Floating-Point Formulas via Coverage-Guided Fuzzing
Manuel Carrasco, Cristian Cadar, and Alastair F. Donaldson
(Imperial College London, UK)

Info

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code
Shahin Honarvar, Mark van der Wilk, and Alastair F. Donaldson
(Imperial College London, UK; University of Oxford, UK)

An Empirical Study of Web Flaky Tests: Understanding and Unveiling DOM Event Interaction Challenges
Yu Pei, Jeongju Sohn, and Mike Papadakis
(University of Luxembourg, Luxembourg; Kyungpook National University, South Korea)

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik
(University of Pennsylvania, USA; Cornell University, USA)
Security vulnerabilities in modern software are prevalent and harmful. While automated vulnerability detection techniques have made promising progress, their scalability and applicability remain challenging. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect security vulnerabilities. In this paper, we perform a more comprehensive study by examining a larger and more diverse set of datasets, languages, and LLMs, and qualitatively evaluating detection performance across prompts and vulnerability classes. Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples—1,000 randomly selected each from five diverse security datasets. These balanced datasets encompass synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes. Our results show that LLMs across all scales and families show modest effectiveness in end-to-end reasoning about vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across all datasets. LLMs are significantly better at detecting vulnerabilities that typically only need intra-procedural reasoning, such as OS Command Injection and NULL Pointer Dereference. Moreover, LLMs report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by up to 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications (e.g., sources and sinks) and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We believe our insights can motivate future work on LLM-augmented vulnerability detection systems.

ADGE: Automated Directed GUI Explorer for Android Applications
Yue Jiang, Xiaobo Xiang, Qingli Guo, Qi Gong, and Xiaorui Gong
(University of Chinese Academy of Sciences, China; Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China; Singular Security Lab, Beijing, China)

Multi-project Just-in-Time Software Defect Prediction Based on Multi-task Learning for Mobile Applications
Feng Chen, Yuxin Ke, Xin Liu, and Qingjie Wei
(Chongqing University of Posts and Telecommunications, China)

A Taxonomy of Integration-Relevant Faults for Microservice Testing
Lena Gregor, Anja Hentschel, Leon Kastner, and Alexander Pretschner
(TU Munich, Germany; Siemens, Germany)

Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems
Stefano Carlo Lambertenghi, Hannes Leonhard, and Andrea Stocco
(TU Munich, Germany; fortiss, Germany)

Improving the Readability of Automatically Generated Tests using Large Language Models
Matteo Biagiola, Gianluca Ghislotti, and Paolo Tonella
(USI Lugano, Switzerland)

Benchmarking Generative AI Models for Deep Learning Test Input Generation
Maryam Maryam, Matteo Biagiola, Andrea Stocco, and Vincenzo Riccio
(University of Udine, Italy; USI Lugano, Switzerland; TU Munich, Germany; fortiss, Germany)

Challenges, Strategies, and Impacts: A Qualitative Study on UI Testing in CI/CD Processes from GitHub Developers’ Perspectives
Xiaoxiao Gan, Huayu Liang, and Chris Brown
(Virginia Tech, USA)
Continuous Integration and Continuous Delivery (CI/CD) processes are vital to meet the growing demands of open source software (OSS), providing a pipeline to enhance project quality and productivity. To ensure the user interfaces (UIs) of these systems work as intended, UI testing is crucial for verifying visual elements of software. Integrating UI tests in CI/CD pipelines should provide fast delivery and comprehensive test coverage. However, there is a gap in understanding how popular UI testing frameworks are adopted within CI/CD workflows—and the effects of this integration on OSS development. Aims: This study aims to explore developers’ perceptions of the challenges, strategies, and impacts of incorporating UI testing into CI/CD environments. In particular, we focus on OSS developers utilizing popular web-based UI testing frameworks—such as Selenium, Cypress, and Playwright—and popular CI/CD platforms—including GitHub Actions, Travis CI, CircleCI, and Jenkins—on public GitHub repositories. Method: We conducted an online survey targeting OSS developers (n = 94) from GitHub with experience integrating UI testing frameworks into configuration files for CI/CD platforms. To augment our results, we conducted follow-up interviews (n = 18) to gain insights on the challenges, opportunities, and impacts of integrating UI testing into CI/CD pipelines. Results: Our results indicate adapting testing strategy, flakiness and longer executions are major challenges in integrating UI testing into CI pipelines—negatively impacting development practices. Alternatively, the benefits include support for realistic test cases and increased detection of issues. However, developers lack effective strategies to mitigate the challenges, relying on ad hoc trial-and-error based approaches, such as temporarily removing flaky tests from CI workflows until they are resolved. Conclusion: Our findings provide implications for OSS developers working on or considering including UI tests in CI/CD pipelines. We also motivate future directions for research and tooling to improve UI testing integration in CI/CD workflows.

Coverage Metrics for T-Wise Feature Interactions
Sabrina Böhm, Tim Jannik Schmidt, Sebastian Krieter, Tobias Pett, Thomas Thüm, and Malte Lochau
(University of Ulm, Germany; University of Paderborn, Germany; TU Braunschweig, Germany; Karlsruhe Institute for Technology, Germany; University of Siegen, Germany)

Info

Code, Test, and Coverage Evolution in Mature Software Systems: Changes over the Past Decade
Thomas Bailey and Cristian Cadar
(Imperial College London, UK)

Info

Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation
Azat Abdullin, Pouria Derakhshanfar, and Annibale Panichella
(JetBrains Research, Netherlands; Delft University of Technology, Netherlands)

Suspicious Types and Bad Neighborhoods: Filtering Spectra with Compiler Information
Leonhard Applis, Matthías Páll Gissursarson, and Annibale Panichella
(National University of Singapore, Singapore; Chalmers University of Technology, Sweden; Delft University of Technology, Netherlands)

Many-Objective Neuroevolution for Testing Games
Patric Feldmeier, Katrin Schmelz, and Gordon Fraser
(University of Passau, Germany)

Differential Testing of Concurrent Classes
Valerio Terragni and Shing-Chi Cheung
(University of Auckland, New Zealand; Hong Kong University of Science and Technology, China)

On Accelerating Deep Neural Network Mutation Analysis by Neuron and Mutant Clustering
Lauren Lyons and Ali Ghanbari
(Auburn University, USA)

AugmenTest: Enhancing Tests with LLM-Driven Oracles
Shaker Mahmud Khandaker, Fitsum Kifetew, Davide Prandi, and Angelo Susi
(Fondazione Bruno Kessler, Italy)
Automated test generation is crucial for ensuring the reliability and robustness of software applications while at the same time reducing the effort needed. While significant progress has been made in test generation research, generating valid test oracles still remains an open problem. To address this challenge, we present AugmenTest, an approach leveraging Large Language Models (LLMs) to infer correct test oracles based on available documentation of the software under test. Unlike most existing methods that rely on code, AugmenTest utilizes the semantic capabilities of LLMs to infer the intended behavior of a method from documentation and developer comments, without looking at the code. AugmenTest includes four variants: Simple Prompt, Extended Prompt, RAG with a generic prompt (without the context of class or method under test), and RAG with Simple Prompt, each offering different levels of contextual information to the LLMs. To evaluate our work, we selected 142 Java classes and generated multiple mutants for each. We then generated tests from these mutants, focusing only on tests that passed on the mutant but failed on the original class, to ensure that the tests effectively captured bugs. This resulted in 203 unique tests with distinct bugs, which were then used to evaluate AugmenTest. Results show that in the most conservative scenario, AugmenTest’s Extended Prompt consistently outperformed the Simple Prompt, achieving a success rate of 30% for generating correct assertions. In comparison, the state-of-the-art TOGA approach achieved 8.2%. Contrary to our expectations, the RAG-based approaches did not lead to improvements, with performance of 18.2% success rate for the most conservative scenario. Our study demonstrates the potential of LLMs in improving the reliability of automated test generation tools, while also highlighting areas for future enhancement

Info

Testing Practices, Challenges, and Developer Perspectives in Open-Source IoT Platforms
Daniel Rodriguez-Cardenas, Safwat Ali Khan, Prianka Mandal, Adwait Nadkarni, Kevin Moran, and Denys Poshyvanyk
(William & Mary, USA; George Mason University, USA; University of Central Florida, USA)

Impact of Large Language Models of Code on Fault Localization
Suhwan Ji, Sanghwa Lee, Changsup Lee, Yo-Sub Han, and Hyeonseung Im
(Yonsei University, South Korea; Kangwon National University, South Korea)
Identifying the point of error is imperative in software debugging. Traditional fault localization (FL) techniques rely on executing the program and using the code coverage matrix in tandem with test case results to calculate a suspiciousness score for each method or line. Recently, learning-based FL techniques have harnessed machine learning models to extract meaningful features from the code coverage matrix and improve FL performance. These techniques, however, require compilable source code, existing test cases, and specialized tools for generating the code coverage matrix for each programming language of interest. In this paper, we propose, for the first time, a simple but effective sequence generation approach for fine-tuning large language models of code (LLMCs) for FL tasks. LLMCs have recently received much attention for various software engineering problems. In line with these, we leverage the innate understanding of code that LLMCs have acquired through pre-training on large code corpora. Specifically, we fine-tune 13 representative encoder, encoder-decoder, and decoder-based LLMCs (across 7 different architectures) for FL tasks. Unlike previous approaches, LLMCs can analyze code sequences that do not compile. Still, they have a limitation on the length of the input data. Therefore, for a fair comparison with existing FL techniques, we extract methods with errors from the project-level benchmark, Defects4J, and analyze them at the line level. Experimental results show that LLMCs fine-tuned with our approach successfully pinpoint error positions in 50.6%, 64.2%, and 72.3% of 1,291 methods in Defects4J for Top-1/3/5 prediction, outperforming the best learning-based state-of-the-art technique by up to 1.35, 1.12, and 1.08 times, respectively. We also conduct an in-depth investigation of key factors that may affect the FL performance of LLMCs. Our findings suggest promising research directions for FL and automated program repair tasks using LLMCs.

Benchmarking Open-Source Large Language Models for Log Level Suggestion
Yi Wen Heng, Zeyang Ma, Zhenhao Li, Dong Jae Kim, and Tse-Hsun (Peter) Chen
(Concordia University, Canada; York University, Canada; DePaul University, USA)

Understanding and Enhancing Attribute Prioritization in Fixing Web UI Tests with LLMs
Zhuolin Xu, Qiushi Li, and Shin Hwei Tan
(Concordia University, Canada)

RustyRTS: Regression Test Selection for Rust
Simon Hundsdorfer, Roland Würsching, and Alexander Pretschner
(TU Munich, Germany)

Info

An Analysis of LLM Fine-Tuning and Few-Shot Learning for Flaky Test Detection and Classification
Riddhi More and Jeremy S. Bradbury
(Ontario Tech University, Canada)

On the Energy Consumption of Test Generation
Fitsum Kifetew, Davide Prandi, and Angelo Susi
(Fondazione Bruno Kessler, Italy)
Research in the area of automated test generation has seen remarkable progress in recent years, resulting in several approaches and tools for effective and efficient generation of test cases. In particular, the EvoSuite tool has been at the forefront of this progress embodying various algorithms for automated test generation of Java programs. EvoSuite has been used to generate test cases for a wide variety of programs as well. While there are a number of empirical studies that report results on the effectiveness, in terms of code coverage and other related metrics, of the various test generation strategies and algorithms implemented in EvoSuite, there are no studies, to the best of our knowledge, on the energy consumption associated to the automated test generation. In this paper, we set out to investigate this aspect by measuring the energy consumed by EvoSuite when generating tests. We also measure the energy consumed in the execution of the test cases generated, comparing them with those manually written by developers. The results show that the different test generation algorithms consumed different amounts of energy, in particular on classes with high cyclomatic complexity. Furthermore, we also observe that manual tests tend to consume more energy as compared to automatically generated tests, without necessarily achieving higher code coverage. Our results also give insight into the methods that consume significantly higher levels of energy, indicating potential points of improvement both for EvoSuite as well as the different programs under test.

Info

Industry Track

Practical Pipeline-Aware Regression Test Optimization for Continuous Integration
Daniel Schwendner, Maximilian Jungwirth, Martin Gruber, Martin Knoche, Daniel Merget, and Gordon Fraser
(BMW Group, Germany; University of Passau, Germany)
Massive, multi-language, monolithic repositories form the backbone of many modern, complex software systems. To ensure consistent code quality while still allowing fast development cycles, Continuous Integration (CI) is commonly applied. However, operating CI at such scale not only leads to a single point of failure for many developers, but also requires computational resources that may reach feasibility limits and cause long feedback latencies. To address these issues, developers commonly split test executions across multiple pipelines, running small and fast tests in pre- submit stages while executing long-running and flaky tests in post- submit pipelines. Given the long runtimes of many pipelines and the substantial proportion of passing test executions (98% in our pre-submit pipelines), there not only a need but also potential for further improvements by prioritizing and selecting tests. However, many previously proposed regression optimization techniques are unfit for an industrial context, because they (1) rely on complex and difficult-to-obtain features like per-test code coverage that are not feasible in large, multi-language environments, (2) do not automatically adapt to rapidly changing systems where new tests are continuously added or modified, and (3) are not designed to distinguish the different objectives of pre- and post-submit pipelines: While pre-submit testing should prioritize failing tests, post-submit pipelines should prioritize tests that indicate non- flaky changes by transitioning from pass to fail outcomes or vice versa. To overcome these issues, we developed a lightweight and pipeline-aware regression test optimization approach that employs Reinforcement Learning models trained on language- agnostic features. We evaluated our approach on a large industry dataset collected over a span of 20 weeks of CI test executions. When predicting the failure likelihood in pre-submit pipelines, our approach scheduled the first failing test within the first 16% of tests, outperforming existing approaches. When predicting test transitions in the post-submit pipeline, it was able to select 87% of developer-relevant tests by cutting the test execution time in half and over 99% within five cycles.

Introducing Black-Box Fuzz Testing for REST APIs in Industry: Challenges and Solutions
Andrea Arcuri, Alexander Poth, and Olsi Rrjolli
(Kristiania University College, Norway; Oslo Metropolitan University, Norway; Volkswagen, Germany)

Integrating LLM-Based Text Generation with Dynamic Context Retrieval for GUI Testing
Juyeon Yoon, Seah Kim, Somin Kim, Sukchul Jung, and Shin Yoo
(KAIST, South Korea; Samsung Research, South Korea)

Assessing the Uncertainty and Robustness of the Laptop Refurbishing Software
Chengjie Lu, Jiahui Wu, Shaukat Ali, and Mikkel Labori Olsen
(Simula Research Laboratory, Norway; University of Oslo, Norway; Danish Technological Institute, Denmark)

Fault Localization via Fine-tuning Large Language Models with Mutation Generated Stack Traces
Neetha Jambigi, Bartosz Bogacz, Moritz Mueller, Thomas Bach, and Michael Felderer
(University of Cologne, Germany; SAP, Germany; DLR, Germany)

LLMs in the Heart of Differential Testing: A Case Study on a Medical Rule Engine
Erblin Isaku, Christoph Laaber, Hassan Sartaj, Shaukat Ali, Thomas Schwitalla, and Jan F. Nygård
(Simula Research Laboratory, Norway; University of Oslo, Norway; Cancer Registry of Norway, Norway; UiT The Arctic University of Norway, Norway)

Compiler Fuzzing in Continuous Integration: A Case Study on Dafny
Karnbongkot Boonriong, Stefan Zetzsche, and Alastair F. Donaldson
(Imperial College London, UK; Amazon, UK)

LLM-Based Labelling of Recorded Automated GUI-Based Test Cases
Diogo Buarque Franzosi, Emil Alégroth, and Maycel Isaac
(Blekinge Institute of Technology, Sweden; Synteda, Sweden)
Graphical User Interface (GUI) based testing is a commonly used practice in industry. Although valuable and, in many cases, necessary, it is associated with challenges such as high cost and requirements on both technical and domain expertise. Augmented testing, a novel approach to GUI test automation, aims to mitigate these challenges by allowing users to record and render test cases and test data directly on the GUI of the system under test (SUT). In this context, Scout is an augmented testing tool that captures system states and transitions during manual interaction with the SUT, storing them in a test model that is visually represented in the form of state trees and reports. While this representation provides basic overview of a test suite, e.g. about its size and number of scenarios, it is limited in terms of analysis depth, interpretability, and reproducibility. In particular, without human state labeling, it is challenging to produce meaningful and easily understandable test reports. To address this limitation, we present a novel solution and a demonstrator, integrated into Scout, which leverages large language models (LLMs) to enrich the model-based test case representation by automatically labeling and describing states and describing transitions. We conducted two experiments to evaluate the impact of thesolution. First, we compared LLM-enhanced reports with expert-generated reports using embedding distance evaluation metrics. Second, we assessed the usability and perceived value of the enhanced reports through an industrial survey. The results of the study indicate that the plugin can improve both readability and interpretability of test reports. This work contributes to the automation of GUI testing by reducing the need for manual intervention, e.g. labeling, and technical expertise, e.g. to understand test case models. Although the solution is studied in the context of augmented testing, we argue for the solution’s generalizability to related test automation techniques. In addition, we argue that this approach enables actionable insights and lays the groundwork for further research into autonomous testing based on Generative AI.

Info

Taming Uncertainty in Critical Scenario Generation for Testing Automated Driving Systems
Selma Grosse, Adam Molin, Dejan Ničković, Alessio Gambi, and Cristinel Mateis
(DENSO Automotive, Germany; Austrian Institute of Technology, Austria)

ML-Based Test Case Prioritization: A Research and Production Perspective in CI Environments
Md Asif Khan, Akramul Azim, Ramiro Liscano, Kevin Smith, Yee-Kang Chang, Gkerta Seferi, and Qasim Tauseef
(Ontario Tech University, Canada; IBM, United Kingdom; IBM, Canada; IBM, UK)

Evaluation of the Choice of LLM in a Multi-agent Solution for GUI-Test Generation
Stevan Tomic, Emil Alégroth, and Maycel Isaac
(Blekinge Institute of Technology, Sweden; Synteda, Sweden)

Early V&V in Knowledge-Centric Systems Engineering: Advances and Benefits in Practice
Jose Luis de la Vara, Juan Manuel Morote, Clara Ayora, Giovanni Giachetti, Luis Alonso, Roy Mendieta, David Muñoz, Ricardo Ruiz Nolasco, and Antonio González
(University of Castilla-La Mancha, Spain; Independent Researcher, Spain; Universidad de Castilla la Mancha, Spain; Universitat Politecnica de Valencia, Spain; The REUSE Company, Spain; RGB Medical Devices, Spain)

Speculative Testing at Google with Transition Prediction
Avi Kondareddy, Sushmita Azad, Abhayendra Singh, and Tim A. D. Henderson
(Google, USA; Google, UK)

Evaluating Machine Learning-Based Test Case Prioritization in the Real World: An Experiment with SAP HANA
Jeongki Son, Gabin An, Jingun Hong, and Shin Yoo
(SAP Labs, South Korea; KAIST, South Korea)

FuzzE, Development of a Fuzzing Approach for Odoo’s Tours Integration Testing Plateform
Gabriel Benoit, François Georis, Géry Debongnie, Benoît Vanderose, and Xavier Devroey
(University of Namur, Belgium; Odoo, Belgium)

Accessible Smart Contracts Verification: Synthesizing Formal Models with Tamed LLMs
Jan Corazza, Ivan Gavran, Gabriela Moreira, and Daniel Neider
(TU Dortmund, Germany; Informal Systems, Austria; Informal Systems, Brazil)

A Tale from the Trenches: Applying Metamorphic and Differential Testing to Bioinformatics Software
Alexis Marsh, Myra B. Cohen, and Robert Cottingham
(Iowa State University, USA; Oak Ridge National Laboratory, USA)

CubeTesterAI: Automated JUnit Test Generation using the LLaMA Model
Daniele Gorla, Shivam Kumar, Pietro Nicolaus Roselli Lorenzini, and Alireza Alipourfaz
(Sapienza University of Rome, Italy; PCCube, Italy)

Short Papers, Vision, and Emerging Results

Leveraging Large Language Models for Explicit Wait Management in End-to-End Web Testing
Dario Olianas, Maurizio Leotta, and Filippo Ricca
(University of Genoa, Italy)

Weighted Call Frequency-Based Fault Localization
Attila Szatmári, Aondowase James Orban, and Tamás Gergely
(University of Szeged, Hungary)

Addressing Data Leakage in HumanEval using Combinatorial Test Design
Jeremy S. Bradbury and Riddhi More
(Ontario Tech University, Canada)

Towards Cross-Build Differential Testing
Jens Dietrich, Tim White, Valerio Terragni, and Behnaz Hassanshahi
(Victoria University of Wellington, New Zealand; University of Auckland, New Zealand; Oracle Labs, Australia)

Test Generation from Use Case Specifications for IoT Systems: Custom, LLM-Based, and Hybrid Approaches
Zacharie Chenail-Larcher, Jean Baptiste Minani, and Naouel Moha
(ÉTS Montréal, Canada; Concordia University, Canada)

Batch Execution of Microbenchmarks for Efficient Performance Testing
Mostafa Jangali, Kundi Yao, Yiming Tang, Diego Elias Costa, and Weiyi Shang
(University of Waterloo, Canada; Rochester Institute of Technology, USA; Concordia University, Canada)

Pre-trained Models for Bytecode Instructions
Donggyu Kim, Taemin Kim, Ji-ho Shin, Song Wang, Heeyoul Choi, and Jaechang Nam
(Handong Global University, South Korea; York University, Canada)

Info

Towards Refined Code Coverage: A New Predictive Problem in Software Testing
Carolin Brandt and Aurora Ramírez
(Delft University of Technology, Netherlands; University of Córdoba, Spain)

Info

EnCus: Customizing Search Space for Automated Program Repair
Seongbin Kim, Sechang Jang, Jindae Kim, and Jaechang Nam
(Handong Global University, South Korea; Seoul National University of Science and Technology, South Korea)

Info

Harnessing Test Call Structures for Improved Fault Localization Effectiveness
Attila Szatmári
(University of Szeged, Hungary)

Improving the Comprehensibility of Generated Test Suites using Test Case Clustering
Mitchell Olsthoorn
(Delft University of Technology, Netherlands)

Education Track

Black-Box Testing for Practitioners: A Case of the New ISTQB Test Analyst Syllabus
Matthias Hamburg and Adam Roman
(International Software Testing Qualifications Board, Belgium; ISTQB German Testing Board, Germany; Jagiellonian University, Poland; ISTQB Polish Testing Board, Poland)

Combining Logic and Large Language Models for Assisted Debugging and Repair of ASP Programs
Ricardo Brancas, Vasco Manquinho, and Ruben Martins
(INESC-ID, Portugal; Universidade de Lisboa, Portugal; Carnegie Mellon University, USA)

Teaching Bug Advocacy through Flipped Classroom
Andreea Galbin-Nasui and Andreea Vescan
(Babes-Bolyai University, Cluj-Napoca, Romania)

Experience Report on using Experiential Learning to Facilitate Learning of Bug Investigation Steps
Adina Moldovan, Oana Casapu, and Andreea Vescan
(Altom, Romania; Babes-Bolyai University, Cluj-Napoca, Romania)

Requirements for an Automated Assessment Tool for Learning Programming by Doing
Arthur Rump, Vadim Zaytsev, and Angelika Mader
(University of Twente, Netherlands)

Info

A System-Level Testing Framework for Automated Assessment of Programming Assignments Allowing Students Object-Oriented Design Freedom
Valerio Terragni and Nasser Giacaman
(University of Auckland, New Zealand)

Can Test Generation and Program Repair Inform Automated Assessment of Programming Projects?
Ruizhen Gu, José Miguel Rojas, and Donghwan Shin
(University of Sheffield, UK)

A Tool-Assisted Training Approach for Empowering Localization and Internationalization Testing Proficiency
Maria Couto, Breno Miranda, and Kiev Gama
(Federal University of Pernambuco, Brazil)

Posters

Poster: Empirical Evaluation of SC-MCC Meta Program Efficiency using Dynamic Symbolic Execution Engine
Monika Rani Golla and Sangharatna Godboley
(NIT Warangal, India)

Poster: Reporting Unique-Cause MC/DC Score using Formal Verification
Monika Rani Golla, Sangharatna Godboley, Avijit Das, and P. Krishna Radha
(NIT Warangal, India; LRDE DRDO, India)

Poster: Quantification of Feature-Interaction Masking in JHipster
Tim Jannik Schmidt, Sabrina Böhm, Sebastian Krieter, Thomas Thüm, and Mathieu Acher
(University of Ulm, Germany; University of Paderborn, Germany; TU Braunschweig, Germany; Univ Rennes - CNRS - Inria - IRISA - IUF, France)

Poster: Unit Testing Past vs. Present: Examining LLMs' Impact on Defect Detection and Efficiency
Rudolf Ramler, Philipp Straubinger, Reinhold Plösch, and Dietmar Winkler
(Software Competence Center Hagenberg, Austria; University of Passau, Germany; JKU Linz, Austria; Center for Digital Production, Austria; TU Wien, Austria)

Testing Tools and Data Showcase Track

Rocket: A System-Level Fuzz-Testing Framework for the XRPL Consensus Algorithm
Wishaal Kanhai, Ivar van Loon, Yuraj Mangalgi, Thijs van der Valk, Lucas Witte, Annibale Panichella, Mitchell Olsthoorn, and Burcu Kulahcioglu Ozkan
(Delft University of Technology, Netherlands)

Video

Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces
Max Hort and Leon Moonen
(Simula Research Laboratory, Norway; BI Norwegian Business School, Norway)

E2E-Loader: A Tool to Generate Performance Tests from End-to-End GUI-Level Tests
Sergio Di Meglio, Luigi Libero Lucio Starace, and Sergio Di Martino
(Federico II University of Naples, Italy)

ViMoTest: A Tool to Specify ViewModel-Based GUI Test Scenarios using Projectional Editing
Mario Fuksa, Sandro Speth, and Steffen Becker
(University of Stuttgart, Germany)

Video

RESTgym: A Flexible Infrastructure for Empirical Assessment of Automated REST API Testing Tools
Davide Corradini, Michele Pasqua, and Mariano Ceccato
(University of Luxembourg, Luxembourg; University of Verona, Italy)

Video

AMBER: AI-Enabled Java Microbenchmark Harness
Antonio Trovato, Luca Traini, Federico Di Menna, and Dario Di Nucci
(University of Salerno, Italy; University of L'Aquila, Italy)

Video

Technical Briefings and Tutorials

Scenario-Based Testing with BeamNG.tech (Hands-On Training)
Chrysanthi Papamichail, David Stark, and Alessio Gambi
(BeamNG, Greece; BeamNG, UK; Austrian Institute of Technology, Austria)

A Developer’s Guide to Building and Testing Accessible Mobile Apps
Juan Pablo Sandoval Alcocer, Leonel Merino, Alison Fernandez-Blanco, William Ravelo-Mendez, Camilo Escobar-Velásquez, and Mario Linares-Vásquez
(Pontificia Universidad Católica de Chile, Chile; Universidad de Los Andes, Colombia)

Doctoral Research

Autonomous Systems

Adversarial Testing with Reinforcement Learning
Andrea Doreste
(USI Lugano, Switzerland)

A Method for Systematically Assessing the Safety of Automated Driving Systems via Simulation
Ali Güllü
(University of Tartu, Estonia)

Uncertainty-Aware Autonomous Driving System Testing with Large Language Models
Jiahui Wu
(Simula Research Laboratory, Norway; University of Oslo, Norway)

Web/Mobile Systems

Identifying and Mitigating Flaky Tests in JavaScript
Negar Hashemi
(Massey University, New Zealand)

End-to-End Testing in Web Environments: Addressing Practical Challenges
Sergio Di Meglio
(Federico II University of Naples, Italy)

Advancing Mobile UI Testing by Learning Screen Usage Semantics
Safwat Ali Khan
(George Mason University, USA)
The demand for quality in mobile applications has increased greatly given users’ high reliance on them for daily tasks. Developers work tirelessly to ensure that their applications are both functional and user-friendly. In pursuit of this, Automated Input Generation (AIG) tools have emerged as a promising solution for testing mobile applications by simulating user interactions and exploring app functionalities. However, these tools face significant challenges in navigating complex Graphical User Interfaces (GUIs), and developers often have trouble understanding their output. More specifically, AIG tools face difficulties in navigating out of certain screens, such as login pages and advertisements, due to a lack of contextual understanding which leads to suboptimal testing coverage. Furthermore, while AIG tools can provide interaction traces consisting of action and screen details, there is limited understanding of its coverage of higher level functionalities, such as logging in, setting alarms, or saving notes. Understanding these covered use cases are essential to ensure comprehensive test coverage of app functionalities. Difficulty in testing mobile UIs can lead to the design of complex interfaces, which can adversely affect users of advanced age who often face usability barriers due to small buttons, cluttered layouts, and unintuitive navigation. There exists many studies that highlight these issues, but automated solutions for improving UI accessibility needs more attention. To address these interconnected challenges faced by mobile developers and app users, my PhD dissertation works towards advancing automated techniques for mobile UI testing. This research seeks to enhance automated UI testing techniques by learning the screen usage semantics of mobile apps and helping them navigate more efficiently, offer more insights about tested functionalities and also improve the usability of a mobile app’s interface by identifying and mitigating UI design issues.

Debugging and Reliability

Enhancing Spectrum-Based Fault Localization in the Context of Reactive Programming
Aondowase James Orban
(University of Szeged, Hungary)

Toward Tool-Agnostic Guidelines for Expert Debugging Strategies
Homayoun Safarpour
(University of Szeged, Hungary)

On Service-to-Service Integration Testing in Microservice Systems
Lena Gregor
(TU Munich, Germany)

Evaluating Correct-Consistency and Robustness in Code-Generating LLMs
Shahin Honarvar
(Imperial College London, UK)

Tool Competition – Self-Driving Car Testing Track

ICST Tool Competition 2025 – Self-Driving Car Testing Track
Christian Birchler, Stefan Klikovits, Mattia Fazzini, and Sebastiano Panichella
(University of Bern, Switzerland; Zurich University of Applied Sciences, Switzerland; Johannes Kepler University Linz, Austria; University of Minnesota, USA)

DETOUR at the ICST 2025 Tool Competition – Self-Driving Car Testing Track
Paolo Arcaini and Ahmet Cetinkaya
(National Institute of Informatics, Japan; Shibaura Institute of Technology, Japan)

DRVN at the ICST 2025 Tool Competition – Self-Driving Car Testing Track
Antony Bartlett, Cynthia Liem, and Annibale Panichella
(Delft University of Technology, Netherlands)

ITS4SDC at the ICST 2025 Tool Competition – Self-Driving Car Testing Track
Ali Güllü, Faiz Ali Shah, and Dietmar Pfahl
(University of Tartu, Estonia)

CertiFail at the ICST 2025 Tool Competition – Self-Driving Car Testing Track
Fasih Munir Malik and Sajad Mazraeh Khatiri
(University of Bern, Switzerland)

NN-SDCTest at the ICST 2025 Tool Competition – Self-Driving Car Testing Track
Prakash Aryana and Sajad Khatiri
(Birla Institute of Technology and Science, India; University of Bern, Switzerland; USI Lugano, Switzerland; Zurich University of Applied Sciences, Switzerland)

Tool Competition – Unmanned Aerial Vehicles Testing Track

ICST Tool Competition 2025 – UAV Testing Track
Sajad Khatiri, Tahereh Zohdinasab, Prasun Saurabh, Dmytro Humeniuk, and Sebastiano Panichella
(University of Bern, Switzerland; Zurich University of Applied Sciences, Switzerland; USI Lugano, Switzerland; Polytechnique Montréal, Canada)

Evolv-1 at the ICST 2025 Tool Competition – UAV Testing Track
Pietro Lechthaler, Davide Prandi, Fitsum Kifetew, and Angelo Susi
(Fondazione Bruno Kessler, Italy)

TGen-UQ at the ICST 2025 Tool Competition – UAV Testing Track
Ali Javadi and Christian Birchler
(University of Bern, Switzerland; Zurich University of Applied Sciences, Switzerland)

PALM at the ICST 2025 Tool Competition – UAV Testing Track
Shuncheng Tang, Zhenya Zhang, Ahmet Cetinkaya, and Paolo Arcaini
(University of Science and Technology of China, China; Kyushu University, Japan; Shibaura Institute of Technology, Japan; National Institute of Informatics, Japan)

proc time: 1.3