Powered by
2025 IEEE Conference on Software Testing, Verification and Validation (ICST),
March 31 – April 4, 2025,
Naples, Italy
Frontmatter
Technical-Research Track
SPIDER: Fuzzing for Stateful Performance Issues in the ONOS Software-Defined Network Controller
Ao Li,
Rohan Padhye, and Vyas Sekar
(Carnegie Mellon University, USA)
Performance issues in software-defined network (SDN) controllers can have serious impacts on the performance and availability of networks. In this paper, we consider a special class of SDN vulnerabilities called stateful performance issues (SPIs), where a sequence of initial input messages drives the controller into a state such that its performance degrades pathologically when processing subsequent messages. Uncovering SPIs in large complex software such as the widely used ONOS SDN controller is challenging because of the large state space of input sequences and the complex software architecture of inter-dependent network services. We present SPIDER, a practical fuzzing framework for identifying SPIs in this setting. The key contribution in our work is to leverage the event-driven modular software architecture of the SDN controller to (a) separately target each network service for SPIs and (b) use static analysis to identify all services whose event handlers can affect the state of the target service directly or indirectly. SPIDER implements this novel dependency-aware modular performance fuzzing approach for 157 network services in ONOS and successfully identifies 10 new performance issues. We present an evaluation of SPIDER against prior work, a sensitivity analysis of design decisions, and case studies of two uncovered SPIs.
Detecting and Evaluating Order-Dependent Flaky Tests in JavaScript
Negar Hashemi, Amjed Tahir, Shawn Rasheed,
August Shi, and
Rachel Blagojevic
(Massey University, New Zealand; UCOL, New Zealand; University of Texas at Austin, USA)
Flaky tests pose a significant issue for software testing. A test with a non-deterministic outcome may undermine the reliability of the testing process, making tests untrustworthy. Previous research has identified test order dependency as one of the most prevalent causes of flakiness, particularly in Java and Python. However, little is known about test order dependency in JavaScript tests. This paper aims to investigate test order dependency in JavaScript projects that use Jest, a widely used JavaScript testing framework. We implemented a systematic approach to randomise tests, test suites and describe blocks and produced 10 unique test reorders for each level. We reran each order 10 times (100 reruns for each test suite/project) and recorded any changes in test outcomes. We then manually analysed each case that showed flaky outcomes to determine the cause of flakiness. We examined our detection approach on a dataset of 81 projects obtained from GitHub. Our results revealed 55 order-dependent tests across 10 projects. Most order-dependent tests (52) occurred between tests, while the remaining three occurred between describe blocks. Those order-dependent tests are caused by either shared files (13) or shared mocking state (42) between tests. While sharing files is a known cause of order-dependent tests in other languages, our results underline a new cause (shared mocking state) that was not reported previously.
The Impact of List Reduction for Language Agnostic Test Case Reducers
Tobias Heineken and Michael Philippsen
(Friedrich-Alexander University Erlangen-Nürnberg, Germany)
To find and fix bugs in compilers or in other code processing tools, modern language agnostic test case reducers boil the input files down to small bug-triggering versions. To do so they carefully craft lists of potentially irrelevant items and apply a list reduction to minimize them. We show that substituting the chosen list reduction algorithm improves the overall reducer runtime without affecting the final file sizes much. In a comparative study we combine 6 renowned test case reducers with 7 established list reductions. Most renowned reducers become faster by switching to another list reduction. We also present three ways to preprocess the crafted lists before the test case reducers pass them to the list reductions. We discuss the conditions for the preprocessings to improve the reducers’ speeds even more. On a benchmark of 321 C and SMT-LIB2 compiler bugs, selecting a different list reduction saves up to 74.7% of the runtime. Most test case reducers benefit from such a substitution. Preprocessing saves up to 9.1 additional percentage points. Combining these ideas saves up to 75.2% of the runtime.
Hybrid Equivalence/Non-equivalence Testing
Laboni Sarker and
Tevfik Bultan
(University of California at Santa Barbara, USA)
Code equivalence analysis is a critical problem in software engineering. In this paper, we focus on assessing if two code segments exhibit diverging behaviors. While symbolic analysis, exemplified by techniques like differential symbolic execution and impacted summaries, has been employed for equivalence analysis, it faces challenges due to limitations of symbolic execution, producing inconclusive results in many cases. To mitigate these limitations, we introduce a hybrid approach that integrates symbolic analysis with differential fuzzing. Differential fuzzing, although unable to prove equivalence like symbolic analysis, is valuable in identifying non-equivalent instances quickly. Our proposed hybrid approach leverages the strengths of symbolic reasoning and fuzzing within a single workflow for more effective analysis. The contributions of this paper include the introduction of a hybrid equivalence/non-equivalence testing approach, multiple heuristics for hybrid analysis, and differential fuzzing. Our experimental evaluation on multiple benchmarks including EQBench and ARDiff, demonstrates that our proposed hybrid techniques can prove equivalence/non-equivalence for 28.52% more cases taking 43.21% less time on average compared to state-of-the-art symbolic execution based techniques.
Archive submitted (1 GB)
Metamorphic Testing for Pose Estimation Systems
Matias Duran, Thomas Laurent, Ellen Rushe, and Anthony Ventresque
(SFI Lero, Ireland; Trinity College Dublin, Ireland; Dublin City University, Ireland)
Pose estimation systems are used in a variety of fields, from sports analytics to livestock care. Given their potential impact, it is paramount to systematically test their behaviour and potential for failure. This is a complex task due to the oracle problem and the high cost of manual labelling necessary to build ground truth keypoints. This problem is exacerbated by the fact that different applications require systems to focus on different subjects (e.g., human versus animal) or landmarks (e.g., only extremities versus whole body and face), which makes labelled test data rarely reusable. To combat these problems we propose MeT-Pose, a metamorphic testing framework for pose estimation systems that bypasses the need for manual annotation while assessing the performance of these systems under different circumstances. MeT-Pose thus allows users of pose estimation systems to assess the systems in conditions that more closely relate to their application without having to label an ad-hoc test dataset or rely only on available datasets, which may not be adapted to their application domain. While we define in general terms, we also present a non-exhaustive list of metamorphic rules that represent common challenges in computer vision applications, as well as a specific way to evaluate these rules. We then experimentally show the effectiveness of MeT-Pose by applying it to Mediapipe Holistic, a state of the art human pose estimation system, with the FLIC and PHOENIX datasets. With these experiments, we outline numerous ways in which the outputs of MeT-Pose can uncover faults in pose estimation systems at a similar or higher rate than classic testing using hand labelled data, and show that users can tailor the rule set they use to the faults and level of accuracy relevant to their application.
Mutation-Based Fuzzing of the Swift Compiler with Incomplete Type Information
Sarah Canto Hyatt and
Kyle Dewey
(University of California at Santa Barbara, USA; California State University, USA)
Fuzzing statically-typed language compilers practically necessitates the generation of well-typed programs, which is a major challenge for fuzzing modern languages with rich type systems. Existing fuzzers only handle languages with simplistic type systems (e.g., C), only generate programs from small language subsets, or frequently generate ill-typed programs. In this work, we propose a mutation-based method for guaranteed well-typed program generation, even with minimal type system knowledge. With our method, we take a known well-typed seed program and annotate understood nodes with their types. We then try to replace annotated nodes with new type-equivalent ones, using a generator which fails if the input type is not understood. The end result is guaranteed overall well-typed as long as the original program was well-typed, even if most nodes are unannotated or the generator usually fails. We applied this approach to fuzzing the Swift compiler, and we are the first to fuzz Swift to the best of our knowledge. To implement our generator, we adapted constraint logic programming (CLP)-based fuzzing to work in a mutation-based context without a CLP engine, and this is the first such adaptation. Our fuzzer generates ∼22k programs per second, and we used it to find 13 Swift bugs, of which 7 have been confirmed or fixed by developers to date. Five bugs involved the compiler rejecting a well-typed program, and were only discoverable thanks to our well-typed generation guarantee.
Scalable SMT Sampling for Floating-Point Formulas via Coverage-Guided Fuzzing
Manuel Carrasco,
Cristian Cadar, and
Alastair F. Donaldson
(Imperial College London, UK)
SMT sampling involves finding numerous satisfying assignments (samples), for an SMT formula, and is increasingly finding applications in software testing. An effective SMT sampler should achieve high throughput, yielding a large number of samples in a given time budget, and high diversity, yielding samples that cover disparate parts of the solution space. Most SMT samplers rely on off-the-shelf SMT solvers and thus inherit those solvers’ scalability issues. Because SMT solvers tend to scale poorly when applied to floating-point constraints, the scalability and diversity of SMT sampling is correspondingly limited in the floating-point domain. We propose JFSampler, the first SMT sampling technique built on top of coverage-guided fuzzing. JFSampler extends Just Fuzz-it Solver (JFS), a scalable but incomplete SMT solver that is effective at finding solutions to floating-point formulas by encoding satisfiability as a reachability problem that is then offloaded to a fuzzer. By building on , JFSampler has an advantage over other SMT samplers in the floating-point domain. Further, we propose two novel strategies that increase both throughput and diversity of sampled solutions. First, JFSampler enhances the fuzzer’s code coverage feedback signal by measuring coverage of the formula’s solution space. Second, JFSampler incorporates a custom mutator tailored for SMT sampling. By design, these two novel techniques can be combined, having a positive synergy on throughput and diversity. We present a large evaluation over QF_FP and QF_BVFP formulas from the SMT-LIB benchmark. Our results show that JFSampler achieves substantial improvements over SMTSampler, a state-of-the-art SMT sampling technique, when applied to floating-point formulas.
Info
Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code
Shahin Honarvar, Mark van der Wilk, and
Alastair F. Donaldson
(Imperial College London, UK; University of Oxford, UK)
We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language question templates, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated test oracle that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a neighbourhood of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM’s code generation abilities to be identified, including anomalies where the LLM correctly solves almost all questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting robustness issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.
An Empirical Study of Web Flaky Tests: Understanding and Unveiling DOM Event Interaction Challenges
Yu Pei, Jeongju Sohn, and
Mike Papadakis
(University of Luxembourg, Luxembourg; Kyungpook National University, South Korea)
Flaky tests, which exhibit non-deterministic behavior and fail without changes to the codebase, pose significant challenges to the reliability and efficiency of software testing processes. Despite extensive research on flaky tests in traditional unit and integration testing, their impact and prevalence within web user interface (UI) testing remains relatively unexplored, especially concerning Document Object Model (DOM) events. In web applications, DOM-related flakiness, resulting from unstable interactions between DOM and events, is particularly prevalent. This study conducts an empirical analysis of 123 flaky tests in 49 open-source web projects, focusing on the correlation between DOM event interactions and test flakiness.
Our findings indicate that DOM events, and their associated interactions with the application, can introduce flakiness in web UI tests; these events are frequently associated with Event-DOM interactions (32.5%), Event operations (22.8%), and Response evaluations (16.3%). The analysis of DOM consistency and event interaction levels reveals that element-level interactions across multiple DOMs are more likely to cause flakiness than interactions confined to a single DOM or occurring at the page level. Furthermore, the primary strategies used by developers to handle these issues involve synchronizing DOM interactions (50.4%), managing conditional event completion (38.2%), and ensuring consistent DOM state transitions (11.4%). We discovered that the Event-DOM category has the highest fixed frequency (2.6 times), while the DOM category on sole takes the longest time to resolve (153.4 days). This study provides practical insights into improving web application testing practices by highlighting the importance of understanding and managing DOM event interactions.
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
Avishree Khare, Saikat Dutta,
Ziyang Li, Alaia Solko-Breslin,
Mayur Naik, and Rajeev Alur
(University of Pennsylvania, USA; Cornell University, USA)
Security vulnerabilities in modern software are prevalent and harmful. While automated vulnerability detection techniques have made promising progress, their scalability and applicability remain challenging. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect security vulnerabilities. In this paper, we perform a more comprehensive study by examining a larger and more diverse set of datasets, languages, and LLMs, and qualitatively evaluating detection performance across prompts and vulnerability classes.
Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples—1,000 randomly selected each from five diverse security datasets. These balanced datasets encompass synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes.
Our results show that LLMs across all scales and families show modest effectiveness in end-to-end reasoning about vulnerabilities,
obtaining an average accuracy of 62.8% and F1 score of 0.71 across all datasets. LLMs are significantly better at detecting vulnerabilities that typically only need intra-procedural reasoning, such as OS Command Injection and NULL Pointer Dereference. Moreover, LLMs report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL.
We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by up to 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications (e.g., sources and sinks) and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We believe our insights can motivate future work on LLM-augmented vulnerability detection systems.
ADGE: Automated Directed GUI Explorer for Android Applications
Yue Jiang, Xiaobo Xiang, Qingli Guo, Qi Gong, and Xiaorui Gong
(Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China; Singular Security Lab, Beijing, China)
With the continuous growth in the number of Android applications and the size of their codebases, it has become increasingly difficult for testers to manually analyze and trigger the functionalities of interest in each application. For instance, it is hard to trigger vulnerability points reported by scanners or reproduce captured crash scenarios. On the other hand, most existing automated exploration techniques exhibit slow performance when triggering specified targets due to the extensive exploration of different paths. Target-directed techniques can effectively address this issue but are relatively underexplored in existing research. The only target-directed exploration tool, GOALEXPLORER, is constrained by the limitations in the precision of its static analysis, which negatively impacts both exploration efficiency and effectiveness.
To boost the efficiency of target-directed exploration, we propose an automated GUI testing method guided by target functions called Automated Directed GUI Explorer (ADGE). Specifically, ADGE first generates a tainted Inter-procedural Control Flow Graph with the GUI widgets by modeling the role of GUI widgets in the control flow as well as their relationship with the target using static analysis. In the dynamic exploration phase, ADGE constructs the real-time model of the fragments and menus on the current screen to guide its exploration decisions with the knowledge of static model. To validate the effectiveness of ADGE, we conduct extensive comparisons of ADGE with the state-of-the-art baseline GOALEXPLORER on 55 benchmark applications. The results demonstrate that ADGE reduced the average time to trigger targets by 44% compared to GOALEXPLORER, while also successfully triggering more than 5.24% targets. Furthermore, during the testing process, ADGE successfully triggered 5 crash events.
Multi-project Just-in-Time Software Defect Prediction Based on Multi-task Learning for Mobile Applications
Chen Feng, Ke Yuxin, Liu Xin, and Wei Qingjie
(Chongqing University of Posts and Telecommunications, China)
In the rapid development of mobile applications, frequent code commits pose significant challenges for quality assurance. Just-in-Time Software Defect Prediction (JIT-SDP) helps at the commit level but often struggles due to insufficient labeled data, particularly in newer applications. To address this issue, we introduce JMFM, a novel approach leveraging Multi-Task Learning (MTL), Fuzzy C-Means (FCM) clustering, and Multi-Head Attention (MHA) for JIT-SDP. JMFM integrates multiple projects for training under the MTL framework, treating each project as a distinct task and enabling cross-project learning. In JMFM, FCM clustering determines the membership of each data sample to various clusters, which is then used in the MHA module to compute the weights of similarity between samples. By integrating the weighted sum of other samples' data, each sample is augmented with additional information for shared learning. Simultaneously, each project is trained in a task-specific layer to retain its unique features. We calculate the joint loss by giving greater weight to projects with fewer samples to ensure that they are not overshadowed by larger projects. Experiments on 15 Android mobile applications show that JMFM outperforms existing models on metrics such as F1, MCC and AUC especially for projects with scarce data.
A Taxonomy of Integration-Relevant Faults for Microservice Testing
Lena Gregor, Anja Hentschel, Leon Kastner, and
Alexander Pretschner
(TU Munich, Germany; Siemens, Germany; fortiss, Germany)
Microservices have emerged as a popular architectural paradigm, offering a flexible and scalable approach to software development. However, their distributed nature and diverse technology stacks introduce inherent complexities, surpassing those of monolithic systems. The integration of microservices presents numerous challenges, from communication failures to compatibility issues, compromising system reliability. Understanding faults in these distributed components is crucial for preventing defects, devising test strategies, and implementing robustness testing. Despite the significance of these software systems, existing taxonomies are limited, as they primarily focus on non-functional attributes or lack empirical validation. To address these gaps, this paper proposes an extensive taxonomy of the most common integration-relevant faults observed in large-scale microservice systems in industry. Leveraging insights from a systematic literature review and ten semi-structured interviews with industry experts, we identify common integration-related faults encountered in real-world microservice projects. Our final taxonomy was validated through a survey with an additional set of 16 practitioners, confirming that almost all fault categories (21/23) were experienced by at least 50% of the survey participants.
Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems
Stefano Carlo Lambertenghi, Hannes Leonhard, and Andrea Stocco
(fortiss, Germany; TU Munich, Germany)
Advanced Driver Assistance Systems (ADAS) based on deep neural networks (DNNs) are widely used in autonomous vehicles for critical perception tasks such as object detection, semantic segmentation, and lane recognition. However, these systems are highly sensitive to input variations, such as noise and changes in lighting, which can compromise their effectiveness and potentially lead to safety-critical failures.
This study offers a comprehensive empirical evaluation of image perturbations, techniques commonly used to assess the robustness of DNNs, to validate and improve the robustness and generalization of ADAS perception systems. We first conducted a systematic review of the literature, identifying 38 categories of perturbations. Next, we evaluated their effectiveness in revealing failures in two different ADAS, both at the component and at the system level. Finally, we explored the use of perturbation-based data augmentation and continuous learning strategies to improve ADAS adaptation to new operational design domains. Our results demonstrate that all categories of image perturbations successfully expose robustness issues in ADAS and that the use of dataset augmentation and continuous learning significantly improves ADAS performance in novel, unseen environments.
Improving the Readability of Automatically Generated Tests using Large Language Models
Matteo Biagiola, Gianluca Ghislotti, and
Paolo Tonella
(USI Lugano, Switzerland)
Search-based test generators are effective at producing unit tests with high coverage. However, such automatically generated tests have no meaningful test and variable names, making them hard to understand and interpret by developers. On the other hand, large language models (LLMs) can generate highly readable test cases, but they are not able to match the effectiveness of search-based generators, in terms of achieved code coverage.
In this paper, we propose to combine the effectiveness of search-based generators with the readability of LLM generated tests. Our approach focuses on improving test and variable names produced by search-based tools, while keeping their semantics (i.e., their coverage) unchanged.
Our evaluation on nine industrial and open source LLMs show that our readability improvement transformations are overall semantically-preserving and stable across multiple repetitions. Moreover, a human study with ten professional developers, show that our LLM-improved tests are as readable as developer-written tests, regardless of the LLM employed.
Benchmarking Generative AI Models for Deep Learning Test Input Generation
Maryam Maryam,
Matteo Biagiola, Andrea Stocco, and
Vincenzo Riccio
(University of Udine, Italy; USI Lugano, Switzerland; TU Munich, Germany)
Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets. Recent advancements in Generative AI (GenAI) models have made them a powerful tool for creating and manipulating synthetic images, although these advancements also imply increased complexity and resource demands for training.
In this work, we benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images, in terms of domain validity and label preservation. We conduct an empirical study involving three different GenAI architectures (VAEs, GANs, Diffusion Models), five classification tasks of increasing complexity, and 364 human evaluations. Our results show that simpler architectures, such as VAEs, are sufficient for less complex datasets like MNIST. However, when dealing with feature-rich datasets, such as ImageNet, more sophisticated architectures like Diffusion Models achieve superior performance by generating a higher number of valid, misclassification-inducing inputs.
Challenges, Strategies, and Impacts: A Qualitative Study on UI Testing in CI/CD Processes from GitHub Developers’ Perspectives
Xiaoxiao Gan, Huayu Liang, and
Chris Brown
(Virginia Tech, USA)
Continuous Integration and Continuous Delivery (CI/CD) processes are vital to meet the growing demands of open source software (OSS), providing a pipeline to enhance project quality and productivity. To ensure the user interfaces (UIs) of these systems work as intended, UI testing is crucial for verifying visual elements of software. Integrating UI tests in CI/CD pipelines should provide fast delivery and comprehensive test coverage. However, there is a gap in understanding how popular UI testing frameworks are adopted within CI/CD workflows—and the effects of this integration on OSS development. Aims: This study aims to explore developers’ perceptions of the challenges, strategies, and impacts of incorporating UI testing into CI/CD environments. In particular, we focus on OSS developers utilizing popular web-based UI testing frameworks—such as Selenium, Cypress, and Playwright—and popular CI/CD platforms—including GitHub Actions, Travis CI, CircleCI, and Jenkins—on public GitHub repositories. Method: We conducted an online survey targeting OSS developers (n = 94) from GitHub with experience integrating UI testing frameworks into configuration files for CI/CD platforms. To augment our results, we conducted follow-up interviews (n = 18) to gain insights on the challenges, opportunities, and impacts of integrating UI testing into CI/CD pipelines. Results: Our results indicate adapting testing strategy, flakiness and longer executions are major challenges in integrating UI testing into CI pipelines—negatively impacting development practices. Alternatively, the benefits include support for realistic test cases and increased detection of issues. However, developers lack effective strategies to mitigate the challenges, relying on ad hoc trial-and-error based approaches, such as temporarily removing flaky tests from CI workflows until they are resolved. Conclusion: Our findings provide implications for OSS developers working on or considering including UI tests in CI/CD pipelines. We also motivate future directions for research and tooling to improve UI testing integration in CI/CD workflows.
Coverage Metrics for T-Wise Feature Interactions
Sabrina Böhm, Tim Jannik Schmidt, Sebastian Krieter, Tobias Pett, Thomas Thüm, and Malte Lochau
(University of Ulm, Germany; University of Paderborn, Germany; TU Braunschweig, Germany; Karlsruhe Institute for Technology, Germany; University of Siegen, Germany)
Software is typically configurable by means of compile-time or runtime variability. As testing every valid configuration is infeasible, t-wise sampling has been proposed to systematically derive a relevant subset of the configurations for testing to cover interactions among t features. Practitioners started to apply t-wise sampling algorithms, but can often only test samples partially due to restricted resources and compare those partial samples based on their t-wise coverage. However, there is no consensus in the literature on how to compute the t-wise coverage in the literature. We propose the first systematic framework to define coverage metrics for t-wise feature
interactions. These metrics differ in the features and feature interactions being considered. We found evidence for at least six different metrics in the literature. In an empirical evaluation,
we show that for a partial sample the coverage differs up to 21% and for some metrics only half of the feature interactions need to be covered. As a long-term impact, our work may help to improve the efficiency and effectiveness of both, t-wise sampling and coverage computations.
Info
Code, Test, and Coverage Evolution in Mature Software Systems: Changes over the Past Decade
Thomas Bailey and
Cristian Cadar
(Imperial College London, UK)
Despite the central role of test suites in the software development process, there is surprisingly limited information on how code and tests co-evolve to exercise different parts of the codebase.
A decade ago, the Covrig project examined the code, test, and coverage evolution in six mature open-source C/C++ projects, spanning a combined development time of twelve years. In this study, we significantly expand the analysis to nine mature C/C++ projects and a combined period of 78 years of development time. Our focus is on understanding how development practices have changed and how these changes have impacted the way in which software is tested.
We report on the co-evolution of code and tests; the adoption of CI, coverage, and fuzzing services; the changes to the overall code coverage achieved by developer test suites; the distribution of patch coverage across revisions; how code changes drive changes in coverage; and the occurrence and evolution of flaky tests.
Our large-scale study paints a mixed picture in terms of how software development and testing have changed over the past decade. While developers put more emphasis on software testing and the overall code coverage achieved by developer test suites has increased in most projects, coverage and fuzzing services are not widely adopted, many patches are still poorly tested, and the fraction of flaky tests has increased.
Info
Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation
Azat Abdullin,
Pouria Derakhshanfar, and
Annibale Panichella
(JetBrains Research, n.n.; Delft University of Technology, Netherlands)
Generating tests automatically is a key and ongoing area of focus in software engineering research. The emergence of Large Language Models (LLMs) has opened up new opportunities, given their ability to perform a wide spectrum of tasks. However, the effectiveness of LLM-based approaches compared to traditional techniques such as search-based software testing (SBST) and symbolic execution remains uncertain. In this paper, we perform an extensive study of automatic test generation approaches based on three tools: EvoSuite for SBST, Kex for symbolic execution, and TestSpark for LLM-based test generation. We evaluate tools’ performance on the GitBug Java dataset and compare them using various execution-based and feature-based metrics. Our results show that while LLM-based test generation is promising, it falls behind traditional methods w.r.t. coverage. However, it significantly outperforms them in mutation scores, suggesting that LLMs provide a deeper semantic understanding of code. LLM-based approach performed worse than SBST and symbolic execution-based approaches w.r.t. fault detection capabilities. Additionally, our feature-based analysis shows that all tools are affected by the complexity and internal dependencies of the class under test (CUT), with LLM-based approaches being especially sensitive to the CUT size.
Suspicious Types and Bad Neighborhoods: Filtering Spectra with Compiler Information
Leonhard Applis, Matthías Páll Gissursarson, and
Annibale Panichella
(Delft University of Technology, Netherlands; Chalmers University of Technology, Sweden)
Spectrum-based fault localization and its formulas often struggle with large spectra containing many expressions irrelevant to the fault, which impacts its overall effectiveness. Spectra can inflate for large programs or on finer granularity, such as expression-level coverage from other languages like Haskell. To address this, we introduce 25 rules to filter the spectra based on type information, AST attributes, and test results. These aim to reduce the suspiciousness of innocent locations (bug-free expressions) and improve the performance of SBFL formulas w.r.t. TOP50 and TOP100 metrics. Our experiment, conducted on 11 Haskell programs, shows that individual filters significantly reduce spectra size, although some data points (faulty expressions) become unsolvable. By applying established SBFL formulas like Ochiai and Tarantula to these reduced spectra, we observe average improvements of up to 40% w.r.t. TOP50 for individual soft rules, such as proximity to failure. Combining the best-performing filters yields improvements of 45.5% for Ochiai, 67.4% for DStar2, and 45.5% for Tarantula. The most effective filtering rules over all formulas captured proximity to failing expressions, usage of a non-unique type, and whether a failing test covered the expression. Our results suggest that simple, straightforward filters can produce substantial performance gains. We further identify 4 uncovered bugs originating from code generation (common in functional programming) and system tests, which can not be addressed purely by spectrum-based fault localization.
Many-Objective Neuroevolution for Testing Games
Patric Feldmeier, Katrin Schmelz, and
Gordon Fraser
(University of Passau, Germany)
Games are designed to challenge human players, but this also makes it challenging to generate software tests for computer games automatically. Neural networks have therefore been proposed to serve as dynamic test cases trained to reach statements in the underlying code, similar to how static test cases consisting of event sequences would do in traditional software. The NEATEST approach combines search-based software testing principles with neuroevolution to generate such dynamic test cases. However, it may take long or even be impossible to evolve a network that can cover individual program statements, and since NEATEST is a single-objective algorithm, it will have to be sequentially invoked for a potentially large number of coverage goals. In this paper, we therefore propose to treat the neuroevolution of dynamic test cases as a many-objective search problem. By targeting all coverage goals at the same time, easy goals are covered quickly, and the search can focus on more challenging ones. We extend the state-of-the-art many-objective test generation algorithms MIO and MOSA as well as the state-of-the-art many-objective neuroevolution algorithm NEWS/D to generate dynamic test cases. Experiments on 20 SCRATCH games show that targeting several objectives simultaneously increases NEATEST’s average branch coverage from 75.88% to 81.33% while reducing the search time by 93.28%.
Differential Testing of Concurrent Classes
Valerio Terragni and
Shing-Chi Cheung
(University of Auckland, New Zealand; Hong Kong University of Science and Technology, China)
Concurrent programs are pervasive, yet difficult to write. The inherent complexity of thread synchronization makes the evolution of concurrent programs prone to concurrency faults. Previous work on regression testing concurrent programs focused on reducing the cost of re-run the existing tests. However, existing tests may not be able to expose the regression faults in the modified program. In this paper, we present ConDiff a differential testing technique that generates concurrent tests and oracles to expose behavioral differences between two versions of a given concurrent class. Since concurrent programs are non-deterministic, this involves exploring all possible non-deterministic thread interleavings of each generated test on both versions. However, we can afford to analyze only a few concurrent tests due to the high cost of exhaustive interleaving exploration. To address the challenge, ConDiff leverages the information of code changes and trace analysis to analyze only those concurrent tests that are likely to expose behavioral differences (if they exist). We evaluated ConDiff on a set of Java classes. Our results show that ConDiff can effectively generate concurrent tests that expose behavioral differences.
On Accelerating Deep Neural Network Mutation Analysis by Neuron and Mutant Clustering
Lauren Lyons and
Ali Ghanbari
(Auburn University, USA)
Mutation analysis of deep neural networks (DNNs) is a promising method for effective evaluation of test data quality and model robustness, but it can be computationally expensive, especially for large models.
To alleviate this, we present DEEPMAACC, a technique and a tool that speeds up DNN mutation analysis through neuron and mutant clustering.
DEEPMAACC implements two methods: (1) neuron clustering to reduce the number of generated mutants and (2) mutant clustering to reduce the number of mutants to be tested by selecting representative mutants for testing.
Both use hierarchical agglomerative clustering to group neurons and mutants with similar weights, with the goal of improving efficiency while maintaining mutation score.
DEEPMAACC has been evaluated on 8 DNN models across 4 popular classification datasets and two DNN architectures.
When compared to exhaustive, or vanilla, mutation analysis, the results provide empirical evidence that neuron clustering approach, on average, accelerates mutation analysis by 69.77%, with an average -26.84% error in mutation score.
Meanwhile, mutant clustering approach, on average, accelerates mutation analysis by 35.31%, with an average 1.96% error in mutation score.
Our results demonstrate that a trade-off can be made between mutation testing speed and mutation score error.
AugmenTest: Enhancing Tests with LLM-Driven Oracles
Shaker Mahmud Khandaker,
Fitsum Kifetew,
Davide Prandi, and
Angelo Susi
(Fondazione Bruno Kessler, Italy)
Automated test generation is crucial for ensuring the
reliability and robustness of software applications while at the
same time reducing the effort needed. While significant progress
has been made in test generation research, generating valid test
oracles still remains an open problem.
To address this challenge, we present AugmenTest, an approach leveraging Large Language Models (LLMs) to infer
correct test oracles based on available documentation of the
software under test. Unlike most existing methods that rely on
code, AugmenTest utilizes the semantic capabilities of LLMs to
infer the intended behavior of a method from documentation and
developer comments, without looking at the code. AugmenTest
includes four variants: Simple Prompt, Extended Prompt, RAG
with a generic prompt (without the context of class or method
under test), and RAG with Simple Prompt, each offering different
levels of contextual information to the LLMs.
To evaluate our work, we selected 142 Java classes and generated multiple mutants for each. We then generated tests from
these mutants, focusing only on tests that passed on the mutant
but failed on the original class, to ensure that the tests effectively
captured bugs. This resulted in 203 unique tests with distinct
bugs, which were then used to evaluate AugmenTest. Results show
that in the most conservative scenario, AugmenTest’s Extended
Prompt consistently outperformed the Simple Prompt, achieving
a success rate of 30% for generating correct assertions. In
comparison, the state-of-the-art TOGA approach achieved 8.2%.
Contrary to our expectations, the RAG-based approaches did not
lead to improvements, with performance of 18.2% success rate
for the most conservative scenario.
Our study demonstrates the potential of LLMs in improving
the reliability of automated test generation tools, while also
highlighting areas for future enhancement
Info
Testing Practices, Challenges, and Developer Perspectives in Open-Source IoT Platforms
Daniel Rodriguez-Cardenas,
Safwat Ali Khan,
Prianka Mandal,
Adwait Nadkarni,
Kevin Moran, and
Denys Poshyvanyk
(William & Mary, USA; George Mason University, USA; University of Central Florida, USA)
As the popularity of Internet of Things (IoT) platforms grows, users gain unprecedented control over their homes, health monitoring, and daily task automation. However, the testing of software for these platforms poses significant challenges due to their diverse composition, e.g., common smart home platforms are often composed of varied types of devices that use a diverse array of communication protocols, connections to mobile apps, cloud services, as well as integration among various platforms. This paper is the first to uncover both the practices and perceptions behind testing in IoT platforms, particularly open-source smart home platforms. Our study is composed of two key components. First, we mine and empirically analyze the code and integrations of two highly popular and well maintained open-source IoT platforms, OpenHAB and HomeAssistant. Our analysis involves the identification of functional and related test methods based on the focal method approach. We find that OpenHAB has only 0.04 test ratio (≈ 4K focal test methods from ≈ 76K functional methods) in Java files, while HomeAssistant exhibits higher test ratio of 0.42, which reveals a significant dearth of testing. Second, to understand the developers’ perspective on testing in IoT, and to explain our empirical observations, we survey 80 open-source developers actively engaged in IoT platform development. Our analysis of survey responses reveals a significant focus on automated (unit) testing, and a lack of manual testing, which supports our empirical observations, as well as testing challenges specific to IoT. Together, our empirical analysis and survey yield 10 key findings that uncover the current state of testing in IoT platforms, and reveal key perceptions and challenges. These findings provide valuable guidance to the research community in navigating the complexities of effectively testing IoT platforms.
Impact of Large Language Models of Code on Fault Localization
Suhwan Ji, Sanghwa Lee, Changsup Lee, Yo-Sub Han, and Hyeonseung Im
(Yonsei University, South Korea; Kangwon National University, South Korea)
Identifying the point of error is imperative in software debugging. Traditional fault localization (FL) techniques rely on executing the program and using the code coverage matrix in tandem with test case results to calculate a suspiciousness score for each method or line. Recently, learning-based FL techniques have harnessed machine learning models to extract meaningful features from the code coverage matrix and improve FL performance. These techniques, however, require compilable source code, existing test cases, and specialized tools for generating the code coverage matrix for each programming language of interest.
In this paper, we propose, for the first time, a simple but effective sequence generation approach for fine-tuning large language models of code (LLMCs) for FL tasks. LLMCs have recently received much attention for various software engineering problems.
In line with these, we leverage the innate understanding of code that LLMCs have acquired through pre-training on large code corpora. Specifically, we fine-tune 13 representative encoder, encoder-decoder, and decoder-based
LLMCs (across 7 different architectures) for FL tasks. Unlike previous approaches, LLMCs can analyze code sequences that do not compile.
Still, they have a limitation on the length of the input data. Therefore, for a fair comparison with existing FL techniques, we extract methods with errors from the project-level benchmark, Defects4J, and analyze them at the line level. Experimental results show that LLMCs fine-tuned with our approach successfully pinpoint error positions in 50.6%, 64.2%, and 72.3% of 1,291 methods in Defects4J for Top-1/3/5 prediction, outperforming the best learning-based state-of-the-art technique by up to 1.35, 1.12, and 1.08 times, respectively. We also conduct an in-depth investigation of key factors that may affect the FL performance of LLMCs. Our findings suggest promising research directions for FL and automated program repair tasks using LLMCs.
Benchmarking Open-Source Large Language Models for Log Level Suggestion
Yi Wen Heng, Zeyang Ma, Zhenhao Li,
Dong Jae Kim, and
Tse-Hsun (Peter) Chen
(Concordia University, Canada; York University, Canada; DePaul University, USA)
Large Language Models (LLMs) have become a focal point of research across various domains, including software engineering, where their capabilities are increasingly leveraged. Recent studies have explored the integration of LLMs into software development tools and frameworks, revealing their potential to enhance performance in text and code-related tasks. Log level is a key part of a logging statement that allows software developers control the information recorded during system runtime. Given that log messages often mix natural language with code-like variables, LLMs' language translation abilities could be applied to determine the suitable verbosity level for logging statements. In this paper, we undertake a detailed empirical analysis to investigate the impact of characteristics and learning paradigms on the performance of 12 open-source LLMs in log level suggestion. We opted for open-source models because they enable us to utilize in-house code while effectively protecting sensitive information and maintaining data security. We examine several prompting strategies, including Zero-shot, Few-shot, and fine-tuning techniques, across different LLMs to identify the most effective combinations for accurate log level suggestions. Our research is supported by experiments conducted on 9 large-scale Java systems. The results indicate that although smaller LLMs can perform effectively with appropriate instruction and suitable techniques, there is still considerable potential for improvement in their ability to suggest log levels.
Understanding and Enhancing Attribute Prioritization in Fixing Web UI Tests with LLMs
Zhuolin Xu, Qiushi Li, and
Shin Hwei Tan
(Concordia University, Canada)
The rapid evolution of Web UI incurs time and effort in UI test maintenance. Prior techniques in Web UI test repair focus on locating the target elements on the new Webpage that match the old ones so that the corresponding broken statements can be repaired. These techniques usually rely on prioritizing certain attributes (e.g., XPath) during matching where the similarity of certain attributes is ranked before other attributes, indicating that there may be bias towards certain attributes during matching.
To mitigate the bias, we present the first study that investigates the feasibility of using prior Web UI repair techniques for initial matching and then using ChatGPT to perform subsequent matching. Our key insight is that given a list of elements matched by prior techniques, ChatGPT can leverage language understanding to perform subsequent matching and use its code generation model for fixing the broken statements.
To mitigate hallucination in ChatGPT, we design an explanation validator that checks if the provided explanation for the matching results is consistent, and provides hints to ChatGPT via a self-correction prompt to further improve its results. Our evaluation on a widely used dataset shows that the ChatGPT-enhanced techniques improve the effectiveness of existing Web test repair techniques. Our study also shares several important insights in improving future Web UI test repair techniques.
RustyRTS: Regression Test Selection for Rust
Simon Hundsdorfer, Roland Würsching, and
Alexander Pretschner
(TU Munich, Germany)
Regression testing is a testing activity that aims to ensure that existing functionality is preserved when introducing changes. The goal of regression test selection (RTS) is to reduce the cost of regression testing by only re-executing tests that are affected by changes. Lately, research on RTS has focused on the languages Java and C++. Despite Rust being an increasingly relevant systems programming language, there are no RTS tools available for this language so far. In this paper, we present and evaluate RustyRTS, the first RTS technique and tool for Rust. It provides both module-level and function-level RTS. Its function-level variants can rely on either static or dynamic code analysis to select affected tests. We evaluate RustyRTS in an empirical study in terms of safety, precision, and effectiveness. When applied to changes resulting from mutation testing on 9 open-source projects, RustyRTS selected 99.99%, 97.87% and 99.97% of all tests that failed in consequence of a modification in code using a module-level, static and dynamic RTS approach respectively. For cases of unsafe behavior, i.e., tests that have not been selected although failing due to such changes, we find plausible explanations. By applying RustyRTS to changes from the Git history of 13 repositories on GitHub, we effectively reduce the average end-to-end (e2e) testing time on the majority of projects. Our results show that the two function-level approaches outperform the coarser module-level one, especially on longer-running test suites. On average, the e2e testing time has been reduced to 67.80%, 62.52% and 52.79% of retest-all by module-level, static and dynamic RustyRTS respectively. Lastly, we provide a novel solution for dynamic dispatch and compile-time function evaluation, two contexts that impose a special challenge on approaches to RTS.
Info
An Analysis of LLM Fine-Tuning and Few-Shot Learning for Flaky Test Detection and Classification
Riddhi More and Jeremy S. Bradbury
(Ontario Tech University, Canada)
Flaky tests exhibit non-deterministic behavior during execution and they may pass or fail without any changes to the program under test. Detecting and classifying these flaky tests is crucial for maintaining the robustness of automated test suites and ensuring the overall reliability and confidence in the testing. However, flaky test detection and classification is challenging due to the variability in test behavior, which can depend on environmental conditions and subtle code interactions.
Large Language Models (LLMs) offer promising approaches to address this challenge, with fine-tuning and few-shot learning (FSL) emerging as viable techniques. With enough data fine-tuning a pre-trained LLM can achieve high accuracy, making it suitable for organizations with more resources. Alternatively, we introduce FlakyXbert, an FSL approach that employs a Siamese network architecture to train efficiently with limited data. To understand the performance and cost differences between these two methods, we compare fine-tuning on larger datasets with FSL in scenarios restricted by smaller datasets. Our evaluation involves two existing flaky test datasets, FlakyCat and IDoFT. Our results suggest that while fine-tuning can achieve high accuracy, FSL provides a cost-effective approach with competitive accuracy, which is especially beneficial for organizations or projects with limited historical data available for training. These findings underscore the viability of both fine-tuning and FSL in flaky test detection and classification with each suited to different organizational needs and resource availability.
On the Energy Consumption of Test Generation
Fitsum Kifetew,
Davide Prandi, and
Angelo Susi
(Fondazione Bruno Kessler, Italy)
Research in the area of automated test generation
has seen remarkable progress in recent years, resulting in several
approaches and tools for effective and efficient generation of test
cases. In particular, the EvoSuite tool has been at the forefront
of this progress embodying various algorithms for automated
test generation of Java programs. EvoSuite has been used to
generate test cases for a wide variety of programs as well. While
there are a number of empirical studies that report results on
the effectiveness, in terms of code coverage and other related
metrics, of the various test generation strategies and algorithms
implemented in EvoSuite, there are no studies, to the best of
our knowledge, on the energy consumption associated to the
automated test generation. In this paper, we set out to investigate
this aspect by measuring the energy consumed by EvoSuite
when generating tests. We also measure the energy consumed in
the execution of the test cases generated, comparing them with
those manually written by developers. The results show that the
different test generation algorithms consumed different amounts
of energy, in particular on classes with high cyclomatic complexity.
Furthermore, we also observe that manual tests tend to consume
more energy as compared to automatically generated tests, without
necessarily achieving higher code coverage. Our results also give
insight into the methods that consume significantly higher levels
of energy, indicating potential points of improvement both for
EvoSuite as well as the different programs under test.
Info
Industry Track
Practical Pipeline-Aware Regression Test Optimization for Continuous Integration
Daniel Schwendner, Maximilian Jungwirth, Martin Gruber, Martin Knoche, Daniel Merget, and
Gordon Fraser
(BMW Group, Germany; University of Passau, Germany)
Massive, multi-language, monolithic repositories form
the backbone of many modern, complex software systems. To
ensure consistent code quality while still allowing fast development
cycles, Continuous Integration (CI) is commonly applied. However,
operating CI at such scale not only leads to a single point of failure
for many developers, but also requires computational resources
that may reach feasibility limits and cause long feedback latencies.
To address these issues, developers commonly split test executions
across multiple pipelines, running small and fast tests in pre-
submit stages while executing long-running and flaky tests in post-
submit pipelines. Given the long runtimes of many pipelines and
the substantial proportion of passing test executions (98% in our
pre-submit pipelines), there not only a need but also potential for
further improvements by prioritizing and selecting tests. However,
many previously proposed regression optimization techniques are
unfit for an industrial context, because they (1) rely on complex
and difficult-to-obtain features like per-test code coverage that
are not feasible in large, multi-language environments, (2) do not
automatically adapt to rapidly changing systems where new tests
are continuously added or modified, and (3) are not designed
to distinguish the different objectives of pre- and post-submit
pipelines: While pre-submit testing should prioritize failing tests,
post-submit pipelines should prioritize tests that indicate non-
flaky changes by transitioning from pass to fail outcomes or
vice versa. To overcome these issues, we developed a lightweight
and pipeline-aware regression test optimization approach that
employs Reinforcement Learning models trained on language-
agnostic features. We evaluated our approach on a large industry
dataset collected over a span of 20 weeks of CI test executions.
When predicting the failure likelihood in pre-submit pipelines,
our approach scheduled the first failing test within the first 16%
of tests, outperforming existing approaches. When predicting test
transitions in the post-submit pipeline, it was able to select 87%
of developer-relevant tests by cutting the test execution time in
half and over 99% within five cycles.
Introducing Black-Box Fuzz Testing for REST APIs in Industry: Challenges and Solutions
Andrea Arcuri, Alexander Poth, and Olsi Rrjolli
(Kristiania University College, Norway; Oslo Metropolitan University, Norway; Volkswagen, Germany)
REST APIs are widely used in industry, in all
different kinds of domains. An example is Volkswagen AG, a German
automobile manufacturer. Established testing approaches
for REST APIs are time consuming, and require expertise from
professional test engineers. Due to its cost and importance, in
the scientific literature several approaches have been proposed
to automatically test REST APIs. The open-source, search-based
fuzzer EVOMASTER is one of such tools proposed in the academic
literature. However, how academic prototypes can be integrated
in industry and have real impact to software engineering practice
requires more investigation. In this paper, we report on our
experience in using EVOMASTER at Volkswagen. We share our
learnt lessons, and identify real-world research challenges that
need to be solved.
Integrating LLM-Based Text Generation with Dynamic Context Retrieval for GUI Testing
Juyeon Yoon, Seah Kim, Somin Kim, Sukchul Jung, and
Shin Yoo
(KAIST, South Korea; Samsung Research, South Korea)
Automated GUI testing plays a crucial role for smartphone vendors who have to ensure that the widely used mobile apps--that are not essentially developed by the vendors--are compatible with new devices and system updates. While existing testing techniques can automatically generate event sequences to reach different GUI views, inputs such as strings and numbers remain difficult to generate, as their generation often involves semantic understanding of the app functionality. Recently, Large Language Models (LLMs) have been successfully adopted to generate string inputs that are semantically relevant to the test case. This paper evaluates the LLM-based input generation in the industrial context of vendor testing of both in-house and 3rd party mobile apps. We present DroidFiller, an LLM based input generation technique that builds upon existing work with more sophisticated prompt engineering and customisable context retrieval. DroidFiller is empirically evaluated using a total of 120 textfields collected from a total of 45 apps, including both in-house and 3rd party ones. The results show that DroidFiller can outperform both vanilla LLM based input generation as well as the existing resource pool approach. We integrate DroidFiller into the existing GUI testing framework used at Samsung, evaluate its performance, and discuss the challenges and considerations for practical adoption of LLM-based input generation in the industry.
Assessing the Uncertainty and Robustness of the Laptop Refurbishing Software
Chengjie Lu, Jiahui Wu,
Shaukat Ali, and Mikkel Labori Olsen
(Simula Research Laboratory, Norway; University of Oslo, Norway; Danish Technological Institute, Denmark)
Refurbishing laptops extends their lives while contributing to reducing electronic waste, which promotes building a sustainable future. To this end, the Danish Technological Institute (DTI) focuses on the research and development of several robotic applications empowered with software, including laptop refurbishing. Cleaning represents a major step in refurbishing and involves identifying and removing stickers from laptop surfaces. Software plays a crucial role in the cleaning process. For instance, the software integrates various object detection models to identify and remove stickers from laptops automatically. However, given the diversity in types of stickers (e.g., shapes, colors, locations), identification of the stickers is highly uncertain, thereby requiring explicit quantification of uncertainty associated with the identified stickers. Such uncertainty quantification can help reduce risks in removing stickers, which, for example, could otherwise result in software faults damaging laptop surfaces. For uncertainty quantification, we adopted the Monte Carlo Dropout method to evaluate six sticker detection models (SDMs) from DTI using three datasets: the original image dataset from DTI and two datasets generated with vision language models, i.e., DALL-E-3 and Stable Diffusion-3. In addition, we presented novel robustness metrics concerning detection accuracy and uncertainty to assess the robustness of the SDMs based on adversarial datasets generated from the three datasets using a dense adversary method. Our evaluation results show that different SDMs perform differently regarding different metrics. Based on the results, we provide SDM selection guidelines and lessons learned from various perspectives.
Fault Localization via Fine-tuning Large Language Models with Mutation Generated Stack Traces
Neetha Jambigi, Bartosz Bogacz, Moritz Mueller,
Thomas Bach, and
Michael Felderer
(University of Cologne, Germany; SAP, Germany; DLR, Germany)
Abrupt and unexpected terminations of software are termed as software crashes. They can be challenging to analyze. Finding the root cause requires extensive manual effort and expertise to connect information sources like stack traces, source code, and logs. Typical approaches to fault localization require either test failures or source code. Crashes occurring in production environments, such as that of SAP HANA, provide solely crash logs and stack traces. We present a novel approach to localize faults based only on the stack trace information and no additional runtime information, by fine-tuning large language models (LLMs). We address complex cases where the root cause of a crash differs from the technical cause, and is not located in the innermost frame of the stack trace. As the number of historic crashes is insufficient to fine-tune LLMs, we augment our dataset by leveraging code mutators to inject synthetic crashes into the code base. By fine-tuning on 64,369 crashes resulting from 4.1 million mutations of the HANA code base, we can correctly predict the root cause location of a crash with an accuracy of 66.9% while baselines only achieve 12.6% and 10.6%. We substantiate the generalizability of our approach by evaluating on two additional open-source databases, SQLite and DuckDB, achieving accuracies of 63% and 74%, respectively. Across all our experiments, fine-tuning consistently outperformed prompting non-finetuned LLMs for localizing faults in our datasets.
LLMs in the Heart of Differential Testing: A Case Study on a Medical Rule Engine
Erblin Isaku,
Christoph Laaber,
Hassan Sartaj,
Shaukat Ali,
Thomas Schwitalla, and
Jan F. Nygård
(Simula Research Laboratory, Norway; University of Oslo, Norway; Oslo Metropolitan University, Norway; Cancer Registry of Norway, Norway; UiT The Arctic University of Norway, Norway)
The Cancer Registry of Norway (CRN) uses an automated cancer registration support system (CaReSS) to support core cancer registry activities, i.e., data capture, data curation, and producing data products and statistics for various stakeholders. GURI is a core component of (CaReSS), which is responsible for validating incoming data with medical rules. Such medical rules are manually implemented by medical experts based on medical standards, regulations, and research. Since large language models (LLMs) have been trained on a large amount of public information, including these documents, they can be employed to generate tests for GURI. Thus, we propose an LLM-based test generation and differential testing approach (LLMeDiff) to test GURI. We experimented with four different LLMs, two medical rule engine implementations, and 58 real medical rules to investigate the hallucination, success, time efficiency, and robustness of the LLMs to generate tests, and these tests’ ability to find potential issues in GURI. Our results showed that GPT-3.5 hallucinates the least, is the most successful, and is generally the most robust; however, it has the worst time efficiency. Our differential testing revealed 22 medical rules where implementation inconsistencies were discovered (e.g., regarding handling rule versions). Finally, we provide insights for practitioners and researchers based on the results.
Compiler Fuzzing in Continuous Integration: A Case Study on Dafny
Karnbongkot Boonriong, Stefan Zetzsche, and
Alastair F. Donaldson
(Imperial College London, UK; Amazon, UK)
We present the design of CompFuzzCI, a framework for incorporating compiler fuzzing into the continuous integration (CI) workflow of the compiler for Dafny, an open-source programming language that is increasingly used in and contributed to by industry. CompFuzzCI explores the idea of running a brief fuzzing campaign as part of the CI workflow of each pull request to a compiler project. Making this effective involved devising solutions for various challenges, including how to deduplicate bugs, how to bisect the project’s revision history to find the commit responsible for a regression (challenging when project interfaces change over time), and how to ensure that fuzz testing complements existing regression testing efforts. We explain how we have engaged with the Dafny development team at Amazon to approach these and other problems in the design of CompFuzzCI, and the lessons learned in the process. As a by-product of our work with CompFuzzCI, we found and reported three previously-unknown bugs in the Dafny compiler. We also present a controlled experiment simulating the use of CompFuzzCI over time on a range of Dafny commits, to assess its ability to find historic bugs. CompFuzzCI prioritises support for the Dafny compiler and the fuzz-d fuzzer but has a generalisable design: with modest modification to its internal interfaces, it could be adapted to work with other fuzzers, and the lessons learned from our experience will be relevant to teams considering including fuzzing in the CI of other industrial software projects.
LLM-Based Labelling of Recorded Automated GUI-Based Test Cases
Diogo Buarque Franzosi, Emil Alégroth, and Maycel Isaac
(Blekinge Institute of Technology, Sweden; Synteda, Sweden)
Graphical User Interface (GUI) based testing is a commonly used practice in industry. Although valuable and, in many cases, necessary, it is associated with challenges such
as high cost and requirements on both technical and domain expertise. Augmented testing, a novel approach to GUI test automation, aims to mitigate these challenges by allowing users
to record and render test cases and test data directly on the GUI of the system under test (SUT).
In this context, Scout is an augmented testing tool that captures system states and transitions during manual interaction with the SUT, storing them in a test model that is visually represented in the form of state trees and reports. While this representation provides basic overview of a test suite, e.g. about its size and number of scenarios, it is limited in terms of analysis depth, interpretability, and reproducibility. In particular, without human state labeling, it is challenging to produce meaningful and easily understandable test reports.
To address this limitation, we present a novel solution and a demonstrator, integrated into Scout, which leverages large language models (LLMs) to enrich the model-based test case
representation by automatically labeling and describing states and describing transitions.
We conducted two experiments to evaluate the impact of thesolution. First, we compared LLM-enhanced reports with expert-generated reports using embedding distance evaluation metrics. Second, we assessed the usability and perceived value of the enhanced reports through an industrial survey. The results of the study indicate that the plugin can improve both readability and interpretability of test reports.
This work contributes to the automation of GUI testing by reducing the need for manual intervention, e.g. labeling, and technical expertise, e.g. to understand test case models. Although the solution is studied in the context of augmented testing, we argue for the solution’s generalizability to related test automation techniques. In addition, we argue that this approach enables actionable insights and lays the groundwork for further research into autonomous testing based on Generative AI.
Info
Taming Uncertainty in Critical Scenario Generation for Testing Automated Driving Systems
Selma Grosse,
Dejan Nickovic, Cristinel Mateis,
Alessio Gambi, and Adam Molin
(DENSO Automotive, Germany; Austrian Institute of Technology, Austria)
Scenario-based testing in simulation has become a cornerstone of industrial practice for systematically assessing autonomous driving systems across diverse and relevant situations.
Generating critical scenarios
is central to this methodology, yet it remains challenging due to the inherent uncertainties resulting from scenario parameterization.
While parameterization is essential for modeling unpredictable factors, like weather, an excess of parameters
hampers testing effectiveness.
To address these challenges, this paper introduces a methodology that guides testers in selecting scenario parameters and managing the associated uncertainties. Our approach integrates specification-driven and optimization-based test generation with sensitivity analysis, enabling testers to assess the impact of scenario parameters on scenario criticality.
We implemented our approach using well-established industry technologies and evaluated it in a highway case study on
three reference search-based scenario generation methods with varying degrees of exploitativeness.
Results from our evaluation suggest that reducing the %various sources of
parameter-induced uncertainty %in logical scenario specifications
can improve the ability of some testing methods to identify critical scenarios while maintaining the diversity of input parameter values.
ML-Based Test Case Prioritization: A Research and Production Perspective in CI Environments
Md Asif Khan, Akramul Azim, Ramiro Liscano, Kevin Smith, Yee-Kang Chang, Gkerta Seferi, and Qasim Tauseef
(Ontario Tech University, Canada; IBM, United Kingdom; IBM, Canada; IBM, UK)
Test case prioritization (TCP) is essential for improving testing efficiency in large-scale continuous integration (CI) environments by reducing feedback time and efficient resource usage. Machine learning (ML) has shown promise in enhancing TCP, however, demonstrating its effectiveness in production environments remains a challenge. Using the IBM Open Liberty dataset, we developed and validated an ML-based TCP framework, showing how we identified the best-performing model step by step—from feature extraction and model training to hyperparameter tuning. After validating the framework in a research setting, we deployed it in IBM’s live production system. The practical implications of this study are as follows. The production results closely mirrored the research outcomes, with models trained on recent data consistently outperforming older models and non-prioritized approaches. Specifically, prioritized builds achieved a mean Average Percentage of Faults Detected (APFD) value 50% higher than that of non-prioritized builds, leading to a substantial improvement in early fault detection. The consistent improvement of models trained on newer data (M-2023) over those trained on older data (M-2022) underscores the importance of regular model updates in maintaining optimal performance. This paper comprehensively compares research and production data, illustrating how our ML-driven TCP framework ensures optimal performance and detailing the steps necessary for successful implementation in dynamic CI environments.
Evaluation of the Choice of LLM in a Multi-agent Solution for GUI-Test Generation
Stevan Tomic, Emil Alégroth, and Maycel Isaac
(Blekinge Institute of Technology, Sweden; Synteda, Sweden)
Automated testing, particularly for GUI-based systems, remains a costly and labor-intensive process and prone to errors. Despite advancements in automation, manual testing still dominates in industrial practice, resulting in delays, higher costs, and increased error rates. Large Language Models (LLMs) have shown great potential to automate tasks traditionally requiring human intervention, leveraging their cognitive-like abilities for test generation and evaluation.
In this study, we present PathFinder, a Multi-Agent LLM (MALLM) framework that incorporates four agents responsible for (a) perception and summarization, (b) decision-making, (c) input handling and extraction, and (d) validation, which work collaboratively to automate exploratory web-based GUI testing.
The goal of this study is to assess how different LLMs, applied to different agents, affect the efficacy of automated exploratory GUI testing. We evaluate PathFinder with three models, Mistral-
Nemo, Gemma2, and Llama3.1, on four e-commerce websites. Thus, 27 permutations of the LLMs, across three agents (excluding the validation agent), to test the hypothesis that a solution with multiple agents, each using different LLMs, is more efficacious (efficient and effective) than a multi-agent solution where all agents use the same LLM.
The results indicate that the choice of LLM constellation (combination of LLMs) significantly impacts efficacy, suggesting that a single LLM across agents may yield the best balance of efficacy (measured by F1-score). Hypothesis to explain this result include, but are not limited to: improved decision-making consistency and reduced task coordination discrepancies. The contributions of this study are an architecture for MALLM- based GUI testing, empirical results on its performance, and novel insights into how LLM selection impacts the efficacy of automated testing.
Early V&V in Knowledge-Centric Systems Engineering: Advances and Benefits in Practice
Jose Luis de la Vara, Juan Manuel Morote, Clara Ayora, Giovanni Giachetti, Luis Alonso, Roy Mendieta, David Muñoz, Ricardo Ruiz Nolasco, and Antonio González
(University of Castilla-La Mancha, Spain; Independent Researcher, Spain; Universidad de Castilla la Mancha, Spain; Universitat Politecnica de Valencia, Spain; The REUSE Company, Spain; RGB Medical Devices, Spain; RGB Medical Devices S.A., Spain)
Knowledge-Centric Systems Engineering is an industrial approach to systems and software engineering that advocates the development and use of knowledge bases that represent system domains. This approach can also exploit artificial intelligence techniques. These means can be used for early V&V (verification and validation) activities, e.g., for system artefact quality analysis and traceability management. This paper presents advances made in these activities with the SES Engineering Studio industrial tool and its underlying methods to meet further early-V&V needs in practice. The methods and the tool have been improved to better address model quality analysis, traceability project management, trace specification, and compliance with standards. For validation, we have initially applied the new features on a medical device. This has also allowed us to study the benefits of the features. The results show that the advances made can lead to wider system artefact analyses, more precise traceability management, better system artefacts, lower V&V effort, and lower issue resolution costs. All in all, the paper presents specific examples of how early-V&V industrial practices and tools can be improved.
Speculative Testing at Google with Transition Prediction
Avi Kondareddy, Sushmita Azad, Abhayendra Singh, and Tim A. D. Henderson
(Google, USA; Google, UK)
Google’s approach to testing includes both testing prior to code submission (for fast validation) and after code submission (for comprehensive validation). However, Google’s ever growing testing demand has lead to increased continuous integration cycle latency and machine costs. When the post code submission continuous integration cycles get longer, it delays detecting breakages in the main repository which increases developer friction and lowers productivity. To mitigate this without increasing resource demand, Google is implementing Postsubmit Speculative Cycles in their Test Automation Platform (TAP). Speculative Cycles prioritize finding novel breakages faster. In this paper we present our new test scheduling architecture and the machine learning system (Transition Prediction) driving it. Both the ML system and the end-to-end test scheduling system are empirically evaluated on 3-months of our production data (120 billion test×cycle pairs, 7.7 million breaking targets, with ∼20 thousand unique breakages). Using Speculative Cycles we observed a median (p50) reduction of approximately 65% (from 107 to 37 minutes) in the time taken to detect novel breaking targets.
Evaluating Machine Learning-Based Test Case Prioritization in the Real World: An Experiment with SAP HANA
Jeongki Son,
Gabin An, Jingun Hong, and
Shin Yoo
(SAP Labs, South Korea; KAIST, South Korea)
Test Case Prioritization (TCP) aims to find orderings of regression test suite execution so that failures can be detected as early as possible. Recently, Machine Learning (ML) based techniques have been proposed and evaluated using open-source projects and their test histories. We report our evaluation of these ML-based TCP techniques, both Reinforcement Learning (RL) and Supervised Learning (SL) based ones, using the industrial testing data collected from SAP HANA, a large-scale database management system. Specifically, our study compares 37 different TCP techniques, including 14 RL models on two datasets, 4 SL models, and 5 non-ML baselines, using real-world testing data of SAP HANA collected over eight months. Our evaluation focuses on both the performance and cost-efficiency of these techniques in the context of Continuous Integration for large-scale industrial projects. The results reveal that while RL models show promising performance, they require significant training time. RL models with sampled data offers a balance between performance and efficiency. Interestingly, the best-performing RL model outperformed or matched non-ML baselines. However, the gradient-boosted SL technique consistently outperformed both RL models and baselines in terms of effectiveness and efficiency, even with complete retraining at each test cycle. Despite RL's capability for incremental learning, it demands substantial training time and still falls short in accuracy compared to SL. Our findings suggest that, even in a large-scale industrial setting, fully retraining an SL model for each cycle proves to be the most effective and efficient approach for TCP, offering superior performance and cost-efficiency compared to RL and traditional methods.
FuzzE, Development of a Fuzzing Approach for Odoo’s Tours Integration Testing Plateform
Gabriel Benoit, François Georis, Géry Debongnie,
Benoît Vanderose, and
Xavier Devroey
(University of Namur, Belgium; Odoo, Belgium)
For many years, Odoo, an open-source add-on-based platform offering an extensive range of functionalities, including Enterprise Resource Planning, has constantly expanded its scope, resulting in an increased complexity of its software. To cope with this evolution, Odoo has developed an integration testing system called tour execution, which executes predefined testing scenarios (i.e., tours) on the web user interface to test the integration between the front, back, and data layers. This paper reports our effort and experience in extending the tour system with fuzzing. Inspired by action research, we followed an iterative approach to devise FuzzE, a plugin for Odoo's tour system to create new tours. FuzzE was eventually developed in three iterations. Our results show that mutational fuzzing is the most effective approach when integrating with an existing testing infrastructure. We also reported one issue to the Odoo issue tracker. Finally, we present lessons learned from our endeavor, including the necessity to consider testability aspects earlier when developing web-based systems to help the fuzzing effort, and the difficulty faced when performing triage and root cause analysis on failing tours.
Accessible Smart Contracts Verification: Synthesizing Formal Models with Tamed LLMs
Jan Corazza, Ivan Gavran, Gabriela Moreira, and Daniel Neider
(TU Dortmund, Germany; Informal Systems, Austria; Informal Systems, Brazil)
When blockchain systems are said to be trustless, what this really means is that all the trust is put into software.
Thus, there are strong incentives to ensure blockchain software is correct---vulnerabilities here cost millions and break businesses.
One of the most powerful ways of establishing software correctness is by using formal methods.
Approaches based on formal methods, however, induce a significant overhead in terms of time and expertise required to successfully employ them.
Our work addresses this critical disadvantage by automating the creation of a formal model---a mathematical abstraction of the software system---which is often a core task when employing formal methods.
We perform model synthesis in three phases: we first transpile the code into model stubs; then we ``fill in the blanks'' using a large language model (LLM); finally, we iteratively repair the generated model, on both syntactical and semantical level.
In this way, we significantly reduce the amount of time necessary to create formal models and increase accessibility of valuable software verification methods that rely on them.
The practical context of our work was reducing the time-to-value of using formal models for correctness audits of smart contracts.
A Tale from the Trenches: Applying Metamorphic and Differential Testing to Bioinformatics Software
Alexis Marsh,
Myra B. Cohen, and Robert Cottingham
(Iowa State University, USA; Oak Ridge National Laboratory, USA)
Metamorphic and differential testing have been proposed as best practices for testing software that is difficult to test, such as for programs in scientific domains. An assumption is that these approaches can be easily customized and applied to almost any domain. However, scientific software is often data-driven, and metamorphic relations may require significant domain knowledge to develop. In addition, tools are often written for ad-hoc experimentation by the scientists and often embed many assumptions about the importance and representation of different natural phenomena. In this paper, we present our experience applying both metamorphic and differential testing to a set of four computational biology tools that predict the growth of an organism. While our original goal was to evaluate these techniques to improve our system-level testing, we encountered multiple roadblocks along the way. Although we did find faults (some confirmed by developers), we also uncovered a set of challenges, including the considerable manual effort required for (a) defining domain-specific tests, (b) validating correctness, and (c) distinguishing between issues stemming from poor data and those arising from incorrect software.
CubeTesterAI: Automated JUnit Test Generation using the LLaMA Model
Daniele Gorla, Shivam Kumar, Pietro Nicolaus Roselli Lorenzini, and Alireza Alipourfaz
(Sapienza University of Rome, Italy; PCCube, Italy)
This paper presents an approach to automating JUnit test generation for Java applications using the Spring Boot framework, leveraging the LLaMA (Large Language Model Architecture) model to enhance the efficiency and accuracy of the testing process. The resulting tool, called CUBETESTERAI, includes a user-friendly web interface and the integration of a CI/CD pipeline using GitLab and Docker. These components streamline the automated test generation process, allowing de- velopers to generate JUnit tests directly from their code snippets with minimal manual intervention. The final implementation executes the LLaMA models through RunPod, an online GPU service, which also enhances the privacy of our tool. Using the advanced natural language processing capabilities of the LLaMA model, CUBETESTERAI is able to generate test cases that provide high code coverage and accurate validation of software functionalities in Java-based Spring Boot applications. Furthermore, it efficiently manages resource-intensive operations and refines the generated tests to address common issues like missing imports and handling of private methods. By comparing CUBETESTERAI with some state-of-the-art tools, we show that our proposal consistently demonstrates competitive and, in many cases, better performance in terms of code coverage in different real-life Java programs.
Short Papers, Vision, and Emerging Results
Weighted Call Frequency-Based Fault Localization
Attila Szatmári, Aondowase James Orban, and Tamás Gergely
(University of Szeged, Hungary)
Spectrum-based fault localization is an automated technique that helps developers identify and isolate the origin of suspicious errors during software development.
Despite being a well-researched topic, it is rarely used in the industry.
The primary reason is that, in its basic version, it uses only local information on the coverage of a program element to estimate its probability of failure, rarely utilizing additional contextual information on the element or the test cases.
Other researchers have tried solving the problem using contextual information with varying success.
In this paper, we enhance the approach called Call Frequency-based Fault Localization, which analyzes the method's occurrence frequency in call-stack instances of failed tests.
While it boosts SBFL's effectiveness, it overlooks the test scope.
We propose that identifying unit and unit-like tests, followed by adjusting the frequency of the method by test type, can further enhance the fault localization ability of FL techniques.
We empirically evaluated our method's effectiveness with the Defects4J benchmark.
We found that utilizing weights in Call Frequency-based Fault Localization often ranks faulty methods higher, increasing the number of items in the top-10 positions.
Addressing Data Leakage in HumanEval using Combinatorial Test Design
Jeremy Bradbury and Riddhi More
(Ontario Tech University, Canada)
The use of large language models (LLMs) is widespread across many domains, including Software Engineering, where they have been used to automate tasks such as program generation and test classification. As LLM-based methods continue to evolve, it is important that we define clear and robust methods that fairly evaluate performance. Benchmarks are a common approach to assess LLMs with respect to their ability to solve problem-specific tasks as well as assess different versions of an LLM to solve tasks over time. For example, the HumanEval benchmark is composed of 164 hand-crafted tasks and has become an important tool in assessing LLM-based program generation. However, a major barrier to a fair evaluation of LLMs using benchmarks like HumanEval is data contamination resulting from data leakage of benchmark tasks and solutions into the training data set. This barrier is compounded by the black-box nature of LLM training data which makes it difficult to even know if data leakage has occurred. To address the data leakage problem, we propose a new benchmark construction method where a benchmark is composed of template tasks that can be instantiated into new concrete tasks using combinatorial test design. Concrete tasks for the same template task must be different enough that data leakage has minimal impact and similar enough that the tasks are interchangeable with respect to performance evaluation. To assess our benchmark construction method, we propose HumanEval_T, an alternative benchmark to HumanEval that was constructed using template tasks and combinatorial test design.
Towards Cross-Build Differential Testing
Jens Dietrich, Tim White,
Valerio Terragni, and Behnaz Hassanshahi
(Victoria University of Wellington, New Zealand; University of Auckland, New Zealand; Oracle Labs, Australia)
Recent concerns about software supply chain security have led to the emergence of different binaries built from the same source code. This will sometimes result in binaries that are not identical and therefore have different cryptographic hashes. The question arises whether those binaries are still equivalent, i.e., whether they have the same behaviour. We explore whether differential testing can be used to provide evidence for non-equivalence.
We study this for 3,541 pairs of binaries built for the same Maven artifact version, distributed on Maven Central, Google Assured Open Source Software and/or Oracle Build-From-Source. We use EvoSuite to generate tests for the baseline binary from Maven Central, run these tests against this baseline binary and any available alternately built binaries, and compare the results for consistency. We argue that any differences may indicate variations in program behaviour and could, therefore, be used to detect compromised binaries or failures at runtime.
Although our preliminary experiments did not reveal any compromised builds, our approach successfully identified three build configuration errors that caused changes in runtime behaviour. These findings underscore the potential of our method to uncover subtle build differences, highlighting opportunities for improvement.
Test Generation from Use Case Specifications for IoT Systems: Custom, LLM-Based, and Hybrid Approaches
Zacharie Chenail-Larcher, Jean Baptiste Minani, and Naouel Moha
(ÉTS Montréal, Canada; Concordia University, Canada)
IoT systems are increasingly developed and deployed across various domains, where End-to-End (E2E) testing is critical to ensure reliability and expected behavior. However, generating comprehensive tests remains challenging due to the heterogeneity, distributed nature, and unique characteristics of IoT systems, which limit the effectiveness of generic test generation approaches. Recent studies demonstrated the effectiveness of Large Language Models (LLM) for test generation in traditional software systems.
Building on this foundation, this study explores and evaluates four distinct approaches for generating E2E tests from use case specifications (UCSs) tailored to IoT systems. These include (1) a custom, (2) a single-stage LLM, (3) a multi-stage LLM, and (4) a hybrid approach combining custom and LLM capabilities. We evaluated these approaches on an IoT system, focusing on correctness and scenario coverage criteria.
Experimental results indicate that all approaches perform well, with notable variations in specific aspects of test generation. The custom and hybrid approaches are more reliable in producing correctly structured and complete tests, with the hybrid approach slightly outperforming others. This study is a work in progress, requiring further investigation to fully realize its potential.
Pre-trained Models for Bytecode Instructions
Donggyu Kim, Taemin Kim,
Ji-ho Shin,
Song Wang, Heeyoul Choi, and Jaechang Nam
(Handong Global University, South Korea; York University, Canada)
Recent advancements in pre-trained models have rapidly expanded their applicability to various software engineering challenges. Despite this progress, current research predominantly focuses on source code and natural language processing, largely overlooking Java bytecode. Java bytecode, with its well-defined structure and high availability, presents a promising yet under-explored domain for leveraging pre-trained models. Its inherent properties, such as platform independence and optimized performance, make Java bytecode an ideal candidate for developing robust and efficient software engineering solutions. Addressing this gap could unlock new opportunities for enhancing automated program analysis, bug detection, and code generation tasks.
In this study, we propose byteT5 and byteBERT, which are pre-trained models with hexadecimal bytecode. To build our models, we developed a bytecode tokenizer, ByteTok, to generate hexadecimal input representations for our pre-trained models. We conduct an empirical study comparing our models and GPT-4o. The results indicate that byteT5 and byteBERT outperform GPT-4o in the span masking task. We anticipate these findings will pave the way for novel approaches to addressing various software engineering challenges, particularly live patching.
Info
Towards Refined Code Coverage: A New Predictive Problem in Software Testing
Carolin Brandt and Aurora Ramírez
(Delft University of Technology, Netherlands; University of Córdoba, Spain)
To measure and improve the strength of test suites, software projects and their developers commonly use code coverage and aim for a threshold of around 80%. But what is the 80% of the source code that should be covered? To prepare for the development of new, more refined code coverage criteria, we introduce a novel predictive problem in software testing: whether a code line is, or should be, covered by the test suite. In this short paper, we propose the collection of coverage information, source code metrics, and abstract syntax tree data and explore whether they are relevant to predict whether a code line is exercised by the test suite or not. We present a preliminary experiment using four machine learning (ML) algorithms and an open source Java project. We observe that ML classifiers can achieve high accuracy (up to 90%) on this novel predictive problem. We also apply an explainable method to better understand the characteristics of code lines that make them more “appealing” to be covered. Our work opens a research line worth to investigate further, where the focus of the prediction is the code to be tested. Our innovative approach contrasts with most predictive problems in software testing, which aim to predict the test case failure probability.
Info
EnCus: Customizing Search Space for Automated Program Repair
Seongbin Kim, Sechang Jang, Jindae Kim, and Jaechang Nam
(Handong Global University, South Korea; Seoul National University of Science and Technology, South Korea)
The primary challenge faced by Automated Program Repair (APR) techniques in fixing buggy programs is the search space problem. To generate a patch, APR techniques must address three critical decisions: where to fix (location), how to fix (operation), and what to fix with (ingredient). In this study, we propose EnCus, a novel approach that customizes the search space of ingredients and mutation operators during patch generation. EnCus acts as an APR wingman, using an ensemble-based strategy to customize the search space. The search space is customized by extracting edit operations that are used to fix similar bug-introducing changes from existing patches. EnCus applies an ensemble of edit operations extracted from three open source project pools and three Abstract Syntax Tree (AST)-level code differencing tools. This ensemble provides complementary perspectives on the buggy context. To evaluate this approach, we integrate EnCus to an existing context-based APR tool, ConFix. Using EnCus, the extensive search space of ConFix is reduced to ten recommended patches. EnCus was evaluated on single-line Defects4J bugs, successfully generating 20 correct patches which performs comparably to state-of-the-art context-based APR techniques.
Info
Harnessing Test Call Structures for Improved Fault Localization Effectiveness
Attila Szatmári
(Szegedi Tudományegyetem, Hungary)
Identifying the cause of a software bug is a difficult
and costly task that requires a detailed understanding of codebase
structures. We argue that a smaller target space can reduce
the developer’s investigation effort, thus it should be utilized
in fault localization. In this paper, we propose a new spectrumbased fault location (SBFL) approach based on the test callstructure information, called Test Call-Structure-Based (TCS)
fault location. We evaluate the effectiveness of our approach
using the Barinel formula on the Defects4J bug benchmark. The
results show that our approach can achieve an average rank
improvement in 75% of the projects in Defects4J. Furthermore,
our approach can place more buggy methods among the Top10 most suspicious elements. Moreover, our approach identifies
more bugs that were previously unlikely to be found using the
hit-based SBFL method.
Education Track
Black-Box Testing for Practitioners: A Case of the New ISTQB Test Analyst Syllabus
Matthias Hamburg and Adam Roman
(International Software Testing Qualifications Board, Belgium; ISTQB German Testing Board, Germany; Jagiellonian University, Poland; ISTQB Polish Testing Board, Poland)
The International Software Testing Qualifications Board (ISTQB) is a volunteer organization aiming to qualify software testing and quality assurance practitioners. It issues a certification scheme with many certification products for this subject matter. These products are based on the current state of practice and consist of a syllabus and a set of sample exam questions. They can also serve as a basis for academic courses in software testing. We recently revised and updated the Advanced Level Test Analyst Syllabus, which focuses on black-box test techniques. In this paper, we share our experience in developing the new syllabus. We describe the methodology for shaping the syllabus's content and selecting black-box techniques. We also show how the ISTQB framework is designed to fill the gap left by academic programs regarding black-box techniques by providing structured syllabi with clear learning objectives and business outcomes that align with industry demands.
Combining Logic and Large Language Models for Assisted Debugging and Repair of ASP Programs
Ricardo Brancas, Vasco Manquinho, and
Ruben Martins
(INESC-ID, Portugal; Universidade de Lisboa, Portugal; Carnegie Mellon University, USA)
Logic programs are a powerful approach for solving NP-Hard problems. However, their declarative nature poses significant challenges in debugging.
Unlike procedural paradigms, which allow for step-by-step inspection of program state, logic programs require reasoning about logical statements for fault localization. This complexity is especially significant in learning environments due to students' inexperience.
We introduce FormHe, a novel tool that integrates logic-based techniques with Large Language Models (LLMs) to detect and correct issues in Answer Set Programming submissions. FormHe consists of two main components: a fault localization module and a program repair module. First, the fault localization module identifies specific faulty statements in need of modification. Next, FormHe applies program mutation techniques and leverages LLMs to repair the flawed code. The resulting repairs are then used to generate hints that guide students in correcting their programs.
Our experiments with real buggy programs submitted by students show that FormHe accurately detects faults in 94% of cases and successfully repairs 58% of incorrect submissions.
Teaching Bug Advocacy through Flipped Classroom
Andreea Galbin-Nasui and
Andreea Vescan
(Babes-Bolyai University, Cluj-Napoca, Romania)
Software testing plays a critical role in the development workflow. Nowadays, the significance of teaching software testing principles is recognized to a greater extent than ever before. The aim of this paper is twofold: (1) to investigate the effectiveness of using a flipped classroom-based context to teach software bug advocacy, and (2) to provide the student’s perspective on using flipped classroom to learn how to advocate for a bug. A seminar activity dedicated to bug reports and how to advocate for a bug is the framework for this investigation, with students being split into teams with the aim to perform two major activities: poster creation for one of the strategies from the RIMGEN mnemonics and providing advocacy strategies for a 3 years old bug. The created artifacts and the answers to a questionnaire dedicated to the learning experience are used as tools to analyze and provide answers to research questions. The results show that flipped classroom-based learning is effective in teaching how to advocate for a software bug. Around 87.75% of students agreed that the poster creation activity helped them better retain the information. A percentage of 61.22% students agreed that time was spent more effectively in class since the text was read outside the classroom, and 82.66% of students also agree that this type of learning provides them with the opportunity to communicate with other students.
Experience Report on using Experiential Learning to Facilitate Learning of Bug Investigation Steps
Adina Moldovan, Oana Casapu, and
Andreea Vescan
(Altom, Romania; Babes-Bolyai University, Cluj-Napoca, Romania)
Testing with proper bug investigation steps is an essential component in the development process. Teaching and learning bug investigation are nowadays performed in different contexts, with learners having different testing skills.
The aim of this paper is to report on using experiential learning for discovering the bug investigation steps. Two learning settings were investigated: informal vs formal learning, in-person vs online learning, testing practitioners vs students participants. Both meetings used game-based activities to engage participants and facilitate learning.
We report on the results of activities with both practitioners and students, distilling valuable lessons for reproducing this approach of experiential learning in learning bug investigation: the used games as system under test provided a fun way of learning and motivated students to participate in the activities, reflection on their actions and the reason behind their actions lead to the development of bug investigation models by the two groups. There are similarities and differences in the bug investigation steps models and in the way the groups perceived the experiential learning. Using games to experience the testing process was considered essential to the learning process, along with the experiential learning methodology.
Requirements for an Automated Assessment Tool for Learning Programming by Doing
Arthur Rump, Vadim Zaytsev, and Angelika Mader
(University of Twente, Netherlands)
Assessment of open-ended assignments such as programming projects is a complex and time-consuming task. When students learn to program, however, they benefit from receiving timely feedback, which requires an assessment of their current work. Our goal is to build a tool that assists in this process by partially automating the assessment of open-ended programming assignments. In this paper we discuss the requirements for this tool, based on interviews with teachers and other relevant stakeholders.
Info
A System-Level Testing Framework for Automated Assessment of Programming Assignments Allowing Students Object-Oriented Design Freedom
Valerio Terragni and Nasser Giacaman
(University of Auckland, New Zealand)
Automated assessment of programming assignments is essential in software engineering education, especially for large classes where manual grading is impractical. While static analysis can evaluate code style and syntax correctness, it cannot assess the functional correctness of students’ implementations. Dynamic analysis through software testing can verify program behavior and provide automated feedback to students. However, traditional unit and integration tests often restrict students’ design freedom by requiring predefined interfaces and method declarations. In this paper, we present SYSCLI, a novel testing framework for system-level testing of Java-based command-Line interface applications. SYSCLI enables test suites that evaluate the functional correctness of students’ implementations without limiting their design choices. We also share our experience using
SYSCLI in a second-year programming course at the University of Auckland, which focuses on object-oriented programming and design patterns and enrolls over 300 students each offering. Analysis of student assignments from 2023 and 2024 shows that SYSCLI is effective in automating grading, allows software design flexibility, and provides actionable feedback to students. Our experience report offers valuable insights into assessing students’ implementation of object-oriented concepts and design patterns.
Can Test Generation and Program Repair Inform Automated Assessment of Programming Projects?
Ruizhen Gu,
José Miguel Rojas, and Donghwan Shin
(University of Sheffield, UK)
Computer Science educators assessing student programming assignments are typically responsible for two challenging tasks: grading and providing feedback. Producing grades that are fair and feedback that is useful to students is a goal common to most educators. In this context, automated test generation and program repair offer promising solutions for detecting bugs and suggesting corrections in students' code which could be leveraged to inform grading and feedback generation. Previous research on the applicability of these techniques to simple programming tasks (e.g., single-method algorithms) has shown promising results, but their effectiveness for more complex programming tasks remains unexplored. To fill this gap, this paper investigates the feasibility of applying existing test generation and program repair tools for assessing complex programming assignment projects. In a case study using a real-world Java programming assignment project with 296 incorrect student submissions, we found that generated tests were insufficient in detecting bugs in over 50% of cases, while full repairs could only be automatically generated for only 2.1% of submissions. Our findings indicate significant limitations in current tools for detecting bugs and repairing student submissions, highlighting the need for more advanced techniques to support automated assessment of complex assignment projects.
A Tool-Assisted Training Approach for Empowering Localization and Internationalization Testing Proficiency
Maria Couto,
Breno Miranda, and Kiev Gama
(Federal University of Pernambuco, Brazil)
Software testing is an important area in the Software Engineering domain, yet it faces significant gaps within Computer Science education. In that context, there is an even less addressed topic: internationalization (i18n) and localization (l10n) tests, which are essential for ensuring software quality in global markets, supporting multiple languages. A key challenge for both large and small companies is the onboarding of new testers without adequate skills, often resulting in insufficient real-world preparation. This paper proposes a tool-assisted training approach designed to enhance the proficiency of l10n and i18n testers through practical, hands-on exercises using real-world failure examples. The effectiveness of our approach was evaluated from two complementary perspectives: i) To assess the effectiveness of the training approach, a case study was conducted within a software industry training setting, comparing novice testers trained with our tool against those receiving conventional training. Results indicated a 150% improvement in fault identification and resolution for testers using the tool, underscoring its effectiveness in enhancing testing skills and overall software quality. ii) To evaluate the perceived usability of the tool developed to support the training activities, a System Usability Scale (SUS) questionnaire was given to five technical leaders responsible for training novice testers. The tool achieved a score of 94.5, positioning its usability between “excellent” and “best imaginable”. This paper highlights the tool’s design, deployment context, and its potential for adoption by other practitioners, aiming to address the current gaps in i18n and l10n tester training, being a promising approach to be used in Software Engineering education.
Posters
Testing Tools and Data Showcase Track
E2E-Loader: A Tool to Generate Performance Tests from End-to-End GUI-Level Tests
Sergio Di Meglio, Luigi Libero Lucio Starace, and Sergio Di Martino
(Federico II University of Naples, Italy)
Performance testing is essential for ensuring that web applications deliver a satisfactory user experience under varying workloads. Crafting meaningful workloads is a key challenge, addressed in previous research by analyzing system logs that reflect real user behaviors.
However, these approaches face limitations: they require the system under test to be deployed to collect usage data, offer limited automation for managing data dependencies, and often lack support for modern protocols like WEBSOCKET.
We present E2E-LOADER, a tool for automating the generation of performance testing workloads for Web Applications. E2E-LOADER leverages existing End-to-End (E2E) GUI-level test cases to create workloads, allowing its use at early stages of development before user data is available. The tool fully supports HTTP and WEBSOCKET-based interactions and includes customizable heuristics to detect data dependencies automatically. E2E-LOADER has been evaluated in previous research in an industrial case study, demonstrating that it produces workloads comparable in quality to those manually designed by practitioners, with significantly less effort and time. The tool and its source code are openly available to support researchers and practitioners in advancing performance testing practices. A screencast showcasing E2E-LOADER in function is available at https://youtu.be/pDWN1l1kAhU
ViMoTest: A Tool to Specify ViewModel-Based GUI Test Scenarios using Projectional Editing
Mario Fuksa, Sandro Speth, and Steffen Becker
(University of Stuttgart, Germany)
Automated GUI testing is crucial in ensuring that presentation logic behaves as expected.
However, existing tools often apply end-to-end approaches and face challenges such as high specification efforts, maintenance difficulties, and flaky tests while coupling to GUI framework specifics.
To address these challenges, we introduce the ViMoTest tool, which leverages Behavior-driven Development, the ViewModel architectural pattern, and projectional Domain-specific Languages (DSLs) to isolate and test presentation logic independently of GUI frameworks.
We demonstrate the tool with a small JavaFX-based task manager example and generate executable code.
Video
AMBER: AI-Enabled Java Microbenchmark Harness
Antonio Trovato, Luca Traini, Federico Di Menna, and Dario Di Nucci
(University of Salerno, Italy; University of L'Aquila, Italy)
JMH is the standard framework for developing and running Java microbenchmarks—lightweight performance tests used to evaluate the execution time of small Java code segments. A key challenge in designing JMH microbenchmarks is determining the appropriate number of warm-up iterations--repeated executions needed to bring microbenchmarks to a performance steady state. Too few warm-up iterations can compromise result quality, as performance measurements may not accurately reflect steady-state behavior. Conversely, too many warm-up iterations can unnecessarily increase testing time.
Here, we present AMBER, an AI-enabled extension of JMH, which leverages Time Series Classification algorithms to predict the beginning of the steady-state phase at run-time and dynamically halt warm-up iterations accordingly. Empirical results show the potential of AMBER in enhancing the cost-effectiveness of Java microbenchmarks. A demo video of AMBER is available at https://www.youtube.com/watch?v=7zOngDQ1z_k.
Video
Technical Briefings and Tutorials
Scenario-Based Testing with BeamNG.tech (Hands-On Training)
Chrysanthi Papamichail, David Stark, and
Alessio Gambi
(BeamNG, Greece; BeamNG, UK; Austrian Institute of Technology, Austria)
Autonomous Driving Systems (ADS) are safety-critical Cyber-Physical Systems that require thorough validation. Currently, scenario-based testing in simulations is the cornerstone of ADS validation, complementing expensive and dangerous natural field operational testing.
Scenario-based testing in simulation can systematically assess ADS in diverse, relevant, and critical driving situations. However, it requires sophisticated tools and rich content to let developers quickly generate the intended testing scenarios.
This hands-on training illustrates how to use the BeamNG.tech framework for effective scenario-based testing, focusing on manual scenario generation, which is often neglected in research despite its central role in practice.
Learning material and additional descriptions are available at: https://github.com/BeamNG/scenario-based-testing-tutorial
A Developer’s Guide to Building and Testing Accessible Mobile Apps
Juan Pablo Sandoval Alcocer,
Leonel Merino,
Alison Fernandez-Blanco,
William Ravelo-Mendez,
Camilo Escobar-Velásquez, and
Mario Linares-Vásquez
(Pontificia Universidad Católica de Chile, Chile; Universidad de Los Andes, Colombia)
Mobile applications play a relevant role in users’ daily lives by improving and easing daily processes such as commuting or making financial transactions. The aforementioned interactions enhance the usability of commonly used services. Nevertheless, the improvements should also consider special execution environments such as weak network connections or special requirements inherited from the user’s condition. Due to this, the design of mobile applications should be driven by improving the user experience. This tutorial targets the usage of inclusive and accessibility design in the development process of mobile apps. Making sure that applications are accessible to all users, regardless of disabilities, is not just about following the law or fulfilling ethical obligations; it is crucial in creating inclusive and fair digital environments. This tutorial will educate participants on accessibility principles and the available tools. They will gain practical experience with specific Android and iOS platform features, as well as become acquainted with state-of-the-art automated and manual testing tools.
Doctoral Research
Autonomous Systems
Adversarial Testing with Reinforcement Learning
Andrea Doreste
(USI Lugano, Switzerland)
Ensuring the proper behavior of autonomous systems, such as Autonomous Driving Systems (ADSs), is essential to ensure their safety. However, testing them effectively and efficiently is still an open research challenge. Existing testing techniques rely on simulations and manipulate objects in the virtual environment to trigger the misbehavior of the system under test. Techniques such as Reinforcement Learning (RL) have been applied to effectively modify static or dynamic objects of the environment, such as properties of obstacles and behaviors of pedestrians or other vehicles. However, these approaches implement centralized controllers of the environment, resulting in possibly unrealistic and even invalid failures of the system. Considering these limitations, in my Ph.D., I aim to use RL to create adversarial agents that are fully autonomous, independent, and act to challenge the ADS under test, finding failures in critical scenarios, and contributing to improve the robustness of the ADS under test.
A Method for Systematically Assessing the Safety of Automated Driving Systems via Simulation
Ali Güllü
(University of Tartu, Estonia)
Automated Driving Systems (ADS) pose significant challenges for testing within their operational design domains (ODD), especially at Levels 4 and 5 of autonomy. Testing these systems requires evaluating their functionality within the defined ODD in a simulation environment, making necessary improvements and corrections, and subsequently implementing them in real-world scenarios. My goal is to develop a testing method that allows for a comprehensive evaluation of scenarios in a simulation environment, especially for scenarios that are challenging or expensive to replicate in the real world. The proposed method will allow for comparative safety analysis by evaluating ADS performance against human drivers, enabling a thorough assessment of its safety profile.
Uncertainty-Aware Autonomous Driving System Testing with Large Language Models
Jiahui Wu
(Simula Research Laboratory, Norway; University of Oslo, Norway)
Autonomous Driving Systems (ADSs) operate in highly dynamic and uncertain environments, requiring testing methods that address both internal (e.g., algorithm randomness, sensor limitations) and external (e.g., unpredictable events, human interactions) uncertainties. However, current approaches often fall short in realistically quantifying these uncertainties, as they struggle to address the complex interplay between known and unknown factors in real-world scenarios. To address these challenges, this work explores leveraging Large Language Models (LLMs) to enhance ADS testing by incorporating human-like reasoning and domain-specific knowledge through prompt engineering, retrieval-augmented generation, and fine-tuning. By integrating LLMs with techniques like Search-Based Testing, we aim to improve testing realism, enhance efficiency, and handle uncertainties more effectively. The proposed strategies seek to develop optimized ADS testing frameworks, enabling safer and more reliable ADS deployments.
Web/Mobile Systems
Identifying and Mitigating Flaky Tests in JavaScript
Negar Hashemi
(Massey University, New Zealand)
Flaky tests, which have non-deterministic outcomes (pass or fail when running on the same code), are a significant issue that affects software quality and makes it difficult to rely on test results. This research aims to explore the root causes of flaky tests in open-source JavaScript projects, develop practical techniques to manifest possible flaky behaviour, and provide a mechanism to mitigate some of those flaky tests.
End-to-End Testing in Web Environments: Addressing Practical Challenges
Sergio Di Meglio
(Federico II University of Naples, Italy)
End-to-End (E2E) testing is a critical practice for ensuring the functionality and reliability of software applications in real-world scenarios. The two main approaches within E2E testing are GUI testing and Performance testing. Despite their importance, their adoption remains limited due to several challenges, including limited automation in test generation, high fragility that complicates maintenance, and the lack of comprehensive datasets. These obstacles hinder both industrial adoption and academic progress. My Ph.D. research carried out in collaboration with an industry partner, addresses these challenges by proposing a solution to automate the generation of web workloads and estimate the fragility of web GUI tests. However, the work also highlighted the persistent lack of a comprehensive dataset. To fill this gap, I have developed a curated dataset of open-source repositories that allow these approaches to be explored and open up new avenues of research.
Advancing Mobile UI Testing by Learning Screen Usage Semantics
Safwat Ali Khan
(George Mason University, USA)
The demand for quality in mobile applications has increased greatly given users’ high reliance on them for daily tasks. Developers work tirelessly to ensure that their applications are both functional and user-friendly. In pursuit of this, Automated Input Generation (AIG) tools have emerged as a promising solution for testing mobile applications by simulating user interactions and exploring app functionalities. However, these tools face significant challenges in navigating complex Graphical User Interfaces (GUIs), and developers often have trouble understanding their output. More specifically, AIG tools face difficulties in navigating out of certain screens, such as login pages and advertisements, due to a lack of contextual understanding which leads to suboptimal testing coverage. Furthermore, while AIG tools can provide interaction traces consisting of action and screen details, there is limited understanding of its coverage of higher level functionalities, such as logging in, setting alarms, or saving notes. Understanding these covered use cases are essential to ensure comprehensive test coverage of app functionalities. Difficulty in testing mobile UIs can lead to the design of complex interfaces, which can adversely affect users of advanced age who often face usability barriers due to small buttons, cluttered layouts, and unintuitive navigation. There exists many studies that highlight these issues, but automated solutions for improving UI accessibility needs more attention. To address these interconnected challenges faced by mobile developers and app users, my PhD dissertation works towards advancing automated techniques for mobile UI testing. This research seeks to enhance automated UI testing techniques by learning the screen usage semantics of mobile apps and helping them navigate more efficiently, offer more insights about tested functionalities and also improve the usability of a mobile app’s interface by identifying and mitigating UI design issues.
Debugging and Reliability
Enhancing Spectrum-Based Fault Localization in the Context of Reactive Programming
Aondowase Orban
(University of Szeged, Hungary)
Several research works have been performed on the spectrum-based fault localization (SBFL) technique that aims to identify and isolate suspicious errors during software development in traditional programming paradigms such as imperative and object-oriented programming.
After extensive research, I noticed that there seem to be few research works that connect SBFL and reactive programming.
Since reactive programming is centered on data streams and the propagation of changes where states are updated asynchronously in response to external stimuli, the traditional coverage-based spectrum seems to be less appropriate for reactive programs.
Having considered these limitations, my aim is to investigate and develop a new technique to apply and enhance the traditional SBFL for reactive programming.
In this doctoral symposium, I present my research topic, the relevant research questions based on the topic, and outline some plans to be considered in the remaining period of my PhD research work.
On Service-to-Service Integration Testing in Microservice Systems
Lena Gregor
(TU Munich, Germany)
Microservices enable scalable and flexible software systems but introduce significant testing challenges, particularly at the interaction level between services.
Traditional test adequacy criteria, such as branch coverage and traditional mutation operators are unsuitable for evaluating interactions across microservices due to their fine-grained focus and reliance on language-specific, resource-intensive methods.
Despite recent advancements in end-to-end coverage metrics and mutation operators, no approach currently targets mutation testing for integration-level tests in microservice systems.
This thesis acknowledges the need for a novel mutation testing approach tailored to microservice systems, leveraging a fault taxonomy derived from real-world projects and literature.
The proposed tool enables runtime fault injection without requiring system redeployment, addressing the polyglot nature and high deployment costs of microservices.
Preliminary results include a validated and published fault taxonomy and a prototype tool.
Future work involves further refinement of the tool, comprehensive evaluations, and comparisons with traditional methods to evaluate its effectiveness and efficiency.
Tool Competition – Self-Driving Car Testing Track
ICST Tool Competition 2025 – Self-Driving Car Testing Track
Christian Birchler, Stefan Klikovits, Mattia Fazzini, and Sebastiano Panichella
(University of Bern, Switzerland; Zurich University of Applied Sciences, Switzerland; Johannes Kepler University Linz, Austria; University of Minnesota, USA)
This is the first edition of the tool competition on testing self-driving cars (SDCs) at the International Conference on Software Testing, Verification and Validation (ICST). The aim is to provide a platform for software testers to submit their tools addressing the test selection problem for simulation-based testing of SDCs, which is considered an emerging and vital domain. The competition provides an advanced software platform and representative case studies to ease participants' entry into SDC regression testing, enabling them to develop their initial test generation tools for SDCS. In this first edition, the competition includes five tools from different authors. All tools were evaluated using (regression) metrics for test selection as well as compared with a baseline approache. This paper provides an overview of the competition, detailing its context, framework, participating tools, evaluation methodology, and key findings.
DETOUR at the ICST 2025 Tool Competition – Self-Driving Car Testing Track
Paolo Arcaini and Ahmet Cetinkaya
(National Institute of Informatics, Japan; Shibaura Institute of Technology, Japan)
DETOUR is a test case selector of road tests for self-driving cars, that participated to the "ICST Tool Competition 2025 - Self-Driving Car Testing Track". DETOUR first transforms road tests, from the provided Cartesian representation, to a curvature representation based on the Frenet frame. Then, DETOUR follows a two-step process. In the first step, tests are clustered according to their similarity; this step considers both tests that have been previously executed (for which it is known whether they pass or fail) and tests that have not been executed. Then, in the second step, the tool selects, from the obtained clusters, the non-executed tests that are closer to executed tests that are known to be failing.
DRVN at the ICST 2025 Tool Competition – Self-Driving Car Testing Track
Antony Bartlett, Cynthia Liem, and
Annibale Panichella
(Delft University of Technology, Netherlands)
DRVN is a regression testing tool that aims to diversify the test scenarios (road maps) to execute for testing and validating self-driving cars. DRVN harnesses the power of convolutional neural networks to identify possible failing roads in a set of generated examples before applying a greedy algorithm that selects and prioritizes the most diverse roads during regression testing. Initial testing discovered that DRVN performed well against random-based test selection.
ITS4SDC at the ICST 2025 Tool Competition – Self-Driving Car Testing Track
Ali ihsan Güllü, Faiz Ali Shah, and Dietmar Pfahl
(University of Tartu, Estonia)
Testing and verification of self-driving cars are essential for ensuring their safety and reliability. In the context of the ICST 2024 self-driving cars testing tool competition, we present ITS4SDC, our tool for selecting roads that challenge lane-keeping assist systems by leading the car off the road. ITS4SDC leverages a long short-term memory-based model for the classification of roads as safe and unsafe and subsequently selects unsafe roads for testing.
CertiFail at the ICST 2025 Tool Competition – Self-Driving Car Testing Track
Fasih Munir Malik and Sajad Mazraeh Khatiri
(University of Bern, Switzerland)
In the context of Cyber Physical Systems like Self Driving Cars the transition from field operational testing to simulation based testing offer the advantage of lower cost, higher efficiency and the possibility of recording and repeating exact failing conditions. Yet traditional simulation based testing require execution of a long list of test cases, which means high computational costs and longer computational time. To combat this drawback, one can use different techniques to select and run
test cases that are more likely to fail. In this work, we propose CertiFail a test case selection model that is an ensemble of several Machine learning models to more accurately predict if a test case
is likely to fail. With CertiFail we achieve an accuracy of 73.7 percent in selection of test cases that fail. In addition CertiFail surpasses baseline models in terms of accurately predicting test
cases that fail.
NN-SDCTest at the ICST 2025 Tool Competition – Self-Driving Car Testing Track
Prakash Aryana and Sajad Khatiri
(Birla Institute of Technology and Science, India; USI Lugano, Switzerland)
Testing self-driving cars (SDCs) requires extensive simulation-based testing, making efficient test case selection important. This paper presents two approaches for test case selection in SDC testing: a curvature-based selector that analyzes road geometry and a graph neural network (GNN) based selector that learns failure patterns. The curvature-based approach uses road geometry analysis, turn detection, and group-based selection strategies, while the GNN approach has a four-layer neural architecture with feature engineering for predicting test failures. Both approaches are implemented and evaluated as part of the ICST 2025 Tool Competition for SDC testing. Our experimental results show that the GNN selector achieves superior computational efficiency (initialization: 1.45s vs 15.27s, selection: 0.30s vs 4.06s) and a better time-to-fault ratio (209.78 vs 239.48), while the curvature-based selector demonstrates stronger fault detection capabilities with a higher fault-to-selection ratio (0.214 vs 0.177). Both approaches maintain comparable diversity scores (0.037 and 0.039 respectively), demonstrating their effectiveness in achieving comprehensive test coverage. The comparative analysis provides insights into the strengths of geometric analysis and machine learning approaches in SDC test selection.
Tool Competition – Unmanned Aerial Vehicles Testing Track
ICST Tool Competition 2025 – UAV Testing Track
Sajad Khatiri,
Tahereh Zohdinasab, Prasun Saurabh, Dmytro Humeniuk, and Sebastiano Panichella
(University of Bern, Switzerland; Zurich University of Applied Sciences, Switzerland; USI Lugano, Switzerland; Polytechnique Montréal, Canada)
Simulation-based testing plays a crucial role in ensuring the safety of autonomous Unmanned Aerial Vehicles (UAVs); however, this area remains underexplored. The UAV Testing Competition aims to engage the software testing community by highlighting UAVs as an emerging and vital domain. This initiative offers a straightforward software platform and representative case studies to ease participants’ entry into UAV testing, enabling them to develop their initial test generation tools for UAVs. In this second iteration of the competition, three tools were submitted, assessed, and thoroughly compared against each other, as well as the baseline approach. Our benchmarking framework analyzed their test generation capabilities across three distinct case studies. The resulting test suites were evaluated and ranked based on their failure detection and diversity. This paper provides an overview of the competition, detailing its context, platform, participating tools, evaluation methodology, and key findings.
TGen-UQ at the ICST 2025 Tool Competition – UAV Testing Track
Ali Javadi and
Christian Birchler
(University of Bern, Switzerland; Zurich University of Applied Sciences, Switzerland)
Testing of autonomous UAV systems poses significant challenges due to its complex nature. The complexity lies in creating realistic and, diverse test scenarios. This report documents a method that applies Q-learning combined with Upper Confidence Bound (UCB) to generate obstacle configurations in a simulated environment. The goal is to evaluate the PX4-Avoidance system by inducing unsafe UAV behaviors through generated obstacle placements. This approach enhances fault detection by balancing exploration and exploitation in the test case generation, ultimately increasing scenario diversity and system robustness.
PALM at the ICST 2025 Tool Competition – UAV Testing Track
Shuncheng Tang, Zhenya Zhang, Ahmet Cetinkaya, and
Paolo Arcaini
(University of Science and Technology of China, China; Kyushu University, Japan; Shibaura Institute of Technology, Japan; National Institute of Informatics, Japan)
PALM is a generator of scenarios for UAV testing, that participated in the ICST Tool Competition 2025 - CPS-UAV Test Case Generation Track. PALM adopts Monte Carlo Tree Search (MCTS) to search for different placements of obstacles of different sizes in the mission environment. By increasing the tree depth, a new obstacle is added to the environment; instead, by adding a new node in the current tree level, the tool optimises the placement and the dimension of the last added obstacle.
proc time: 1.19