FSE – Journal Issue

Editorial Message
The Proceedings of the ACM series presents the highest-quality research conducted in diverse areas of computer science, as represented by the ACM Special Interest Groups. The Proceedings of the ACM on Software Engineering (PACMSE) focuses on top-quality, original research on all aspects of software engineering, from requirements elicitation to quality assessment and from design to maintenance, evolution, and deployment. PACMSE covers a broad range of topics and methods that help conceive, create, and maintain better software, be it embedded, cloud-based, mobile, ubiquitous, or one that runs on conventional computers. The journal welcomes contributions to new methodologies, tools, theories, and models, as well as empirical studies and survey papers related to the wide spectrum of software engineering topics. We particularly welcome contributions that offer additional artifacts, such as code, datasets, etc., in the light of reproducibility. The journal operates in close collaboration with the Special Interest Group on Software Engineering (SIGSOFT) and is committed to making high-quality peer-reviewed scientific research in software engineering free of restrictions on both access and use.

Article: fse25foreword-fm001-p doi:

Gleipner: A Benchmark for Gadget Chain Detection in Java Deserialization Vulnerabilities
Bruno Kreyssig and Alexandre Bartel
(Umeå University, Sweden)
While multiple recent publications on detecting Java Deserialization Vulnerabilities highlight an increasing relevance of the topic, until now no proper benchmark has been established to evaluate the individual approaches. Hence, it has become increasingly difficult to show improvements over previous tools and trade-offs that were made. In this work, we synthesize the main challenges in gadget chain detection. More specifically, this unveils the constraints program analysis faces in the context of gadget chain detection. From there, we develop Gleipner: the first synthetic, large-scale and systematic benchmark to validate the effectiveness of algorithms for detecting gadget chains in the Java programming language. We then benchmark seven previous publications in the field using Gleipner. As a result, it shows, that (1) our benchmark provides a transparent, qualitative, and sound measurement for the maturity of gadget chain detecting tools, (2) Gleipner alleviates severe benchmarking flaws which were previously common in the field and (3) state-of-the-art tools still struggle with most challenges in gadget chain detection.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p15-p doi:10.1145/3715711

Detecting Smart Contract State-Inconsistency Bugs via Flow Divergence and Multiplex Symbolic Execution
Yinxi Liu, Wei Meng, and Yinqian Zhang
(Rochester Institute of Technology, USA; Chinese University of Hong Kong, China; Southern University of Science and Technology, China)
Ethereum smart contracts determine state transition results not only by the previous states, but also by a mutable global state consisting of storage variables. This has resulted in state-inconsistency bugs, which grant an attacker the ability to modify contract states either through recursive function calls to a contract (reentrancy), or by exploiting transaction order dependence (TOD). Current studies have determined that identifying data races on global storage variables can capture all state-inconsistency bugs. Nevertheless, eliminating false positives poses a significant challenge, given the extensive number of execution paths that could potentially cause a data race. For simplicity, existing research considers a data race to be vulnerable as long as the variable involved could have inconsistent values under different execution orders. However, such a data race could be benign when the inconsistent value does not affect any critical computation or decision-making process in the program. Besides, the data race could also be infeasible when there is no valid state in the contract that allows the execution of both orders. In this paper, we aim to appreciably reduce these false positives without introducing false negatives. We present DivertScan, a precise framework to detect exploitable state-inconsistency bugs in smart contracts. We first introduce the use of flow divergence to check where the involved variable may flow to. This allows DivertScan to precisely infer the potential effects of a data race and determine whether it can be exploited for inducing unexpected program behaviors. We also propose multiplex symbolic execution to examine different execution orders in one time of solving. This helps DivertScan to determine whether a common starting state could potentially exist. To address the scalability issue in symbolic execution, DivertScan utilizes an overapproximated pre-checking and a selective exploration strategy. As a result, it only needs to explore a limited state space. DivertScan significantly outperformed state-of-the-art tools by improving the precision rate by 20.72% to 74.93% while introducing no false negatives. It also identified five exploitable real-world vulnerabilities that other tools missed. The detected vulnerabilities could potentially lead to a loss of up to $68.2M, based on trading records and rate limits.

Publisher's Version Article: fse25main-p36-p doi:10.1145/3715712

MendelFuzz: The Return of the Deterministic Stage
Han Zheng, Flavio Toffalini, Marcel Böhme, and Mathias Payer
(EPFL, Switzerland; Ruhr University Bochum, Germany; MPI-SP, Germany)
Can a fuzzer cover more code with minimal corruption of the initial seed? Before a seed is fuzzed, the early greybox fuzzers first systematically enumerated slightly corrupted inputs by applying every mutation operator to every part of the seed, once per generated input. The hope of this so-called “deterministic” stage was that simple changes to the seed would be less likely to break the complex file format; the resulting inputs would find bugs in the program logic well beyond the program’s parser. However, when experiments showed that disabling the deterministic stage achieves more coverage, applying multiple mutation operators at the same time to a single input, most fuzzers disabled the deterministic stage by default. Instead of ignoring the deterministic stage, we analyze its potential and substantially improve deterministic exploration. Our deterministic stage is now the default in AFL++, reverting the earlier decision of dropping deterministic exploration. We start by investigating the overhead and the contribution of the deterministic stage to the discovery of coverage-increasing inputs. While the sheer number of generated inputs explains the overhead, we find that only a few critical seeds (20%), and only a few critical bytes in a seed (0.5%) are responsible for the vast majority of the coverage-increasing inputs (83% and 84%, respectively). Hence, we develop an efficient technique, called , to identify these critical seeds / bytes so as to prune a large number of unnecessary inputs. retains the benefits of the classic deterministic stage by only enumerating a tiny part of the total deterministic state space. We evaluate implementation on two benchmarking frameworks, FuzzBench and Magma. Our evaluation shows that outperforms state-of-the-art fuzzers with and without the (old) deterministic stage enabled, both in terms of coverage and bug finding. also discovered 8 new CVEs on exhaustively fuzzed security-critical applications. Finally, has been independently evaluated and integrated into AFL++ as default option.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable Article: fse25main-p58-p doi:10.1145/3715713

SmartShot: Hunt Hidden Vulnerabilities in Smart Contracts using Mutable Snapshots
Ruichao Liang, Jing Chen, Ruochen Cao, Kun He, Ruiying Du, Shuhua Li, Zheng Lin, and Cong Wu
(Wuhan University, China; University of Hong Kong, Hong Kong)
Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts.

Publisher's Version Article: fse25main-p114-p doi:10.1145/3715714

On-Demand Scenario Generation for Testing Automated Driving Systems
Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang
(Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China)
The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels.

Publisher's Version

Info Article: fse25main-p120-p doi:10.1145/3715722

Element-Based Automated DNN Repair with Fine-Tuned Masked Language Model
Xu Wang, Mingming Zhang, Xiangxin Meng, Jian Zhang, Yang Liu, and Chunming Hu
(Beihang University, China; Zhongguancun Laboratory, China; Ministry of Education, China; Nanyang Technological University, Singapore)
Deep Neural Networks (DNNs) are prevalent across a wide range of applications. Despite their success, the complexity and opaque nature of DNNs pose significant challenges in debugging and repairing DNN models, limiting their reliability and broader adoption. In this paper, we propose MLM4DNN, an element-based automated DNN repair method. Unlike previous techniques that focus on post-training adjustments or rely heavily on predefined bug patterns, MLM4DNN repairs DNNs by leveraging a fine-tuned Masked Language Model (MLM) to predict correct fixes for nine predefined key elements in DNNs. We construct a large-scale dataset by masking nine key elements from the correct DNN source code and then force the MLM to restore the correct elements to learn the deep semantics that ensure the normal functionalities of DNNs. Afterwards, a light-weight static analysis tool is designed to filter out low-quality patches to enhance the repair efficiency. We introduce a patch validation method specifically for DNN repair tasks, which consists of three evaluation metrics from different aspects to model the effectiveness of generated patches. We construct a benchmark, Benchmark_APR4DNN, including 51 buggy DNN models and an evaluation tool that outputs the three metrics. We evaluate MLM4DNN against six baselines on Benchmark_APR4DNN, and results show that MLM4DNN outperforms all state-of-the-art baselines, including two dynamic-based and four zero-shot learning-based methods. After applying the fine-tuned MLM design to several prevalent Large Language Models (LLMs), we consistently observe improved performance in DNN repair tasks compared to the original LLMs, which demonstrates the effectiveness of the method proposed in this paper.

Publisher's Version Article: fse25main-p127-p doi:10.1145/3715716

Mystique: Automated Vulnerability Patch Porting with Semantic and Syntactic-Enhanced LLM
Susheng Wu, Ruisi Wang, Yiheng Cao, Bihuan Chen, Zhuotong Zhou, Yiheng Huang, JunPeng Zhao, and Xin Peng
(Fudan University, China)
Branching repositories facilitates efficient software development but can also inadvertently propagate vulnerabilities. When an original branch is patched, other unfixed branches remain vulnerable unless the patch is successfully ported. However, due to inherent discrepancies between branches, many patches cannot be directly applied and require manual intervention, which is time-consuming and leads to delays in patch porting, increasing vulnerability risks. Existing automated patch porting approaches are prone to errors, as they often overlook essential semantic and syntactic context of vulnerability and fail to detect or refine faulty patches. We propose Mystique, a novel LLM-based approach to address these limitations. Mystique first slices the semantic-related statements linked to the vulnerability while ensuring syntactic correctness, allowing it to extract the signatures for both the original patched function and the target vulnerable function. Mystique then utilizes a fine-tuned LLM to generate a fixed function, which is further iteratively checked and refined to ensure successful porting. Our evaluation shows that Mystique achieved a success rate of 0.954 at function level and of 0.924 at CVE level, outperforming state-of-the-art approaches by at least 13.2% at function level and 12.3% at CVE level. Our evaluation also demonstrates Mystique’s superior generality across various projects, bugs, and programming languages. Mystique successfully ported patches for 34 real-world vulnerable branches.

Publisher's Version

Published Artifact

Artifacts Available

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p144-p doi:10.1145/3715718

Smart Contract Fuzzing Towards Profitable Vulnerabilities
Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, and Yang Liu
(Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China)
Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: 1. a lack of profit-centric techniques for expediting detection and, 2. insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits. VERITE has three key features: 1. DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2. potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3. a gradient descent-based profit maximization strategy for these identified candidates. VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ItyFuzz in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards.

Publisher's Version

Published Artifact

Info

Artifacts Available Article: fse25main-p153-p doi:10.1145/3715720

CKTyper: Enhancing Type Inference for Java Code Snippets by Leveraging Crowdsourcing Knowledge in Stack Overflow
Anji Li, Neng Zhang, Ying Zou, Zhixiang Chen, Jian Wang, and Zibin Zheng
(Sun Yat-sen University, China; Central China Normal University, China; Queen's University, Canada; Wuhan University, China)
Code snippets are widely used in technical forums to demonstrate solutions to programming problems. They can be leveraged by developers to accelerate problem-solving. However, code snippets often lack concrete types of the APIs used in them, which impedes their understanding and resue. To enhance the description of a code snippet, a number of approaches are proposed to infer the types of APIs. Although existing approaches can achieve good performance, their performance is limited by ignoring other information outside the input code snippet (e.g., the descriptions of similar code snippets) that could potentially improve the performance. In this paper, we propose a novel type inference approach, named CKTyper, by leveraging crowdsourcing knowledge in technical posts. The key idea is to generate a relevant context for a target code snippet from the posts containing similar code snippets and then employ the context to promote the type inference with large language models (e.g., ChatGPT). More specifically, we build a crowdsourcing knowledge base (CKB) by extracting code snippets from a large set of posts and index the CKB using Lucene. An API type dictionary is also built from a set of API libraries. Given a code snippet to be inferred, we first retrieve a list of similar code snippets from the indexed CKB. Then, we generate a crowdsourcing knowledge context (CKC) by extracting and summarizing useful content (e.g., API-related sentences) in the posts that contain the similar code snippets. The CKC is subsequently used to improve the type inference of ChatGPT on the input code snippet. The hallucination of ChatGPT is eliminated by employing the API type dictionary. Evaluation results on two open-source datasets demonstrate the effectiveness and efficiency of CKTyper. CKTyper achieves the optimal precision/recall of 97.80% and 95.54% on both datasets, respectively, significantly outperforming three state-of-the-art baselines and ChatGPT.

Publisher's Version Article: fse25main-p159-p doi:10.1145/3715724

Ransomware Detection through Temporal Correlation between Encryption and I/O Behavior
Lihua Guo, Yiwei Hou, Chijin Zhou, Quan Zhang, and Yu Jiang
(Tsinghua University, China)
In recent years, the increase in ransomware attacks has significantly impacted individuals and organizations. Many strategies have been proposed to detect ransomware’s file disruption operation. However, they rely on certain assumptions that gradually fail in the face of evolving ransomware attacks, which use more stealthy encryption methods or benign-imitation-based I/O orchestration. To mitigate this, we propose an approach to detect ransomware attacks through temporal correlation between encryption and I/O behaviors. Its key idea is that there is a strong temporal correlation inherent in ransomware’s encryption and I/O behaviors. To disrupt files, ransomware must first read the file data from the disk into memory, encrypt it, and then write the encrypted data back to the disk. This process creates a pronounced temporal correlation between the computation load of the encryption operations and the size of the files being encrypted. Based on this invariant, we implement a prototype called RansomRadar and evaluate its effectiveness against 411 latest ransomware samples from 89 families. Experimental results show that it achieves a detection rate of 100.00% with 2.68% false alarms. Its F1-score is 96.03, higher than the existing detectors for 31.82 on average.

Publisher's Version Article: fse25main-p160-p doi:10.1145/3715725

DeclarUI: Bridging Design and Development with Automated Declarative UI Code Generation
Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang
(Huazhong University of Science and Technology, China; Australian National University, Australia)
Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance. Despite these advantages, the process of translating UI designs into functional code remains challenging and time-consuming. Recent advancements in multimodal large language models (MLLMs) have shown promise in directly generating mobile app code from user interface (UI) designs. However, the direct application of MLLMs to this task is limited by challenges in accurately recognizing UI components and comprehensively capturing interaction logic. To address these challenges, we propose DeclarUI, an automated approach that synergizes computer vision (CV), MLLMs, and iterative compiler-driven optimization to generate and refine declarative UI code from designs. DeclarUI enhances visual fidelity, functional completeness, and code quality through precise component segmentation, Page Transition Graphs (PTGs) for modeling complex inter-page relationships, and iterative optimization. In our evaluation, DeclarUI outperforms baselines on React Native, a widely adopted declarative UI framework, achieving a 96.8% PTG coverage rate and a 98% compilation success rate. Notably, DeclarUI demonstrates significant improvements over state-of-the-art MLLMs, with a 123% increase in PTG coverage rate, up to 55% enhancement in visual similarity scores, and a 29% boost in compilation success rate. We further demonstrate DeclarUI’s generalizability through successful applications to Flutter and ArkUI frameworks. User studies with professional developers confirm that DeclarUI’s generated code meets industrial-grade standards in code availability, modification time, readability, and maintainability. By streamlining app development, improving efficiency, and fostering designer-developer collaboration, DeclarUI offers a practical solution to the persistent challenges in mobile UI development.

Publisher's Version Article: fse25main-p162-p doi:10.1145/3715726

COFFE: A Code Efficiency Benchmark for Code Generation
Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren
(Chinese University of Hong Kong, China; Zhejiang University, China)
Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation. To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation.

Publisher's Version

Published Artifact

Artifacts Available

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p216-p doi:10.1145/3715727

Pinning Is Futile: You Need More Than Local Dependency Versioning to Defend against Supply Chain Attacks
Hao He, Bogdan Vasilescu, and Christian Kästner
(Carnegie Mellon University, USA)
Recent high-profile incidents in open-source software have greatly raised practitioner attention on software supply chain attacks. To guard against potential malicious package updates, security practitioners advocate pinning dependency to specific versions rather than floating in version ranges. However, it remains controversial whether pinning carries a meaningful security benefit that outweighs the cost of maintaining outdated and possibly vulnerable dependencies. In this paper, we quantify, through counterfactual analysis and simulations, the security and maintenance impact of version constraints in the npm ecosystem. By simulating dependency resolutions over historical time points, we find that pinning direct dependencies not only (as expected) increases the cost of maintaining vulnerable and outdated dependencies, but also (surprisingly) even increases the risk of exposure to malicious package updates in larger dependency graphs due to the specifics of npm’s dependency resolution mechanism. Finally, we explore collective pinning strategies to secure the ecosystem against supply chain attacks, suggesting specific changes to npm to enable such interventions. Our study provides guidance for practitioners and tool designers to manage their supply chains more securely.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p247-p doi:10.1145/3715728

An Empirical Study of Suppressed Static Analysis Warnings
Huimin Hu, Yingying Wang, Julia Rubin, and Michael Pradel
(University of Stuttgart, Germany; University of British Columbia, Canada)
Scalable static analyzers are popular tools for finding incorrect, inefficient, insecure, and hard-to-maintain code early during the development process. Because not all warnings reported by a static analyzer are immediately useful to developers, many static analyzers provide a way to suppress warnings, e.g., in the form of special comments added into the code. Such suppressions are an important mechanism at the interface between static analyzers and software developers, but little is currently known about them. This paper presents the first in-depth empirical study of suppressions of static analysis warnings, addressing questions about the prevalence of suppressions, their evolution over time, the relationship between suppressions and warnings, and the reasons for using suppressions. We answer these questions by studying projects written in three popular languages and suppressions for warnings by four popular static analyzers. Our findings show that (i) suppressions are relatively common, e.g., with a total of 7,357 suppressions in 46 Python projects, (ii) the number of suppressions in a project tends to continuously increase over time, (iii) surprisingly, 50.8% of all suppressions do not affect any warning and hence are practically useless, (iv) some suppressions, including useless ones, may unintentionally hide future warnings, and (v) common reasons for introducing suppressions include false positives, suboptimal configurations of the static analyzer, and misleading warning messages. These results have actionable implications, e.g., that developers should be made aware of useless suppressions and the potential risk of unintentional suppressing, that static analyzers should provide better warning messages, and that static analyzers should separately categorize warnings from third-party libraries.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable Article: fse25main-p253-p doi:10.1145/3715729

Towards Diverse Program Transformations for Program Simplification
Haibo Wang, Zezhong Xing, Chengnian Sun, Zheng Wang, and Shin Hwei Tan
(Concordia University, Canada; Southern University of Science and Technology, China; University of Waterloo, Canada; University of Leeds, UK)
By reducing the number of lines of code, program simplification reduces code complexity, improving software maintainability and code comprehension. While several existing techniques can be used for automatic program simplification, there is no consensus on the effectiveness of these approaches. We present the first study on how real-world developers simplify programs in open-source software projects. By analyzing 382 pull requests from 296 projects, we summarize the types of program transformations used, the motivations behind simplifications, and the set of program transformations that have not been covered by existing refactoring types. As a result of our study, we submitted eight bug reports to a widely used refactoring detection tool, RefactoringMiner, where seven were fixed. Our study also identifies gaps in applying existing approaches for automating program simplification and outlines the criteria for designing automatic program simplification techniques. In light of these observations, we propose SimpT5, a tool to automatically produce simplified programs that are semantically equivalent programs with reduced lines of code. SimpT5 is trained on our collected dataset of 92,485 simplified programs with two heuristics: (1) modified line localization that encodes lines changed in simplified programs, and (2) checkers that measure the quality of generated programs. Experimental results show that SimpT5 outperforms prior approaches in automating developer-induced program simplification.

Publisher's Version

Info Article: fse25main-p255-p doi:10.1145/3715730

Today’s Cat Is Tomorrow’s Dog: Accounting for Time-Based Changes in the Labels of ML Vulnerability Detection Approaches
Ranindya Paramitha, Yuan Feng, and Fabio Massacci
(University of Trento, Italy; Vrije Universiteit Amsterdam, Netherlands)
Vulnerability datasets used for ML testing implicitly contain retrospective information. When tested on the field, one can only use the labels available at the time of training and testing (e.g. seen and assumed negatives). As vulnerabilities are discovered across calendar time, labels change and past performance is not necessarily aligned with future performance. Past works only considered the slices of the whole history (e.g. DiverseVUl) or individual differences between releases (e.g. Jimenez et al. ESEC/FSE 2019). Such approaches are either too optimistic in training (e.g. the whole history) or too conservative (e.g. consecutive releases). We propose a method to restructure a dataset into a series of datasets in which both training and testing labels change to account for the knowledge available at the time. If the model is actually learning, it should improve its performance over time as more data becomes available and data becomes more stable, an effect that can be checked with the Mann-Kendall test. We validate our methodology for vulnerability detection with 4 time-based datasets (3 projects from BigVul dataset + Vuldeepecker’s NVD) and 5 ML models (Code2Vec, CodeBERT, LineVul, ReGVD, and Vuldeepecker). In contrast to the intuitive expectation (more retrospective information, better performance), the trend results show that performance changes inconsistently across the years, showing that most models are not learning.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25main-p258-p doi:10.1145/3715731

A New Approach to Evaluating Nullability Inference Tools
Nima Karimipour, Erfan Arvan, Martin Kellogg, and Manu Sridharan
(University of California at Riverside, USA; New Jersey Institute of Technology, USA)
Null-pointer exceptions are serious problem for Java, and researchers have developed type-based nullness checking tools to prevent them. These tools, however, have a downside: they require developers to write nullability annotations, which is time-consuming and hinders adoption. Researchers have therefore proposed nullability annotation inference tools, whose goal is to (partially) automate the task of annotating a program for nullability. However, prior works rely on differing theories of what makes a set of nullability annotations good, making comparing their effectiveness challenging. In this work, we identify a systematic bias in some prior experimental evaluation of these tools: the use of “type reconstruction” experiments to see if a tool can recover erased developer-written annotations. We show that developers make semantic code changes while adding annotations to facilitate typechecking, leading such experiments to overestimate the effectiveness of inference tools on never-annotated code. We propose a new definition of the “best” inferred annotations for a program that avoids this bias, based on a systematic exploration of the design space. With this new definition, we perform the first head-to-head comparison of three extant nullability inference tools. Our evaluation showed the complementary strengths of the tools and remaining weaknesses that could be addressed in future work.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p262-p doi:10.1145/3715732

A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems
Yuntianyi Chen, Yuqi Huai, Yirui He, Shilong Li, Changnam Hong, Qi Alfred Chen, and Joshua Garcia
(University of California at Irvine, USA)
As autonomous driving systems (ADSes) become increasingly complex and integral to daily life, the importance of understanding the nature and mitigation of software bugs in these systems has grown correspondingly. Addressing the challenges of software maintenance in autonomous driving systems (e.g., handling real-time system decisions and ensuring safety-critical reliability) is crucial due to the unique combination of real-time decision-making requirements and the high stakes of operational failures in ADSes. The potential of automated tools in this domain is promising, yet there remains a gap in our comprehension of the challenges faced and the strategies employed during manual debugging and repair of such systems. In this paper, we present an empirical study that investigates bug-fix patterns in ADSes, with the aim of improving reliability and safety. We have analyzed the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes with the study of bug symptoms, root causes, and bug-fix patterns. Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management. Additionally, we find that the frequency distribution of bug-fix patterns varies significantly depending on their nature and types and that certain categories of bugs are recurrent and more challenging to exterminate. Based on our findings, we propose a hierarchy of ADS bugs and two taxonomies of 15 syntactic bug-fix patterns and 27 semantic bug-fix patterns that offer guidance for bug identification and resolution. We also contribute a benchmark of 1,331 ADS bug-fix instances.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p263-p doi:10.1145/3715733

An Empirical Study on Release-Wise Refactoring Patterns
Shayan Noei, Heng Li, and Ying Zou
(Queen's University, Canada; Polytechnique Montréal, Canada)
Refactoring is a technical approach to increase the internal quality of software without altering its external functionalities. Developers often invest significant effort in refactoring. With the increased adoption of continuous integration and deployment (CI/CD), refactoring activities may vary within and across different releases and be influenced by various release goals. For example, developers may consistently allocate refactoring activities throughout a release, or prioritize new features early on in a release and only pick up refactoring late in a release. Different approaches to allocating refactoring tasks may have different implications for code quality. However, there is a lack of existing research on how practitioners allocate their refactoring activities within a release and their impact on code quality. Therefore, we first empirically study the frequent release-wise refactoring patterns in 207 open-source Java projects and their characteristics. Then, we analyze how these patterns and their transitions affect code quality. We identify four major release-wise refactoring patterns: early active, late active, steady active, and steady inactive. We find that adopting the late active pattern—characterized by gradually increasing refactoring activities as the release approaches—leads to the best code quality. We observe that as projects mature, refactoring becomes more active, reflected in the increasing use of the steady active release-wise refactoring pattern and the decreasing utilization of the steady inactive release-wise refactoring pattern. While the steady active pattern shows improvement in quality-related code metrics (e.g., cohesion), it can also lead to more architectural problems. Additionally, we observe that developers tend to adhere to a single refactoring pattern rather than switching between different patterns. The late active pattern, in particular, can be a safe release-wise refactoring pattern that is used repeatedly. Our results can help practitioners understand existing release-wise refactoring patterns and their effects on code quality, enabling them to utilize the most effective pattern to enhance release quality.

Publisher's Version Article: fse25main-p265-p doi:10.1145/3715734

Hallucination Detection in Large Language Models with Metamorphic Relations
Borui Yang, Md Afif Al Mamun, Jie M. Zhang, and Gias Uddin
(King's College London, UK; University of Calgary, Canada; York University, Canada)
Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses. These hallucinations present challenges for LLM-based applications that demand high factual accuracy. Existing hallucination detection methods primarily depend on external resources, which can suffer from issues such as low availability, incomplete coverage, privacy concerns, high latency, low reliability, and poor scalability. There are also methods depending on output probabilities, which are often inaccessible for closed-source LLMs like GPT models. This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic relation and prompt mutation. Unlike existing methods, MetaQA operates without any external resources and is compatible with both open-source and closed-source LLMs. MetaQA is based on the hypothesis that if an LLM’s response is a hallucination, the designed metamorphic relations will be violated. We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets, and on two open-source and two closed-source LLMs. Our results reveal that MetaQA outperforms SelfCheckGPT in terms of precision, recall, and f1 score. For the four LLMs we study, MetaQA outperforms SelfCheckGPT with a superiority margin ranging from 0.041 - 0.113 (for precision), 0.143 - 0.430 (for recall), and 0.154 - 0.368 (for F1-score). For instance, with Mistral-7B, MetaQA achieves an average F1-score of 0.435, compared to SelfCheckGPT’s F1-score of 0.205, representing an improvement rate of 112.2%. MetaQA also demonstrates superiority across all different categories of questions.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p266-p doi:10.1145/3715735

One-for-All Does Not Work! Enhancing Vulnerability Detection by Mixture-of-Experts (MoE)
Xu Yang, Shaowei Wang, Jiayuan Zhou, and Wenhan Zhu
(University of Manitoba, Canada; Huawei, Canada)
Deep Learning-based Vulnerability Detection (DLVD) techniques have garnered significant interest due to their ability to automatically learn vulnerability patterns from previously compromised code. Despite the notable accuracy demonstrated by pioneering tools, the broader application of DLVD methods in real-world scenarios is hindered by significant challenges. A primary issue is the “one-for-all” design, where a single model is trained to handle all types of vulnerabilities. This approach fails to capture the patterns of different vulnerability types, resulting in suboptimal performance, particularly for less common vulnerabilities that are often underrepresented in training datasets. To address these challenges, we propose MoEVD, which adopts the Mixture-of-Experts (MoE) framework for vulnerability detection. MoEVD decomposes vulnerability detection into two tasks, CWE type classification and CWE-specific vulnerability detection. By splitting the task, in vulnerability detection, MoEVD allows specific experts to handle distinct types of vulnerabilities instead of handling all vulnerabilities within one model. Our results show that MoEVD achieves an F1-score of 0.44, significantly outperforming all studied state-of-the-art (SOTA) baselines by at least 12.8%. MoEVD excels across almost all CWE types, improving recall over the best SOTA baseline by 9% to 77.8%. Notably, MoEVD does not sacrifice performance on long-tailed CWE types; instead, its MoE design enhances performance (F1-score) on these by at least 7.3%, addressing long-tailed issues effectively.

Publisher's Version Article: fse25main-p268-p doi:10.1145/3715736

LlamaRestTest: Effective REST API Testing with Small Language Models
Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso
(Georgia Institute of Technology, USA; IBM Research, USA)
Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on OpenAPI specifications. Although Large Language Models (LLMs) have shown promising test-generation abilities, their application to REST API testing remains mostly unexplored. We present LlamaRestTest, a novel approach that employs two custom LLMs-created by fine-tuning and quantizing the Llama3-8B model using mined datasets of REST API example values and inter-parameter dependencies-to generate realistic test inputs and uncover inter-parameter dependencies during the testing process by analyzing server responses. We evaluated LlamaRestTest on 12 real-world services (including popular services such as Spotify), comparing it against RESTGPT, a GPT-powered specification-enhancement tool, as well as several state-of-the-art REST API testing tools, including RESTler, MoRest, EvoMaster, and ARAT-RL. Our results demonstrate that fine-tuning enables smaller models to outperform much larger models in detecting actionable parameter-dependency rules and generating valid inputs for REST API testing. We also evaluated different tool configurations, ranging from the base Llama3-8B model to fine-tuned versions, and explored multiple quantization techniques, including 2-bit, 4-bit, and 8-bit integer formats. Our study shows that small language models can perform as well as, or better than, large language models in REST API testing, balancing effectiveness and efficiency. Furthermore, LlamaRestTest outperforms state-of-the-art REST API testing tools in code coverage achieved and internal server errors identified, even when those tools use RESTGPT-enhanced specifications. Finally, through an ablation study, we show that each component of LlamaRestTest contributes to its overall performance.

Publisher's Version Article: fse25main-p274-p doi:10.1145/3715737

Code Change Intention, Development Artifact, and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM
Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu
(University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China)
Detecting vulnerability fix commits in open-source software is crucial for maintaining software security. To help OSS identify vulnerability fix commits, several automated approaches are developed. However, existing approaches like VulFixMiner and CoLeFunDa, focus solely on code changes, neglecting essential context from development artifacts. Tools like Vulcurator, which integrates issue reports, fail to leverage semantic associations between different development artifacts (e.g., pull requests and history vulnerability fixes). Moreover, they miss vulnerability fixes in tangled commits and lack explanations, limiting practical use. Hence to address those limitations, we propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning to improve the accuracy of vulnerability fix detection. LLM4VFD comprises three components: (1) Code Change Intention, which analyzes commit summaries, purposes, and implications using Chain-of-Thought reasoning; (2) Development Artifact, which incorporates context from related issue reports and pull requests; (3) Historical Vulnerability, which retrieves similar past vulnerability fixes to enrich context. More importantly, on top of the prediction, LLM4VFD also provides a detailed analysis and explanation to help security experts understand the rationale behind the decision. We evaluated LLM4VFD against state-of-the-art techniques, including Pre-trained Language Model-based approaches and vanilla LLMs, using a newly collected dataset, BigVulFixes. Experimental results demonstrate that LLM4VFD significantly outperforms the best-performed existing approach by 68.1%--145.4%. Furthermore, We conducted a user study with security experts, showing that the analysis generated by LLM4VFD improves the efficiency of vulnerability fix identification.

Publisher's Version Article: fse25main-p295-p doi:10.1145/3715738

QSF: Multi-objective Optimization Based Efficient Solving for Floating-Point Constraints
Xu Yang, Zhenbang Chen, Wei Dong, and Ji Wang
(National University of Defense Technology, China)
Floating-point constraint solving is challenging due to the complex representation and non-linear computations. Search-based constraint solving provides an effective method for solving floating-point constraints. In this paper, we propose QSF to improve the efficiency of search-based solving for floating-point constraints. The key idea of QSF is to model the floating-point constraint solving problem as a multi-objective optimization problem. Specifically, QSF considers both the number of unsatisfied constraints and the sum of the violation degrees of unsatisfied constraints as the objectives for search-based optimization. Besides, we propose a new evolutionary algorithm in which the mutation operators are specially designed for floating-point numbers, aiming to solve the multi-objective problem more efficiently. We have implemented QSF and conducted extensive experiments on both the SMT-COMP benchmark and the benchmark from real-world floating-point programs. The results demonstrate that compared to SOTA floating-point solvers, QSF achieved an average speedup of 15.72X under a 60-second timeout and an impressive 87.48X under a 600-second timeout on the first benchmark. Similarly, on the second benchmark, QSF delivered an average speedup of 22.44X and 29.23X, respectively, under the two timeout configurations. Furthermore, QSF has also enhanced the performance of symbolic execution for floating-point programs.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p301-p doi:10.1145/3715739

Adaptive Random Testing with Q-grams: The Illusion Comes True
Matteo Biagiola, Robert Feldt, and Paolo Tonella
(USI Lugano, Switzerland; Chalmers University of Technology, Sweden)
Adaptive Random Testing (ART) has faced criticism, particularly for its computational inefficiency, as highlighted by Arcuri and Briand. Their analysis clarified how ART requires a quadratic number of distance computations as the number of test executions increases, which limits its scalability in scenarios requiring extensive testing to uncover faults. Simulation results support this, showing that the computational overhead of these distance calculations often outweighs ART’s benefits. While various ART variants have attempted to reduce these costs, they frequently do so at the expense of fault detection, lack complexity guarantees, or are restricted to specific input types, such as numerical or discrete data. In this paper, we introduce a novel framework for adaptive random testing that replaces pairwise distance computations with a compact aggregation of past executions, such as counting the q-grams observed in previous runs. Test case selection then leverages this aggregated data to measure diversity (e.g., entropy of q-grams), allowing us to reduce the computational complexity from quadratic to linear. Experiments with a benchmark of six web applications, show that ART with q-grams covers, on average, 4× more unique targets than random testing, and 3.5×more than ART using traditional distance-based methods.

Publisher's Version Article: fse25main-p316-p doi:10.1145/3715740

Understanding and Characterizing Mock Assertions in Unit Tests
Hengcheng Zhu, Valerio Terragni, Lili Wei, Shing-Chi Cheung, Jiarong Wu, and Yepang Liu
(Hong Kong University of Science and Technology, China; University of Auckland, New Zealand; McGill University, Canada; Southern University of Science and Technology, China)
Mock assertions provide developers with a powerful means to validate program behaviors that are unobservable to test assertions. Despite their significance, they are rarely considered by automated test generation techniques. Effective generation of mock assertions requires understanding how they are used in practice. Although previous studies highlighted the importance of mock assertions, none provide insight into their usages. To bridge this gap, we conducted the first empirical study on mock assertions, examining their adoption, the characteristics of the verified method invocations, and their effectiveness in fault detection. Our analysis of 4,652 test cases from 11 popular Java projects reveals that mock assertions are mostly applied to validating specific kinds of method calls, such as those interacting with external resources and those reflecting whether a certain code path was traversed in systems under test. Additionally, we find that mock assertions complement traditional test assertions by ensuring the desired side effects have been produced, validating control flow logic, and checking internal computation results. Our findings contribute to a better understanding of mock assertion usages and provide a foundation for future related research such as automated test generation that support mock assertions.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p320-p doi:10.1145/3715741

Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning
Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä
(University of Helsinki, Finland; University of Oulu, Finland; University of Antwerp, Belgium)
Microservice-based systems (MSS) may fail with various fault types, due to their complex and dynamic nature. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts from practitioners are still required for further root cause analysis to diagnose specific fault types and analyze failure reasons for detected abnormal traces, particularly when abnormal traces do not stem directly from specific services. In this paper, we propose a novel AIOps framework, TraFaultDia, to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia is trained on several abnormal trace classification tasks with a few labeled instances from a MSS using a meta-learning approach. After training, TraFaultDia can quickly adapt to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia’s use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two representative MSS, TrainTicket and OnlineBoutique, with open datasets. In these datasets, each fault category is tied to the faulty system component(s) (service/pod) with a root cause. Our TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. Our results show that, within the MSS it is trained on, TraFaultDia achieves an average accuracy of 93.26% and 85.20% across 50 new, unseen abnormal trace classification tasks for TrainTicket and OnlineBoutique respectively, when provided with 10 labeled instances for each fault category per task in each system. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system.

Publisher's Version Article: fse25main-p345-p doi:10.1145/3715742

Detecting and Handling WoT Violations by Learning Physical Interactions from Device Logs
Bingkun Sun, Shiqi Sun, Jialin Ren, Mingming Hu, Kun Hu, Liwei Shen, and Xin Peng
(Fudan University, China; Northwestern Polytechnical University, China)
The Web of Things (WoT) system standardizes the integration of ubiquitous IoT devices in physical environments, enabling various software applications to automatically sense and regulate the physical environment. While providing convenience, the complex interactions among software applications and physical environment make the WoT system vulnerable to violations caused by improper actuator operations, which may cause undesirable or even harmful results, posing serious risks to user safety and security. In response to this critical concern, many previous efforts have be made. However, existing works primarily focus on analyzing software application behaviors, with insufficient consideration of the physical interactions, multi-source violations, and environmental dynamics in such ubiquitous software systems. As a result, they fail to comprehensively detect the impact of actuator operations on the dynamic environment, thus limiting their effectiveness. To address these limitations, we propose SysGuard, a violation detecting and handling approach. SysGuard employs the dynamic probabilistic graphical model (DPGM) to model the physical interactions as the physical interaction graph (PIG). In the offline phase, SysGuard takes device description models and history device logs as input to capture physical interactions by learning the PIG. In this process, a large language model (LLM) based causal analysis method is further introduced to filter out the device dependencies unrelated to physical interaction by analyzing the device interaction scenarios recorded in device logs. In the online phase, SysGuard processes user-customized violation rules, and monitors runtime device logs to predict violation states and generates handling policies by inferring the PIG. Evaluation on two real-world WoT systems shows that SysGuard significantly outperforms existing state-of-the-art works, achieving high performance in both violation detection and handling. It also confirms the runtime efficiency and scalability of SysGuard. Ablation experiment on our constructed dataset demonstrates that the LLM-based causal analysis significantly improves the performance of SysGuard, with the accuracy increasing in both violation detecting and handling.

Publisher's Version

Info Article: fse25main-p367-p doi:10.1145/3715743

The Landscape of Toxicity: An Empirical Investigation of Toxicity on GitHub
Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu
(University of Nebraska at Omaha, USA; Wayne State University, USA)
Toxicity on GitHub can severely impact Open Source Software (OSS) development communities. To mitigate such behavior, a better understanding of its nature and how various measurable characteristics of project contexts and participants are associated with its prevalence is necessary. To achieve this goal, we conducted a large-scale mixed-method empirical study of 2,828 GitHub-based OSS projects randomly selected based on a stratified sampling strategy. Using ToxiCR, an SE domain-specific toxicity detector, we automatically classified each comment as toxic or non-toxic. Additionally, we manually analyzed a random sample of 600 comments to validate ToxiCR's performance and gain insights into the nature of toxicity within our dataset. The results of our study suggest that profanity is the most frequent toxicity on GitHub, followed by trolling and insults. While a project's popularity is positively associated with the prevalence of toxicity, its issue resolution rate has the opposite association. Corporate-sponsored projects are less toxic, but gaming projects are seven times more toxic than non-gaming ones. OSS contributors who have authored toxic comments in the past are significantly more likely to repeat such behavior. Moreover, such individuals are more likely to become targets of toxic texts.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p400-p doi:10.1145/3715744

DuoReduce: Bug Isolation for Multi-layer Extensible Compilation
Jiyuan Wang, Yuxin Qiu, Ben Limpanukorn, Hong Jin Kang, Qian Zhang, and Miryung Kim
(University of California at Los Angeles, USA; University of California at Riverside, USA)
In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton, CIRCT, and ONNX-MLIR. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual-dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators.

Publisher's Version Article: fse25main-p414-p doi:10.1145/3715747

ROSCallBaX: Statically Detecting Inconsistencies in Callback Function Setup of Robotic Systems
Sayali Kate, Yifei Gao, Shiwei Feng, and Xiangyu Zhang
(Purdue University, USA)
Increasingly popular Robot Operating System (ROS) framework allows building robotic systems by integrating newly developed and/or reused modules, where the modules can use different versions of the framework (e.g., ROS1 or ROS2) and programming language (e.g. C++ or Python). The majority of such robotic systems' work happens in callbacks. The framework provides various elements for initializing callbacks and for setting up the execution of callbacks. It is the responsibility of developers to compose callbacks and their execution setup elements, and hence can lead to inconsistencies related to the setup of callback execution due to developer's incomplete knowledge of the semantics of elements in various versions of the framework. Some of these inconsistencies do not throw errors at runtime, making their detection difficult for developers. We propose a static approach to detecting such inconsistencies by extracting a static view of the composition of robotic system's callbacks and their execution setup, and then checking it against the composition conventions based on the elements' semantics. We evaluate our ROSCallBaX prototype on the dataset created from the posts on developer forums and ROS projects that are publicly available. The evaluation results show that our approach can detect real inconsistencies.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable Article: fse25main-p431-p doi:10.1145/3715748

Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models
Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng
(Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China)
Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers exhibit differences in coding style. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem.

Publisher's Version Article: fse25main-p434-p doi:10.1145/3715749

Automated Unit Test Refactoring
Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia
(Zhejiang University, China)
Test smells arise from poor design practices and insufficient domain knowledge, which can lower the quality of test code and make it harder to maintain and update. Manually refactoring of test smells is time-consuming and error-prone, highlighting the necessity for automated approaches. Current rule-based refactoring methods often struggle in scenarios not covered by predefined rules and lack the flexibility needed to handle diverse cases effectively. In this paper, we propose a novel approach called UTRefactor, a context-enhanced, LLM-based framework for automatic test refactoring in Java projects. UTRefactor extracts relevant context from test code and leverages an external knowledge base that includes test smell definitions, descriptions, and DSL-based refactoring rules. By simulating the manual refactoring process through a chain-of-thought approach, UTRefactor guides the LLM to eliminate test smells in a step-by-step process, ensuring both accuracy and consistency throughout the refactoring. Additionally, we implement a checkpoint mechanism to facilitate comprehensive refactoring, particularly when multiple smells are present. We evaluate UTRefactor on 879 tests from six open-source Java projects, reducing the number of test smells from 2,375 to 265, achieving an 89% reduction. UTRefactor outperforms direct LLM-based refactoring methods by 61.82% in smell elimination and significantly surpasses the performance of a rule-based test smell refactoring tool. Our results demonstrate the effectiveness of UTRefactor in enhancing test code quality while minimizing manual involvement.

Publisher's Version Article: fse25main-p455-p doi:10.1145/3715750

HornBro: Homotopy-Like Method for Automated Quantum Program Repair
Siwei Tan, Liqiang Lu, Debin Xiang, Tianyao Chu, Congliang Lang, Jintao Chen, Xing Hu, and Jianwei Yin
(Zhejiang University, China)
Quantum programs provide exponential speedups compared to classical programs in certain areas, but they also inevitably encounter logical faults. Automatically repairing quantum programs is much more challenging than repairing classical programs due to the non-replicability of data, the vast search space of program inputs, and the new programming paradigm. Existing works based on semantic-based or learning-based program repair techniques are fundamentally limited in repairing efficiency and effectiveness. In this work, we propose HornBro, an efficient framework for automated quantum program repair. The key insight of HornBro lies in the homotopy-like method, which iteratively switches between the classical and quantum parts. This approach allows the repair tasks to be efficiently offloaded to the most suitable platforms, enabling a progressive convergence toward the correct program. We start by designing an implication assertion pragma to enable rigorous specifications of quantum program behavior, which helps to generate a quantum test suite automatically. This suite leverages the orthonormal bases of quantum programs to accommodate different encoding schemes. Given a fixed number of test cases, it allows the maximum input coverage of potential counter-example candidates. Then, we develop a Clifford approximation method with an SMT-based search to transform the patch localization program into a symbolic reasoning problem. Finally, we offload the computationally intensive repair of gate parameters to quantum hardware by leveraging the differentiability of quantum gates. Experiments suggest that HornBro increases the repair success rate by more than 62.5% compared to the existing repair techniques, supporting more types of quantum bugs. It also achieves 35.7× speedup in the repair and 99.9% gate reduction of the patch.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p466-p doi:10.1145/3715751

Automated Soap Opera Testing Directed by LLMs and Scenario Knowledge: Feasibility, Challenges, and Road Ahead
Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry (Xiwei) Xu, Qinghua Lu, and Liming Zhu
(Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; UNSW, Australia)
Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective. Although ET has proven effective in system-level testing of interactive systems, the need for manual execution has hindered large-scale adoption. In this work, we explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing). We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process. We then develop a multi-agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. The system consists of three multi-modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs. Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under-explored scenario boundaries and incorrectly identified bugs. Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human-AI co-learning, and the integration of soap opera testing with broader software engineering practices. These insights aim to guide and inspire the future research.

Publisher's Version Article: fse25main-p474-p doi:10.1145/3715752

LLM-Based Method Name Suggestion with Automatically Generated Context-Rich Prompts
Waseem Akram, Yanjie Jiang, Yuxia Zhang, Haris Ali Khan, and Hui Liu
(Beijing Institute of Technology, China; Peking University, China)
Accurate method naming is crucial for code readability and maintainability. However, manually creating concise and meaningful names remains a significant challenge. To this end, in this paper, we propose an approach based on Large Language Model (LLMs) to suggest method names according to function descriptions. The key of the approach is ContextCraft, an automated algorithm for generating context-rich prompts for LLM that suggests the expected method names according to the prompts. For a given query (functional description), it retrieves a few best examples whose functional descriptions have the greatest similarity with the query. From the examples, it identifies tokens that are likely to appear in the final method name as well as their likely positions, picks up pivot words that are semantically related to tokens in the according method names, and specifies the evaluation results of the LLM on the selected examples. All such outputs (tokens with probabilities and position information, pivot words accompanied by associated name tokens and similarity scores, and evaluation results) together with the query and the selected examples are then filled in a predefined prompt template, resulting in a context-rich prompt. This context-rich prompt reduces the randomness of LLMs by focusing the LLM’s attention on relevant contexts, constraining the solution space, and anchoring results to meaningful semantic relationships. Consequently, the LLM leverages this prompt to generate the expected method name, producing a more accurate and relevant suggestion. We evaluated the proposed approach with 43k real-world Java and Python methods accompanied by functional descriptions. Our evaluation results suggested that it significantly outperforms the state-of-the-art approach RNN-att-Copy, improving the chance of exact match by 52% and decreasing the edit distance between generated and expected method names by 32%. Our evaluation results also suggested that the proposed approach worked well for various LLMs, including ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Gemini-1.5, and Llama-3.

Publisher's Version Article: fse25main-p476-p doi:10.1145/3715753

Demystifying LLM-Based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA)
Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically resolve software development issues. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents at the time of paper submission! Agentless also achieves more than 50% solve rate when using Claude 3.5 Sonnet on the new SWE-bench Verified benchmark. In fact, Agentless has already been adopted by OpenAI as the go-to approach to showcase the real-world coding performance of both GPT-4o and the new o1 models; more recently, Agentless has also been used by DeepSeek to evaluate their newest DeepSeek V3 and R1 models. Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patches or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-𝑆 by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the currently overlooked potential of a simplistic, cost-effective technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction. We have open-sourced Agentless at: https://github.com/OpenAutoCoder/Agentless

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p481-p doi:10.1145/3715754

Standing on the Shoulders of Giants: Bug-Aware Automated GUI Testing via Retrieval Augmentation
Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TU Munich, Germany)
In software development, similar apps often encounter similar bugs due to shared functionalities and implementation methods. However, current automated GUI testing methods mainly focus on generating test scripts to cover more pages by analyzing the internal structure of the app, without targeted exploration of paths that may trigger bugs, resulting in low efficiency in bug discovery. Considering that a large number of bug reports on open source platforms can provide external knowledge for testing, this paper proposes BugHunter, a novel bug-aware automated GUI testing approach that generates exploration paths guided by bug reports from similar apps, utilizing a combination of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Instead of focusing solely on coverage, BugHunter dynamically adapts the testing process to target bug paths, thereby increasing bug detection efficiency. BugHunter first builds a high-quality bug knowledge base from historical bug reports. Then it retrieves relevant reports from this large bug knowledge base using a two-stage retrieval process, and generates test paths based on similar apps’ bug reports. BugHunter also introduces a local and global path-planning mechanism to handle differences in functionality and UI design across apps, and the ambiguous behavior or missing steps in the online bug reports. We evaluate BugHunter on 121 bugs across 71 apps and compare its performance against 16 state-of-the-art baselines. BugHunter achieves 60% improvement in bug detection over the best baseline, with comparable or higher coverage against the baselines. Furthermore, BugHunter successfully detects 49 new crash bugs in real-world apps from Google Play, with 33 bugs fixed, 9 confirmed, and 7 pending feedback.

Publisher's Version Article: fse25main-p506-p doi:10.1145/3715755

PDCAT: Preference-Driven Compiler Auto-tuning
Mingxuan Zhu, Zeyu Sun, and Dan Hao
(Peking University, China; Institute of Software at Chinese Academy of Sciences, China)
Compilers are crucial software tools that usually convert programs in high-level languages into machine code. A compiler provides hundreds of optimizations to improve the performance of the compiled code, which are controlled by enabled or disabled optimization flags. However, the vast number of combinations of these flags makes it extremely challenging to select the desired settings for compiler optimization flags (i.e., an optimization sequence) for a given target program. In the literature, many auto-tuning techniques have been proposed to select a desired optimization sequence via different strategies across the entire optimization space. However, due to the huge optimization space, these techniques commonly suffer from the widely recognized efficiency problem. To address this problem, in this paper, we propose a preference-driven selection approach PDCAT, which reduces the search space of optimization sequences through three components. In particular, PDCAT first identifies combined optimizations based on compiler documentation to exclude optimization sequences violating the combined constraints, and then categorizes the optimizations into a common optimization set (whose optimization flags are fixed) and an exploration set containing the remaining optimizations. Finally, within the search process, PDCAT assigns distinct enable probabilities to the explored optimization flags and finally selects a desired optimization sequence. The former two components reduce the search space by removing invalid optimization sequences and fixing some optimization flags, whereas the latter performs a biased search in the search space. To evaluate the performance of the proposed approach PDCAT, we conducted an extensive experimental study on the latest version of the GCC compiler with two widely used benchmarks, cBench and PolyBench. The results show that PDCAT significantly outperforms the four compared techniques, including the state-of-art technique SRTuner. Moreover, each component of PDCAT not only contributes to its performance, but also improves the acceleration performance of the compared techniques.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p510-p doi:10.1145/3715756

Towards Understanding Docker Build Faults in Practice: Symptoms, Root Causes, and Fix Patterns
Yiwen Wu, Yang Zhang, Tao Wang, Bo Ding, and Huaimin Wang
(National University of Defense Technology, China; State Key Laboratory of Complex and Critical Software Environment, China)
Docker building is a critical component of containerization in modern software development, automating the process of packaging and converting sources into container images. It is not uncommon to find that Docker build faults (DBFs) occur frequently across Docker-based projects, inducing non-negligible costs in development activities. DBF resolution is a challenging problem and previous studies have demonstrated that developers spend non-trivial time in resolving encountered build faults. However, the characteristics of DBFs is still under-investigated, hindering practical solutions to build management in the Docker community. In this paper, to bridge this gap, we present a comprehensive study for understanding the real-world DBFs in practice. We collect and analyze a DBF dataset of 255 issues and 219 posts from GitHub, Stack Overflow, and Docker Forum. We investigate and construct characteristic taxonomies for the DBFs, including 15 symptoms, 23 root causes, and 35 fix patterns. Moreover, we study the fault distributions of symptoms and root causes, in terms of the different build types, i.e., Dockerfile builds and Docker-compose builds. Based on the results, we provide actionable implications and develop a knowledge-based application, which can potentially facilitate research and assist developers in improving the Docker build management.

Publisher's Version Article: fse25main-p531-p doi:10.1145/3715757

Large Language Models for In-File Vulnerability Localization Can Be “Lost in the End”
Francesco Sovrano, Adam Bauer, and Alberto Bacchelli
(ETH Zurich, Switzerland; University of Zurich, Switzerland)
Traditionally, software vulnerability detection research has focused on individual small functions due to earlier language processing technologies’ limitations in handling larger inputs. However, this function-level approach may miss bugs that span multiple functions and code blocks. Recent advancements in artificial intelligence have enabled processing of larger inputs, leading everyday software developers to increasingly rely on chat-based large language models (LLMs) like GPT-3.5 and GPT-4 to detect vulnerabilities across entire files, not just within functions. This new development practice requires researchers to urgently investigate whether commonly used LLMs can effectively analyze large file-sized inputs, in order to provide timely insights for software developers and engineers about the pros and cons of this emerging technological trend. Hence, the goal of this paper is to evaluate the effectiveness of several state-of-the-art chat-based LLMs, including the GPT models, in detecting in-file vulnerabilities. We conducted a costly investigation into how the performance of LLMs varies based on vulnerability type, input size, and vulnerability location within the file. To give enough statistical power (β ≥ .8) to our study, we could only focus on the three most common (as well as dangerous) vulnerabilities: XSS, SQL injection, and path traversal. Our findings indicate that the effectiveness of LLMs in detecting these vulnerabilities is strongly influenced by both the location of the vulnerability and the overall size of the input. Specifically, regardless of the vulnerability type, LLMs tend to significantly (p < .05) underperform when detecting vulnerabilities located toward the end of larger files—a pattern we call the lost-in-the-end effect. Finally, to further support software developers and practitioners, we also explored the optimal input size for these LLMs and presented a simple strategy for identifying it, which can be applied to other models and vulnerability types. Eventually, we show how adjusting the input size can lead to significant improvements in LLM-based vulnerability detection, with an average recall increase of over 37% across all models.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p535-p doi:10.1145/3715758

Empirically Evaluating the Impact of Object-Centric Breakpoints on the Debugging of Object-Oriented Programs
Valentin Bourcier, Pooja Rani, Maximilian Ignacio Willembrinck Santander, Alberto Bacchelli, and Steven Costiou
(University of Lille - Inria - CNRS - Centrale Lille - UMR 9189 CRIStAL, France; University of Zurich, Switzerland)
Debugging consists in understanding the behavior of a program to identify and correct its defects. Breakpoints are the most commonly used debugging tool and aim to facilitate the debugging process by allowing developers to interrupt a program’s execution at a source code location of their choice and inspect the state of the program. Researchers suggest that in systems developed using object-oriented programming (OOP), traditional breakpoints may be a not effective method for debugging. In OOP, developers create code in classes, which at runtime are instantiated as object—entities with their own state and behavior that can interact with one another. Traditional breakpoints are set within the class code, halting execution for every object that shares that class’s code. This leads to unnecessary interruptions for developers who are focused on monitoring the behavior of a specific object. As an answer to this challenge, researchers proposed object-centric debugging, an approach based on debugging tools that focus on objects rather than classes. In particular, using object-centric breakpoints, developers can select specific objects (rather than classes) for which the execution must be interrupted. Even though it seems reasonable that this approach may ease the debugging process by reducing the time and actions needed for debugging objects, no research has yet verified its actual impact. To investigate the impact of object-centric breakpoints on the debugging process, we devised and conducted a controlled experiment with 81 developers who spent an average of 1 hour and 30 minutes each on the study. The experiment required participants to complete two debugging tasks using debugging tools with vs. without object-centric breakpoints. We found no significant effect from the use of object-centric breakpoints on the number of actions required to debug or the effectiveness in understanding or fixing the bug. However, for one of the two tasks, we measured a statistically significant reduction in debugging time for participants who used object-centric breakpoints, while for the other task, there was a statistically significant increase. Our analysis suggests that the impact of object-centric breakpoints varies depending on the context and the specific nature of the bug being addressed. In particular, our analysis indicates that object-centric breakpoints can speed up the process of locating the root cause of a bug when the bug can be replicated without needing to restart the program. We discuss the implications of these findings for debugging practices and future research.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p536-p doi:10.1145/3715759

ChangeGuard: Validating Code Changes via Pairwise Learning-Guided Execution
Lars Gröninger, Beatriz Souza, and Michael Pradel
(University of Stuttgart, Germany)
Code changes are an integral part of the software development process. Many code changes are meant to improve the code without changing its functional behavior, e.g., refactorings and performance improvements. Unfortunately, validating whether a code change preserves the behavior is non-trivial, particularly when the code change is performed deep inside a complex project. This paper presents ChangeGuard, an approach that uses learning-guided execution to compare the runtime behavior of a modified function. The approach is enabled by the novel concept of pairwise learning-guided execution and by a set of techniques that improve the robustness and coverage of the state-of-the-art learning-guided execution technique. Our evaluation applies ChangeGuard to a dataset of 224 manually annotated code changes from popular Python open-source projects and to three datasets of code changes obtained by applying automated code transformations. Our results show that the approach identifies semantics-changing code changes with a precision of 77.1% and a recall of 69.5%, and that it detects unexpected behavioral changes introduced by automatic code refactoring tools. In contrast, the existing regression tests of the analyzed projects miss the vast majority of semantics-changing code changes, with a recall of only 7.6%. We envision our approach being useful for detecting unintended behavioral changes early in the development process and for improving the quality of automated code transformations.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25main-p542-p doi:10.1145/3715760

Integrating Large Language Models and Reinforcement Learning for Non-linear Reasoning
Yoav Alon and Cristina David
(University of Bristol, UK)
Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM's space exploration: (1) the Agent has access to domain-specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM's training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task, and compare it against Chain of Thought (CoT) and Tree of Thoughts (ToT). We assess both the downstream task, denoting the binary classification, and the intermediate reasoning steps. Our approach compares positively against CoT and ToT.

Publisher's Version Article: fse25main-p551-p doi:10.1145/3715761

How Do Programming Students Use Generative AI?
Christian Rahe and Walid Maalej
(University of Hamburg, Germany)
Programming students have a widespread access to powerful Generative AI tools like ChatGPT. While this can help understand the learning material and assist with exercises, educators are voicing more and more concerns about an overreliance on generated outputs and lack of critical thinking skills. It is thus important to understand how students actually use generative AI and what impact this could have on their learning behavior. To this end, we conducted a study including an exploratory experiment with 37 programming students, giving them monitored access to ChatGPT while solving a code authoring exercise. The task was not directly solvable by ChatGPT and required code comprehension and reasoning. While only 23 of the students actually opted to use the chatbot, the majority of those eventually prompted it to simply generate a full solution. We observed two prevalent usage strategies: to seek knowledge about general concepts and to directly generate solutions. Instead of using the bot to comprehend the code and their own mistakes, students often got trapped in a vicious cycle of submitting wrong generated code and then asking the bot for a fix. Those who self-reported using generative AI regularly were more likely to prompt the bot to generate a solution. Our findings indicate that concerns about potential decrease in programmers' agency and productivity with Generative AI are justified. We discuss how researchers and educators can respond to the potential risk of students uncritically over-relying on Generative AI. We also discuss potential modifications to our study design for large-scale replications.

Publisher's Version Article: fse25main-p552-p doi:10.1145/3715762

LLMDroid: Enhancing Automated Mobile App GUI Testing Coverage with Large Language Model Guidance
Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang
(Huazhong University of Science and Technology, China; Monash University, Australia; OPPO, China)
With the rapid development of Large Language Models (LLMs), their integration into automated mobile GUI testing has emerged as a promising research direction. However, existing LLM-based testing approaches face significant challenges, including time inefficiency and high costs due to constant LLM querying. To address these issues, this paper introduces LLMDroid, a novel testing framework designed to enhance existing automated mobile GUI testing tools by leveraging LLMs more efficiently. The workflow of LLMDroid comprises two main stages: Autonomous Exploration and LLM Guidance. During Autonomous Exploration, LLMDroid utilizes existing testing tools while leveraging LLMs to summarize explored pages. When code coverage growth slows, it transitions to LLM Guidance to strategically direct testing towards unexplored functionalities. This approach minimizes LLM interactions while maximizing their impact on test coverage. We applied LLMDroid to three popular open-source Android testing tools and evaluated it on 14 top-listed apps from Google Play. Results demonstrate an average increase of 26.16% in code coverage and 29.31% in activity coverage. Furthermore, our evaluation under different LLMs reveals that LLMDroid outperform existing step-wise approaches with significant cost efficiency, achieving optimal performance at $0.49 per hour using GPT-4o among tested models, with a cost-effective alternative achieving 94% of this performance at just $0.03 per hour. These findings highlight LLMDroid’s effectiveness in enhancing automated mobile app testing and its potential for widespread adoption.

Publisher's Version Article: fse25main-p556-p doi:10.1145/3715763

The Struggles of LLMs in Cross-Lingual Code Clone Detection
Micheline Bénédicte Moumoula, Abdoul Kader Kaboré, Jacques Klein, and Tegawendé F. Bissyandé
(University of Luxembourg, Luxembourg; Interdisciplinary Centre of Excellence in AI for Development (CITADEL), Burkina Faso)
With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and, eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.99, for straightforward programming examples. However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of “code clones” in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ∼1 and ∼20 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection.

Publisher's Version Article: fse25main-p558-p doi:10.1145/3715764

Has My Code Been Stolen for Model Training? A Naturalness Based Approach to Code Contamination Detection
Haris Ali Khan, Yanjie Jiang, Qasim Umer, Yuxia Zhang, Waseem Akram, and Hui Liu
(Beijing Institute of Technology, China; Peking University, China; King Fahd University of Petroleum and Minerals, Saudi Arabia)
It is often valuable to know whether a given piece of source code has or hasn’t been used to train a given deep learning model. On one side, it helps avoid data contamination problems that may exaggerate the performance of evaluated models. Conversely, it facilitates copyright protection by identifying private or protected code leveraged for model training without permission. To this end, automated approaches have been proposed for the detection, known as data contamination detection. Such approaches often heavily rely on the confidence of the involved models, assuming that the models should be more confident in handling contaminated data than cleaned data. However, such approaches do not consider the nature of the given data item, i.e., how difficult it is to predict the given item. Consequently, difficult-to-predict contaminated data and easy-to-predict cleaned data are often misclassified. As an initial attempt to solve this problem, this paper presents a naturalness-based approach, called Natural-DaCoDe, for code-completion models to distinguish contaminated source code from cleaned ones. Natural-DaCoDe leverages code naturalness to quantitatively measure the difficulty of a given source code for code-completion models. It then trains a classifier to distinguish contaminated source code according to both code naturalness and the performance of the code-completion models on the given source code. We evaluate Natural-DaCoDe with two pre-trained large language models (e.g., ChatGPT and Claude) and two code-completion models that we trained from scratch for detecting contamination data. Our evaluation results suggest that Natural-DaCoDe substantially outperformed the state-of-the-art approaches in detecting contaminated data, improving the average accuracy by 61.78%. We also evaluate Natural-DaCoDe with method name suggestion task, and it remains more accurate than the state-of-the-art approaches, improving the accuracy by 54.39%. Furthermore, Natural-DaCoDe was tested on a natural language text benchmark, significantly outperforming the state-of-the-art approaches by 22% . It may suggest that Natural-DaCoDe could be applied to various source code related tasks besides code complete.

Publisher's Version Article: fse25main-p561-p doi:10.1145/3715765

Automated and Accurate Token Transfer Identification and Its Applications in Cryptocurrency Security
Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang
(University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada)
Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them.

Publisher's Version Article: fse25main-p575-p doi:10.1145/3715766

Revolutionizing Newcomers’ Onboarding Process in OSS Communities: The Future AI Mentor
Xin Tan, Xiao Long, Yinghao Zhu, Lin Shi, Xiaoli Lian, and Li Zhang
(Beihang University, China)
Onboarding newcomers is vital for the sustainability of open-source software (OSS) projects. To lower barriers and increase engagement, OSS projects have dedicated experts who provide guidance for newcomers. However, timely responses are often hindered by experts' busy schedules. The recent rapid advancements of AI in software engineering have brought opportunities to leverage AI as a substitute for expert mentoring. However, the potential role of AI as a comprehensive mentor throughout the entire onboarding process remains unexplored. To identify design strategies of this "AI mentor", we applied Design Fiction as a participatory method with 19 OSS newcomers. We investigated their current onboarding experience and elicited 32 design strategies for future AI mentor. Participants envisioned AI mentor being integrated into OSS platforms like GitHub, where it could offer assistance to newcomers, such as "recommending projects based on personalized requirements" and "assessing and categorizing project issues by difficulty". We also collected participants' perceptions of a prototype, named "OSSerCopilot", that implemented the envisioned strategies. They found the interface useful and user-friendly, showing a willingness to use it in the future, which suggests the design strategies are effective. Finally, in order to identify the gaps between our design strategies and current research, we conducted a comprehensive literature review, evaluating the extent of existing research support for this concept. We find that research is relatively scarce in certain areas where newcomers highly anticipate AI mentor assistance, such as "discovering an interested project". Our study has the potential to revolutionize the current newcomer-expert mentorship and provides valuable insights for researchers and tool designers aiming to develop and enhance AI mentor systems.

Publisher's Version Article: fse25main-p580-p doi:10.1145/3715767

Impact of Request Formats on Effort Estimation: Are LLMs Different Than Humans?
Gül Çalıklı and Mohammed Alhamed
(University of Glasgow, UK; Applied Behaviour Systems, UK)
Software development Effort Estimation (SEE) comprises predicting the most realistic amount of effort (e.g., in work hours) required to develop or maintain software based on incomplete, uncertain, and noisy input. Expert judgment is the dominant SEE strategy used in the industry. Yet, expert-based judgment can provide inaccurate effort estimates, leading to projects’ poor budget planning and cost and time overruns, negatively impacting the world economy. Large Language Models (LLMs) are good candidates to assist software professionals in effort estimation. However, their effective leveraging for SEE requires thoroughly investigating their limitations and to what extent they overlap with those of (human) software practitioners. One primary limitation of LLMs is the sensitivity of their responses to prompt changes. Similarly, empirical studies showed that changes in the request format (e.g., rephrasing) could impact (human) software professionals’ effort estimates. This paper reports the first study that replicates a series of SEE experiments, which were initially carried out with software professionals (humans) in the literature. Our study aims to investigate how LLMs’ effort estimates change due to the transition from the traditional request format (i.e., "How much effort is required to complete X?”) to the alternative request format (i.e., "How much can be completed in Y work hours?”). Our experiments involved three different LLMs (GPT-4, Gemini 1.5 Pro, Llama 3.1) and 88 software project specifications (per treatment in each experiment), resulting in 880 prompts, in total that we prepared using 704 user stories from three large-scale open-source software projects (Hyperledger Fabric, Mulesoft Mule, Spring XD). Our findings align with the original experiments conducted with software professionals: The first four experiments showed that LLMs provide lower effort estimates due to transitioning from the traditional to the alternative request format. The findings of the fifth and first experiments detected that LLMs display patterns analogous to anchoring bias, a human cognitive bias defined as the tendency to stick to the anchor (i.e., the "Y work-hours” in the alternative request format). Our findings provide crucial insights into facilitating future human-AI collaboration and prompt designs for improved effort estimation accuracy.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p581-p doi:10.1145/3715771

Mitigating Emergent Malware Label Noise in DNN-Based Android Malware Detection
Haodong Li, Xiao Cheng, Guohan Zhang, Guosheng Xu, Guoai Xu, and Haoyu Wang
(Beijing University of Posts and Telecommunications, China; UNSW, Australia; Harbin Institute of Technology, Shenzhen, China; Huazhong University of Science and Technology, China)
Learning-based Android malware detection has earned significant recognition across industry and academia, yet its effectiveness hinges on the accuracy of labeled training data. Manual labeling, being prohibitively expensive, has prompted the use of automated methods, such as leveraging anti-virus engines like VirusTotal, which unfortunately introduces mislabeling, aka "label noise". The state-of-the-art label noise reduction approach, MalWhiteout, can effectively reduce random label noise but underperforms in mitigating real-world emergent malware (EM) label noise stemming from newly emerging Android malware variants overlooked by VirusTotal. To tackle this, we conceptualize EM label noise detection as an anomaly detection problem and introduce a novel tool, MalCleanse, that surpasses MalWhiteout's ability to address EM label noise. MalCleanse combines uncertainty estimation with unsupervised anomaly detection, identifying samples with high uncertainty as mislabeled, thereby enhancing its capability to remove EM label noise. Our experimental results demonstrate a significant reduction in EM label noise by approximately 25.25%, achieving an F1 Score of 80.32% for label noise detection at a noise ratio of 40.61%. Notably, MalCleanse outperforms MalWhiteout with an increase of 40.9% in overall F1 score for mitigating EM label noise. This paper pioneers the integration of deep neural network model uncertainty to refine label accuracy, thereby enhancing the reliability of malware detection systems. Our approach represents a significant step forward in addressing the challenges posed by emergent malware in automated labeling systems.

Publisher's Version Article: fse25main-p594-p doi:10.1145/3715769

Detecting Metadata-Related Bugs in Enterprise Applications
Md Mahir Asef Kabir, Xiaoyin Wang, and Na Meng
(Virginia Tech, USA; University of Texas at San Antonio, USA)
When building enterprise applications (EAs) on Java frameworks (e.g., Spring), developers often configure application components via metadata (i.e., Java annotations and XML files). It is challenging for developers to correctly use metadata, because the usage rules can be complex and existing tools provide limited assistance. When developers misuse metadata, EAs become misconfigured, which can trigger erroneous runtime behaviors or introduce security vulnerabilities. To help developers correctly use metadata, this paper presents (1) RSL — a domain-specific language that domain experts can adopt to prescribe metadata checking rules, and (2) MeCheck — a tool that takes in RSL rules and EAs to check for rule violations.
With RSL, domain experts (e.g., owner developers of a Java framework) can specify metadata checking rules by defining content consistency among XML files, annotations, and Java code. Given such RSL rules and a program to scan, MeCheck interprets rules as cross-file static analyzers that scan Java and/or XML files to gather information and look for consistency violations. For evaluation, we studied the Spring and JUnit documentation to manually define 15 rules, and created 2 datasets with 115 open-source EAs. The first dataset includes 45 EAs, and the ground truth of 45 manually injected bugs. The second dataset includes multiple versions of 70 EAs. We observed that MeCheck identified bugs in the first dataset with 100% precision, 96% recall, and 98% F-score. It reported 152 bugs in the second dataset, 49 of which were already fixed by developers. Our evaluation shows that MeCheck helps ensure the correct usage of metadata.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25main-p596-p doi:10.1145/3715772

An Adaptive Language-Agnostic Pruning Method for Greener Language Models for Code
Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, and Tushar Sharma
(Dalhousie University, Canada; University of Murcia, Spain; McGill University, Canada; Linköping University, Sweden)
Language models of code have demonstrated remarkable performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce the computational overhead of these models. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size that is up to x3 less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 emissions by up to 44.85%. Importantly, it achieves a reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of their adoption in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE.

Publisher's Version Article: fse25main-p600-p doi:10.1145/3715773

10 Years Later: Revisiting How Developers Search for Code
Kathryn T. Stolee, Tobias Welp, Caitlin Sadowski, and Sebastian Elbaum
(North Carolina State University, USA; Google, Germany; Unaffiliated, USA; University of Virginia, USA)
Code search is an integral part of a developer’s workflow. In 2015, researchers published a paper reflecting on the code search practices at Google of 27 developers who used the internal Code Search tool. That paper had first-hand accounts for why those developers were using code search and highlighted how often and in what situations developers were searching for code. In the past decade, much has changed in the landscape of developer support. New languages have emerged, artificial intelligence (AI) for code generation has gained traction, auto-complete in the IDE has gotten better, Q&A forums have increased in popularity, and code repositories are larger than ever. It is worth considering whether those observations from almost a decade ago have stood the test of time. In this work, inspired by the prior survey about the Code Search tool, we run a series of three surveys with 1,945 total responses and report overall Code Search usage statistics for over 100,000 users. Unlike the prior work, in our surveys, we include explicit success criteria to understand when code search is meeting their needs, and when it is not. We dive further into two common sub-categories of code search effort: when its users are looking for examples and when they are using code search alongside code review. We find that Code Search users continue to use the tool frequently and the frequency has not changed despite the introduction of AI-enhanced development support. Users continue to turn to Code Search to find examples, but the frequency of example-seeking behavior has decreased. More often than before, users access the tool to learn about and explore code. This has implications for future Code Search support in software development.

Publisher's Version Article: fse25main-p680-p doi:10.1145/3715774

IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models
Sayem Mohammad Imtiaz, Astha Singh, Fraol Batole, and Hridesh Rajan
(Iowa State University, USA; Tulane University, USA)
Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity, harmful responses, and factual inaccuracies. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, drawing inspiration from fault localization via program slicing, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model’s most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model’s overall versatility by altering a smaller portion of the model. Furthermore, dynamic selection allows for a more nuanced and precise model repair compared to a fixed selection strategy. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair.

Publisher's Version Article: fse25main-p722-p doi:10.1145/3715775

Clone Detection for Smart Contracts: How Far Are We?
Zuobin Wang, Zhiyuan Wan, Yujing Chen, Yun Zhang, David Lo, Difan Xie, and Xiaohu Yang
(Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang), China; Hangzhou City University, China; Singapore Management University, Singapore)
In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research.

Publisher's Version Article: fse25main-p725-p doi:10.1145/3715776

Dissecting Real-World Cross-Language Bugs
Haoran Yang and Haipeng Cai
(Washington State University, USA; SUNY Buffalo, USA)
Multilingual systems are prevalent and broadly impactful, but also complex due to the intricate interactions between the heterogeneous programming languages the systems are developed in. This complexity is further aggravated by the diversity of cross-language interoperability across different language combinations, resulting in additional, often stealthy cross-language bugs. Yet despite the growing number of tools aimed to discover cross-language bugs, a systematic understanding of such bugs is still lacking. To fill this gap, we conduct the first comprehensive study of cross-language bugs, characterizing them in 5 aspects including their symptoms, locations, manifestation, root causes, and fixes, as well as their relationships. Through careful identification and detailed analysis of 400 cross-language bugs in real-world multilingual projects classified from 54,356 relevant code commits in their GitHub repositories, we revealed not only bug characteristics of those five aspects but also how they compare between two top language combinations in the multilingual world (Python-C and Java-C). In addition to findings of the study as well as its enabling tools and datasets, we also provide practical recommendations regarding the prevention, detection, and patching of cross-language bugs.

Publisher's Version

Published Artifact

Info

Artifacts Available

Artifacts Functional Article: fse25main-p727-p doi:10.1145/3715777

Less Is More: On the Importance of Data Quality for Unit Test Generation
Junwei Zhang, Xing Hu, Shan Gao, Xin Xia, David Lo, and Shanping Li
(Zhejiang University, China; Huawei, China; Singapore Management University, Singapore)
Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p747-p doi:10.1145/3715778

Protecting Privacy in Software Logs: What Should Be Anonymized?
Roozbeh Aghili, Heng Li, and Foutse Khomh
(Polytechnique Montréal, Canada)
Software logs, generated during the runtime of software systems, are essential for various development and analysis activities, such as anomaly detection and failure diagnosis. However, the presence of sensitive information in these logs poses significant privacy concerns, particularly regarding Personally Identifiable Information (PII) and quasi-identifiers that could lead to re-identification risks. While general data privacy has been extensively studied, the specific domain of privacy in software logs remains underexplored, with inconsistent definitions of sensitivity and a lack of standardized guidelines for anonymization. To mitigate this gap, this study offers a comprehensive analysis of privacy in software logs from multiple perspectives. We start by performing an analysis of 25 publicly available log datasets to identify potentially sensitive attributes. Based on the result of this step, we focus on three perspectives: privacy regulations, research literature, and industry practices. We first analyze key data privacy regulations, such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), to understand the legal requirements concerning sensitive information in logs. Second, we conduct a systematic literature review to identify common privacy attributes and practices in log anonymization, revealing gaps in existing approaches. Finally, we survey 45 industry professionals to capture practical insights on log anonymization practices. Our findings shed light on various perspectives of log privacy and reveal industry challenges, such as technical and efficiency issues while highlighting the need for standardized guidelines. By combining insights from regulatory, academic, and industry perspectives, our study aims to provide a clearer framework for identifying and protecting sensitive information in software logs.

Publisher's Version Article: fse25main-p752-p doi:10.1145/3715779

MiSum: Multi-modality Heterogeneous Code Graph Learning for Multi-intent Binary Code Summarization
Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao
(National University of Defense Technology, China)
The current landscape of binary code summarization predominantly focuses on generating a single summary, which limits its utility and understanding for reverse engineers. Existing approaches often fail to meet the diverse needs of users, such as providing detailed insights into usage patterns, implementation nuances, and design rationale, as observed in the field of source code summarization. This highlights the need for multi-intent binary code summarization to enhance the effectiveness of reverse engineering processes. To address this gap, we propose MiSum, a novel method that leverages multi-modality heterogeneous code graph alignment and learning to integrate both assembly code and pseudo-code. MiSum introduces a unified multi-modality heterogeneous code graph (MM-HCG) that aligns assembly code graphs with pseudo-code graphs, capturing both low-level execution details and high-level structural information. We further propose multi-modality heterogeneous graph learning with heterogeneous mutual attention and message passing, which highlights important code blocks and discovers inter-dependencies across different code forms. Additionally, an intent-aware summary generator with an intent-aware attention mechanism is introduced to produce customized summaries tailored to multiple intents. Extensive experiments, including evaluations across various architectures and optimization levels, demonstrate that MiSum outperforms state-of-the-art baselines in BLEU, METEOR, and ROUGE-L metrics. Human evaluations validate its capability to effectively support reverse engineers in understanding diverse binary code intents, marking a significant advancement in binary code analysis.

Publisher's Version

Published Artifact

Info

Artifacts Available Article: fse25main-p760-p doi:10.1145/3715780

SemBIC: Semantic-Aware Identification of Bug-Inducing Commits
Xiao Chen, Hengcheng Zhu, Jialun Cao, Ming Wen, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; Huazhong University of Science and Technology, China)
Debugging can be much facilitated if one can identify the evolution commit that introduced the bug leading to a detected failure (aka. bug-inducing commit, BIC). Although one may, in theory, locate BICs by executing the detected failing test on various historical commit versions, it is impractical when the test cannot be executed on some of those versions. On the other hand, existing static techniques often assume the availability of additional information such as patches and bug reports, or the applicability of predefined heuristics like commit chronology. However, these approaches are ineffective when such assumptions do not hold, which are often the case in practice. To address these limitations, we propose SEMBIC to identify the BIC of a bug by statically tracking the semantic changes in the execution path prescribed by the failing test across successive historical commit versions. Our insight is that the greater the semantic changes a commit introduces concerning the failing execution path of a target bug, the more likely it is to be the BIC. To distill semantic changes relevant to the failure, we focus on three fine-grained semantic properties. We evaluate the performance of SEMBIC on a benchmark containing 199 real-world bugs from 12 open-source projects. We found that SEMBIC can identify BICs with high accuracy – it ranks the BIC as top 1 for 88 out of 199 bugs, and achieves an MRR of 0.520, outperforming the state-of-the-art technique by 29.4% and 13.6%, respectively.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25main-p785-p doi:10.1145/3715781

Eliminating Backdoors in Neural Code Models for Secure Code Understanding
Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu
(Nanyang Technological University, Singapore; Nanjing University, China)
Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline).

Publisher's Version Article: fse25main-p792-p doi:10.1145/3715782

Understanding Debugging as Episodes: A Case Study on Performance Bugs in Configurable Software Systems
Max Weber, Alina Mailach, Sven Apel, Janet Siegmund, Raimund Dachselt, and Norbert Siegmund
(Leipzig University, Germany; ScaDS.AI Dresden-Leipzig, Germany; Saarland Informatics Campus, Germany; Saarland University, Germany; Chemnitz University of Technology, Germany; Dresden University of Technology, Germany)
Debugging performance bugs in configurable software systems is a complex and time-consuming task that requires not only fixing a bug, but also understanding its root cause. While there is a vast body of literature on debugging strategies, there is no consensus on general debugging. This makes it difficult to provide concrete guidance for developers, especially for configuration-dependent performance bugs. The goal of our work is to alleviate this situation by providing an framework to describe debugging strategies in a more general, unifying way. We conducted a user study with 12 professional developers who debugged a performance bug in a real-world configurable system. To observe developers in an unobstructive way, we provided an immersive virtual reality tool, SoftVR, giving them a large degree of freedom to choose the preferred debugging strategy. The results show that the existing documentation of strategies is too coarse-grained and intermixed to identify successful approaches. In a subsequent qualitative analysis, we devised a coding framework to reason about debugging approaches. With this framework, we identified five goal-oriented episodes that developers employ, which they also confirmed in subsequent interviews. Our work provides a unified description of debugging strategies, allowing researchers a common foundation to study debugging and practitioners and teachers guidance on successful debugging strategies.

Publisher's Version Article: fse25main-p808-p doi:10.1145/3717523

Detecting and Reducing the Factual Hallucinations of Large Language Models with Metamorphic Testing
Weibin Wu, Yuhang Cao, Ning Yi, Rongyi Ou, and Zibin Zheng
(Sun Yat-sen University, China)
Question answering (QA) is a fundamental task of large language models (LLMs), which requires LLMs to automatically answer human-posed questions in natural language. However, LLMs are known to distort facts and make non-factual statements (i.e., hallucinations) when dealing with QA tasks, which may affect the deployment of LLMs in real-life situations. In this work, we propose DrHall, a framework for detecting and reducing the factual hallucinations of black-box LLMs with metamorphic testing (MT). We believe that hallucinated answers are unstable. Therefore, when LLMs hallucinate, they are more likely to produce different answers if we use metamorphic testing to make LLMs re-execute the same task with different execution paths, which motivates the design of DrHall. The effectiveness of DrHall is evaluated empirically on three datasets, including a self-built dataset of natural language questions: FactHalluQA, as well as two datasets of programming questions: Refactory and LeetCode. The evaluation results confirm that DrHall can consistently outperform the state-of-the-art baselines, obtaining an average F1 score of over 0.856 for hallucination detection. For hallucination correction, DrHall can also outperform the state-of-the-art baselines, with an average hallucination correction rate of over 53%. We hope that our work can enhance the reliability of LLMs and provide new insights for the research of LLM hallucination mitigation.

Publisher's Version Article: fse25main-p828-p doi:10.1145/3715784

On the Unnecessary Complexity of Names in X.509 and Their Impact on Implementations
Yuteng Sun, Joyanta Debnath, Wenzheng Hong, Omar Chowdhury, and Sze Yiu Chau
(Chinese University of Hong Kong, Hong Kong; Stony Brook University, USA; Independent, China)
TheX.509 Public Key Infrastructure (PKI) provides a cryptographically verifiable mechanism for authenticating a binding of an entity’s public-key with its identity, presented in a tamper-proof digital certificate. This often serves as a foundational building block for achieving different security guarantees in many critical applications and protocols (e.g., SSL/TLS). Identities in the context of X.509 PKI are often captured as names, which are encoded in certificates as composite records with different optional fields that can store various types of string values (e.g., ASCII, UTF8). Although such flexibility allows for the support of diverse name types (e.g., IP addresses, DNS names) and application scenarios, it imposes on library developers obligations to enforce unnecessarily convoluted requirements. Bugs in enforcing these requirements can lead to potential interoperability and performance issues, and might open doors to impersonation attacks. This paper focuses on analyzing how open-source libraries enforce the constraints regarding the formatting, encoding, and processing of complex name structures on X.509 certificate chains, for the purpose of establishing identities. Our analysis reveals that a portfolio of simplistic testing approaches can expose blatant violations of the requirements in widely used open-source libraries. Although there is a substantial amount of prior work that focused on testing the overall certificate chain validation process of X.509 libraries, the identified violations have escaped their scrutiny. To make matters worse, we observe that almost all the analyzed libraries completely ignore certain pre-processing steps prescribed by the standard. This begs the question of whether it is beneficial to have a standard so flexible but complex that none of the implementations can faithfully adhere to it. With our test results, we argue in the negative, and explain how simpler alternatives (e.g., other forms of identifiers such as Authority and Subject Key Identifiers) can be used to enforce similar constraints with no loss of security.

Publisher's Version Article: fse25main-p839-p doi:10.1145/3715785

RePurr: Automated Repair of Block-Based Learners’ Programs
Sebastian Schweikl and Gordon Fraser
(University of Passau, Germany)
Programming is increasingly taught using dedicated block-based programming environments such as Scratch. While the use of blocks instead of text prevents syntax errors, learners can still make semantic mistakes implying a need for feedback and help. Since teachers may be overwhelmed by help requests in a classroom, may not have the required programming education themselves, and may simply not be available in independent learning scenarios, automated hint generation is desirable. Automated program repair can provide the foundation for automated hints, but relies on multiple assumptions: (1) Program repair usually aims to produce localized patches for fixing single bugs, but learners may fundamentally misunderstand programming concepts and tasks or request help for substantially incomplete programs. (2) Software tests are required to guide the search and to localize broken statements, but test suites for block-based programs are different to those considered in past research on fault localization and repair: They consist of system tests, where very few tests are sufficient to fully cover the code. At the same time, these tests have vastly longer execution times caused by the use of animations and interactions on Scratch programs, thus inhibiting the applicability of metaheuristic search. (3) The plastic surgery hypothesis assumes that the code necessary for repairs already exists in the codebase. Block-based programs tend to be small and may lack this necessary redundancy. In order to study whether automated program repair of block-based programs is nevertheless feasible, in this paper we introduce, to the best of our knowledge, the first automated program repair approach for Scratch programs based on evolutionary search. Our RePurr prototype includes novel refinements of fault localization to improve the lack of guidance of the test suites, recovers the plastic surgery hypothesis by exploiting that a learning scenario provides model and student solutions as alternatives, and uses parallelization and accelerated executions to reduce the costs of fitness evaluations. Empirical evaluation of RePurr on a set of real learners' programs confirms the anticipated challenges, but also demonstrates that the repair can nonetheless effectively improve and fix learners' programs, thus enabling automated generation of hints and feedback for learners.

Publisher's Version Article: fse25main-p853-p doi:10.1145/3715786

Why the Proof Fails in Different Versions of Theorem Provers: An Empirical Study of Compatibility Issues in Isabelle
Xiaokun Luan, David Sanan, Zhe Hou, Qiyuan Xu, Chengwei Liu, Yufan Cai, Yang Liu, and Meng Sun
(Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore)
Proof assistants are software tools for formal modeling and verification of software, hardware, design, and mathematical proofs. Due to the growing complexity and scale of formal proofs, compatibility issues frequently arise when using different versions of proof assistants. These issues result in broken proofs, disrupting the maintenance of formalized theories and hindering the broader dissemination of results within the community. Although existing works have proposed techniques to address specific types of compatibility issues, the overall characteristics of these issues remain largely unexplored. To address this gap, we conduct the first extensive empirical study to characterize compatibility issues, using Isabelle as a case study. We develop a regression testing framework to automatically collect compatibility issues from the Archive of Formal Proofs, the largest repository of formal proofs in Isabelle. By analyzing 12,079 collected issues, we identify their types and symptoms and further investigate their root causes. We also extract updated proofs that address these issues to understand the applied resolution strategies. Our study provides an in-depth understanding of compatibility issues in proof assistants, offering insights that support the development of effective techniques to mitigate these issues.

Publisher's Version

Published Artifact

Artifacts Available

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p855-p doi:10.1145/3715787

Improving Graph Learning-Based Fault Localization with Tailored Semi-supervised Learning
Chun Li, Hui Li, Zhong Li, Minxue Pan, and Xuandong Li
(Nanjing University, China; Samsung Electronics, China)
Due to advancements in graph neural networks, graph learning-based fault localization (GBFL) methods have achieved promising results. However, as these methods are supervised learning paradigms and deep learning is typically data-hungry, they can only be trained on fully labeled large-scale datasets. This is impractical because labeling failed tests is similar to manual fault localization, which is time-consuming and labor-intensive, leading to only a small portion of failed tests that can be labeled within limited budgets. These data labeling limitations would lead to the sub-optimal effectiveness of supervised GBFL techniques. Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance and address data labeling limitations. However, as these methods are not specifically designed for fault localization, directly utilizing them might lead to sub-optimal effectiveness. In response, we propose a novel semi-supervised GBFL framework, LEGATO. LEGATO first leverages the attention mechanism to identify and augment likely fault-unrelated sub-graphs in unlabeled graphs and then quantifies the suspiciousness distribution of unlabeled graphs to estimate pseudo-labels. Through training the model on augmented unlabeled graphs and pseudo-labels, LEGATO can utilize the unlabeled data to improve the effectiveness of fault localization and address the restrictions in data labeling. By extensive evaluations against 3 baselines SSL methods, LEGATO demonstrates superior performance by outperforming all the methods in comparison.

Publisher's Version Article: fse25main-p873-p doi:10.1145/3715788

A Mixed-Methods Study of Model-Based GUI Testing in Real-World Industrial Settings
Shaoheng Cao, Renyi Chen, Wenhua Yang, Minxue Pan, and Xuandong Li
(Nanjing University, China; Samsung Electronics, China; Nanjing University of Aeronautics and Astronautics, China)
Model-based testing (MBT) has been an important methodology in software engineering, attracting extensive research attention for over four decades. However, despite its academic acclaim, studies examining the impact of MBT in industrial environments—particularly regarding its extended effects—remain limited and yield unclear results. This gap may contribute to the challenges in establishing a study environment for implementing and applying MBT in production settings to evaluate its impact over time. To bridge this gap, we collaborated with an industrial partner to undertake a comprehensive, longitudinal empirical study employing mixed methods. Over two months, we implemented our MBT tool within the corporation, assessing the immediate and extended effectiveness and efficiency of MBT compared to script-writing-based testing. Through a mix of quantitative and qualitative methods—spanning controlled experiments, questionnaire surveys, and interviews—our study uncovers several insightful findings. These include differences in effectiveness and maintainability between immediate and extended MBT application, the evolving perceptions and expectations of engineers regarding MBT, and more. Leveraging these insights, we propose actionable implications for both the academic and industrial communities, aimed at bolstering confidence in MBT adoption and investment for software testing purposes.

Publisher's Version Article: fse25main-p877-p doi:10.1145/3715789

Beyond PEFT: Layer-Wise Optimization for More Effective and Efficient Large Code Model Tuning
Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu
(Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China)
Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data.
To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA.

Publisher's Version Article: fse25mainb-p2-p doi:10.1145/3729341

Prompts Are Programs Too! Understanding How Developers Build Software Containing Prompts
Jenny T. Liang, Melissa Lin, Nikitha Rao, and Brad A. Myers
(Carnegie Mellon University, USA)
Generative pre-trained models power intelligent software features used by millions of users controlled by developer-written natural language prompts. Despite the impact of prompt-powered software, little is known about its development process and its relationship to programming. In this work, we argue that some prompts are programs and that the development of prompts is a distinct phenomenon in programming known as “prompt programming”. We develop an understanding of prompt programming using Straussian grounded theory through interviews with 20 developers engaged in prompt development across a variety of contexts, models, domains, and prompt structures. We contribute 15 observations to form a preliminary understanding of current prompt programming practices. For example, rather than building mental models of code, prompt programmers develop mental models of the foundation model (FM)’s behavior on the prompt by interacting with the FM. While prior research shows that experts have well-formed mental models, we find that prompt programmers who have developed dozens of prompts still struggle to develop reliable mental models. Our observations show that prompt programming differs from traditional software development, motivating the creation of prompt programming tools and providing implications for software engineering stakeholders.

Publisher's Version Article: fse25mainb-p4-p doi:10.1145/3729342

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation
Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma
(University of Alberta, Canada; University of Tokyo, Japan)
The rapid advancement of generative AI and multi-modal foundation models has shown significant potential in advancing robotic manipulation. Vision-language-action (VLA) models, in particular, have emerged as a promising approach for visuomotor control by leveraging large-scale vision-language data and robot demonstrations. However, current VLA models are typically evaluated using a limited set of hand-crafted scenes, leaving their general performance and robustness in diverse scenarios largely unexplored. To address this gap, we present VLATest, a fuzzing framework designed to generate robotic manipulation scenes for testing VLA models. Based on VLATest, we conducted an empirical study to assess the performance of seven representative VLA models. Our study results revealed that current VLA models lack the robustness necessary for practical deployment. Additionally, we investigated the impact of various factors, including the number of confounding objects, lighting conditions, camera poses, unseen objects, and task instruction mutations, on the VLA model's performance. Our findings highlight the limitations of existing VLA models, emphasizing the need for further research to develop reliable and trustworthy VLA applications.

Publisher's Version Article: fse25mainb-p11-p doi:10.1145/3729343

Unlocking Optimal ORM Database Designs: Accelerated Tradeoff Analysis with Transformers
Md Rashedul Hasan, Mohammad Rashedul Hasan, and Hamid Bagheri
(University of Nebraska-Lincoln, USA)
Optimizing object-relational database mapping (ORM) design is crucial for performance and scalability in modern software systems. However, widely used ORM tools offer limited support for exploring performance tradeoffs, often enforcing a single design and overlooking alternatives, which can lead to suboptimal outcomes. While systematic tradeoff analysis can reveal Pareto-optimal designs, its high computational cost and poor scalability hinder practical adoption. This paper presents DesignTradeoffSculptor, an extensible tool suite for efficient, scalable tradeoff analysis in ORM database design. Leveraging advanced Transformer-based deep learning models—trained and fine-tuned on formally analyzed database designs—and framing design exploration as a Natural Language Processing task, DesignTradeoffSculptor efficiently identifies and removes suboptimal designs, sharply reducing the number of candidates requiring costly tradeoff analysis. Experiments show that DesignTradeoffSculptor uncovers optimal designs missed by leading ORM tools and improves analysis efficiency by over 98.21%, reducing tradeoff analysis time from 15 days to just 18 minutes, demonstrating the transformative potential of integrating formal methods with deep learning.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p24-p doi:10.1145/3729344

SmartNote: An LLM-Powered, Personalised Release Note Generator That Just Works
Farbod Daneshyan, Runzhi He, Jianyu Wu, and Minghui Zhou
(Peking University, China)
The release note is a crucial document outlining changes in new software versions. It plays a key role in helping stakeholders recognise important changes and understand the implications behind them. Despite this fact, many developers view the process of writing software release notes as a tedious and dreadful task. Consequently, numerous tools (e.g., DeepRelease and Conventional Changelog) have been developed by researchers and practitioners to automate the generation of software release notes. However, these tools fail to consider project domain and target audience for personalisation, limiting their relevance and conciseness. Additionally, they suffer from limited applicability, often necessitating significant workflow adjustments and adoption efforts, hindering practical use and stressing developers. Despite recent advancements in natural language processing and the proven capabilities of large language models (LLMs) in various code and text-related tasks, there are no existing studies investigating the integration and utilisation of LLMs in automated release note generation. Therefore, we propose SmartNote, a novel and widely applicable release note generation approach that produces high-quality, contextually personalised release notes by leveraging LLM capabilities to aggregate, describe, and summarise changes based on code, commit, and pull request details. It categorises and scores (for significance) commits to generate structured and concise release notes of prioritised changes. We conduct human and automatic evaluations that reveal SmartNote outperforms or achieves comparable performance to DeepRelease (state-of-the-art), Conventional Changelog (off-the-shelf), and the projects' original release note across four quality metrics: completeness, clarity, conciseness, and organisation. In both evaluations, SmartNote ranked first for completeness and organisation, while clarity ranked first in the human evaluation. Furthermore, our controlled study reveals the significance of contextual awareness, while our applicability analysis confirms SmartNote's effectiveness across diverse projects.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p32-p doi:10.1145/3729345

CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift
Jiongchi Yu, Xiaofei Xie, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liauw
(Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore)
With the rapid advancement of cloud-native computing, securing cloud environments has become an important task. Log-based Anomaly Detection (LAD) is the most representative technique used in different systems for attack detection and safety guarantee, where multiple LAD methods and relevant datasets have been proposed. However, even though some of these datasets are specifically prepared for cloud systems, they only cover limited cloud behaviors and lack information from a whole-system perspective. Another critical issue to consider is normality shift, which implies that the test distribution could differ from the training distribution and highly affect the performance of LAD. Unfortunately, existing works only focus on simple shift types such as chronological changes, while other cloud-specific shift types are ignored, e.g., different deployed cloud architectures. Therefore, a dataset that captures diverse cloud system behaviors and various types of normality shifts is essential.
To fill this gap, we construct a dataset CAShift to evaluate the performance of LAD in cloud, which considers different roles of software in cloud systems, supports three real-world normality shift types (application shift, version shift, and cloud architecture shift), and features 20 different attack scenarios in various cloud system components. Based on CAShift, we conduct a comprehensive empirical study to investigate the effectiveness of existing LAD methods in normality shift scenarios. Additionally, to explore the feasibility of shift adaptation, we further investigate three continuous learning approaches, which are the most common methods to mitigate the impact of distribution shift. Results demonstrated that 1) all LAD methods suffer from normality shift where the performance drops up to 34%, and 2) existing continuous learning methods are promising to address shift drawbacks, but the ratio of data used for model retraining and the selection of algorithms highly affect the shift adaptation, with an increase in the F1-Score of up to 27%. Based on our findings, we offer valuable implications for future research in designing more robust LAD models and methods for LAD shift adaptation.

Publisher's Version Article: fse25mainb-p68-p doi:10.1145/3729346

Incorporating Verification Standards for Security Requirements Generation from Functional Specifications
Xiaoli Lian, Shuaisong Wang, Hanyu Zou, Fang Liu, Jiajun Wu, and Li Zhang
(Beihang University, China)
In the current software-driven era, ensuring privacy and security is critical. Despite this, the specification of security requirements for software is still largely a manual and labor-intensive process. Engineers are tasked with analyzing potential security threats based on functional requirements (FRs), a procedure prone to omissions and errors due to the expertise gap between cybersecurity experts and software engineers. To bridge this gap, we introduce F2SRD (Function-to-Security Requirements Derivation), an automated approach that proactively derives security requirements (SRs) from functional specifications under the guidance of relevant security verification requirements (VRs) drawn from the well recognized OWASP Application Security Verification Standard (ASVS). F2SRD operates in two main phases: Initially, we develop a VR retriever trained on a custom database of FR-VR pairs, enabling it to adeptly select applicable VRs from ASVS. This targeted retrieval informs the precise and actionable formulation of SRs. Subsequently, these VRs are used to construct structured prompts that direct GPT-4 in generating SRs. Our comparative analysis against two established models demonstrates F2SRD's enhanced performance in producing SRs that excel in inspiration, diversity, and specificity—essential attributes for effective security requirement generation. By leveraging security verification standards, we believe that the generated SRs are not only more focused but also resonate stronger with the needs of engineers.

Publisher's Version Article: fse25mainb-p84-p doi:10.1145/3729347

Multi-modal Traffic Scenario Generation for Autonomous Driving System Testing
Zhi Tu, Liangkun Niu, Wei Fan, and Tianyi Zhang
(Purdue University, USA)
Autonomous driving systems (ADS) require extensive testing and validation before deployment. However, it is tedious and time-consuming to construct traffic scenarios for ADS testing. In this paper, we propose TrafficComposer, a multi-modal traffic scenario construction approach for ADS testing. TrafficComposer takes as input a natural language (NL) description of a desired traffic scenario and a complementary traffic scene image. Then, it generates the corresponding traffic scenario in a simulator, such as CARLA and LGSVL. Specifically, TrafficComposer integrates high-level dynamic information about the traffic scenario from the NL description and intricate details about the surrounding vehicles, pedestrians, and the road network from the image. The information from the two modalities is complementary to each other and helps generate high-quality traffic scenarios for ADS testing. On a benchmark of 120 traffic scenarios, TrafficComposer achieves 97.0% accuracy, outperforming the best-performing baseline by 7.3%. Both direct testing and fuzz testing experiments on six ADSs prove the bug detection capabilities of the traffic scenarios generated by TrafficComposer. These scenarios can directly discover 37 bugs and help two fuzzing methods find 33%–124% more bugs serving as initial seeds.

Publisher's Version Article: fse25mainb-p115-p doi:10.1145/3729348

Dynamic Taint Tracking for Modern Java Virtual Machines
Katherine Hough and Jonathan Bell
(Northeastern University, USA)
Dynamic taint tracking is a program analysis that traces the flow of information through a program. In the Java virtual machine (JVM), there are two prominent approaches for dynamic taint tracking: "shadowing" and "mirroring". Shadowing is able to precisely track information flows, but is also prone to disrupting the semantics of the program under analysis. Mirroring is better able to preserve program semantics, but often inaccurate. The limitations of these approaches are further exacerbated by features introduced in the latest Java versions. In this paper, we propose Galette, an approach for dynamic taint tracking in the JVM that combines aspects of both shadowing and mirroring to provide precise, robust taint tag propagation in modern JVMs. On a benchmark suite of 3,451 synthetic Java programs, we found that Galette was able to propagate taint tags with perfect accuracy while preserving program semantics on all four active long-term support versions of Java. We also found that Galette's runtime and memory overheads were competitive with that of two state-of-the-art dynamic taint tracking systems on a benchmark suite of twenty real-world Java programs.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25mainb-p116-p doi:10.1145/3729349

Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?
Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu
(Nanyang Technological University, Singapore; Peking University, China; King's College London, UK; Fudan University, China)
Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods.

Publisher's Version Article: fse25mainb-p123-p doi:10.1145/3729350

TracePicker: Optimization-Based Trace Sampling for Microservice-Based Systems
Shuaiyu Xie, Jian Wang, Maodong Li, Peiran Chen, Jifeng Xuan, and Bing Li
(Wuhan University, China; Zhongguancun Laboratory, China)
Distributed tracing is a pivotal technique for software operators to understand and diagnose issues within microservice-based systems, offering a comprehensive view of user requests propagated through various services. However, the unprecedented volume of traces imposes expensive storage and analytical burdens on online systems. Conventional tracing approaches typically rely on random sampling with a fixed probability for each trace, which risks missing valuable traces. Several tail-based sampling methods have thus been proposed to sample traces based on their content. Nevertheless, these methods primarily evaluate traces on an individual basis, neglecting the collective attributes of the sample set in terms of comprehensiveness, balance, and consistency. To address these issues, we propose TracePicker, an optimization-based online sampler designed to enhance the quality of sampled data while mitigating storage burden. TracePicker employs a streaming anomaly detector to capture and retain anomalous traces that are crucial for troubleshooting. For normal traces, the sampling process is segmented into quota allocation and group sampling, both formulated as integer programming problems. By solving these problems using dynamic programming and evolution algorithms, TracePicker selects a high-quality subset of data, minimizing overall information loss. Experimental results demonstrate that TracePicker outperforms existing tail-based sampling methods in terms of both sampling quality and time consumption.

Publisher's Version Article: fse25mainb-p130-p doi:10.1145/3729351

Towards Understanding Fine-Grained Programming Mistakes and Fixing Patterns in Data Science
Wei-Hao Chen, Jia Lin Cheoh, Manthan Keim, Sabine Brunswicker, and Tianyi Zhang
(Purdue University, USA)
Programming is an essential activity in data science (DS). Unlike regular software developers, DS programmers often use Jupyter notebooks instead of conventional IDEs. Moreover, DS programmers focus on statistics, data analytics, and modeling rather than writing production-ready code following best practices in software engineering. Thus, in order to provide effective tool support to improve their productivity, it is important to understand what kinds of errors they make and how they fix them. Previous studies have analyzed DS code from public code-sharing platforms such as GitHub and Kaggle. However, they only accounted for code changes committed to the version history, omitting many programming mistakes that are resolved before code commits. To bridge the gap, we present an in-depth analysis of the fine-grained logs of a DS competition, which includes 390 Jupyter Notebooks written by participants over six weeks. In addition, we conducted semi-structured interviews with 10 DS programmers from different domains to understand the reasons behind their programming mistakes. We identified several unique programming mistakes and fix patterns that had not been reported before, highlighting opportunities for designing new tool support for DS programming.

Publisher's Version Article: fse25mainb-p137-p doi:10.1145/3729352

LookAhead: Preventing DeFi Attacks via Unveiling Adversarial Contracts
Shoupeng Ren, Lipeng He, Tianyu Tu, Di Wu, Jian Liu, Kui Ren, and Chun Chen
(Zhejiang University, China; University of Waterloo, Canada)
The exploitation of smart contract vulnerabilities in Decentralized Finance (DeFi) has resulted in financial losses exceeding 3 billion US dollars. Existing defense mechanisms primarily focus on detecting and reacting to adversarial transactions executed by attackers that target victim contracts. However, with the emergence of private transaction pools where transactions are sent directly to miners without first appearing in public mempools, current detection tools face significant challenges in identifying attack activities effectively.
Based on the fact that most attack logic rely on deploying intermediate contracts as supporting components to the exploitation of victim contracts, novel detection methods have been proposed that focus on identifying these adversarial contracts instead of adversarial transactions. However, previous state-of-the-art approaches in this direction have failed to produce results satisfactory enough for real-world deployment. In this paper, we propose LookAhead, a new framework for detecting DeFi attacks via unveiling adversarial contracts. LookAhead leverages common attack patterns, code semantics and intrinsic characteristics found in adversarial contracts to train Machine Learning (ML)-based classifiers that can effectively distinguish adversarial contracts from benign ones and make timely predictions of different types of potential attacks. Experiments on our labeled datasets show that LookAhead achieves an F1-score of 0.8966, which represents an improvement of over 44.4% compared to the previous state-of-the-art solution, with a False Positive Rate at only 0.16%.

Publisher's Version Article: fse25mainb-p146-p doi:10.1145/3729353

Doc2OracLL: Investigating the Impact of Documentation on LLM-Based Test Oracle Generation
Soneya Binta Hossain, Raygan Taylor, and Matthew Dwyer
(University of Virginia, USA; Dillard University, USA)
Code documentation is a critical artifact of software development, bridging human understanding and machine- readable code. Beyond aiding developers in code comprehension and maintenance, documentation also plays a critical role in automating various software engineering tasks, such as test oracle generation (TOG). In Java, Javadoc comments offer structured, natural language documentation embedded directly within the source code, typically describing functionality, usage, parameters, return values, and exceptional behavior. While prior research has explored the use of Javadoc comments in TOG alongside other information, such as the method under test (MUT), their potential as a stand-alone input source, the most relevant Javadoc components, and guidelines for writing effective Javadoc comments for automating TOG remain less explored. In this study, we investigate the impact of Javadoc comments on TOG through a comprehensive analysis. We begin by fine-tuning 10 large language models using three different prompt pairs to assess the role of Javadoc comments alongside other contextual information. Next, we systematically analyze the impact of different Javadoc comment’s components on TOG. To evaluate the generalizability of Javadoc comments from various sources, we also generate them using the GPT-3.5 model. We perform a thorough bug detection study using Defects4J dataset to understand their role in real-world bug detection. Our results show that incorporating Javadoc comments improves the accuracy of test oracles in most cases, aligning closely with ground truth. We find that Javadoc comments alone can match or even outperform approaches that utilize the MUT implementation. Additionally, we identify that the description and the return tag are the most valuable components for TOG. Finally, our approach, when using only Javadoc comments, detects between 19% and 94% more real-world bugs in Defects4J than prior methods, establishing a new state-of-the-art. To further guide developers in writing effective documentation, we conduct a detailed qualitative study on when Javadoc comments are helpful or harmful for TOG.

Publisher's Version Article: fse25mainb-p152-p doi:10.1145/3729354

ChatDBG: Augmenting Debugging with Large Language Models
Kyla H. Levin, Nicolas van Kempen, Emery D. Berger, and Stephen N. Freund
(University of Massachusetts at Amherst, USA; Amazon Web Services, USA; Williams College, USA)
Debugging is a critical but challenging task for programmers. This paper proposes ChatDBG, an AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like "why is x null?". To handle these queries, ChatDBG grants the LLM autonomy to "take the wheel": it can act as an independent agent capable of querying and controlling the debugger to navigate through stacks and inspect program state. It then reports its findings and yields back control to the programmer. By leveraging the real-world knowledge embedded in LLMs, ChatDBG can diagnose issues identifiable only through the use of domain-specific reasoning. Our ChatDBG prototype integrates with standard debuggers including LLDB and GDB for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded more than 75,000 times.

Publisher's Version

Published Artifact

Info

Artifacts Available

Artifacts Reusable Article: fse25mainb-p157-p doi:10.1145/3729355

A Knowledge Enhanced Large Language Model for Bug Localization
Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang
(Nanjing University, China; Singapore Management University, Singapore)
A significant number of bug reports are generated every day as software systems continue to develop. Large Language Models (LLMs) have been used to correlate bug reports with source code to locate bugs automatically. The existing research has shown that LLMs are effective for bug localization and can increase software development efficiency. However, these studies still have two limitations. First, these models fail to capture context information about bug reports and source code. Second, these models are unable to understand the domain-specific expertise inherent to particular projects, such as version information in projects that are composed of alphanumeric characters without any semantic meaning.
To address these challenges, we propose a Knowledge Enhanced Pre-Trained model using project documents and historical code, called KEPT, for bug localization. Project documents record, revise, and restate project information that provides rich semantic information about those projects. Historical code contains rich code semantic information that can enhance the reasoning ability of LLMs. Specifically, we construct knowledge graphs from project documents and source code. Then, we introduce knowledge graphs to the LLM through soft-position embedding and visible matrices, enhancing its contextual and professional reasoning ability. To validate our model, we conducted a series of experiments on seven open-source software projects with over 6,000 bug reports. Compared with the traditional model (Locus), KEPT performs better by 33.2% to 59.5% in terms of mean reciprocal rank, mean average precision, and Top@N. Compared with the best-performing non-commercial LLM (CodeT5), KEPT achieves an improvement of 36.6% to 63.7%. Compared to the state-of-the-art commercial LLM developed by OpenAI, called text-embedding-ada-002, KEPT achieves an average improvement of 7.8% to 17.4%. The results indicate that introducing knowledge graphs contributes to enhance the effectiveness of the LLM in bug localization.

Publisher's Version Article: fse25mainb-p166-p doi:10.1145/3729356

Zero-Shot Cross-Domain Code Search without Fine-Tuning
Keyu Liang, Zhongxin Liu, Chao Liu, Zhiyuan Wan, David Lo, and Xiaohu Yang
(Zhejiang University, China; Chongqing University, China; Singapore Management University, Singapore)
Code search is a crucial task in software engineering, aiming to retrieve code snippets that are semantically relevant to a natural language query. Recently, Pre-trained Language Models (PLMs) have shown remarkable success and are widely adopted for code search tasks. However, PLM-based methods often struggle in cross-domain scenarios. When applied to a new domain, they typically require extensive fine-tuning with substantial data. Even worse, the data scarcity problem in new domains often forces these methods to operate in a zero-shot setting, resulting in a significant decline in performance. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search.
The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. We first conduct an empirical study to investigate the effectiveness of these two matching schemas in zero-shot cross-domain code search. Our findings highlight the strong complementarity among the three matching schemas, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge first employs zero-shot prompting to guide Large Language Models (LLMs) to generate a comment for each code snippet in the codebase and produce a code for each query. Subsequently, it encodes queries, code snippets, comments, and the generated code using PLMs and assesses similarities through three matching schemas: query-code, query-comment, and generated code-code. Lastly, CodeBridge leverages a sampling-based fusion approach that combines these three similarity scores to rank the final search outcomes. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning.

Publisher's Version

Info Article: fse25mainb-p171-p doi:10.1145/3729357

RegTrieve: Reducing System-Level Regression Errors for Machine Learning Systems via Retrieval-Enhanced Ensemble
Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng
(Fudan University, China; Singapore Management University, Singapore)
Multiple machine learning (ML) models are often incorporated into real-world ML systems. However, updating an individual model in these ML systems frequently results in regression errors, where the new model performs worse than the old model for some inputs. While model-level regression errors have been widely studied, little is known about how regression errors propagate at system level. To address this gap, we propose RegTrieve, a novel retrieval-enhanced ensemble approach to reduce regression errors at both model and system level. Our evaluation across various model update scenarios shows that RegTrieve reduces system-level regression errors with almost no impact on system accuracy, outperforming all baselines by 20.43% on average.

Publisher's Version

Info Article: fse25mainb-p182-p doi:10.1145/3729358

TerzoN: Human-in-the-Loop Software Testing with a Composite Oracle
Matthew C. Davis, Amy Wei, Brad A. Myers, and Joshua Sunshine
(Carnegie Mellon University, USA; University of Michigan, USA)
Software testing is difficult, tedious, and may consume 28%–50% of software engineering labor. Automatic test generators aim to ease this burden but have important trade-offs. Fuzzers use an implicit oracle that can detect obviously invalid results, but the oracle problem has no general solution, and an implicit oracle cannot automatically evaluate correctness. Test suite generators like EvoSuite use the program under test as the oracle and therefore cannot evaluate correctness. Property-based testing tools evaluate correctness, but users have difficulty coming up with properties to test and understanding whether their properties are correct. Consequently, practitioners create many test suites manually and often use an example-based oracle to tediously specify correct input and output examples. To help bridge the gaps among various oracle and tool types, we present the Composite Oracle, which organizes various oracle types into a hierarchy and renders a single test result per example execution. To understand the Composite Oracle’s practical properties, we built TerzoN, a test suite generator that includes a particular instantiation of the Composite Oracle. TerzoN displays all the test results in an integrated view composed from the results of three types of oracles and finds some types of test assertion inconsistencies that might otherwise lead to misleading test results. We evaluated TerzoN in a randomized controlled trial with 14 professional software engineers with a popular industry tool, fast-check, as the control. Participants using TerzoN elicited 72% more bugs (p < 0.01), accurately described more than twice the number of bugs (p < 0.01) and tested 16% more quickly (p < 0.05) relative to fast-check.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p196-p doi:10.1145/3729359

Teaching AI the ‘Why’ and ‘How’ of Software Vulnerability Fixes
Amiao Gao, Zenong Zhang, Simin Wang, LiGuo Huang, Shiyi Wei, and Vincent Ng
(Southern Methodist University, Dallas, USA; University of Texas at Dallas, USA)
Understanding software vulnerabilities and their resolutions is crucial for securing modern software systems. This study presents a novel traceability model that links a pair of sentences describing at least one of the three types of semantics (triggers, crash phenomenon and fix action) for a vulnerability in natural language (NL) vulnerability artifacts, to their corresponding pair of code statements. Different from the traditional traceability models, our trace links between a pair of related NL sentences and a pair of code statements can recover the semantic relationship between code statements so that the specific role played by each code statement in a vulnerability can be automatically identified. Our end-to-end approach is implemented in two key steps: VulnExtract and VulnTrace. VulnExtract automatically extracts sentences describing triggers, crash phenomenon and/or fix action for a vulnerability using 37 discourse patterns derived from NL artifacts (CVE summary, bug reports and commit messages). VulnTrace employs pre-trained code search models to trace these sentences to corresponding code statements. Our empirical study, based on 341 CVEs and their associated code snippets, demonstrates the effectiveness of our approach, with recall exceeding 90% in most cases for NL sentence extraction. VulnTrace achieves a Top5 accuracy of over 68.2% for mapping a pair of related NL sentences to corresponding pair of code statements. The end-to-end combined VulnExtract+VulnTrace achieves a Top5 accuracy of 59.6% and 53.1% for mapping two pairs of NL sentences to code statements. These results highlight the potential of our method in automating vulnerability comprehension and reducing manual effort.

Publisher's Version

Info Article: fse25mainb-p198-p doi:10.1145/3729360

Bridging Operator Semantic Inconsistencies: A Source-Level Cross-Framework Model Conversion Approach
Xingpei Li, Yan Lei, Zhouyang Jia, Yuanliang Zhang, Haoran Liu, Liqian Chen, Wei Dong, and Shanshan Li
(National University of Defense Technology, China; Chongqing University, China)
As deep learning (DL) frameworks become widely used, converting models between frameworks is crucial for ecosystem flexibility. However, interestingly, existing model converters commonly focus on syntactic operator API mapping—transpiling operator names and parameters—which results in API compatibility issues (i.e., incompatible parameters, missing operators). These issues arise from semantic inconsistencies due to differences in operator implementation, causing conversion failure or performance degradation.
In this paper, we present the first comprehensive study on operator semantic inconsistencies through API mapping analysis and framework source code inspection, revealing that 47% of sampled operators exhibit inconsistencies across DL frameworks, with source code limited to individual layers and no inter-layer interactions. This suggests that layer-wise source code alignment is feasible. Based on this, we propose ModelX, a source-level cross-framework conversion approach that extends operator API mapping by addressing semantic inconsistencies beyond the API level. Experiments on PyTorch-to-Paddle conversion show that ModelX successfully converts 624 out of 686 sampled operators and outperforms two state-of-the-art converters and three popular large language models. Moreover, ModelX achieves minimal metric gaps (avg. all under 3.4%) across 52 models from vision, text, and audio tasks, indicating strong robustness.

Publisher's Version Article: fse25mainb-p200-p doi:10.1145/3729361

UnitCon: Synthesizing Targeted Unit Tests for Java Runtime Exceptions
Sujin Jang, Yeonhee Ryou, Heewon Lee, and Kihong Heo
(KAIST, Republic of Korea)
We present UnitCon, a system for synthesizing targeted unit testsfor runtime exceptions in Java programs. Targeted unit tests aim to reveal a bug at a specific location in the program under test. This capability benefits various tasks in software development, such as patch testing, crash reproduction, or static analysis alarm inspection. However, conventional unit test generation tools are mainly designed for regression tests by maximizing code coverage; hence they are not effective at such target-specific tasks. In this paper, we propose a novel synthesis technique that effectively guides the search for targeted unit tests. The key idea is to use static analysis to prune and prioritize the search space by estimating the semantics of candidate test cases. This allows us to efficiently focus on promising unit tests that are likely to trigger runtime exceptions at the target location. According to our experiments on a suite of Java programs, our approach outperforms the state-of-the-art unit test generation tools. We also applied UnitCon for inspecting static analysis alarms for null pointer exceptions (NPEs) in 51 open-source projects and discovered 21 previously unknown NPE bugs.

Publisher's Version

Published Artifact

Artifacts Available

ACM SIGSOFT Distinguished Paper Award Article: fse25mainb-p208-p doi:10.1145/3729362

An Empirical Study of Bugs in Data Visualization Libraries
Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu, Shing-Chi Cheung, and Chengnian Sun
(Hong Kong University of Science and Technology, China; University of Waterloo, Canada)
Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries.
This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. Our findings offer valuable insights into the nature and patterns of bugs in DataViz libraries, providing a foundation for developers and researchers to improve library reliability, and ultimately benefit more accurate and reliable data visualizations across various domains.

Publisher's Version Article: fse25mainb-p225-p doi:10.1145/3729363

Divide-and-Conquer: Generating UI Code from Screenshots
Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu
(Chinese University of Hong Kong, China; Singapore Management University, Singapore)
Websites are critical in today’s digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore automatic design-to-code solutions, we first conduct a motivating study on GPT-4o and identify three types of issues in generating UI code: element omission, element distortion, and element misarrangement. We further reveal that a focus on smaller visual segments can help multimodal large language models (MLLMs) mitigate these failures in the generation process.
In this paper, we propose DCGen, a divide-and-conquer-based approach to automate the translation of webpage design to UI code. DCGen starts by dividing screenshots into manageable segments, generating code for each segment, and then reassembling them into complete UI code for the entire screenshot. We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 15% improvement in visual similarity and 8% in code similarity for large input images. Human evaluations show that DCGen can help developers implement webpages significantly faster and more similar to the UI designs. To the best of our knowledge, DCGen is the first segment-aware MLLM-based approach for generating UI code directly from screenshots.

Publisher's Version Article: fse25mainb-p236-p doi:10.1145/3729364

Liberating Libraries through Automated Fuzz Driver Generation: Striking a Balance without Consumer Code
Flavio Toffalini, Nicolas Badoux, Zurab Tsinadze, and Mathias Payer
(Ruhr University Bochum, Germany; EPFL, Switzerland)
Fuzz testing a software library requires developers to write fuzz drivers, specialized programs exercising the library. Given a driver, fuzzers generate interesting inputs that trigger the library’s bugs. Writing fuzz drivers manually is a cumbersome process and they frequently hit a coverage plateau, calling for more diverse drivers. To alleviate the need for human expert knowledge, emerging automatic driver generation techniques invest computational time for tasks besides input generation. Therefore, to maximize the number of bugs found, it is crucial to carefully balance the available computational resources between generating valid drivers and testing them thoroughly. Current works model driver generation and testing as a single problem, i.e., they mutate both the driver’s code and input together. This simple approach is limited, as many libraries need a combination of non-trivial library usage and complex inputs. For example, consider a JPEG manipulation library, bugs appear when specific library functions and corrupted images are coincidentally tested together, which, if both are mutated synchronously is difficult to trigger. We introduce libErator, a novel library testing approach that balances constrained computational resources to achieve two goals: (a) quickly generate valid fuzz drivers and (b) deeply test these drivers to find bugs. To achieve these goals, libErator employs three main techniques. First, we leverage insights from a novel static analysis on the library code to improve the likelihood of generating meaningful drivers. Second, we design a method to quickly discard non-functional drivers, reducing even further resources wasted on unfruitful drivers. Finally, we show an effective driver selection method that avoids redundant tests. We deploy libErator on 15 open-source libraries and evaluate it against manually written and automatically generated drivers. We show that libErator reaches comparable coverage to manually written drivers and, on average, exceeds coverage from existing automated driver generation techniques. More importantly, libErator automatically finds 24 confirmed bugs, 21 of which are already fixed and upstreamed. Among the bugs found, one was assigned a CVE while others contributed to the project test suites, thus showcasing the ability of libErator to create valid library usages. Finally, libErator achieves 25% true positive ratio, doubling the state of the art.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p239-p doi:10.1145/3729365

Automatically Fixing Dependency Breaking Changes
Lukas Fruntke and Jens Krinke
(University College London, UK)
Breaking changes in dependencies are a common challenge in software development, requiring manual intervention to resolve. This study examines how well LLM automate the repair of breaking changes caused by dependency updates in Java projects. Although earlier methods have mostly concentrated on detecting breaking changes or investigating their impact, they have not been able to completely automate the repair process. We introduce and compare two new approaches: an agentic system that combines automated tool usage with LLM, and a recursive zero-shot approach, employing iterative prompt refinement. Our experimental framework assesses the repair success of both approaches, using the BUMP dataset of curated breaking changes. We also investigate the impact of variables such as dependency popularity and prompt configuration on repair outcomes. Our results demonstrate a substantial difference in test suite success rates, with the agentic approach achieving a repair success rate of up to 23%, while the zero-shot prompting approach achieved a repair success rate of up to 19%. We show that automated program repair of breaking dependencies with LLMs is feasible and can be optimised to achieve better repair outcomes.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p241-p doi:10.1145/3729366

Alert Summarization for Online Service Systems by Validating Propagation Paths of Faults
Jia Chen, Yuang He, Peng Wang, Xiaolei Chen, Jie Shi, and Wei Wang
(Fudan University, China)
For online service systems, alerts are crucial for root cause analysis as they capture symptoms triggered by system faults. In real-world scenarios, a fault can propagate across multiple system components, generating a large volume of alerts. Various approaches have been proposed to summarize alerts into incidents to accelerate root cause analysis, using the topology information. However, these approaches focus solely on connectivity, neglecting the semantics of the topology, which significantly impacts their performance. In this paper, we introduce ProAlert, a novel topology-based approach that summarizes alerts into incidents by validating fault propagation paths. ProAlert first unsupervisedly learns fault propagation patterns from historical alerts and system topology offline. It then uses these patterns to validate fault paths in real-time alerts, leading to more accurate incident summarization. Moreover, the fault propagation paths provided by ProAlert improve the interpretability of incidents, assisting maintenance engineers in understanding the root causes of faults. To demonstrate the effectiveness and efficiency of ProAlert, we conduct extensive experiments on real-world data. The results show that ProAlert outperforms state-of-the-art approaches.

Publisher's Version Article: fse25mainb-p242-p doi:10.1145/3729367

It’s Acting Odd! Exploring Equivocal Behaviors of Goodware
Gregorio Dalia, Andrea Di Sorbo, Corrado Aaron Visaggio, and Gerardo Canfora
(University of Sannio, Italy; University of Foggia, Italy)
Nowadays, although it is widely known which behaviors of a software define a malware, there is a large gray area of invasive behaviors that do not make software necessarily harmful but act pervasively without the user’s perception. Being adequately informed of such behaviors is crucial to enhance transparency and awareness about the possible risks or blind spots to consider when executing/adopting a software tool or library. To cope with this issue, this work aims to (i) identify and classify equivocal behaviors of desktop software applications and (ii) assess the relevance and prevalence of such behaviors in trusted software. We identify twelve equivocal behaviors and evaluate their equivocality through a survey involving 32 software engineering and cybersecurity experts. Then, we investigate the extent to which such behaviors are exhibited by trusted software compared to malware samples. The results demonstrate that most surveyed experts generally agree on the equivocality of identified behaviors. In addition, legitimate software frequently manifests some of the behaviors identified as equivocal. Specifically, in more than 30% of reports generated for trusted software, we find behaviors aimed to (i) obtain information about the client system and available resources, (ii) perform advanced interaction with OS utilities, or (iii) avoid analysis.

Publisher's Version Article: fse25mainb-p249-p doi:10.1145/3729368

Scientific Open-Source Software Is Less Likely to Become Abandoned Than One Might Think! Lessons from Curating a Catalog of Maintained Scientific Software
Addi Malviya Thakur, Reed Milewicz, Mahmoud Jahanshahi, Lavínia Paganini, Bogdan Vasilescu, and Audris Mockus
(University of Tennessee at Knoxville, USA; Oak Ridge National Laboratory, USA; Sandia National Laboratories, USA; Eindhoven University of Technology, Netherlands; Carnegie Mellon University, USA)
Scientific software is essential to scientific innovation and in many ways it is distinct from other types of software. Abandoned (or unmaintained), buggy, and hard to use software, a perception often associated with scientific software can hinder scientific progress, yet, in contrast to other types of software, its longevity is poorly understood. Existing data curation efforts are fragmented by science domain and/or are small in scale and lack key attributes. We use large language models to classify public software repositories in World of Code into distinct scientific domains and layers of the software stack, curating a large and diverse collection of over 18,000 scientific software projects. Using this data, we estimate survival models to understand how the domain, infrastructural layer, and other attributes of scientific software affect its longevity. We further obtain a matched sample of non-scientific software repositories and investigate the differences. We find that infrastructural layers, downstream dependencies, mentions of publications, and participants from government are associated with a longer lifespan, while newer projects with participants from academia had shorter lifespan. Against common expectations, scientific projects have a longer lifetime than matched non-scientific open-source software projects. We expect our curated attribute-rich collection to support future research on scientific software and provide insights that may help extend longevity of both scientific and other projects.

Publisher's Version

Info Article: fse25mainb-p270-p doi:10.1145/3729369

Automated Recognition of Buggy Behaviors from Mobile Bug Reports
Zhaoxu Zhang, Komei Ryu, Tingting Yu, and William G.J. Halfond
(University of Southern California, USA; University of Connecticut, USA)
Bug report reproduction is a crucial but time-consuming task to be carried out during mobile app maintenance. To accelerate this process, researchers have developed automated techniques for reproducing mobile app bug reports. However, due to the lack of an effective mechanism to recognize different buggy behaviors described in the report, existing work is limited to reproducing crash bug reports, or requires developers to manually analyze execution traces to determine if a bug was successfully reproduced. To address this limitation, we introduce a novel technique to automatically identify and extract the buggy behavior from the bug report and detect it during the automated reproduction process. To accommodate various buggy behaviors of mobile app bugs, we conducted an empirical study and created a standardized representation for expressing the bug behavior identified from our study. Given a report, our approach first transforms the documented buggy behavior into this standardized representation, then matches it against real-time device and UI information during the reproduction to recognize the bug. Our empirical evaluation demonstrated that our approach achieved over 90% precision and recall in generating the standardized representation of buggy behaviors. It correctly identified bugs in 83% of the bug reports and enhanced existing reproduction techniques, allowing them to reproduce four times more bug reports.

Publisher's Version Article: fse25mainb-p282-p doi:10.1145/3729370

Enhancing Web Accessibility: Automated Detection of Issues with Generative AI
Ziyao He, Syed Fatiul Huq, and Sam Malek
(University of California at Irvine, USA)
Websites are integral to people’s daily lives, with billions in use today. However, due to limited awareness of accessibility and its guidelines, developers often release web apps that are inaccessible to people with disabilities, who make up around 16% of the global population. To ensure a baseline of accessibility, software engineers rely on automated checkers that assess a webpage’s compliance based on predefined rules. Unfortunately, these tools typically cover only a small subset of accessibility guidelines and often overlook violations that require a semantic understanding of the webpage. The advent of generative AI, known for its ability to comprehend textual and visual content, has created new possibilities for detecting accessibility violations. We began by studying the most widely used guideline, WCAG, to determine the testable success criteria that generative AI could address. This led to the development of an automated tool called GenA11y, which extracts elements from a page related to each success criterion and inputs them into an LLM prompted to detect accessibility issues on the web. Evaluations of GenA11y showed its effectiveness, with a precision of 94.5% and a recall of 87.61%. Additionally, when tested on real websites, GenA11y identified an average of eight more types of accessibility violations than the combination of existing tools.

Publisher's Version Article: fse25mainb-p297-p doi:10.1145/3729371

Directed Testing in MLIR: Unleashing Its Potential by Overcoming the Limitations of Random Fuzzing
Weiyuan Tong, Zixu Wang, Zhanyong Tang, Jianbin Fang, Yuqun Zhang, and Guixin Ye
(NorthWest University, China; National University of Defense Technology, China; Southern University of Science and Technology, China)
MLIR is a new way of creating compiler infrastructures that can be easily reused and extended. Current MLIR fuzzing methods focus primarily on test case generation or mutation using randomly selected passes. However, they often overlook the hierarchical structure of MLIR, resulting in inefficiencies in bug detection, especially for issues triggered by downstream dialects. Random testing lacks a focused approach to exploring the code space, resulting in wasted resources on normal components and overlooking bug-prone areas. To address these limitations, we introduce MLIRTracer, a top-down fuzzing approach that targets the highest level of MLIR programs (tosa IR) with a directed testing strategy. Our method systematically traverses the hierarchical code space of MLIR, from tosa IR to the lower levels, while prioritizing tests of bug-prone areas through directed exploration. MLIRTracer has successfully detected 73 bugs, with 61 already resolved by the MLIR developers.

Publisher's Version Article: fse25mainb-p308-p doi:10.1145/3729372

DiSCo: Towards Decompiling EVM Bytecode to Source Code using Large Language Models
Xing Su, Hanzhong Liang, Hao Wu, Ben Niu, Fengyuan Xu, and Sheng Zhong
(Nanjing University, China; Institute of Information Engineering at Chinese Academy of Sciences, China)
Understanding the Ethereum smart contract bytecode is essential for ensuring cryptoeconomics security. However, existing decompilers primarily convert bytecode into pseudocode, which is not easily comprehensible for general users, potentially leading to misunderstanding of contract behavior and increased vulnerability to scams or exploits. In this paper, we propose DiSCo, the first LLMs-based EVM decompilation pipeline, which aims to enable LLMs to understand the opaque bytecode and lift it into smart contract code. DiSCo introduces three core technologies. First, a logic-invariant intermediate representation is proposed to reproject the low-level bytecode into high-level abstracted units. The second technique involves semantic enhancement based on a novel type-aware graph model to infer stripped variables during compilation, enhancing the lifting effect. The third technology is a flexible method incorporating code specifications to construct LLM-comprehensible prompts for source code generation. Extensive experiments illustrate that our generated code guarantees a high compilability rate at 75%, with differential fuzzing pass rate averaging at 50%. Manual validation results further indicate that the generated solidity contracts significantly outperforms baseline methods in tasks such as code comprehension and attack reproduction.

Publisher's Version Article: fse25mainb-p317-p doi:10.1145/3729373

Towards Understanding Performance Bugs in Popular Data Science Libraries
Haowen Yang, Zhengda Li, Zhiqing Zhong, Xiaoying Tang, and Pinjia He
(Chinese University of Hong Kong, Shenzhen, China)
With the increasing demand for handling large-scale and complex data, data science (DS) applications often suffer from long execution time and rapid RAM exhaustion, which leads to many serious issues like unbearable delays and crashes in financial transactions. As popular DS libraries are frequently used in these applications, their performance bugs (PBs) are a major contributing factor to these issues, making it crucial to address them for improving overall application performance. However, PBs in popular DS libraries remain largely unexplored. To address this gap, we conducted a study of 202 PBs collected from seven popular DS libraries. Our study examined the impact of PBs and proposed a taxonomy for common root causes. We found over half of the PBs arise from inefficient data processing operations, especially within data structure. We also explored the effort required to locate their root causes and fix these bugs, along with the challenges involved. Notably, around 20% of the PBs could be fixed using simple strategies (e.g. Conditions Optimizing), suggesting the potential for automated repair approaches. Our findings highlight the severity of PBs in core DS libraries and offer insights for developing high-performance libraries and detecting PBs. Furthermore, we derived test rules from our identified root causes, identifying eight PBs, of which four were confirmed, demonstrating the practical utility of our findings.

Publisher's Version Article: fse25mainb-p326-p doi:10.1145/3729374

De-duplicating Silent Compiler Bugs via Deep Semantic Representation
Junjie Chen, Xingyu Fan, Chen Yang, Shuang Liu, and Jun Sun
(Tianjin University, China; Renmin University of China, China; Singapore Management University, Singapore)
The compiler bug duplication problem (where many test failures are caused by the same compiler bug) can lead to huge waste of time and resource in diagnosing test failures produced by compiler testing. It is particularly challenging with regard to the silent compiler bugs that do not produce any error messages. To address this problem, multiple white-box techniques were proposed, but they are inapplicable in many practical scenarios. Black-box techniques are more practical, but the existing ones are less effective as they often rely on irrelevant syntactic information. To bridge this gap, we propose a novel black-box technique (BLADE), which aims to improve the effectiveness of black-box de-duplication by extracting failure-relevant semantic information from failure-triggering test programs in a black-box manner. It first learns failure-relevant semantic information based on intermediate representation learning by employing the classification of failure-triggering and failure-free test programs as the auxiliary objective, and then extracts such information based on model interpretation. Our experiments on four widely-used datasets (collected from GCC and LLVM) show that BLADE significantly outperforms the two existing black-box techniques with an average improvement of 36% and 12% in identifying unique silent compiler bugs when analyzing the same number of test failures respectively, and achieves competitive effectiveness with the state-of-the-art white-box techniques.

Publisher's Version Article: fse25mainb-p328-p doi:10.1145/3729375

Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers
Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, and Aldeida Aleti
(Monash University, Australia; University of Auckland, New Zealand; Royal Melbourne Institute of Technology, Australia; Joint Research Centre at the European Commission, Italy)
Machine learning (ML) for text classification has been widely used in various domains, such as toxicity detection, chatbot consulting, and review analysis. These applications can significantly impact ethics, economics, and human behavior, raising serious concerns about trusting ML decisions. Several studies indicate that traditional uncertainty metrics, such as model confidence, and performance metrics, like accuracy, are insufficient to build human trust in ML models. These models often learn spurious correlations during training and predict based on them during inference. When deployed in the real world, where such correlations are absent, their performance can deteriorate significantly. To avoid this, a common practice is to test whether predictions are made reasonably based on valid patterns in the data. Along with this, a challenge known as the trustworthiness oracle problem has been introduced. So far, due to the lack of automated trustworthiness oracles, the assessment requires manual validation, based on the decision process disclosed by explanation methods. However, this approach is time-consuming, error-prone, and not scalable.
To address this problem, we propose TOKI, the first automated trustworthiness oracle generation method for text classifiers. TOKI automatically checks whether the words contributing the most to a prediction are semantically related to the predicted class. Specifically, we leverage ML explanation methods to extract the decision-contributing words and measure their semantic relatedness with the class based on word embeddings. As a demonstration of its practical usefulness, we also introduce a novel adversarial attack method that targets trustworthiness vulnerabilities identified by TOKI. We compare TOKI with a naive baseline based solely on model confidence. To evaluate their alignment with human judgement, experiments are conducted on human-created ground truths of approximately 8,000 predictions. Additionally, we compare the effectiveness of TOKI-guided adversarial attack method with A2T, a state-of-the-art adversarial attack method for text classification. Results show that (1) relying on prediction uncertainty metrics, such as model confidence, cannot effectively distinguish between trustworthy and untrustworthy predictions, (2) TOKI achieves 142% higher accuracy than the naive baseline, and (3) TOKI-guided adversarial attack method is more effective with fewer perturbations than A2T.

Publisher's Version Article: fse25mainb-p334-p doi:10.1145/3729376

No More Labelled Examples? An Unsupervised Log Parser with LLMs
Junjie Huang, Zhihan Jiang, Zhuangbin Chen, and Michael Lyu
(Chinese University of Hong Kong, Hong Kong; Sun Yat-sen University, China)
Log parsing serves as an essential prerequisite for various log analysis tasks. Recent advancements in this field have improved parsing accuracy by leveraging the semantics in logs through fine-tuning large language models (LLMs) or learning from in-context demonstrations. However, these methods heavily depend on high-quality labeled examples to achieve optimal performance. In practice, continuously collecting high-quality labeled data is challenging since logs are huge in volume and under frequent evolution, leading to performance degradation or heavy maintenance efforts for existing log parsers after deployment. To address this issue, we propose LUNAR, an unsupervised LLM-based method for efficient and off-the-shelf log parsing. Our key insight is that while LLMs may struggle with direct log parsing, their performance can be significantly enhanced through comparative analysis across multiple logs that differ only in their parameter parts. We refer to such groups of logs as Log Contrastive Units (LCUs). Given the vast volume of logs, obtaining LCUs is difficult. Therefore, LUNAR introduces a hybrid ranking scheme to effectively search for LCUs by jointly considering the commonality and variability among logs. Additionally, LUNAR crafts a novel parsing prompt for LLMs to identify contrastive patterns and extract meaningful log structures from LCUs. Experiments on large-scale public and industrial log datasets demonstrate that LUNAR significantly outperforms state-of-the-art log parsers in terms of accuracy and efficiency, providing an effective and practical solution for real-world deployment.

Publisher's Version

Info Article: fse25mainb-p354-p doi:10.1145/3729377

VulPA: Detecting Semantically Recurring Vulnerabilities with Multi-object Typestate Analysis
Liqing Cao, Haofeng Li, Chenghang Shi, Jie Lu, Haining Meng, Lian Li, and Jingling Xue
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China; UNSW, Australia)
Detecting semantically recurring vulnerabilities with similar root causes remains a challenge due to the complex interactions between multiple variables. This paper introduces VulPA, a novel approach for precisely identifying such vulnerabilities through complex inter-procedural data and control flows across multiple objects. VulPA tackles this challenge in two steps: 1) Defining root causes with a Vulnerability Pattern Description Language (VPDL) that specifies variable relations and bug-triggering operations, and 2) Detecting these patterns using an inter-procedural multi-object analysis that tracks dataflows and variable interactions. Built on the Heros IFDS framework, VulPA was evaluated on 26 Java applications using rules from 34 CVEs. It identified 90 new vulnerabilities (23.7% false positive rate), outperforming existing tools (ReDeBug, VUDDY, SourcererCC, PHunter, PPT4J, FlowDroid, and IDE𝑎𝑙), which collectively found only 13. VulPA effectively uncovers complex vulnerabilities missed by state-of-the-art tools.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p362-p doi:10.1145/3729378

AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation
Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand
(University of Illinois at Urbana-Champaign, USA; Indian Institute of Science, India; Cornell University, USA; IBM Research, USA)
Code translation transforms programs from one programming language (PL) to another. One prominent use case is application modernization to enhance maintainability and reliability. Several rule-based transpilers have been designed to automate code translation between different pairs of PLs. However, the rules can become obsolete as the PLs evolve and cannot generalize to other PLs. Recent studies have explored the automation of code translation using Large Language Models (LLMs). One key observation is that such techniques may work well for crafted benchmarks but fail to generalize to the scale and complexity of real-world projects with inter- and intra-class dependencies, custom types, PL-specific features, etc. We propose AlphaTrans, a neuro-symbolic approach to automate repository-level code translation. AlphaTrans translates both source and test code, and employs multiple levels of validation to ensure the translation preserves the functionality of the source program. To break down the problem for LLMs, AlphaTrans leverages program analysis to decompose the program into fragments and translates them in the reverse call order.
We leveraged AlphaTrans to translate ten real-world open-source projects consisting of ⟨836, 8575, 2719⟩ (application and test) classes, (application and test) methods, and unit tests. AlphaTrans breaks down these projects into 17874 fragments and translates the entire repository. 96.40% of the translated fragments are syntactically correct, and AlphaTrans validates the translations’ runtime behavior and functional correctness for 27.03% and 25.14% of the application method fragments. On average, integrated translation and validation takes 34 hours (min=3, max=121) to translate a project, showing its scalability in practice. For the syntactically or semantically incorrect translations, AlphaTrans generates a report including existing translation, stack trace, test errors, or assertion failures. We provided these artifacts to two developers to fix the translation bugs in four projects. They fixed the issues in 20.1 hours on average (5.5 hours for the smallest and 34 hours for the largest project) and achieved all passing tests. Without AlphaTrans, translating and validating such big projects could take weeks, if not months.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25mainb-p373-p doi:10.1145/3729379

Code Red! On the Harmfulness of Applying Off-the-Shelf Large Language Models to Programming Tasks
Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, and Maliheh Izadi
(Delft University of Technology, Netherlands)
Nowadays, developers increasingly rely on solutions powered by Large Language Models (LLM) to assist them with their coding tasks. This makes it crucial to align these tools with human values to prevent malicious misuse. In this paper, we propose a comprehensive framework for assessing the potential harmfulness of LLMs within the software engineering domain. We begin by developing a taxonomy of potentially harmful software engineering scenarios and subsequently, create a dataset of prompts based on this taxonomy. To systematically assess the responses, we design and validate an automatic evaluator that classifies the outputs of a variety of LLMs both open-source and closed-source models, as well as general-purpose and code-specific LLMs. Furthermore, we investigate the impact of models' size, architecture family, and alignment strategies on their tendency to generate harmful content. % Results The results show significant disparities in the alignment of various LLMs for harmlessness. We find that some models and model families, such as Openhermes, are more harmful than others and that code-specific models do not perform better than their general-purpose counterparts. Notably, some fine-tuned models perform significantly worse than their base-models due to their design choices. On the other side, we find that larger models tend to be more helpful and are less likely to respond with harmful information. These results highlight the importance of targeted alignment strategies tailored to the unique challenges of software engineering tasks and provide a foundation for future work in this critical area.

Publisher's Version Article: fse25mainb-p383-p doi:10.1145/3729380

On the Characteristics and Impacts of Protestware Libraries
Tanner Finken, Jesse Chen, and Sazzadur Rahaman
(University of Arizona, USA)
Protests are public expressions of personal or collective discontent with the current state of affairs. Although traditional protests involve in-person events, the ubiquity of computers and software opened up a new avenue for activism: protestware. While news and media heavily report individual protestware as discovered, an in-depth understanding of how they impact the open-source software supply chain is largely missing. In particular, we do not have a detailed understanding of their characteristics and impact on the open-source community who rely on free contributions. To address this gap, we first collect 163 samples of libraries that are either modified (protestware) or created (which we call protestware enablers) with a clear intention to protest. In addition, we analyze the aftermath of the protestware, which has the potential to affect the software supply chain in terms of community sentiment and usage. We report that: (1) protestware has three notable characteristics, namely, i) the way protests are induced is diverse, ii) the altered functionality can be discriminatory, and iii) the transparency (i.e. reporting the change for protest) is not always respected; (2) disruptive protestware may cause a substantial adverse impact on downstream users; (3) developers of protestware may not shift their beliefs even with pushback; (4) the usage of protestware from JavaScript libraries has been seen to generally increase over time. [Content Warning: This paper contains aggressive and derogatory language in the form of examples from GitHub user comments, which some might find unsettling.]

Publisher's Version Article: fse25mainb-p395-p doi:10.1145/3729381

Scene Flow Specifications: Encoding and Monitoring Rich Temporal Safety Properties of Autonomous Systems
Trey Woodlief, Felipe Toledo, Matthew Dwyer, and Sebastian Elbaum
(University of Virginia, USA)
To ensure the safety of autonomous systems, it is imperative for them to abide by their safety properties. The specification of such safety properties is challenging because of the gap between the input sensor space (e.g., pixels, point clouds) and the semantic space over which safety properties are specified (e.g. people, vehicles, road). Recent work utilized scene graphs to overcome portions of that gap, enabling the specification and synthesis of monitors targeting many safe driving properties for autonomous vehicles. However, scene graphs are not rich enough to express the many driving properties that include temporal elements (i.e., when two vehicles enter an intersection at the same time, the vehicle on the left shall yield...), fundamentally limiting the types of specifications that can be monitored. In this work, we characterize the expressiveness required to specify a large body of driving properties, identify property types that cannot be specified with current approaches, which we name scene flow properties, and construct an enhanced domain-specific language that utilizes symbolic entities across time to enable the encoding of the rich temporal properties required for autonomous system safety. In analyzing a set of 114 specifications, we find that our approach can successfully encode 110 (96%) specifications as compared to 87 (76%) under prior approaches, an improvement of 20 percentage points. We implement the specifications in the form of a runtime monitoring framework to check the compliance of 3 state-of-the-art autonomous vehicles finding that they violated scene flow properties over 40 times in 30 test executions, including 34 violations for failing to yield properly at intersections. Empirical results demonstrate the implementation is suitably efficient for runtime monitoring applications.

Publisher's Version Article: fse25mainb-p416-p doi:10.1145/3729382

Automated Extraction and Analysis of Developer's Rationale in Open Source Software
Mouna Dhaouadi, Bentley Oakes, and Michalis Famelis
(Université de Montréal, Canada; Polytechnique Montréal, Canada)
Contributors to open source software must deeply understand a project’s history to make coherent decisions which do not conflict with past reasoning. However, inspecting all related changes to a proposed contribution requires intensive manual effort, and previous research has not yet produced an automated mechanism to expose and analyze these conflicts. In this article, we propose such an automated approach for rationale analyses, based on an instantiation of Kantara, an existing high-level rationale extraction and management architecture. Our implementation leverages pre-trained models and Large Language Models, and includes structure-based mechanisms to detect reasoning conflicts and problems which could cause design erosion in a project over time. We show the feasibility of our extraction and analysis approach using the OOM-Killer module of the Linux Kernel project, and investigate the approach’s generalization to five other highly active open source projects. The results confirm that our automated approach can support rationale analyses with reasonable performance, by finding interesting relationships and to detect potential conflicts and reasoning problems. We also show the effectiveness of the automated extraction of decision and rationale sentences and the prospects for generalizing this to other open source projects. This automated approach could therefore be used by open source software developers to proactively address hidden issues and to ensure that new changes do not conflict with past decisions.

Publisher's Version Article: fse25mainb-p426-p doi:10.1145/3729383

Error Delayed Is Not Error Handled: Understanding and Fixing Propagated Error-Handling Bugs
Haoran Liu, Shanshan Li, Zhouyang Jia, Yuanliang Zhang, Linxiao Bai, Si Zheng, Xiaoguang Mao, and Xiangke Liao
(National University of Defense Technology, China)
Error handling is critical for software reliability. In software systems, error handling may be delayed to other functions. Such propagated error handling (PEH) could easily be missed and lead to bugs. Our research reveals that PEH bugs are prevalent in software systems and, on average, take 44.1 days to fully address. Existing approaches have primarily focused on the error-handling bug within individual functions, which makes it difficult to fully address PEH bugs.
In this paper, we conducted the first in-depth study on PEH bugs in 11 mature software systems, examining how errors propagate and how they should be handled. We introduce EH-Fixer, an LLM-based tool for automated program repair specifically designed to address PEH bugs. For each PEH bug, EH-Fixer constructs its propagation path, and repairs them through retrieval-augmented generation. To assess the performance of our approach, we collected 89 historical PEH bugs from the Linux Kernel as well as 9 widely used applications. The experimental results show that EH-Fixer can fix 83.1% (74/89) of PEH bugs.

Publisher's Version Article: fse25mainb-p433-p doi:10.1145/3729384

Medusa: A Framework for Collaborative Development of Foundation Models with Automated Parameter Ownership Assignment
Dezhi Ran, Yuan Cao, Yuzhe Guo, Yuetong Li, Mengzhou Wu, Simin Chen, Wei Yang, and Tao Xie
(Peking University, China; Beijing Jiaotong University, China; University of Chicago, USA; University of Texas at Dallas, USA)
Foundation models (FMs) have become the backbone of intelligent systems. Collaborative development of FMs enables multiple teams to fine-tune different aspects of an FM simultaneously. However, conflicts in model updates across teams, particularly when modifying overlapping parameters, pose significant challenges to maintaining model performance. To address these challenges, in this paper, we propose Medusa, a novel framework designed to support collaborative FM development by managing model branches and introducing a structured system of parameter ownership. Medusa tracks fine-tuning efforts as separate branches, similar to Git, allowing developers to work on different tasks without destabilizing the base model. Instead of passively merging parameters from already fine-tuned models, Medusa proactively controls the merging process through our novel algorithm for assigning ownership of parameters by generating merging-aware masks to guide the fine-tuning process, ensuring that only specific branches can modify designated parameters. Medusa approximates the optimal assignment even as model complexity increases, ensuring scalability in large models. To investigate the efficacy of Medusa, we conduct extensive evaluations on five datasets and three models fine-tuned by three popular techniques, and compare our approach against six state-of-the-art approaches for post-training model merging. The evaluation results show that Medusa substantially and generally improves the effectiveness of collaborative model development, across different models, fine-tuning techniques, and datasets. Specifically, with automated parameter ownership assignment and masked fine-tuning, Medusa outperforms post-training model-merging approaches by improving model performance after merging by 3.19% absolute points. Ablation studies further demonstrate the efficacy of the algorithms in Medusa.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25mainb-p442-p doi:10.1145/3729385

CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building
Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang
(Fudan University, China; Huazhong University of Science and Technology, China)
Project building is pivotal to support various program analysis tasks, such as generating intermediate representation code for static analysis and preparing binary code for vulnerability reproduction. However, automating the building process for C/C++ projects is a highly complex endeavor, involving tremendous technical challenges, such as intricate dependency management, diverse build systems, varied toolchains, and multifaceted error handling mechanisms. Consequently, building C/C++ projects often proves to be difficult in practice, hindering the progress of downstream applications. Unfortunately, research on facilitating the building of C/C++ projects remains to be inadequate. The emergence of Large Language Models (LLMs) offers promising solutions to automated software building. Trained on extensive corpora, LLMs can help unify diverse build systems through their comprehension capabilities and address complex errors by leveraging tacit knowledge storage. Moreover, LLM-based agents can be systematically designed to dynamically interact with the environment, effectively managing dynamic building issues. Motivated by these opportunities, we first conduct an empirical study to systematically analyze the current challenges in the C/C++ project building process. Particularly, we observe that most popular C/C++ projects encounter an average of five errors when relying solely on the default build systems. Based on our study, we develop an automated build system called CXXCrafter to specifically address the above-mentioned challenges, such as dependency resolution. Our evaluation on open-source software demonstrates that CXXCrafter achieves a success rate of 78% in project building. Specifically, among the Top100 dataset, 72 projects are built successfully by both CXXCrafter and manual efforts, 3 by CXXCrafter only, and 14 manually only. Despite the slightly lower performance,CXXCrafter can save tremendous manual efforts and can also be easily applied to a wider range of applications automatically.

Publisher's Version Article: fse25mainb-p449-p doi:10.1145/3729386

A Causal Learning Framework for Enhancing Robustness of Source Code Models
Junyao Ye, Zhen Li, Xi Tang, Deqing Zou, Shouhuai Xu, Weizhong Qiang, and Hai Jin
(Huazhong University of Science and Technology, China; University of Colorado at Colorado Springs, USA)
Deep Learning (DL) models are useful for many software engineering tasks. However, these models are susceptible to adversarial attacks, partly because they learn spurious features that incur spurious correlations between these features and model predictions. In this paper, we tackle the problem with a novel causal learning framework, dubbed CausalCode, which leverages causal inference principles to mitigate spurious correlations. At a high level, CausalCode can be characterized as follows: (i) it uses causal data augmentation to generate intervention examples to disrupt spurious correlations; (ii) it leverages regularization to learn invariant representations that prefer causal features to spurious features; (iii) it can enhance the robustness of multiple DL models for source code-based software engineering tasks because it is task-agnostic and model-agnostic. To evaluate its effectiveness, we conduct comprehensive experiments on two models (i.e., CodeBERT and GraphCodeBERT), with respect to four software engineering tasks (i.e., defect detection, functionality classification, code translation, and code repair). Experimental results show that CausalCode outperforms the state-of-the-art approaches in enhancing the robustness of these models.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25mainb-p451-p doi:10.1145/3729387

Recasting Type Hints from WebAssembly Contracts
Kunsong Zhao, Zihao Li, Weimin Chen, Xiapu Luo, Ting Chen, Guozhu Meng, and Yajin Zhou
(Hong Kong Polytechnic University, China; University of Electronic Science and Technology of China, China; Institute of Information Engineering at Chinese Academy of Sciences, China; Zhejiang University, China)
WebAssembly has become the preferred smart contract format for various blockchain platforms due to its high portability and near-native execution speed. To effectively understand WebAssembly contracts, it is crucial to recover high-level type signatures because of the limited type information that WebAssembly provides. However, existing studies on type inference for smart contracts primarily center around Ethereum Virtual Machine bytecode, which is not applicable to WebAssembly owing to their differing targets and runtime semantics. This paper introduces WasmHint, a novel solution that leverages deep learning inference to automatically recover high-level parameter and return types from WebAssembly contracts. More specifically, WasmHint constructs a wCFG representation to clarify dependencies within WebAssembly code and simulates its execution to capture type-related operational information. By learning comprehensive code semantics, it infers parameter and return types, with a semantic corrector designed to enhance information coordination. We conduct experiments on a newly constructed dataset containing 77,208 WebAssembly contract functions. The results demonstrate that WasmHint achieves inference accuracies of 80.0% for parameter types and 95.8% for return types, with average improvements of 86.6% and 34.0% over the baseline methods, respectively.

Publisher's Version Article: fse25mainb-p471-p doi:10.1145/3729388

Revisiting Optimization-Resilience Claims in Binary Diffing Tools: Insights from LLVM Peephole Optimization Analysis
Xiaolei Ren, Mengfei Ren, Yu Lei, and Jiang Ming
(Macau University of Science and Technology, Macau; University of Alabama in Huntsville, USA; University of Texas at Arlington, USA; Tulane University, USA)
Binary diffing technique aims to identify differences/similarities in executable files without source code access. Its potential applications in various software security tasks, such as vulnerability search, code clone detection, and malware analysis have generated a large body of literature over the past few years. A recurring theme in binary diffing research is to evaluate the resilience against the impact of compiler optimization, which is the most common source leading to syntactic differences in binary code. Despite claims by most binary diffing papers that they are immune to compiler optimization, recent studies have highlighted a pressing need for the research community to revisit these optimization-resilience claims. In this paper, we investigate peephole optimization's impact on binary diffing. Mainstream compilers feature a multitude of peephole optimization rules, facilitating local rewriting of input programs to replace instruction sequences within a window (i.e., peephole) with shorter and/or faster equivalents. Our research reveals that peephole optimization primarily affects binary code differences at the intra-procedural level, which contradicts the assumptions made by basic-block centric comparison approaches. We customized an LLVM translation validation tool to investigate the impact of peephole optimization from the overall optimization process. Our experimental results demonstrate 1) peephole optimization modifies binary code during the whole optimization process, and 2) no existing basic-block centric comparison tools can properly deal with all changes caused by peephole optimization, leading to further performance loss in downstream applications. Our study introduces a "peephole-oriented" test suite, designed to isolate and measure the impact of peephole optimizations on binary code. This suite provides a new perspective for evaluating the resilience of binary diffing tools against subtle, intra-procedural code changes, setting a new benchmark for future tool development. Our findings reveal critical insights that challenge existing assumptions in binary diffing, highlighting the need for more robust analysis techniques.

Publisher's Version Article: fse25mainb-p495-p doi:10.1145/3729389

Demystifying Memorization in LLM-Based Program Repair via a General Hypothesis Testing Framework
Jiaolong Kong, Xiaofei Xie, and Shangqing Liu
(Singapore Management University, Singapore; Nanjing University, China)
Large Language Models (LLMs) have achieved remarkable success in various applications, particularly in code-related tasks such as code generation and program repair, setting new performance benchmarks. However, the extensive use of large training corpora raises concerns about whether these achievements stem from genuine understanding or mere memorization of training data—a question often overlooked in current research. This paper aims to study the memorization issue within LLM-based program repair by investigating whether the correct patches generated by LLMs are the result of memorization. The key challenge lies in the absence of ground truth for confirming memorization, leading to various ad-hoc methods designed for its detection. To address this challenge, we first propose a general framework that formalizes memorization detection as a general hypothesis testing problem, where existing approaches can be unified by defining a low-probability event under the null hypothesis that the data is not memorized. The occurrence of such an event leads to the rejection of the null hypothesis, indicating potential memorization.
Based on this framework, we design two specific methods (i.e., low-probability events) to detect potential memorization: 1) basic ground-truth matching, and 2) reassessment after substantial code mutation. We investigate the memorization issue in LLM-based program repair using two datasets: Defects4J, a widely used benchmark that is likely included in the training data, and GitBug-Java, a new dataset that is unlikely to be part of the training data. Our findings reveal that a significant portion of correct patches exactly match the ground truths in Defects4J (e.g., 78.83% and 87.42% on GPT-3.5 and CodeLlama-7b, respectively). Moreover, even after significant modifications to the buggy code, where the original repairs should not be generated, a considerable percentage of bugs (e.g., 81.82% on GPT-3.5 and 88.24% on CodeLlama-7b) continue to be fixed exactly as in the original bug fixes, indicating a high likelihood of memorization. Furthermore, we evaluate existing memorization detection methods and demonstrate their ineffectiveness in this context (e.g., most AUROCs are below 0.5). The theoretical analysis under our hypothesis testing framework shows that their defined events may not meet the requirements for being low-probability. The study highlights the critical need for more robust and rigorous evaluations in LLM-based software engineering research, ensuring a clear distinction between true problem-solving capabilities and mere memorization.

Publisher's Version Article: fse25mainb-p509-p doi:10.1145/3729390

Expressing and Checking Statistical Assumptions
Alexi Turcotte and Zheyuan Wu
(CISPA, Germany; Saarland University, Germany)
Literate programming environments like Jupyter and R Markdown notebooks, coupled with easy-to-use languages like Python and R, put a plethora of statistical methods right at a data analyst’s fingertips. But are these methods being used correctly? Statistical methods make statistical assumptions about samples being analyzed, and in many cases produce reasonable looking results even if assumptions are not met.
We propose an approach that allows library developers to annotate functions with statistical assumptions, phrases them as hypotheses about the data, and inserts hypothesis tests investigating the likelihood that the assumption is met; this way, analysts using these functions will have their data checked automatically. We implement this approach in two tools: prob-check-py for Python, and prob-check-r for R, and to evaluate them we identify common hypothesis testing and statistical modeling functions, annotate them with the relevant statistical assumptions, and run 128 Kaggle notebooks that use those methods to identify misuses. Our investigation reveals statistically significant evidence against assumptions in 84.38% of surveyed notebooks, and in 53.36% of calls to annotated functions. In the case of hypothesis tests, had an equivalent test that did not make these assumptions been chosen, a different conclusion would have been drawn in 11.51% of cases.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25mainb-p539-p doi:10.1145/3729391

Core Developer Turnover in the Rust Package Ecosystem: Prevalence, Impact, and Awareness
Meng Fan, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu
(Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway)
Continued contributions of core developers in open source software (OSS) projects are key for sustaining and maintaining successful OSS projects. A major risk to the sustainability of OSS projects is developer turnover. Prior studies have explored developer turnover at the level of individual projects. A shortcoming of such studies is that they ignore the impact of developer turnover on downstream projects. Yet, an awareness of the turnover of core developers offers useful insights to the rest of an open source ecosystem. This study performs a large-scale empirical analysis of code developer turnover in the Rust package ecosystem. We find that the turnover of core developers is quite common in the whole Rust ecosystem with 36,991 packages. This is particularly worrying as a vast majority of Rust packages only have a single core developer. We found that core developer turnover can significantly decrease the quality and efficiency of software development and maintenance, even leading to deprecation. This is a major source of concern for those Rust packages that are widely used. We surveyed developers' perspectives on the turnover of core developers in upstream packages. We found that developers widely agreed that core developer turnover can affect project stability and sustainability. They also emphasized the importance of transparency and timely notifications regarding the health status of upstream dependencies. This study provides unique insights to help communities focus on building reliable software dependency networks.

Publisher's Version Article: fse25mainb-p553-p doi:10.1145/3729392

Who Will Stop Contributing to OSS Projects? Predicting Company Turnover Based on Initial Behavior
Mian Qin, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu
(Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway)
Open Source Software (OSS) projects are no longer only developed by volunteers. Instead, many organizations, from early-stage startups to large global enterprises, actively participate in many well-known projects. The survival and success of OSS projects rely on long-term contributors, who have extensive experience and knowledge. While prior literature has explored volunteer turnover in OSS, there is a paucity of research on company turnover in OSS ecosystems. Given the intensive involvement of companies in OSS and the different nature of corporate contributors vis-a-vis volunteers, it is important to investigate company turnover in OSS projects. This study first explores the prevalence and characteristics of companies that discontinue contributing to OSS projects, and then develops models to predict companies’ turnover. Based on a study of the Linux kernel, we analyze the early-stage behavior of 1,322 companies that have contributed to the project. We find that approximately 12% of companies discontinue contributing each year; one-sixth of those used to be core contributing companies (those that ranked in the top 20% by commit volume). Furthermore, withdrawing companies tend to have a lower intensity and scope of contributions, make primarily perfective changes, collaborate less, and operate on a smaller scale. We propose a Temporal Convolutional Network (TCN) deep learning model based on these indicators to predict whether companies will discontinue. The evaluation results show that the model achieves an AUC metric of .76 and an accuracy of .71. We evaluated the model in two other OSS projects, Rust and OpenStack, and the performance remains stable.

Publisher's Version Article: fse25mainb-p555-p doi:10.1145/3729393

Automatically Detecting Numerical Instability in Machine Learning Applications via Soft Assertions
Shaila Sharmin, Anwar Hossain Zahid, Subhankar Bhattacharjee, Chiamaka Igwilo, Miryung Kim, and Wei Le
(Iowa State University, USA; University of California at Los Angeles, USA)
Machine learning (ML) applications have become an integral part of our lives. ML applications extensively use floating-point computation and involve very large/small numbers; thus, maintaining the numerical stability of such complex computations remains an important challenge. Numerical bugs can lead to system crashes, incorrect output, and wasted computing resources. In this paper, we introduce a novel idea, namely soft assertions (SA), to encode safety/error conditions for the places where numerical instability can occur. A soft assertion is an ML model automatically trained using the dataset obtained during unit testing of unstable functions. Given the values at the unstable function in an ML application, a soft assertion reports how to change these values in order to trigger the instability. We then use the output of soft assertions as signals to effectively mutate inputs to trigger numerical instability in ML applications. In the evaluation, we used the GRIST benchmark, a total of 79 programs, as well as 15 real-world ML applications from GitHub. We compared our tool with 5 state-of-the-art (SOTA) fuzzers. We found all the GRIST bugs and outperformed the baselines. We found 13 numerical bugs in real-world code, one of which had already been confirmed by the GitHub developers. While the baselines mostly found the bugs that report NaN and INF, our tool found numerical bugs with incorrect output. We showed one case where the Tumor Detection Model, trained on Brain MRI images, should have predicted ”tumor”, but instead, it incorrectly predicted ”no tumor” due to the numerical bugs. Our replication package is located at https://figshare.com/s/6528d21ccd28bea94c32.

Publisher's Version Article: fse25mainb-p618-p doi:10.1145/3729394

DyLin: A Dynamic Linter for Python
Aryaz Eghbali, Felix Burk, and Michael Pradel
(University of Stuttgart, Germany)
Python is a dynamic language with applications in many domains, and one of the most popular languages in recent years. To increase code quality, developers have turned to “linters” that statically analyze the source code and warn about potential programming problems. However, the inherent limitations of static analysis and the dynamic nature of Python make it difficult or even impossible for static linters to detect some problems. This paper presents DyLin, the first dynamic linter for Python. Similar to a traditional linter, the approach has an extensible set of checkers, which, unlike in traditional linters, search for specific programming anti-patterns by analyzing the program as it executes. A key contribution of this paper is a set of 15 Python-specific anti-patterns that are hard to find statically but amenable to dynamic analysis, along with corresponding checkers to detect them. Our evaluation applies DyLin to 37 popular open-source Python projects on GitHub and to a dataset of code submitted to Kaggle machine learning competitions, totaling over 683k lines of Python code. The approach reports a total of 68 problems, 48 of which are previously unknown true positives, i.e., a precision of 70.6%. The detected problems include bugs that cause incorrect values, such as inf, incorrect behavior, e.g., missing out on relevant events, unnecessary computations that slow down the program, and unintended data leakage from test data to the training phase of machine learning pipelines. These issues remained unnoticed in public repositories for more than 3.5 years, on average, despite the fact that the corresponding code has been exercised by the developer-written tests. A comparison with popular static linters and a type checker shows that DyLin complements these tools by detecting problems that are missed statically. Based on our reporting of 42 issues to the developers, 31 issues have so far been fixed.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p619-p doi:10.1145/3729395

NLP Libraries, Energy Consumption and Runtime: An Empirical Study
Rajrupa Chattaraj and Sridhar Chimalakonda
(IIT Tirupati, India)
In the realm of natural language processing (NLP), the rising computational demands of modern models bring energy efficiency to the forefront of sustainable computing. Preprocessing tasks, such as tokenization, stemming, and POS tagging, are critical steps in transforming raw text into structured formats suitable for machine learning models. However, despite their widespread use in numerous NLP pipelines, little attention has been given to their energy consumption. This empirical study evaluates and compares the energy consumption and runtime performance of three popular NLP libraries—NLTK, spaCy, and Gensim—across six common preprocessing tasks. We conducted a comprehensive comparison using three distinct datasets and six preprocessing tasks. Energy consumption was measured using the Intel-RAPL and NVIDIA-SMI interfaces, while runtime performance was recorded across all library-task combinations. The results reveal substantial discrepancies in energy consumption across the three libraries, with up to 93% of cases exhibiting significant variations. Gensim showed superior efficiency in tokenization and stemming, while spaCy excelled in tasks like POS tagging and Named Entity Recognition (NER). These findings underscore the potential for optimizing NLP preprocessing tasks for energy efficiency. Our study highlights the untapped potential for improving energy efficiency in NLP pipelines. These insights emphasize the need for more focused research into energy-efficient NLP techniques, especially in the preprocessing phase, to support the development of greener, more sustainable computational models.

Publisher's Version Article: fse25mainb-p633-p doi:10.1145/3729396

An Empirical Study of Code Clones from Commercial AI Code Generators
Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, and Michael Lyu
(Sun Yat-sen University, China; Zhuhai Key Laboratory of Trusted Large Language Models, China; Chinese University of Hong Kong, China)
Deep learning (DL) has revolutionized various software engineering tasks. Particularly, the emergence of AI code generators has pushed the boundaries of automatic programming to synthesize entire programs based on user-defined specifications in natural language. However, it remains a mystery if these AI code generators rely on the copy-and-paste programming practice, resulting in code clone concerns. In this work, to comprehensively study the code cloning behavior of AI code generators, we conduct an empirical study on three state-of-the-art commercial AI code generators to investigate the existence of all types of clones, which remains underexplored. Our experimental results show that the total Type-1 and Type-2 clone rates of the state-of-the-art commercial AI code generators can reach up to 7.50%, indicating marked code clone issues. Furthermore, it is observed that AI code generators risk infringing copyrights and propagating buggy and vulnerable code resulting from cloning code and show a certain degree of stability in generating code clones.

Publisher's Version Article: fse25mainb-p640-p doi:10.1145/3729397

CoverUp: Effective High Coverage Test Generation for Python
Juan Altmayer Pizzorno and Emery D. Berger
(University of Massachusetts at Amherst, USA; Amazon Web Services, USA)
Testing is an essential part of software development. Test generation tools attempt to automate the otherwise labor-intensive task of test creation, but generating high-coverage tests remains challenging. This paper proposes CoverUp, a novel approach to driving the generation of high-coverage Python regression tests. CoverUp combines coverage analysis, code context, and feedback in prompts that iteratively guide the LLM to generate tests that improve line and branch coverage. We evaluate our prototype CoverUp implementation across a benchmark of challenging code derived from open-source Python projects and show that CoverUp substantially improves on the state of the art. Compared to CodaMosa, a hybrid search/LLM-based test generator, CoverUp achieves a per-module median line+branch coverage of 80% (vs. 47%). Compared to MuTAP, a mutation- and LLM-based test generator, CoverUp achieves an overall line+branch coverage of 89% (vs. 77%). We also demonstrate that CoverUp’s performance stems not only from the LLM used but from the combined effectiveness of its components.

Publisher's Version

Info Article: fse25mainb-p665-p doi:10.1145/3729398

ReproCopilot: LLM-Driven Failure Reproduction with Dynamic Refinement
Tanakorn Leesatapornwongsa, Fazle Faisal, and Suman Nath
(Microsoft Research, USA)
Failure reproduction is a crucial step for debugging software systems, but it is often challenging and time-consuming, especially when the failures are caused by complex inputs, states, or environments. In this paper, we present ReproCopilot, a tool that leverages program analysis and a large language model (LLM) to generate a workload (i.e., code and inputs) that can reproduce a given failure. ReproCopilot proposes two novel techniques: state-oriented code generation and dynamic refinement. These techniques can iteratively guide the LLM with program analysis feedback until the generated workload can successfully reproduce the target failure. We evaluate ReproCopilot on 50 real-world failures from 17 open-source projects, and show that it can reproduce 76% of them, significantly outperforming the-state-of-the-art solutions.

Publisher's Version Article: fse25mainb-p684-p doi:10.1145/3729399

Calibration of Large Language Models on Code Summarization
Yuvraj Virk, Premkumar Devanbu, and Toufique Ahmed
(University of California at Davis, USA; IBM Research, USA)
A brief, fluent, and relevant summary can be helpful during program comprehension; however, such a summary does require significant human effort to produce. Often, good summaries are unavailable in software projects, which makes maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit of work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies.
However, LLM-generated summaries can be inaccurate, incomplete, etc: generally, too dissimilar to one that a good developer might write. Given an LLM-generated code summary, how can a user rationally judge if a summary is sufficiently good and reliable? Given just some input source code, and an LLM-generated summary, existing approaches can help judge brevity, fluency and relevance of the summary; however, it’s difficult to gauge whether an LLM-generated summary sufficiently resembles what a human might produce, without a “golden” human-produced summary to compare against. Prior research indicates that human-produced summaries are generally preferred by human-raters, so we explore this issue in this paper. We study this resemblance question as a calibration problem: given just the code & the summary from an LLM, can we compute a confidence measure, that provides a reliable indication of whether the summary sufficiently resembles what a human would have produced in this situation? We examine this question using several LLMs, for several languages, and in several different settings. Our investigation suggests approaches to provide reliable predictions of the likelihood that an LLM-generated summary would sufficiently resemble a summary a human might write for the same code.

Publisher's Version Article: fse25mainb-p713-p doi:10.1145/3729400

CRISPE: Semantic-Guided Execution Planning and Dynamic Reasoning for Enhancing Code Coverage Prediction
Hridya Dhulipala, Aashish Yadavally, Smit Soneshbhai Patel, and Tien N. Nguyen
(University of Texas at Dallas, USA)
While LLMs excel in understanding source code and descriptive texts for tasks like code generation, code completion, etc., they exhibit weaknesses in predicting dynamic program behavior, such as code coverage and runtime error detection, which typically require program execution. Aiming to advance the capability of LLMs in reasoning and predicting the program behavior at runtime, we present CRISPE (short for Coverage Rationalization and Intelligent Selection ProcedurE), a novel approach for code coverage prediction. CRISPE guides an LLM in simulating program execution via an execution plan based on two key factors: (1) program semantics of each statement type, and (2) the observation of the set of covered statements at the current “execution” step relative to all feasible code coverage options. We formulate code coverage prediction as a process of semantic-guided execution-based planning, where feasible coverage options are utilized to assess whether the LLM is heading in the correct reasoning. We enhance the traditional generative task with the retrieval-based framework on feasible options of code coverage. Our experimental results show that CRISPE achieves high accuracy in coverage prediction in terms of both exact-match and statement-match coverage metrics, improving over the baselines. We also show that with semantic-guiding and dynamic reasoning from CRISPE, the LLM generates more correct planning steps. To demonstrate CRISPE’s usefulness, we used it in the downstream task of statically detecting runtime error(s) in incomplete code snippets with the given inputs.

Publisher's Version Article: fse25mainb-p729-p doi:10.1145/3729401

Blended Analysis for Predictive Execution
Yi Li, Hridya Dhulipala, Aashish Yadavally, Xiaokai Rong, Shaohua Wang, and Tien N. Nguyen
(University of Texas at Dallas, USA; Central University of Finance and Economics, China)
Although Large Language Models (LLMs) are highly proficient in understanding source code and descriptive texts, they have limitations in reasoning on dynamic program behaviors, such as execution trace and code coverage prediction, and runtime error prediction, which usually require actual program execution. To advance the ability of LLMs in predicting dynamic behaviors, we leverage the strengths of both approaches, Program Analysis (PA) and LLM, in building PredEx, a predictive executor for Python. Our principle is a blended analysis between PA and LLM to use PA to guide the LLM in predicting execution traces. We break down the task of predictive execution into smaller sub-tasks and leverage the deterministic nature when an execution order can be deterministically decided. When it is not certain, we use predictive backward slicing per variable, i.e., slicing the prior trace to only the parts that affect each variable separately breaks up the valuation prediction into significantly simpler problems. Our empirical evaluation on real-world datasets shows that PredEx achieves 31.5–47.1% relatively higher accuracy in predicting full execution traces than the state-of-the-art models. It also produces 8.6–53.7% more correct execution trace prefixes than those baselines. In predicting next executed statements, its relative improvement over the baselines is 15.7–102.3%. Finally, we show PredEx’s usefulness in two tasks: static code coverage analysis and static prediction of run-time errors for (in)complete code.

Publisher's Version Article: fse25mainb-p730-p doi:10.1145/3729402

Statement-Level Adversarial Attack on Vulnerability Detection Models via Out-of-Distribution Features
Xiaohu Du, Ming Wen, Haoyu Wang, Zichao Wei, and Hai Jin
(Huazhong University of Science and Technology, China)
Code vulnerability detection is crucial to ensure software security. Recent advancements, particularly with the emergence of Code Pre-Trained Models (CodePTMs) and Large Language Models (LLMs), have led to significant progress in this area. However, these models are easily susceptible to adversarial attacks, where even slight input modifications can lead the models to generate opposite results. Existing adversarial approaches, such as identifier replacement, code transformation, and dead code insertion, demonstrate promising performance but still face several limitations. First, the perturbations applied to the target code are relatively constrained (e.g., identifier replacement can only be applied to a small subset of tokens within the entire codebase). Second, the design of perturbed tokens lacks specificity in forcing the model to make incorrect predictions (e.g., they are generated by random selection or context-based prediction). Such limitations lead to the inefficiency and ineffectiveness of existing attacks. To address these issues, we propose SLODA (Statement-level OOD Features driven Adversarial Attack), which introduces two types of out-of-distribution (OOD) features: universal features via code deoptimization and label-specific features extracted from existing mispredicted and adversarial examples. These statement-level OOD features not only expand the perturbation scope, but can also significantly reduce the search space due to their inherently adversarial nature. Moreover, since the OOD features are extracted from existing code and the attack considers the context of the target code, they are more difficult to detect. Our extensive experiments across 15 models demonstrate that SLODA surpasses existing five state-of-the-art approaches in terms of the effectiveness, efficiency, and detection resistance. Furthermore, the adversarial examples generated by SLODA also exhibit promising performance to enhance model robustness.

Publisher's Version Article: fse25mainb-p793-p doi:10.1145/3729403

Understanding Industry Perspectives of Static Application Security Testing (SAST) Evaluation
Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren
(Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China)
The demand for automated security analysis techniques, specifically static application security testing (SAST), is steadily rising. Assessing the effectiveness of SAST tools is crucial for evaluating current techniques and inspiring future technical advancements. Regrettably, recent research suggests that existing benchmarks used for evaluation often fail to meet the industry's needs, significantly impeding the adoption of SASTs in real-world scenarios. This paper presents a qualitative study to bridge this gap. We investigate why industrial professionals utilize SAST benchmarks, identify barriers to their usage, and explore potential improvements for existing benchmarks. Specifically, we conducted in-depth, semi-structured interviews with twenty industrial professionals possessing diverse field experience and backgrounds in security and product development. As the first comprehensive investigation of SAST evaluation from an industrial perspective, our findings would break down the barriers between academia and industry, providing valuable inspiration for designing better benchmarks and promoting new advances in SAST evaluation.

Publisher's Version Article: fse25mainb-p800-p doi:10.1145/3729404

Smaller but Better: Self-Paced Knowledge Distillation for Lightweight yet Effective LCMs
Yujia Chen, Yang Ye, Zhongqi Li, Yuchi Ma, and Cuiyun Gao
(Harbin Institute of Technology, Shenzhen, China; Huawei Cloud Computing Technologies, China)
Large code models (LCMs) have remarkably advanced the field of code generation. Despite their impressive capabilities, they still face practical deployment issues, such as high inference costs, limited accessibility of proprietary LCMs, and adaptability issues of ultra-large LCMs. These issues highlight the critical need for more accessible, lightweight yet effective LCMs. Knowledge distillation (KD) offers a promising solution, which transfers the programming capabilities of larger, advanced LCMs (Teacher) to smaller, less powerful LCMs (Student). However, existing KD methods for code intelligence often lack consideration of fault domain knowledge and rely on static seed knowledge, leading to degraded programming capabilities of student models. In this paper, we propose a novel Self-Paced knOwledge DistillAtion framework, named SODA, aiming at developing lightweight yet effective student LCMs via adaptively transferring the programming capabilities from advanced teacher LCMs. SODA consists of three stages in one cycle: (1) Correct-and-Fault Knowledge Delivery stage aims at improving the student model’s capability to recognize errors while ensuring its basic programming skill during the knowledge transferring, which involves correctness-aware supervised learning and fault-aware contrastive learning methods. (2) Multi-View Feedback stage aims at measuring the quality of results generated by the student model from two views, including model-based and static tool-based measurement, for identifying the difficult questions; (3) Feedback-based Knowledge Update stage aims at updating the student model adaptively by generating new questions at different difficulty levels, in which the difficulty levels are categorized based on the feedback in the second stage. By performing the distillation process iteratively, the student model is continuously refined through learning more advanced programming skills from the teacher model. We compare SODA with four state-of-the-art KD approaches on three widely-used code generation benchmarks with different programming languages. Experimental results show that SODA improves the student model by 65.96% in terms of average Pass@1, outperforming the best baseline PERsD by 29.85%. Based on the proposed SODA framework, we develop SodaCoder, a series of lightweight yet effective LCMs with ∼7B parameters, which outperform 15 LCMs with less than or equal to 16B parameters. Notably, SodaCoder-DS-6.7B, built on DeepseekCoder-6.7B, even surpasses the prominent ChatGPT on average Pass@1 across seven programming languages (66.4 vs. 61.3).

Publisher's Version Article: fse25mainb-p872-p doi:10.1145/3729405