FSE 2025
Proceedings of the ACM on Software Engineering, Volume 2, Number FSE

Powered by

Proceedings of the ACM on Software Engineering, Volume 2, Number FSE

FSE – Journal Issue

Contents - Abstracts - Authors

Title Page
Article: fse25foreword-fm000-p doi:

Editorial Message
Article: fse25foreword-fm001-p doi:

Sponsors
Article: fse25foreword-fm003-p doi:

Gleipner: A Benchmark for Gadget Chain Detection in Java Deserialization Vulnerabilities
Bruno Kreyssig and Alexandre Bartel
(Umeå University, Sweden)

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p15-p doi:10.1145/3715711

Detecting Smart Contract State-Inconsistency Bugs via Flow Divergence and Multiplex Symbolic Execution
Yinxi Liu, Wei Meng, and Yinqian Zhang
(Rochester Institute of Technology, USA; Chinese University of Hong Kong, China; Southern University of Science and Technology, China)
Ethereum smart contracts determine state transition results not only by the previous states, but also by a mutable global state consisting of storage variables. This has resulted in state-inconsistency bugs, which grant an attacker the ability to modify contract states either through recursive function calls to a contract (reentrancy), or by exploiting transaction order dependence (TOD). Current studies have determined that identifying data races on global storage variables can capture all state-inconsistency bugs. Nevertheless, eliminating false positives poses a significant challenge, given the extensive number of execution paths that could potentially cause a data race. For simplicity, existing research considers a data race to be vulnerable as long as the variable involved could have inconsistent values under different execution orders. However, such a data race could be benign when the inconsistent value does not affect any critical computation or decision-making process in the program. Besides, the data race could also be infeasible when there is no valid state in the contract that allows the execution of both orders. In this paper, we aim to appreciably reduce these false positives without introducing false negatives. We present DivertScan, a precise framework to detect exploitable state-inconsistency bugs in smart contracts. We first introduce the use of flow divergence to check where the involved variable may flow to. This allows DivertScan to precisely infer the potential effects of a data race and determine whether it can be exploited for inducing unexpected program behaviors. We also propose multiplex symbolic execution to examine different execution orders in one time of solving. This helps DivertScan to determine whether a common starting state could potentially exist. To address the scalability issue in symbolic execution, DivertScan utilizes an overapproximated pre-checking and a selective exploration strategy. As a result, it only needs to explore a limited state space. DivertScan significantly outperformed state-of-the-art tools by improving the precision rate by 20.72% to 74.93% while introducing no false negatives. It also identified five exploitable real-world vulnerabilities that other tools missed. The detected vulnerabilities could potentially lead to a loss of up to $68.2M, based on trading records and rate limits.

Publisher's Version Article: fse25main-p36-p doi:10.1145/3715712

MendelFuzz: The Return of the Deterministic Stage
Han Zheng, Flavio Toffalini, Marcel Böhme, and Mathias Payer
(EPFL, Switzerland; Ruhr University Bochum, Germany; MPI-SP, Germany)
Can a fuzzer cover more code with minimal corruption of the initial seed? Before a seed is fuzzed, the early greybox fuzzers first systematically enumerated slightly corrupted inputs by applying every mutation operator to every part of the seed, once per generated input. The hope of this so-called “deterministic” stage was that simple changes to the seed would be less likely to break the complex file format; the resulting inputs would find bugs in the program logic well beyond the program’s parser. However, when experiments showed that disabling the deterministic stage achieves more coverage, applying multiple mutation operators at the same time to a single input, most fuzzers disabled the deterministic stage by default. Instead of ignoring the deterministic stage, we analyze its potential and substantially improve deterministic exploration. Our deterministic stage is now the default in AFL++, reverting the earlier decision of dropping deterministic exploration. We start by investigating the overhead and the contribution of the deterministic stage to the discovery of coverage-increasing inputs. While the sheer number of generated inputs explains the overhead, we find that only a few critical seeds (20%), and only a few critical bytes in a seed (0.5%) are responsible for the vast majority of the coverage-increasing inputs (83% and 84%, respectively). Hence, we develop an efficient technique, called , to identify these critical seeds / bytes so as to prune a large number of unnecessary inputs. retains the benefits of the classic deterministic stage by only enumerating a tiny part of the total deterministic state space. We evaluate implementation on two benchmarking frameworks, FuzzBench and Magma. Our evaluation shows that outperforms state-of-the-art fuzzers with and without the (old) deterministic stage enabled, both in terms of coverage and bug finding. also discovered 8 new CVEs on exhaustively fuzzed security-critical applications. Finally, has been independently evaluated and integrated into AFL++ as default option.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable Article: fse25main-p58-p doi:10.1145/3715713

SmartShot: Hunt Hidden Vulnerabilities in Smart Contracts using Mutable Snapshots
Ruichao Liang, Jing Chen, Ruochen Cao, Kun He, Ruiying Du, Shuhua Li, Zheng Lin, and Cong Wu
(Wuhan University, China; University of Hong Kong, Hong Kong)
Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts.

Publisher's Version Article: fse25main-p114-p doi:10.1145/3715714

On-Demand Scenario Generation for Testing Automated Driving Systems
Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, and Zijiang Yang
(Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China)

Publisher's Version

Info Article: fse25main-p120-p doi:10.1145/3715722

Element-Based Automated DNN Repair with Fine-Tuned Masked Language Model
Xu Wang, Mingming Zhang, Xiangxin Meng, Jian Zhang, Yang Liu, and Chunming Hu
(Beihang University, China; Zhongguancun Laboratory, China; Ministry of Education, China; Nanyang Technological University, Singapore)

Publisher's Version Article: fse25main-p127-p doi:10.1145/3715716

Mystique: Automated Vulnerability Patch Porting with Semantic and Syntactic-Enhanced LLM
Susheng Wu, Ruisi Wang, Yiheng Cao, Bihuan Chen, Zhuotong Zhou, Yiheng Huang, JunPeng Zhao, and Xin Peng
(Fudan University, China)

Publisher's Version

Published Artifact

Artifacts Available

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p144-p doi:10.1145/3715718

Smart Contract Fuzzing Towards Profitable Vulnerabilities
Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, and Yang Liu
(Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China)

Publisher's Version

Published Artifact

Info

Artifacts Available Article: fse25main-p153-p doi:10.1145/3715720

CKTyper: Enhancing Type Inference for Java Code Snippets by Leveraging Crowdsourcing Knowledge in Stack Overflow
Anji Li, Neng Zhang, Ying Zou, Zhixiang Chen, Jian Wang, and Zibin Zheng
(Sun Yat-sen University, China; Central China Normal University, China; Queen's University, Canada; Wuhan University, China)
Code snippets are widely used in technical forums to demonstrate solutions to programming problems. They can be leveraged by developers to accelerate problem-solving. However, code snippets often lack concrete types of the APIs used in them, which impedes their understanding and resue. To enhance the description of a code snippet, a number of approaches are proposed to infer the types of APIs. Although existing approaches can achieve good performance, their performance is limited by ignoring other information outside the input code snippet (e.g., the descriptions of similar code snippets) that could potentially improve the performance. In this paper, we propose a novel type inference approach, named CKTyper, by leveraging crowdsourcing knowledge in technical posts. The key idea is to generate a relevant context for a target code snippet from the posts containing similar code snippets and then employ the context to promote the type inference with large language models (e.g., ChatGPT). More specifically, we build a crowdsourcing knowledge base (CKB) by extracting code snippets from a large set of posts and index the CKB using Lucene. An API type dictionary is also built from a set of API libraries. Given a code snippet to be inferred, we first retrieve a list of similar code snippets from the indexed CKB. Then, we generate a crowdsourcing knowledge context (CKC) by extracting and summarizing useful content (e.g., API-related sentences) in the posts that contain the similar code snippets. The CKC is subsequently used to improve the type inference of ChatGPT on the input code snippet. The hallucination of ChatGPT is eliminated by employing the API type dictionary. Evaluation results on two open-source datasets demonstrate the effectiveness and efficiency of CKTyper. CKTyper achieves the optimal precision/recall of 97.80% and 95.54% on both datasets, respectively, significantly outperforming three state-of-the-art baselines and ChatGPT.

Publisher's Version Article: fse25main-p159-p doi:10.1145/3715724

Ransomware Detection through Temporal Correlation between Encryption and I/O Behavior
Lihua Guo, Yiwei Hou, Chijin Zhou, Quan Zhang, and Yu Jiang
(Tsinghua University, China)

Publisher's Version Article: fse25main-p160-p doi:10.1145/3715725

DeclarUI: Bridging Design and Development with Automated Declarative UI Code Generation
Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang
(Huazhong University of Science and Technology, China; Australian National University, Australia)

Publisher's Version Article: fse25main-p162-p doi:10.1145/3715726

COFFE: A Code Efficiency Benchmark for Code Generation
Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren
(Chinese University of Hong Kong, China; Zhejiang University, China)

Publisher's Version

Published Artifact

Artifacts Available

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p216-p doi:10.1145/3715727

Pinning Is Futile: You Need More Than Local Dependency Versioning to Defend against Supply Chain Attacks
Hao He, Bogdan Vasilescu, and Christian Kästner
(Carnegie Mellon University, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p247-p doi:10.1145/3715728

An Empirical Study of Suppressed Static Analysis Warnings
Huimin Hu, Yingying Wang, Julia Rubin, and Michael Pradel
(University of Stuttgart, Germany; University of British Columbia, Canada)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable Article: fse25main-p253-p doi:10.1145/3715729

Towards Diverse Program Transformations for Program Simplification
Haibo Wang, Zezhong Xing, Chengnian Sun, Zheng Wang, and Shin Hwei Tan
(Concordia University, Canada; Southern University of Science and Technology, China; University of Waterloo, Canada; University of Leeds, UK)

Publisher's Version

Info Article: fse25main-p255-p doi:10.1145/3715730

Today’s Cat Is Tomorrow’s Dog: Accounting for Time-Based Changes in the Labels of ML Vulnerability Detection Approaches
Ranindya Paramitha, Yuan Feng, and Fabio Massacci
(University of Trento, Italy; Vrije Universiteit Amsterdam, Netherlands)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25main-p258-p doi:10.1145/3715731

A New Approach to Evaluating Nullability Inference Tools
Nima Karimipour, Erfan Arvan, Martin Kellogg, and Manu Sridharan
(University of California at Riverside, USA; New Jersey Institute of Technology, USA)

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p262-p doi:10.1145/3715732

A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems
Yuntianyi Chen, Yuqi Huai, Yirui He, Shilong Li, Changnam Hong, Qi Alfred Chen, and Joshua Garcia
(University of California at Irvine, USA)

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p263-p doi:10.1145/3715733

An Empirical Study on Release-Wise Refactoring Patterns
Shayan Noei, Heng Li, and Ying Zou
(Queen's University, Canada; Polytechnique Montréal, Canada)
Refactoring is a technical approach to increase the internal quality of software without altering its external functionalities. Developers often invest significant effort in refactoring. With the increased adoption of continuous integration and deployment (CI/CD), refactoring activities may vary within and across different releases and be influenced by various release goals. For example, developers may consistently allocate refactoring activities throughout a release, or prioritize new features early on in a release and only pick up refactoring late in a release. Different approaches to allocating refactoring tasks may have different implications for code quality. However, there is a lack of existing research on how practitioners allocate their refactoring activities within a release and their impact on code quality. Therefore, we first empirically study the frequent release-wise refactoring patterns in 207 open-source Java projects and their characteristics. Then, we analyze how these patterns and their transitions affect code quality. We identify four major release-wise refactoring patterns: early active, late active, steady active, and steady inactive. We find that adopting the late active pattern—characterized by gradually increasing refactoring activities as the release approaches—leads to the best code quality. We observe that as projects mature, refactoring becomes more active, reflected in the increasing use of the steady active release-wise refactoring pattern and the decreasing utilization of the steady inactive release-wise refactoring pattern. While the steady active pattern shows improvement in quality-related code metrics (e.g., cohesion), it can also lead to more architectural problems. Additionally, we observe that developers tend to adhere to a single refactoring pattern rather than switching between different patterns. The late active pattern, in particular, can be a safe release-wise refactoring pattern that is used repeatedly. Our results can help practitioners understand existing release-wise refactoring patterns and their effects on code quality, enabling them to utilize the most effective pattern to enhance release quality.

Publisher's Version Article: fse25main-p265-p doi:10.1145/3715734

Hallucination Detection in Large Language Models with Metamorphic Relations
Borui Yang, Md Afif Al Mamun, Jie M. Zhang, and Gias Uddin
(King's College London, UK; University of Calgary, Canada; York University, Canada)

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p266-p doi:10.1145/3715735

One-for-All Does Not Work! Enhancing Vulnerability Detection by Mixture-of-Experts (MoE)
Xu Yang, Shaowei Wang, Jiayuan Zhou, and Wenhan Zhu
(University of Manitoba, Canada; Huawei, Canada)

Publisher's Version Article: fse25main-p268-p doi:10.1145/3715736

LlamaRestTest: Effective REST API Testing with Small Language Models
Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso
(Georgia Institute of Technology, USA; IBM Research, USA)

Publisher's Version Article: fse25main-p274-p doi:10.1145/3715737

Code Change Intention, Development Artifact, and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM
Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu
(University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China)

Publisher's Version Article: fse25main-p295-p doi:10.1145/3715738

QSF: Multi-objective Optimization Based Efficient Solving for Floating-Point Constraints
Xu Yang, Zhenbang Chen, Wei Dong, and Ji Wang
(National University of Defense Technology, China)

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p301-p doi:10.1145/3715739

Adaptive Random Testing with Q-grams: The Illusion Comes True
Matteo Biagiola, Robert Feldt, and Paolo Tonella
(USI Lugano, Switzerland; Chalmers University of Technology, Sweden)

Publisher's Version Article: fse25main-p316-p doi:10.1145/3715740

Understanding and Characterizing Mock Assertions in Unit Tests
Hengcheng Zhu, Valerio Terragni, Lili Wei, Shing-Chi Cheung, Jiarong Wu, and Yepang Liu
(Hong Kong University of Science and Technology, China; University of Auckland, New Zealand; McGill University, Canada; Southern University of Science and Technology, China)

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p320-p doi:10.1145/3715741

Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning
Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä
(University of Helsinki, Finland; University of Oulu, Finland; University of Antwerp, Belgium)
Microservice-based systems (MSS) may fail with various fault types, due to their complex and dynamic nature. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts from practitioners are still required for further root cause analysis to diagnose specific fault types and analyze failure reasons for detected abnormal traces, particularly when abnormal traces do not stem directly from specific services. In this paper, we propose a novel AIOps framework, TraFaultDia, to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia is trained on several abnormal trace classification tasks with a few labeled instances from a MSS using a meta-learning approach. After training, TraFaultDia can quickly adapt to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia’s use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two representative MSS, TrainTicket and OnlineBoutique, with open datasets. In these datasets, each fault category is tied to the faulty system component(s) (service/pod) with a root cause. Our TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. Our results show that, within the MSS it is trained on, TraFaultDia achieves an average accuracy of 93.26% and 85.20% across 50 new, unseen abnormal trace classification tasks for TrainTicket and OnlineBoutique respectively, when provided with 10 labeled instances for each fault category per task in each system. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system.

Publisher's Version Article: fse25main-p345-p doi:10.1145/3715742

Detecting and Handling WoT Violations by Learning Physical Interactions from Device Logs
Bingkun Sun, Shiqi Sun, Jialin Ren, Mingming Hu, Kun Hu, Liwei Shen, and Xin Peng
(Fudan University, China; Northwestern Polytechnical University, China)
The Web of Things (WoT) system standardizes the integration of ubiquitous IoT devices in physical environments, enabling various software applications to automatically sense and regulate the physical environment. While providing convenience, the complex interactions among software applications and physical environment make the WoT system vulnerable to violations caused by improper actuator operations, which may cause undesirable or even harmful results, posing serious risks to user safety and security. In response to this critical concern, many previous efforts have be made. However, existing works primarily focus on analyzing software application behaviors, with insufficient consideration of the physical interactions, multi-source violations, and environmental dynamics in such ubiquitous software systems. As a result, they fail to comprehensively detect the impact of actuator operations on the dynamic environment, thus limiting their effectiveness. To address these limitations, we propose SysGuard, a violation detecting and handling approach. SysGuard employs the dynamic probabilistic graphical model (DPGM) to model the physical interactions as the physical interaction graph (PIG). In the offline phase, SysGuard takes device description models and history device logs as input to capture physical interactions by learning the PIG. In this process, a large language model (LLM) based causal analysis method is further introduced to filter out the device dependencies unrelated to physical interaction by analyzing the device interaction scenarios recorded in device logs. In the online phase, SysGuard processes user-customized violation rules, and monitors runtime device logs to predict violation states and generates handling policies by inferring the PIG. Evaluation on two real-world WoT systems shows that SysGuard significantly outperforms existing state-of-the-art works, achieving high performance in both violation detection and handling. It also confirms the runtime efficiency and scalability of SysGuard. Ablation experiment on our constructed dataset demonstrates that the LLM-based causal analysis significantly improves the performance of SysGuard, with the accuracy increasing in both violation detecting and handling.

Publisher's Version

Info Article: fse25main-p367-p doi:10.1145/3715743

The Landscape of Toxicity: An Empirical Investigation of Toxicity on GitHub
Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu
(University of Nebraska at Omaha, USA; Wayne State University, USA)

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p400-p doi:10.1145/3715744

DuoReduce: Bug Isolation for Multi-layer Extensible Compilation
Jiyuan Wang, Yuxin Qiu, Ben Limpanukorn, Hong Jin Kang, Qian Zhang, and Miryung Kim
(University of California at Los Angeles, USA; University of California at Riverside, USA)
In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton, CIRCT, and ONNX-MLIR. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual-dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators.

Publisher's Version Article: fse25main-p414-p doi:10.1145/3715747

ROSCallBaX: Statically Detecting Inconsistencies in Callback Function Setup of Robotic Systems
Sayali Kate, Yifei Gao, Shiwei Feng, and Xiangyu Zhang
(Purdue University, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Reusable Article: fse25main-p431-p doi:10.1145/3715748

Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models
Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng
(Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China)

Publisher's Version Article: fse25main-p434-p doi:10.1145/3715749

Automated Unit Test Refactoring
Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia
(Zhejiang University, China)

Publisher's Version Article: fse25main-p455-p doi:10.1145/3715750

HornBro: Homotopy-Like Method for Automated Quantum Program Repair
Siwei Tan, Liqiang Lu, Debin Xiang, Tianyao Chu, Congliang Lang, Jintao Chen, Xing Hu, and Jianwei Yin
(Zhejiang University, China)

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p466-p doi:10.1145/3715751

Automated Soap Opera Testing Directed by LLMs and Scenario Knowledge: Feasibility, Challenges, and Road Ahead
Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry (Xiwei) Xu, Qinghua Lu, and Liming Zhu
(Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; UNSW, Australia)

Publisher's Version Article: fse25main-p474-p doi:10.1145/3715752

LLM-Based Method Name Suggestion with Automatically Generated Context-Rich Prompts
Waseem Akram, Yanjie Jiang, Yuxia Zhang, Haris Ali Khan, and Hui Liu
(Beijing Institute of Technology, China; Peking University, China)
Accurate method naming is crucial for code readability and maintainability. However, manually creating concise and meaningful names remains a significant challenge. To this end, in this paper, we propose an approach based on Large Language Model (LLMs) to suggest method names according to function descriptions. The key of the approach is ContextCraft, an automated algorithm for generating context-rich prompts for LLM that suggests the expected method names according to the prompts. For a given query (functional description), it retrieves a few best examples whose functional descriptions have the greatest similarity with the query. From the examples, it identifies tokens that are likely to appear in the final method name as well as their likely positions, picks up pivot words that are semantically related to tokens in the according method names, and specifies the evaluation results of the LLM on the selected examples. All such outputs (tokens with probabilities and position information, pivot words accompanied by associated name tokens and similarity scores, and evaluation results) together with the query and the selected examples are then filled in a predefined prompt template, resulting in a context-rich prompt. This context-rich prompt reduces the randomness of LLMs by focusing the LLM’s attention on relevant contexts, constraining the solution space, and anchoring results to meaningful semantic relationships. Consequently, the LLM leverages this prompt to generate the expected method name, producing a more accurate and relevant suggestion. We evaluated the proposed approach with 43k real-world Java and Python methods accompanied by functional descriptions. Our evaluation results suggested that it significantly outperforms the state-of-the-art approach RNN-att-Copy, improving the chance of exact match by 52% and decreasing the edit distance between generated and expected method names by 32%. Our evaluation results also suggested that the proposed approach worked well for various LLMs, including ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Gemini-1.5, and Llama-3.

Publisher's Version Article: fse25main-p476-p doi:10.1145/3715753

Demystifying LLM-Based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA)
Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically resolve software development issues. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents at the time of paper submission! Agentless also achieves more than 50% solve rate when using Claude 3.5 Sonnet on the new SWE-bench Verified benchmark. In fact, Agentless has already been adopted by OpenAI as the go-to approach to showcase the real-world coding performance of both GPT-4o and the new o1 models; more recently, Agentless has also been used by DeepSeek to evaluate their newest DeepSeek V3 and R1 models. Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patches or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-𝑆 by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the currently overlooked potential of a simplistic, cost-effective technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction. We have open-sourced Agentless at: https://github.com/OpenAutoCoder/Agentless

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p481-p doi:10.1145/3715754

Standing on the Shoulders of Giants: Bug-Aware Automated GUI Testing via Retrieval Augmentation
Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Boyu Wu, Jun Hu, and Qing Wang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TU Munich, Germany)

Publisher's Version Article: fse25main-p506-p doi:10.1145/3715755

PDCAT: Preference-Driven Compiler Auto-tuning
Mingxuan Zhu, Zeyu Sun, and Dan Hao
(Peking University, China; Institute of Software at Chinese Academy of Sciences, China)
Compilers are crucial software tools that usually convert programs in high-level languages into machine code. A compiler provides hundreds of optimizations to improve the performance of the compiled code, which are controlled by enabled or disabled optimization flags. However, the vast number of combinations of these flags makes it extremely challenging to select the desired settings for compiler optimization flags (i.e., an optimization sequence) for a given target program. In the literature, many auto-tuning techniques have been proposed to select a desired optimization sequence via different strategies across the entire optimization space. However, due to the huge optimization space, these techniques commonly suffer from the widely recognized efficiency problem. To address this problem, in this paper, we propose a preference-driven selection approach PDCAT, which reduces the search space of optimization sequences through three components. In particular, PDCAT first identifies combined optimizations based on compiler documentation to exclude optimization sequences violating the combined constraints, and then categorizes the optimizations into a common optimization set (whose optimization flags are fixed) and an exploration set containing the remaining optimizations. Finally, within the search process, PDCAT assigns distinct enable probabilities to the explored optimization flags and finally selects a desired optimization sequence. The former two components reduce the search space by removing invalid optimization sequences and fixing some optimization flags, whereas the latter performs a biased search in the search space. To evaluate the performance of the proposed approach PDCAT, we conducted an extensive experimental study on the latest version of the GCC compiler with two widely used benchmarks, cBench and PolyBench. The results show that PDCAT significantly outperforms the four compared techniques, including the state-of-art technique SRTuner. Moreover, each component of PDCAT not only contributes to its performance, but also improves the acceleration performance of the compared techniques.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p510-p doi:10.1145/3715756

Towards Understanding Docker Build Faults in Practice: Symptoms, Root Causes, and Fix Patterns
Yiwen Wu, Yang Zhang, Tao Wang, Bo Ding, and Huaimin Wang
(National University of Defense Technology, China; State Key Laboratory of Complex and Critical Software Environment, China)

Publisher's Version Article: fse25main-p531-p doi:10.1145/3715757

Large Language Models for In-File Vulnerability Localization Can Be “Lost in the End”
Francesco Sovrano, Adam Bauer, and Alberto Bacchelli
(ETH Zurich, Switzerland; University of Zurich, Switzerland)
Traditionally, software vulnerability detection research has focused on individual small functions due to earlier language processing technologies’ limitations in handling larger inputs. However, this function-level approach may miss bugs that span multiple functions and code blocks. Recent advancements in artificial intelligence have enabled processing of larger inputs, leading everyday software developers to increasingly rely on chat-based large language models (LLMs) like GPT-3.5 and GPT-4 to detect vulnerabilities across entire files, not just within functions. This new development practice requires researchers to urgently investigate whether commonly used LLMs can effectively analyze large file-sized inputs, in order to provide timely insights for software developers and engineers about the pros and cons of this emerging technological trend. Hence, the goal of this paper is to evaluate the effectiveness of several state-of-the-art chat-based LLMs, including the GPT models, in detecting in-file vulnerabilities. We conducted a costly investigation into how the performance of LLMs varies based on vulnerability type, input size, and vulnerability location within the file. To give enough statistical power (β ≥ .8) to our study, we could only focus on the three most common (as well as dangerous) vulnerabilities: XSS, SQL injection, and path traversal. Our findings indicate that the effectiveness of LLMs in detecting these vulnerabilities is strongly influenced by both the location of the vulnerability and the overall size of the input. Specifically, regardless of the vulnerability type, LLMs tend to significantly (p < .05) underperform when detecting vulnerabilities located toward the end of larger files—a pattern we call the lost-in-the-end effect. Finally, to further support software developers and practitioners, we also explored the optimal input size for these LLMs and presented a simple strategy for identifying it, which can be applied to other models and vulnerability types. Eventually, we show how adjusting the input size can lead to significant improvements in LLM-based vulnerability detection, with an average recall increase of over 37% across all models.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p535-p doi:10.1145/3715758

Empirically Evaluating the Impact of Object-Centric Breakpoints on the Debugging of Object-Oriented Programs
Valentin Bourcier, Pooja Rani, Maximilian Ignacio Willembrinck Santander, Alberto Bacchelli, and Steven Costiou
(University of Lille - Inria - CNRS - Centrale Lille - UMR 9189 CRIStAL, France; University of Zurich, Switzerland)
Debugging consists in understanding the behavior of a program to identify and correct its defects. Breakpoints are the most commonly used debugging tool and aim to facilitate the debugging process by allowing developers to interrupt a program’s execution at a source code location of their choice and inspect the state of the program. Researchers suggest that in systems developed using object-oriented programming (OOP), traditional breakpoints may be a not effective method for debugging. In OOP, developers create code in classes, which at runtime are instantiated as object—entities with their own state and behavior that can interact with one another. Traditional breakpoints are set within the class code, halting execution for every object that shares that class’s code. This leads to unnecessary interruptions for developers who are focused on monitoring the behavior of a specific object. As an answer to this challenge, researchers proposed object-centric debugging, an approach based on debugging tools that focus on objects rather than classes. In particular, using object-centric breakpoints, developers can select specific objects (rather than classes) for which the execution must be interrupted. Even though it seems reasonable that this approach may ease the debugging process by reducing the time and actions needed for debugging objects, no research has yet verified its actual impact. To investigate the impact of object-centric breakpoints on the debugging process, we devised and conducted a controlled experiment with 81 developers who spent an average of 1 hour and 30 minutes each on the study. The experiment required participants to complete two debugging tasks using debugging tools with vs. without object-centric breakpoints. We found no significant effect from the use of object-centric breakpoints on the number of actions required to debug or the effectiveness in understanding or fixing the bug. However, for one of the two tasks, we measured a statistically significant reduction in debugging time for participants who used object-centric breakpoints, while for the other task, there was a statistically significant increase. Our analysis suggests that the impact of object-centric breakpoints varies depending on the context and the specific nature of the bug being addressed. In particular, our analysis indicates that object-centric breakpoints can speed up the process of locating the root cause of a bug when the bug can be replicated without needing to restart the program. We discuss the implications of these findings for debugging practices and future research.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p536-p doi:10.1145/3715759

ChangeGuard: Validating Code Changes via Pairwise Learning-Guided Execution
Lars Gröninger, Beatriz Souza, and Michael Pradel
(University of Stuttgart, Germany)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25main-p542-p doi:10.1145/3715760

Integrating Large Language Models and Reinforcement Learning for Non-linear Reasoning
Yoav Alon and Cristina David
(University of Bristol, UK)

Publisher's Version Article: fse25main-p551-p doi:10.1145/3715761

How Do Programming Students Use Generative AI?
Christian Rahe and Walid Maalej
(University of Hamburg, Germany)

Publisher's Version Article: fse25main-p552-p doi:10.1145/3715762

LLMDroid: Enhancing Automated Mobile App GUI Testing Coverage with Large Language Model Guidance
Chenxu Wang, Tianming Liu, Yanjie Zhao, Minghui Yang, and Haoyu Wang
(Huazhong University of Science and Technology, China; Monash University, Australia; OPPO, China)

Publisher's Version Article: fse25main-p556-p doi:10.1145/3715763

The Struggles of LLMs in Cross-Lingual Code Clone Detection
Micheline Bénédicte Moumoula, Abdoul Kader Kaboré, Jacques Klein, and Tegawendé F. Bissyandé
(University of Luxembourg, Luxembourg; Interdisciplinary Centre of Excellence in AI for Development (CITADEL), Burkina Faso)

Publisher's Version Article: fse25main-p558-p doi:10.1145/3715764

Has My Code Been Stolen for Model Training? A Naturalness Based Approach to Code Contamination Detection
Haris Ali Khan, Yanjie Jiang, Qasim Umer, Yuxia Zhang, Waseem Akram, and Hui Liu
(Beijing Institute of Technology, China; Peking University, China; King Fahd University of Petroleum and Minerals, Saudi Arabia)
It is often valuable to know whether a given piece of source code has or hasn’t been used to train a given deep learning model. On one side, it helps avoid data contamination problems that may exaggerate the performance of evaluated models. Conversely, it facilitates copyright protection by identifying private or protected code leveraged for model training without permission. To this end, automated approaches have been proposed for the detection, known as data contamination detection. Such approaches often heavily rely on the confidence of the involved models, assuming that the models should be more confident in handling contaminated data than cleaned data. However, such approaches do not consider the nature of the given data item, i.e., how difficult it is to predict the given item. Consequently, difficult-to-predict contaminated data and easy-to-predict cleaned data are often misclassified. As an initial attempt to solve this problem, this paper presents a naturalness-based approach, called Natural-DaCoDe, for code-completion models to distinguish contaminated source code from cleaned ones. Natural-DaCoDe leverages code naturalness to quantitatively measure the difficulty of a given source code for code-completion models. It then trains a classifier to distinguish contaminated source code according to both code naturalness and the performance of the code-completion models on the given source code. We evaluate Natural-DaCoDe with two pre-trained large language models (e.g., ChatGPT and Claude) and two code-completion models that we trained from scratch for detecting contamination data. Our evaluation results suggest that Natural-DaCoDe substantially outperformed the state-of-the-art approaches in detecting contaminated data, improving the average accuracy by 61.78%. We also evaluate Natural-DaCoDe with method name suggestion task, and it remains more accurate than the state-of-the-art approaches, improving the accuracy by 54.39%. Furthermore, Natural-DaCoDe was tested on a natural language text benchmark, significantly outperforming the state-of-the-art approaches by 22% . It may suggest that Natural-DaCoDe could be applied to various source code related tasks besides code complete.

Publisher's Version Article: fse25main-p561-p doi:10.1145/3715765

Automated and Accurate Token Transfer Identification and Its Applications in Cryptocurrency Security
Shuwei Song, Ting Chen, Ao Qiao, Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang, Xiaodong Lin, Peng He, Wensheng Zhang, and Xiaosong Zhang
(University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada)

Publisher's Version Article: fse25main-p575-p doi:10.1145/3715766

Revolutionizing Newcomers’ Onboarding Process in OSS Communities: The Future AI Mentor
Xin Tan, Xiao Long, Yinghao Zhu, Lin Shi, Xiaoli Lian, and Li Zhang
(Beihang University, China)

Publisher's Version Article: fse25main-p580-p doi:10.1145/3715767

Impact of Request Formats on Effort Estimation: Are LLMs Different Than Humans?
Gül Çalıklı and Mohammed Alhamed
(University of Glasgow, UK; Applied Behaviour Systems, UK)
Software development Effort Estimation (SEE) comprises predicting the most realistic amount of effort (e.g., in work hours) required to develop or maintain software based on incomplete, uncertain, and noisy input. Expert judgment is the dominant SEE strategy used in the industry. Yet, expert-based judgment can provide inaccurate effort estimates, leading to projects’ poor budget planning and cost and time overruns, negatively impacting the world economy. Large Language Models (LLMs) are good candidates to assist software professionals in effort estimation. However, their effective leveraging for SEE requires thoroughly investigating their limitations and to what extent they overlap with those of (human) software practitioners. One primary limitation of LLMs is the sensitivity of their responses to prompt changes. Similarly, empirical studies showed that changes in the request format (e.g., rephrasing) could impact (human) software professionals’ effort estimates. This paper reports the first study that replicates a series of SEE experiments, which were initially carried out with software professionals (humans) in the literature. Our study aims to investigate how LLMs’ effort estimates change due to the transition from the traditional request format (i.e., "How much effort is required to complete X?”) to the alternative request format (i.e., "How much can be completed in Y work hours?”). Our experiments involved three different LLMs (GPT-4, Gemini 1.5 Pro, Llama 3.1) and 88 software project specifications (per treatment in each experiment), resulting in 880 prompts, in total that we prepared using 704 user stories from three large-scale open-source software projects (Hyperledger Fabric, Mulesoft Mule, Spring XD). Our findings align with the original experiments conducted with software professionals: The first four experiments showed that LLMs provide lower effort estimates due to transitioning from the traditional to the alternative request format. The findings of the fifth and first experiments detected that LLMs display patterns analogous to anchoring bias, a human cognitive bias defined as the tendency to stick to the anchor (i.e., the "Y work-hours” in the alternative request format). Our findings provide crucial insights into facilitating future human-AI collaboration and prompt designs for improved effort estimation accuracy.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25main-p581-p doi:10.1145/3715771

Mitigating Emergent Malware Label Noise in DNN-Based Android Malware Detection
Haodong Li, Xiao Cheng, Guohan Zhang, Guosheng Xu, Guoai Xu, and Haoyu Wang
(Beijing University of Posts and Telecommunications, China; UNSW, Australia; Harbin Institute of Technology, Shenzhen, China; Huazhong University of Science and Technology, China)

Publisher's Version Article: fse25main-p594-p doi:10.1145/3715769

Detecting Metadata-Related Bugs in Enterprise Applications
Md Mahir Asef Kabir, Xiaoyin Wang, and Na Meng
(Virginia Tech, USA; University of Texas at San Antonio, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25main-p596-p doi:10.1145/3715772

An Adaptive Language-Agnostic Pruning Method for Greener Language Models for Code
Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, and Tushar Sharma
(Dalhousie University, Canada; University of Murcia, Spain; McGill University, Canada; Linköping University, Sweden)
Language models of code have demonstrated remarkable performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce the computational overhead of these models. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size that is up to x3 less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 emissions by up to 44.85%. Importantly, it achieves a reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of their adoption in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE.

Publisher's Version Article: fse25main-p600-p doi:10.1145/3715773

10 Years Later: Revisiting How Developers Search for Code
Kathryn T. Stolee, Tobias Welp, Caitlin Sadowski, and Sebastian Elbaum
(North Carolina State University, USA; Google, Germany; Unaffiliated, USA; University of Virginia, USA)

Publisher's Version Article: fse25main-p680-p doi:10.1145/3715774

IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models
Sayem Mohammad Imtiaz, Astha Singh, Fraol Batole, and Hridesh Rajan
(Iowa State University, USA; Tulane University, USA)
Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity, harmful responses, and factual inaccuracies. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, drawing inspiration from fault localization via program slicing, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model’s most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model’s overall versatility by altering a smaller portion of the model. Furthermore, dynamic selection allows for a more nuanced and precise model repair compared to a fixed selection strategy. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair.

Publisher's Version Article: fse25main-p722-p doi:10.1145/3715775

Clone Detection for Smart Contracts: How Far Are We?
Zuobin Wang, Zhiyuan Wan, Yujing Chen, Yun Zhang, David Lo, Difan Xie, and Xiaohu Yang
(Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang), China; Hangzhou City University, China; Singapore Management University, Singapore)
In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research.

Publisher's Version Article: fse25main-p725-p doi:10.1145/3715776

Dissecting Real-World Cross-Language Bugs
Haoran Yang and Haipeng Cai
(Washington State University, USA; SUNY Buffalo, USA)

Publisher's Version

Published Artifact

Info

Artifacts Available

Artifacts Functional Article: fse25main-p727-p doi:10.1145/3715777

Less Is More: On the Importance of Data Quality for Unit Test Generation
Junwei Zhang, Xing Hu, Shan Gao, Xin Xia, David Lo, and Shanping Li
(Zhejiang University, China; Huawei, China; Singapore Management University, Singapore)
Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p747-p doi:10.1145/3715778

Protecting Privacy in Software Logs: What Should Be Anonymized?
Roozbeh Aghili, Heng Li, and Foutse Khomh
(Polytechnique Montréal, Canada)

Publisher's Version Article: fse25main-p752-p doi:10.1145/3715779

MiSum: Multi-modality Heterogeneous Code Graph Learning for Multi-intent Binary Code Summarization
Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao
(National University of Defense Technology, China)

Publisher's Version

Published Artifact

Info

Artifacts Available Article: fse25main-p760-p doi:10.1145/3715780

SemBIC: Semantic-Aware Identification of Bug-Inducing Commits
Xiao Chen, Hengcheng Zhu, Jialun Cao, Ming Wen, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; Huazhong University of Science and Technology, China)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25main-p785-p doi:10.1145/3715781

Eliminating Backdoors in Neural Code Models for Secure Code Understanding
Weisong Sun, Yuchen Chen, Chunrong Fang, Yebo Feng, Yuan Xiao, An Guo, Quanjun Zhang, Zhenyu Chen, Baowen Xu, and Yang Liu
(Nanyang Technological University, Singapore; Nanjing University, China)
Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline).

Publisher's Version Article: fse25main-p792-p doi:10.1145/3715782

Understanding Debugging as Episodes: A Case Study on Performance Bugs in Configurable Software Systems
Max Weber, Alina Mailach, Sven Apel, Janet Siegmund, Raimund Dachselt, and Norbert Siegmund
(Leipzig University, Germany; ScaDS.AI Dresden-Leipzig, Germany; Saarland Informatics Campus, Germany; Saarland University, Germany; Chemnitz University of Technology, Germany; Dresden University of Technology, Germany)

Publisher's Version Article: fse25main-p808-p doi:10.1145/3717523

Detecting and Reducing the Factual Hallucinations of Large Language Models with Metamorphic Testing
Weibin Wu, Yuhang Cao, Ning Yi, Rongyi Ou, and Zibin Zheng
(Sun Yat-sen University, China)

Publisher's Version Article: fse25main-p828-p doi:10.1145/3715784

On the Unnecessary Complexity of Names in X.509 and Their Impact on Implementations
Yuteng Sun, Joyanta Debnath, Wenzheng Hong, Omar Chowdhury, and Sze Yiu Chau
(Chinese University of Hong Kong, Hong Kong; Stony Brook University, USA; Independent, China)
TheX.509 Public Key Infrastructure (PKI) provides a cryptographically verifiable mechanism for authenticating a binding of an entity’s public-key with its identity, presented in a tamper-proof digital certificate. This often serves as a foundational building block for achieving different security guarantees in many critical applications and protocols (e.g., SSL/TLS). Identities in the context of X.509 PKI are often captured as names, which are encoded in certificates as composite records with different optional fields that can store various types of string values (e.g., ASCII, UTF8). Although such flexibility allows for the support of diverse name types (e.g., IP addresses, DNS names) and application scenarios, it imposes on library developers obligations to enforce unnecessarily convoluted requirements. Bugs in enforcing these requirements can lead to potential interoperability and performance issues, and might open doors to impersonation attacks. This paper focuses on analyzing how open-source libraries enforce the constraints regarding the formatting, encoding, and processing of complex name structures on X.509 certificate chains, for the purpose of establishing identities. Our analysis reveals that a portfolio of simplistic testing approaches can expose blatant violations of the requirements in widely used open-source libraries. Although there is a substantial amount of prior work that focused on testing the overall certificate chain validation process of X.509 libraries, the identified violations have escaped their scrutiny. To make matters worse, we observe that almost all the analyzed libraries completely ignore certain pre-processing steps prescribed by the standard. This begs the question of whether it is beneficial to have a standard so flexible but complex that none of the implementations can faithfully adhere to it. With our test results, we argue in the negative, and explain how simpler alternatives (e.g., other forms of identifiers such as Authority and Subject Key Identifiers) can be used to enforce similar constraints with no loss of security.

Publisher's Version Article: fse25main-p839-p doi:10.1145/3715785

RePurr: Automated Repair of Block-Based Learners’ Programs
Sebastian Schweikl and Gordon Fraser
(University of Passau, Germany)
Programming is increasingly taught using dedicated block-based programming environments such as Scratch. While the use of blocks instead of text prevents syntax errors, learners can still make semantic mistakes implying a need for feedback and help. Since teachers may be overwhelmed by help requests in a classroom, may not have the required programming education themselves, and may simply not be available in independent learning scenarios, automated hint generation is desirable. Automated program repair can provide the foundation for automated hints, but relies on multiple assumptions: (1) Program repair usually aims to produce localized patches for fixing single bugs, but learners may fundamentally misunderstand programming concepts and tasks or request help for substantially incomplete programs. (2) Software tests are required to guide the search and to localize broken statements, but test suites for block-based programs are different to those considered in past research on fault localization and repair: They consist of system tests, where very few tests are sufficient to fully cover the code. At the same time, these tests have vastly longer execution times caused by the use of animations and interactions on Scratch programs, thus inhibiting the applicability of metaheuristic search. (3) The plastic surgery hypothesis assumes that the code necessary for repairs already exists in the codebase. Block-based programs tend to be small and may lack this necessary redundancy. In order to study whether automated program repair of block-based programs is nevertheless feasible, in this paper we introduce, to the best of our knowledge, the first automated program repair approach for Scratch programs based on evolutionary search. Our RePurr prototype includes novel refinements of fault localization to improve the lack of guidance of the test suites, recovers the plastic surgery hypothesis by exploiting that a learning scenario provides model and student solutions as alternatives, and uses parallelization and accelerated executions to reduce the costs of fitness evaluations. Empirical evaluation of RePurr on a set of real learners' programs confirms the anticipated challenges, but also demonstrates that the repair can nonetheless effectively improve and fix learners' programs, thus enabling automated generation of hints and feedback for learners.

Publisher's Version Article: fse25main-p853-p doi:10.1145/3715786

Why the Proof Fails in Different Versions of Theorem Provers: An Empirical Study of Compatibility Issues in Isabelle
Xiaokun Luan, David Sanan, Zhe Hou, Qiyuan Xu, Chengwei Liu, Yufan Cai, Yang Liu, and Meng Sun
(Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore)

Publisher's Version

Published Artifact

Artifacts Available

ACM SIGSOFT Distinguished Paper Award Article: fse25main-p855-p doi:10.1145/3715787

Improving Graph Learning-Based Fault Localization with Tailored Semi-supervised Learning
Chun Li, Hui Li, Zhong Li, Minxue Pan, and Xuandong Li
(Nanjing University, China; Samsung Electronics, China)

Publisher's Version Article: fse25main-p873-p doi:10.1145/3715788

A Mixed-Methods Study of Model-Based GUI Testing in Real-World Industrial Settings
Shaoheng Cao, Renyi Chen, Wenhua Yang, Minxue Pan, and Xuandong Li
(Nanjing University, China; Samsung Electronics, China; Nanjing University of Aeronautics and Astronautics, China)

Publisher's Version Article: fse25main-p877-p doi:10.1145/3715789

Beyond PEFT: Layer-Wise Optimization for More Effective and Efficient Large Code Model Tuning
Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu
(Chinese University of Hong Kong, China; University of Electronic Science and Technology of China, China; Hong Kong University of Science and Technology, China; Tencent, China)
Large Code Models (LCMs) have demonstrated remarkable effectiveness across various code intelligence tasks. Supervised fine-tuning is essential to optimize their performance for specific downstream tasks. Compared with the traditional full-parameter fine-tuning (FFT) method, Parameter-Efficient Fine-Tuning (PEFT) methods can train LCMs with substantially reduced resource consumption and have gained widespread attention among researchers and practitioners. While existing studies have explored PEFT methods for code intelligence tasks, they have predominantly focused on a limited subset of scenarios, such as code generation with publicly available datasets, leading to constrained generalizability of the findings. To mitigate the limitation, we conduct a comprehensive study on exploring the effectiveness of the PEFT methods, which involves five code intelligence tasks containing both public and private data. Our extensive experiments reveal a considerable performance gap between PEFT methods and FFT, which is contrary to the findings of existing studies. We also find that this disparity is particularly pronounced in tasks involving private data.
To improve the tuning performance for LCMs while reducing resource utilization during training, we propose a Layer-Wise Optimization (LWO) strategy in the paper. LWO incrementally updates the parameters of each layer in the whole model architecture, without introducing any additional component and inference overhead. Experiments across five LCMs and five code intelligence tasks demonstrate that LWO trains LCMs more effectively and efficiently compared to previous PEFT methods, with significant improvements in tasks using private data. For instance, in the line-level code completion task using our private code repositories, LWO outperforms the state-of-the-art LoRA method by 22% and 12% in terms of accuracy and BLEU scores, respectively. Furthermore, LWO can enable more efficient LCM tuning, reducing the training time by an average of 42.7% compared to LoRA.

Publisher's Version Article: fse25mainb-p2-p doi:10.1145/3729341

Prompts Are Programs Too! Understanding How Developers Build Software Containing Prompts
Jenny T. Liang, Melissa Lin, Nikitha Rao, and Brad A. Myers
(Carnegie Mellon University, USA)

Publisher's Version Article: fse25mainb-p4-p doi:10.1145/3729342

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation
Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma
(University of Alberta, Canada; University of Tokyo, Japan)

Publisher's Version Article: fse25mainb-p11-p doi:10.1145/3729343

Unlocking Optimal ORM Database Designs: Accelerated Tradeoff Analysis with Transformers
Md Rashedul Hasan, Mohammad Rashedul Hasan, and Hamid Bagheri
(University of Nebraska-Lincoln, USA)

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p24-p doi:10.1145/3729344

SmartNote: An LLM-Powered, Personalised Release Note Generator That Just Works
Farbod Daneshyan, Runzhi He, Jianyu Wu, and Minghui Zhou
(Peking University, China)
The release note is a crucial document outlining changes in new software versions. It plays a key role in helping stakeholders recognise important changes and understand the implications behind them. Despite this fact, many developers view the process of writing software release notes as a tedious and dreadful task. Consequently, numerous tools (e.g., DeepRelease and Conventional Changelog) have been developed by researchers and practitioners to automate the generation of software release notes. However, these tools fail to consider project domain and target audience for personalisation, limiting their relevance and conciseness. Additionally, they suffer from limited applicability, often necessitating significant workflow adjustments and adoption efforts, hindering practical use and stressing developers. Despite recent advancements in natural language processing and the proven capabilities of large language models (LLMs) in various code and text-related tasks, there are no existing studies investigating the integration and utilisation of LLMs in automated release note generation. Therefore, we propose SmartNote, a novel and widely applicable release note generation approach that produces high-quality, contextually personalised release notes by leveraging LLM capabilities to aggregate, describe, and summarise changes based on code, commit, and pull request details. It categorises and scores (for significance) commits to generate structured and concise release notes of prioritised changes. We conduct human and automatic evaluations that reveal SmartNote outperforms or achieves comparable performance to DeepRelease (state-of-the-art), Conventional Changelog (off-the-shelf), and the projects' original release note across four quality metrics: completeness, clarity, conciseness, and organisation. In both evaluations, SmartNote ranked first for completeness and organisation, while clarity ranked first in the human evaluation. Furthermore, our controlled study reveals the significance of contextual awareness, while our applicability analysis confirms SmartNote's effectiveness across diverse projects.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p32-p doi:10.1145/3729345

CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift
Jiongchi Yu, Xiaofei Xie, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, and Frank Liauw
(Singapore Management University, Singapore; Tianjin University, China; Zhejiang University, China; Shanghai Jiao Tong University, China; University of Tokyo, Japan; University of Alberta, Canada; Southern Cross University, Australia; Government Technology Agency of Singapore, Singapore)
With the rapid advancement of cloud-native computing, securing cloud environments has become an important task. Log-based Anomaly Detection (LAD) is the most representative technique used in different systems for attack detection and safety guarantee, where multiple LAD methods and relevant datasets have been proposed. However, even though some of these datasets are specifically prepared for cloud systems, they only cover limited cloud behaviors and lack information from a whole-system perspective. Another critical issue to consider is normality shift, which implies that the test distribution could differ from the training distribution and highly affect the performance of LAD. Unfortunately, existing works only focus on simple shift types such as chronological changes, while other cloud-specific shift types are ignored, e.g., different deployed cloud architectures. Therefore, a dataset that captures diverse cloud system behaviors and various types of normality shifts is essential.
To fill this gap, we construct a dataset CAShift to evaluate the performance of LAD in cloud, which considers different roles of software in cloud systems, supports three real-world normality shift types (application shift, version shift, and cloud architecture shift), and features 20 different attack scenarios in various cloud system components. Based on CAShift, we conduct a comprehensive empirical study to investigate the effectiveness of existing LAD methods in normality shift scenarios. Additionally, to explore the feasibility of shift adaptation, we further investigate three continuous learning approaches, which are the most common methods to mitigate the impact of distribution shift. Results demonstrated that 1) all LAD methods suffer from normality shift where the performance drops up to 34%, and 2) existing continuous learning methods are promising to address shift drawbacks, but the ratio of data used for model retraining and the selection of algorithms highly affect the shift adaptation, with an increase in the F1-Score of up to 27%. Based on our findings, we offer valuable implications for future research in designing more robust LAD models and methods for LAD shift adaptation.

Publisher's Version Article: fse25mainb-p68-p doi:10.1145/3729346

Incorporating Verification Standards for Security Requirements Generation from Functional Specifications
Xiaoli Lian, Shuaisong Wang, Hanyu Zou, Fang Liu, Jiajun Wu, and Li Zhang
(Beihang University, China)

Publisher's Version Article: fse25mainb-p84-p doi:10.1145/3729347

Multi-modal Traffic Scenario Generation for Autonomous Driving System Testing
Zhi Tu, Liangkun Niu, Wei Fan, and Tianyi Zhang
(Purdue University, USA)

Publisher's Version Article: fse25mainb-p115-p doi:10.1145/3729348

Dynamic Taint Tracking for Modern Java Virtual Machines
Katherine Hough and Jonathan Bell
(Northeastern University, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25mainb-p116-p doi:10.1145/3729349

Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?
Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu
(Nanyang Technological University, Singapore; Peking University, China; King's College London, UK; Fudan University, China)

Publisher's Version Article: fse25mainb-p123-p doi:10.1145/3729350

TracePicker: Optimization-Based Trace Sampling for Microservice-Based Systems
Shuaiyu Xie, Jian Wang, Maodong Li, Peiran Chen, Jifeng Xuan, and Bing Li
(Wuhan University, China; Zhongguancun Laboratory, China)

Publisher's Version Article: fse25mainb-p130-p doi:10.1145/3729351

Towards Understanding Fine-Grained Programming Mistakes and Fixing Patterns in Data Science
Wei-Hao Chen, Jia Lin Cheoh, Manthan Keim, Sabine Brunswicker, and Tianyi Zhang
(Purdue University, USA)

Publisher's Version Article: fse25mainb-p137-p doi:10.1145/3729352

LookAhead: Preventing DeFi Attacks via Unveiling Adversarial Contracts
Shoupeng Ren, Lipeng He, Tianyu Tu, Di Wu, Jian Liu, Kui Ren, and Chun Chen
(Zhejiang University, China; University of Waterloo, Canada)

Publisher's Version Article: fse25mainb-p146-p doi:10.1145/3729353

Doc2OracLL: Investigating the Impact of Documentation on LLM-Based Test Oracle Generation
Soneya Binta Hossain, Raygan Taylor, and Matthew Dwyer
(University of Virginia, USA; Dillard University, USA)
Code documentation is a critical artifact of software development, bridging human understanding and machine- readable code. Beyond aiding developers in code comprehension and maintenance, documentation also plays a critical role in automating various software engineering tasks, such as test oracle generation (TOG). In Java, Javadoc comments offer structured, natural language documentation embedded directly within the source code, typically describing functionality, usage, parameters, return values, and exceptional behavior. While prior research has explored the use of Javadoc comments in TOG alongside other information, such as the method under test (MUT), their potential as a stand-alone input source, the most relevant Javadoc components, and guidelines for writing effective Javadoc comments for automating TOG remain less explored. In this study, we investigate the impact of Javadoc comments on TOG through a comprehensive analysis. We begin by fine-tuning 10 large language models using three different prompt pairs to assess the role of Javadoc comments alongside other contextual information. Next, we systematically analyze the impact of different Javadoc comment’s components on TOG. To evaluate the generalizability of Javadoc comments from various sources, we also generate them using the GPT-3.5 model. We perform a thorough bug detection study using Defects4J dataset to understand their role in real-world bug detection. Our results show that incorporating Javadoc comments improves the accuracy of test oracles in most cases, aligning closely with ground truth. We find that Javadoc comments alone can match or even outperform approaches that utilize the MUT implementation. Additionally, we identify that the description and the return tag are the most valuable components for TOG. Finally, our approach, when using only Javadoc comments, detects between 19% and 94% more real-world bugs in Defects4J than prior methods, establishing a new state-of-the-art. To further guide developers in writing effective documentation, we conduct a detailed qualitative study on when Javadoc comments are helpful or harmful for TOG.

Publisher's Version Article: fse25mainb-p152-p doi:10.1145/3729354

ChatDBG: Augmenting Debugging with Large Language Models
Kyla H. Levin, Nicolas van Kempen, Emery D. Berger, and Stephen N. Freund
(University of Massachusetts at Amherst, USA; Amazon Web Services, USA; Williams College, USA)

Publisher's Version

Published Artifact

Info

Artifacts Available

Artifacts Reusable Article: fse25mainb-p157-p doi:10.1145/3729355

A Knowledge Enhanced Large Language Model for Bug Localization
Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang
(Nanjing University, China; Singapore Management University, Singapore)
A significant number of bug reports are generated every day as software systems continue to develop. Large Language Models (LLMs) have been used to correlate bug reports with source code to locate bugs automatically. The existing research has shown that LLMs are effective for bug localization and can increase software development efficiency. However, these studies still have two limitations. First, these models fail to capture context information about bug reports and source code. Second, these models are unable to understand the domain-specific expertise inherent to particular projects, such as version information in projects that are composed of alphanumeric characters without any semantic meaning.
To address these challenges, we propose a Knowledge Enhanced Pre-Trained model using project documents and historical code, called KEPT, for bug localization. Project documents record, revise, and restate project information that provides rich semantic information about those projects. Historical code contains rich code semantic information that can enhance the reasoning ability of LLMs. Specifically, we construct knowledge graphs from project documents and source code. Then, we introduce knowledge graphs to the LLM through soft-position embedding and visible matrices, enhancing its contextual and professional reasoning ability. To validate our model, we conducted a series of experiments on seven open-source software projects with over 6,000 bug reports. Compared with the traditional model (Locus), KEPT performs better by 33.2% to 59.5% in terms of mean reciprocal rank, mean average precision, and Top@N. Compared with the best-performing non-commercial LLM (CodeT5), KEPT achieves an improvement of 36.6% to 63.7%. Compared to the state-of-the-art commercial LLM developed by OpenAI, called text-embedding-ada-002, KEPT achieves an average improvement of 7.8% to 17.4%. The results indicate that introducing knowledge graphs contributes to enhance the effectiveness of the LLM in bug localization.

Publisher's Version Article: fse25mainb-p166-p doi:10.1145/3729356

Zero-Shot Cross-Domain Code Search without Fine-Tuning
Keyu Liang, Zhongxin Liu, Chao Liu, Zhiyuan Wan, David Lo, and Xiaohu Yang
(Zhejiang University, China; Chongqing University, China; Singapore Management University, Singapore)
Code search is a crucial task in software engineering, aiming to retrieve code snippets that are semantically relevant to a natural language query. Recently, Pre-trained Language Models (PLMs) have shown remarkable success and are widely adopted for code search tasks. However, PLM-based methods often struggle in cross-domain scenarios. When applied to a new domain, they typically require extensive fine-tuning with substantial data. Even worse, the data scarcity problem in new domains often forces these methods to operate in a zero-shot setting, resulting in a significant decline in performance. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search.
The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. We first conduct an empirical study to investigate the effectiveness of these two matching schemas in zero-shot cross-domain code search. Our findings highlight the strong complementarity among the three matching schemas, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge first employs zero-shot prompting to guide Large Language Models (LLMs) to generate a comment for each code snippet in the codebase and produce a code for each query. Subsequently, it encodes queries, code snippets, comments, and the generated code using PLMs and assesses similarities through three matching schemas: query-code, query-comment, and generated code-code. Lastly, CodeBridge leverages a sampling-based fusion approach that combines these three similarity scores to rank the final search outcomes. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning.

Publisher's Version

Info Article: fse25mainb-p171-p doi:10.1145/3729357

RegTrieve: Reducing System-Level Regression Errors for Machine Learning Systems via Retrieval-Enhanced Ensemble
Junming Cao, Xuwen Xiang, Mingfei Cheng, Bihuan Chen, Xinyan Wang, You Lu, Chaofeng Sha, Xiaofei Xie, and Xin Peng
(Fudan University, China; Singapore Management University, Singapore)

Publisher's Version

Info Article: fse25mainb-p182-p doi:10.1145/3729358

TerzoN: Human-in-the-Loop Software Testing with a Composite Oracle
Matthew C. Davis, Amy Wei, Brad A. Myers, and Joshua Sunshine
(Carnegie Mellon University, USA; University of Michigan, USA)

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p196-p doi:10.1145/3729359

Teaching AI the ‘Why’ and ‘How’ of Software Vulnerability Fixes
Amiao Gao, Zenong Zhang, Simin Wang, LiGuo Huang, Shiyi Wei, and Vincent Ng
(Southern Methodist University, Dallas, USA; University of Texas at Dallas, USA)

Publisher's Version

Info Article: fse25mainb-p198-p doi:10.1145/3729360

Bridging Operator Semantic Inconsistencies: A Source-Level Cross-Framework Model Conversion Approach
Xingpei Li, Yan Lei, Zhouyang Jia, Yuanliang Zhang, Haoran Liu, Liqian Chen, Wei Dong, and Shanshan Li
(National University of Defense Technology, China; Chongqing University, China)

Publisher's Version Article: fse25mainb-p200-p doi:10.1145/3729361

UnitCon: Synthesizing Targeted Unit Tests for Java Runtime Exceptions
Sujin Jang, Yeonhee Ryou, Heewon Lee, and Kihong Heo
(KAIST, Republic of Korea)

Publisher's Version

Published Artifact

Artifacts Available

ACM SIGSOFT Distinguished Paper Award Article: fse25mainb-p208-p doi:10.1145/3729362

An Empirical Study of Bugs in Data Visualization Libraries
Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu, Shing-Chi Cheung, and Chengnian Sun
(Hong Kong University of Science and Technology, China; University of Waterloo, Canada)
Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries.
This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. Our findings offer valuable insights into the nature and patterns of bugs in DataViz libraries, providing a foundation for developers and researchers to improve library reliability, and ultimately benefit more accurate and reliable data visualizations across various domains.

Publisher's Version Article: fse25mainb-p225-p doi:10.1145/3729363

Divide-and-Conquer: Generating UI Code from Screenshots
Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu
(Chinese University of Hong Kong, China; Singapore Management University, Singapore)

Publisher's Version Article: fse25mainb-p236-p doi:10.1145/3729364

Liberating Libraries through Automated Fuzz Driver Generation: Striking a Balance without Consumer Code
Flavio Toffalini, Nicolas Badoux, Zurab Tsinadze, and Mathias Payer
(Ruhr University Bochum, Germany; EPFL, Switzerland)
Fuzz testing a software library requires developers to write fuzz drivers, specialized programs exercising the library. Given a driver, fuzzers generate interesting inputs that trigger the library’s bugs. Writing fuzz drivers manually is a cumbersome process and they frequently hit a coverage plateau, calling for more diverse drivers. To alleviate the need for human expert knowledge, emerging automatic driver generation techniques invest computational time for tasks besides input generation. Therefore, to maximize the number of bugs found, it is crucial to carefully balance the available computational resources between generating valid drivers and testing them thoroughly. Current works model driver generation and testing as a single problem, i.e., they mutate both the driver’s code and input together. This simple approach is limited, as many libraries need a combination of non-trivial library usage and complex inputs. For example, consider a JPEG manipulation library, bugs appear when specific library functions and corrupted images are coincidentally tested together, which, if both are mutated synchronously is difficult to trigger. We introduce libErator, a novel library testing approach that balances constrained computational resources to achieve two goals: (a) quickly generate valid fuzz drivers and (b) deeply test these drivers to find bugs. To achieve these goals, libErator employs three main techniques. First, we leverage insights from a novel static analysis on the library code to improve the likelihood of generating meaningful drivers. Second, we design a method to quickly discard non-functional drivers, reducing even further resources wasted on unfruitful drivers. Finally, we show an effective driver selection method that avoids redundant tests. We deploy libErator on 15 open-source libraries and evaluate it against manually written and automatically generated drivers. We show that libErator reaches comparable coverage to manually written drivers and, on average, exceeds coverage from existing automated driver generation techniques. More importantly, libErator automatically finds 24 confirmed bugs, 21 of which are already fixed and upstreamed. Among the bugs found, one was assigned a CVE while others contributed to the project test suites, thus showcasing the ability of libErator to create valid library usages. Finally, libErator achieves 25% true positive ratio, doubling the state of the art.

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p239-p doi:10.1145/3729365

Automatically Fixing Dependency Breaking Changes
Lukas Fruntke and Jens Krinke
(University College London, UK)

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p241-p doi:10.1145/3729366

Alert Summarization for Online Service Systems by Validating Propagation Paths of Faults
Jia Chen, Yuang He, Peng Wang, Xiaolei Chen, Jie Shi, and Wei Wang
(Fudan University, China)

Publisher's Version Article: fse25mainb-p242-p doi:10.1145/3729367

It’s Acting Odd! Exploring Equivocal Behaviors of Goodware
Gregorio Dalia, Andrea Di Sorbo, Corrado Aaron Visaggio, and Gerardo Canfora
(University of Sannio, Italy; University of Foggia, Italy)

Publisher's Version Article: fse25mainb-p249-p doi:10.1145/3729368

Scientific Open-Source Software Is Less Likely to Become Abandoned Than One Might Think! Lessons from Curating a Catalog of Maintained Scientific Software
Addi Malviya Thakur, Reed Milewicz, Mahmoud Jahanshahi, Lavínia Paganini, Bogdan Vasilescu, and Audris Mockus
(University of Tennessee at Knoxville, USA; Oak Ridge National Laboratory, USA; Sandia National Laboratories, USA; Eindhoven University of Technology, Netherlands; Carnegie Mellon University, USA)

Publisher's Version

Info Article: fse25mainb-p270-p doi:10.1145/3729369

Automated Recognition of Buggy Behaviors from Mobile Bug Reports
Zhaoxu Zhang, Komei Ryu, Tingting Yu, and William G.J. Halfond
(University of Southern California, USA; University of Connecticut, USA)

Publisher's Version Article: fse25mainb-p282-p doi:10.1145/3729370

Enhancing Web Accessibility: Automated Detection of Issues with Generative AI
Ziyao He, Syed Fatiul Huq, and Sam Malek
(University of California at Irvine, USA)

Publisher's Version Article: fse25mainb-p297-p doi:10.1145/3729371

Directed Testing in MLIR: Unleashing Its Potential by Overcoming the Limitations of Random Fuzzing
Weiyuan Tong, Zixu Wang, Zhanyong Tang, Jianbin Fang, Yuqun Zhang, and Guixin Ye
(NorthWest University, China; National University of Defense Technology, China; Southern University of Science and Technology, China)

Publisher's Version Article: fse25mainb-p308-p doi:10.1145/3729372

DiSCo: Towards Decompiling EVM Bytecode to Source Code using Large Language Models
Xing Su, Hanzhong Liang, Hao Wu, Ben Niu, Fengyuan Xu, and Sheng Zhong
(Nanjing University, China; Institute of Information Engineering at Chinese Academy of Sciences, China)

Publisher's Version Article: fse25mainb-p317-p doi:10.1145/3729373

Towards Understanding Performance Bugs in Popular Data Science Libraries
Haowen Yang, Zhengda Li, Zhiqing Zhong, Xiaoying Tang, and Pinjia He
(Chinese University of Hong Kong, Shenzhen, China)

Publisher's Version Article: fse25mainb-p326-p doi:10.1145/3729374

De-duplicating Silent Compiler Bugs via Deep Semantic Representation
Junjie Chen, Xingyu Fan, Chen Yang, Shuang Liu, and Jun Sun
(Tianjin University, China; Renmin University of China, China; Singapore Management University, Singapore)

Publisher's Version Article: fse25mainb-p328-p doi:10.1145/3729375

Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers
Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, and Aldeida Aleti
(Monash University, Australia; University of Auckland, New Zealand; Royal Melbourne Institute of Technology, Australia; Joint Research Centre at the European Commission, Italy)
Machine learning (ML) for text classification has been widely used in various domains, such as toxicity detection, chatbot consulting, and review analysis. These applications can significantly impact ethics, economics, and human behavior, raising serious concerns about trusting ML decisions. Several studies indicate that traditional uncertainty metrics, such as model confidence, and performance metrics, like accuracy, are insufficient to build human trust in ML models. These models often learn spurious correlations during training and predict based on them during inference. When deployed in the real world, where such correlations are absent, their performance can deteriorate significantly. To avoid this, a common practice is to test whether predictions are made reasonably based on valid patterns in the data. Along with this, a challenge known as the trustworthiness oracle problem has been introduced. So far, due to the lack of automated trustworthiness oracles, the assessment requires manual validation, based on the decision process disclosed by explanation methods. However, this approach is time-consuming, error-prone, and not scalable.
To address this problem, we propose TOKI, the first automated trustworthiness oracle generation method for text classifiers. TOKI automatically checks whether the words contributing the most to a prediction are semantically related to the predicted class. Specifically, we leverage ML explanation methods to extract the decision-contributing words and measure their semantic relatedness with the class based on word embeddings. As a demonstration of its practical usefulness, we also introduce a novel adversarial attack method that targets trustworthiness vulnerabilities identified by TOKI. We compare TOKI with a naive baseline based solely on model confidence. To evaluate their alignment with human judgement, experiments are conducted on human-created ground truths of approximately 8,000 predictions. Additionally, we compare the effectiveness of TOKI-guided adversarial attack method with A2T, a state-of-the-art adversarial attack method for text classification. Results show that (1) relying on prediction uncertainty metrics, such as model confidence, cannot effectively distinguish between trustworthy and untrustworthy predictions, (2) TOKI achieves 142% higher accuracy than the naive baseline, and (3) TOKI-guided adversarial attack method is more effective with fewer perturbations than A2T.

Publisher's Version Article: fse25mainb-p334-p doi:10.1145/3729376

No More Labelled Examples? An Unsupervised Log Parser with LLMs
Junjie Huang, Zhihan Jiang, Zhuangbin Chen, and Michael Lyu
(Chinese University of Hong Kong, Hong Kong; Sun Yat-sen University, China)

Publisher's Version

Info Article: fse25mainb-p354-p doi:10.1145/3729377

VulPA: Detecting Semantically Recurring Vulnerabilities with Multi-object Typestate Analysis
Liqing Cao, Haofeng Li, Chenghang Shi, Jie Lu, Haining Meng, Lian Li, and Jingling Xue
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Zhongguancun Laboratory, China; UNSW, Australia)

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p362-p doi:10.1145/3729378

AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation
Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand
(University of Illinois at Urbana-Champaign, USA; Indian Institute of Science, India; Cornell University, USA; IBM Research, USA)
Code translation transforms programs from one programming language (PL) to another. One prominent use case is application modernization to enhance maintainability and reliability. Several rule-based transpilers have been designed to automate code translation between different pairs of PLs. However, the rules can become obsolete as the PLs evolve and cannot generalize to other PLs. Recent studies have explored the automation of code translation using Large Language Models (LLMs). One key observation is that such techniques may work well for crafted benchmarks but fail to generalize to the scale and complexity of real-world projects with inter- and intra-class dependencies, custom types, PL-specific features, etc. We propose AlphaTrans, a neuro-symbolic approach to automate repository-level code translation. AlphaTrans translates both source and test code, and employs multiple levels of validation to ensure the translation preserves the functionality of the source program. To break down the problem for LLMs, AlphaTrans leverages program analysis to decompose the program into fragments and translates them in the reverse call order.
We leveraged AlphaTrans to translate ten real-world open-source projects consisting of ⟨836, 8575, 2719⟩ (application and test) classes, (application and test) methods, and unit tests. AlphaTrans breaks down these projects into 17874 fragments and translates the entire repository. 96.40% of the translated fragments are syntactically correct, and AlphaTrans validates the translations’ runtime behavior and functional correctness for 27.03% and 25.14% of the application method fragments. On average, integrated translation and validation takes 34 hours (min=3, max=121) to translate a project, showing its scalability in practice. For the syntactically or semantically incorrect translations, AlphaTrans generates a report including existing translation, stack trace, test errors, or assertion failures. We provided these artifacts to two developers to fix the translation bugs in four projects. They fixed the issues in 20.1 hours on average (5.5 hours for the smallest and 34 hours for the largest project) and achieved all passing tests. Without AlphaTrans, translating and validating such big projects could take weeks, if not months.

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25mainb-p373-p doi:10.1145/3729379

Code Red! On the Harmfulness of Applying Off-the-Shelf Large Language Models to Programming Tasks
Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, and Maliheh Izadi
(Delft University of Technology, Netherlands)

Publisher's Version Article: fse25mainb-p383-p doi:10.1145/3729380

On the Characteristics and Impacts of Protestware Libraries
Tanner Finken, Jesse Chen, and Sazzadur Rahaman
(University of Arizona, USA)

Publisher's Version Article: fse25mainb-p395-p doi:10.1145/3729381

Scene Flow Specifications: Encoding and Monitoring Rich Temporal Safety Properties of Autonomous Systems
Trey Woodlief, Felipe Toledo, Matthew Dwyer, and Sebastian Elbaum
(University of Virginia, USA)

Publisher's Version Article: fse25mainb-p416-p doi:10.1145/3729382

Automated Extraction and Analysis of Developer's Rationale in Open Source Software
Mouna Dhaouadi, Bentley Oakes, and Michalis Famelis
(Université de Montréal, Canada; Polytechnique Montréal, Canada)

Publisher's Version Article: fse25mainb-p426-p doi:10.1145/3729383

Error Delayed Is Not Error Handled: Understanding and Fixing Propagated Error-Handling Bugs
Haoran Liu, Shanshan Li, Zhouyang Jia, Yuanliang Zhang, Linxiao Bai, Si Zheng, Xiaoguang Mao, and Xiangke Liao
(National University of Defense Technology, China)

Publisher's Version Article: fse25mainb-p433-p doi:10.1145/3729384

Medusa: A Framework for Collaborative Development of Foundation Models with Automated Parameter Ownership Assignment
Dezhi Ran, Yuan Cao, Yuzhe Guo, Yuetong Li, Mengzhou Wu, Simin Chen, Wei Yang, and Tao Xie
(Peking University, China; Beijing Jiaotong University, China; University of Chicago, USA; University of Texas at Dallas, USA)

Publisher's Version

Published Artifact

Artifacts Available

Artifacts Functional Article: fse25mainb-p442-p doi:10.1145/3729385

CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building
Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang
(Fudan University, China; Huazhong University of Science and Technology, China)
Project building is pivotal to support various program analysis tasks, such as generating intermediate representation code for static analysis and preparing binary code for vulnerability reproduction. However, automating the building process for C/C++ projects is a highly complex endeavor, involving tremendous technical challenges, such as intricate dependency management, diverse build systems, varied toolchains, and multifaceted error handling mechanisms. Consequently, building C/C++ projects often proves to be difficult in practice, hindering the progress of downstream applications. Unfortunately, research on facilitating the building of C/C++ projects remains to be inadequate. The emergence of Large Language Models (LLMs) offers promising solutions to automated software building. Trained on extensive corpora, LLMs can help unify diverse build systems through their comprehension capabilities and address complex errors by leveraging tacit knowledge storage. Moreover, LLM-based agents can be systematically designed to dynamically interact with the environment, effectively managing dynamic building issues. Motivated by these opportunities, we first conduct an empirical study to systematically analyze the current challenges in the C/C++ project building process. Particularly, we observe that most popular C/C++ projects encounter an average of five errors when relying solely on the default build systems. Based on our study, we develop an automated build system called CXXCrafter to specifically address the above-mentioned challenges, such as dependency resolution. Our evaluation on open-source software demonstrates that CXXCrafter achieves a success rate of 78% in project building. Specifically, among the Top100 dataset, 72 projects are built successfully by both CXXCrafter and manual efforts, 3 by CXXCrafter only, and 14 manually only. Despite the slightly lower performance,CXXCrafter can save tremendous manual efforts and can also be easily applied to a wider range of applications automatically.

Publisher's Version Article: fse25mainb-p449-p doi:10.1145/3729386

A Causal Learning Framework for Enhancing Robustness of Source Code Models
Junyao Ye, Zhen Li, Xi Tang, Deqing Zou, Shouhuai Xu, Weizhong Qiang, and Hai Jin
(Huazhong University of Science and Technology, China; University of Colorado at Colorado Springs, USA)

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25mainb-p451-p doi:10.1145/3729387

Recasting Type Hints from WebAssembly Contracts
Kunsong Zhao, Zihao Li, Weimin Chen, Xiapu Luo, Ting Chen, Guozhu Meng, and Yajin Zhou
(Hong Kong Polytechnic University, China; University of Electronic Science and Technology of China, China; Institute of Information Engineering at Chinese Academy of Sciences, China; Zhejiang University, China)

Publisher's Version Article: fse25mainb-p471-p doi:10.1145/3729388

Revisiting Optimization-Resilience Claims in Binary Diffing Tools: Insights from LLVM Peephole Optimization Analysis
Xiaolei Ren, Mengfei Ren, Yu Lei, and Jiang Ming
(Macau University of Science and Technology, Macau; University of Alabama in Huntsville, USA; University of Texas at Arlington, USA; Tulane University, USA)
Binary diffing technique aims to identify differences/similarities in executable files without source code access. Its potential applications in various software security tasks, such as vulnerability search, code clone detection, and malware analysis have generated a large body of literature over the past few years. A recurring theme in binary diffing research is to evaluate the resilience against the impact of compiler optimization, which is the most common source leading to syntactic differences in binary code. Despite claims by most binary diffing papers that they are immune to compiler optimization, recent studies have highlighted a pressing need for the research community to revisit these optimization-resilience claims. In this paper, we investigate peephole optimization's impact on binary diffing. Mainstream compilers feature a multitude of peephole optimization rules, facilitating local rewriting of input programs to replace instruction sequences within a window (i.e., peephole) with shorter and/or faster equivalents. Our research reveals that peephole optimization primarily affects binary code differences at the intra-procedural level, which contradicts the assumptions made by basic-block centric comparison approaches. We customized an LLVM translation validation tool to investigate the impact of peephole optimization from the overall optimization process. Our experimental results demonstrate 1) peephole optimization modifies binary code during the whole optimization process, and 2) no existing basic-block centric comparison tools can properly deal with all changes caused by peephole optimization, leading to further performance loss in downstream applications. Our study introduces a "peephole-oriented" test suite, designed to isolate and measure the impact of peephole optimizations on binary code. This suite provides a new perspective for evaluating the resilience of binary diffing tools against subtle, intra-procedural code changes, setting a new benchmark for future tool development. Our findings reveal critical insights that challenge existing assumptions in binary diffing, highlighting the need for more robust analysis techniques.

Publisher's Version Article: fse25mainb-p495-p doi:10.1145/3729389

Demystifying Memorization in LLM-Based Program Repair via a General Hypothesis Testing Framework
Jiaolong Kong, Xiaofei Xie, and Shangqing Liu
(Singapore Management University, Singapore; Nanjing University, China)
Large Language Models (LLMs) have achieved remarkable success in various applications, particularly in code-related tasks such as code generation and program repair, setting new performance benchmarks. However, the extensive use of large training corpora raises concerns about whether these achievements stem from genuine understanding or mere memorization of training data—a question often overlooked in current research. This paper aims to study the memorization issue within LLM-based program repair by investigating whether the correct patches generated by LLMs are the result of memorization. The key challenge lies in the absence of ground truth for confirming memorization, leading to various ad-hoc methods designed for its detection. To address this challenge, we first propose a general framework that formalizes memorization detection as a general hypothesis testing problem, where existing approaches can be unified by defining a low-probability event under the null hypothesis that the data is not memorized. The occurrence of such an event leads to the rejection of the null hypothesis, indicating potential memorization.
Based on this framework, we design two specific methods (i.e., low-probability events) to detect potential memorization: 1) basic ground-truth matching, and 2) reassessment after substantial code mutation. We investigate the memorization issue in LLM-based program repair using two datasets: Defects4J, a widely used benchmark that is likely included in the training data, and GitBug-Java, a new dataset that is unlikely to be part of the training data. Our findings reveal that a significant portion of correct patches exactly match the ground truths in Defects4J (e.g., 78.83% and 87.42% on GPT-3.5 and CodeLlama-7b, respectively). Moreover, even after significant modifications to the buggy code, where the original repairs should not be generated, a considerable percentage of bugs (e.g., 81.82% on GPT-3.5 and 88.24% on CodeLlama-7b) continue to be fixed exactly as in the original bug fixes, indicating a high likelihood of memorization. Furthermore, we evaluate existing memorization detection methods and demonstrate their ineffectiveness in this context (e.g., most AUROCs are below 0.5). The theoretical analysis under our hypothesis testing framework shows that their defined events may not meet the requirements for being low-probability. The study highlights the critical need for more robust and rigorous evaluations in LLM-based software engineering research, ensuring a clear distinction between true problem-solving capabilities and mere memorization.

Publisher's Version Article: fse25mainb-p509-p doi:10.1145/3729390

Expressing and Checking Statistical Assumptions
Alexi Turcotte and Zheyuan Wu
(CISPA, Germany; Saarland University, Germany)

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: fse25mainb-p539-p doi:10.1145/3729391

Core Developer Turnover in the Rust Package Ecosystem: Prevalence, Impact, and Awareness
Meng Fan, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu
(Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway)

Publisher's Version Article: fse25mainb-p553-p doi:10.1145/3729392

Who Will Stop Contributing to OSS Projects? Predicting Company Turnover Based on Initial Behavior
Mian Qin, Yuxia Zhang, Klaas-Jan Stol, and Hui Liu
(Beijing Institute of Technology, China; University College Cork, Ireland; Lero, Ireland; SINTEF, Norway)

Publisher's Version Article: fse25mainb-p555-p doi:10.1145/3729393

Automatically Detecting Numerical Instability in Machine Learning Applications via Soft Assertions
Shaila Sharmin, Anwar Hossain Zahid, Subhankar Bhattacharjee, Chiamaka Igwilo, Miryung Kim, and Wei Le
(Iowa State University, USA; University of California at Los Angeles, USA)

Publisher's Version Article: fse25mainb-p618-p doi:10.1145/3729394

DyLin: A Dynamic Linter for Python
Aryaz Eghbali, Felix Burk, and Michael Pradel
(University of Stuttgart, Germany)

Publisher's Version

Published Artifact

Artifacts Available Article: fse25mainb-p619-p doi:10.1145/3729395

NLP Libraries, Energy Consumption and Runtime: An Empirical Study
Rajrupa Chattaraj and Sridhar Chimalakonda
(IIT Tirupati, India)

Publisher's Version Article: fse25mainb-p633-p doi:10.1145/3729396

An Empirical Study of Code Clones from Commercial AI Code Generators
Weibin Wu, Haoxuan Hu, Zhaoji Fan, Yitong Qiao, Yizhan Huang, Yichen Li, Zibin Zheng, and Michael Lyu
(Sun Yat-sen University, China; Zhuhai Key Laboratory of Trusted Large Language Models, China; Chinese University of Hong Kong, China)

Publisher's Version Article: fse25mainb-p640-p doi:10.1145/3729397

CoverUp: Effective High Coverage Test Generation for Python
Juan Altmayer Pizzorno and Emery D. Berger
(University of Massachusetts at Amherst, USA; Amazon Web Services, USA)

Publisher's Version

Info Article: fse25mainb-p665-p doi:10.1145/3729398

ReproCopilot: LLM-Driven Failure Reproduction with Dynamic Refinement
Tanakorn Leesatapornwongsa, Fazle Faisal, and Suman Nath
(Microsoft Research, USA)

Publisher's Version Article: fse25mainb-p684-p doi:10.1145/3729399

Calibration of Large Language Models on Code Summarization
Yuvraj Virk, Premkumar Devanbu, and Toufique Ahmed
(University of California at Davis, USA; IBM Research, USA)
A brief, fluent, and relevant summary can be helpful during program comprehension; however, such a summary does require significant human effort to produce. Often, good summaries are unavailable in software projects, which makes maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit of work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies.
However, LLM-generated summaries can be inaccurate, incomplete, etc: generally, too dissimilar to one that a good developer might write. Given an LLM-generated code summary, how can a user rationally judge if a summary is sufficiently good and reliable? Given just some input source code, and an LLM-generated summary, existing approaches can help judge brevity, fluency and relevance of the summary; however, it’s difficult to gauge whether an LLM-generated summary sufficiently resembles what a human might produce, without a “golden” human-produced summary to compare against. Prior research indicates that human-produced summaries are generally preferred by human-raters, so we explore this issue in this paper. We study this resemblance question as a calibration problem: given just the code & the summary from an LLM, can we compute a confidence measure, that provides a reliable indication of whether the summary sufficiently resembles what a human would have produced in this situation? We examine this question using several LLMs, for several languages, and in several different settings. Our investigation suggests approaches to provide reliable predictions of the likelihood that an LLM-generated summary would sufficiently resemble a summary a human might write for the same code.

Publisher's Version Article: fse25mainb-p713-p doi:10.1145/3729400

CRISPE: Semantic-Guided Execution Planning and Dynamic Reasoning for Enhancing Code Coverage Prediction
Hridya Dhulipala, Aashish Yadavally, Smit Soneshbhai Patel, and Tien N. Nguyen
(University of Texas at Dallas, USA)

Publisher's Version Article: fse25mainb-p729-p doi:10.1145/3729401

Blended Analysis for Predictive Execution
Yi Li, Hridya Dhulipala, Aashish Yadavally, Xiaokai Rong, Shaohua Wang, and Tien N. Nguyen
(University of Texas at Dallas, USA; Central University of Finance and Economics, China)

Publisher's Version Article: fse25mainb-p730-p doi:10.1145/3729402

Statement-Level Adversarial Attack on Vulnerability Detection Models via Out-of-Distribution Features
Xiaohu Du, Ming Wen, Haoyu Wang, Zichao Wei, and Hai Jin
(Huazhong University of Science and Technology, China)

Publisher's Version Article: fse25mainb-p793-p doi:10.1145/3729403

Understanding Industry Perspectives of Static Application Security Testing (SAST) Evaluation
Yuan Li, Peisen Yao, Kan Yu, Chengpeng Wang, Yaoyang Ye, Song Li, Meng Luo, Yepang Liu, and Kui Ren
(Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, China; Ant Group, China; Hong Kong University of Science and Technology, China; Southern University of Science and Technology, China)

Publisher's Version Article: fse25mainb-p800-p doi:10.1145/3729404

Smaller but Better: Self-Paced Knowledge Distillation for Lightweight yet Effective LCMs
Yujia Chen, Yang Ye, Zhongqi Li, Yuchi Ma, and Cuiyun Gao
(Harbin Institute of Technology, Shenzhen, China; Huawei Cloud Computing Technologies, China)
Large code models (LCMs) have remarkably advanced the field of code generation. Despite their impressive capabilities, they still face practical deployment issues, such as high inference costs, limited accessibility of proprietary LCMs, and adaptability issues of ultra-large LCMs. These issues highlight the critical need for more accessible, lightweight yet effective LCMs. Knowledge distillation (KD) offers a promising solution, which transfers the programming capabilities of larger, advanced LCMs (Teacher) to smaller, less powerful LCMs (Student). However, existing KD methods for code intelligence often lack consideration of fault domain knowledge and rely on static seed knowledge, leading to degraded programming capabilities of student models. In this paper, we propose a novel Self-Paced knOwledge DistillAtion framework, named SODA, aiming at developing lightweight yet effective student LCMs via adaptively transferring the programming capabilities from advanced teacher LCMs. SODA consists of three stages in one cycle: (1) Correct-and-Fault Knowledge Delivery stage aims at improving the student model’s capability to recognize errors while ensuring its basic programming skill during the knowledge transferring, which involves correctness-aware supervised learning and fault-aware contrastive learning methods. (2) Multi-View Feedback stage aims at measuring the quality of results generated by the student model from two views, including model-based and static tool-based measurement, for identifying the difficult questions; (3) Feedback-based Knowledge Update stage aims at updating the student model adaptively by generating new questions at different difficulty levels, in which the difficulty levels are categorized based on the feedback in the second stage. By performing the distillation process iteratively, the student model is continuously refined through learning more advanced programming skills from the teacher model. We compare SODA with four state-of-the-art KD approaches on three widely-used code generation benchmarks with different programming languages. Experimental results show that SODA improves the student model by 65.96% in terms of average Pass@1, outperforming the best baseline PERsD by 29.85%. Based on the proposed SODA framework, we develop SodaCoder, a series of lightweight yet effective LCMs with ∼7B parameters, which outperform 15 LCMs with less than or equal to 16B parameters. Notably, SodaCoder-DS-6.7B, built on DeepseekCoder-6.7B, even surpasses the prominent ChatGPT on average Pass@1 across seven programming languages (66.4 vs. 61.3).

Publisher's Version Article: fse25mainb-p872-p doi:10.1145/3729405

proc time: 4.26