Powered by
Proceedings of the ACM on Software Engineering, Volume 2, Number FSE,
June 23–27, 2025,
Trondheim, Norway
Frontmatter
Papers
MendelFuzz: The Return of the Deterministic Stage
Han Zheng,
Flavio Toffalini,
Marcel Böhme, and
Mathias Payer
(EPFL, Switzerland; Ruhr-Universität Bochum, Germany; MPI-SP, Germany)
Can a fuzzer cover more code with minimal corruption of the initial seed? Before a seed is fuzzed, the early greybox fuzzers first systematically enumerated slightly corrupted inputs by applying every mutation operator to every part of the seed, once per generated input. The hope of this so-called “deterministic” stage was that simple changes to the seed would be less likely to break the complex file format; the resulting inputs would find bugs in the program logic well beyond the program’s parser. However, when experiments showed that disabling the deterministic stage achieves more coverage, applying multiple mutation operators at the same time to a single input, most fuzzers disabled the deterministic stage by default. Instead of ignoring the deterministic stage, we analyze its potential and substantially improve deterministic exploration. Our deterministic stage is now the default in AFL++, reverting the earlier decision of dropping deterministic exploration. We start by investigating the overhead and the contribution of the deterministic stage to the discovery of coverage-increasing inputs. While the sheer number of generated inputs explains the overhead, we find that only a few critical seeds (20%), and only a few critical bytes in a seed (0.5%) are responsible for the vast majority of the coverage-increasing inputs (83% and 84%, respectively). Hence, we develop an efficient technique, called , to identify these critical seeds / bytes so as to prune a large number of unnecessary inputs. retains the benefits of the classic deterministic stage by only enumerating a tiny part of the total deterministic state space. We evaluate implementation on two benchmarking frameworks, FuzzBench and Magma. Our evaluation shows that outperforms state-of-the-art fuzzers with and without the (old) deterministic stage enabled, both in terms of coverage and bug finding. also discovered 8 new CVEs on exhaustively fuzzed security-critical applications. Finally, has been independently evaluated and integrated into AFL++ as default option.
Article Search
On-Demand Scenario Generation for Testing Automated Driving Systems
Songyang Yan,
Xiaodong Zhang,
Kunkun Hao,
Haojie Xin,
Yonggang Luo,
Jucheng Yang,
Ming Fan,
Chao Yang,
Jun Sun, and
Zijiang Yang
(Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China)
Article Search
Smart Contract Fuzzing Towards Profitable Vulnerabilities
Ziqiao Kong,
Cen Zhang,
Maoyi Xie,
Ming Hu,
Yue Xue,
Ye Liu,
Haijun Wang, and
Yang Liu
(Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China)
Article Search
COFFE: A Code Efficiency Benchmark for Code Generation
Yun Peng,
Jun Wan,
Yichen Li, and
Xiaoxue Ren
(Chinese University of Hong Kong, China; Zhejiang University, China)
Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation.
To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation.
Article Search
An Empirical Study of Suppressed Static Analysis Warnings
Huimin Hu,
Yingying Wang,
Julia Rubin, and
Michael Pradel
(University of Stuttgart, Germany; University of British Columbia, Canada)
Scalable static analyzers are popular tools for finding incorrect, inefficient, insecure, and hard-to-maintain code early during the development process. Because not all warnings reported by a static analyzer are immediately useful to developers, many static analyzers provide a way to suppress warnings, e.g., in the form of special comments added into the code. Such suppressions are an important mechanism at the interface between static analyzers and software developers, but little is currently known about them. This paper presents the first in-depth empirical study of suppressions of static analysis warnings, addressing questions about the prevalence of suppressions, their evolution over time, the relationship between suppressions and warnings, and the reasons for using suppressions. We answer these questions by studying projects written in three popular languages and suppressions for warnings by four popular static analyzers. Our findings show that (i) suppressions are relatively common, e.g., with a total of 7,357 suppressions in 46 Python projects, (ii) the number of suppressions in a project tends to continuously increase over time, (iii) surprisingly, 50.8% of all suppressions do not affect any warning and hence are practically useless, (iv) some suppressions, including useless ones, may unintentionally hide future warnings, and (v) common reasons for introducing suppressions include false positives, suboptimal configurations of the static analyzer, and misleading warning messages. These results have actionable implications, e.g., that developers should be made aware of useless suppressions and the potential risk of unintentional suppressing, that static analyzers should provide better warning messages, and that static analyzers should separately categorize warnings from third-party libraries.
Article Search
A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems
Yuntianyi Chen,
Yuqi Huai,
Yirui He,
Shilong Li,
Changnam Hong,
Qi Alfred Chen, and
Joshua Garcia
(University of California at Irvine, USA)
As autonomous driving systems (ADSes) become increasingly complex and integral to daily life, the importance of understanding the nature and mitigation of software bugs in these systems has grown correspondingly. Addressing the challenges of software maintenance in autonomous driving systems (e.g., handling real-time system decisions and ensuring safety-critical reliability) is crucial due to the unique combination of real-time decision-making requirements and the high stakes of operational failures in ADSes. The potential of automated tools in this domain is promising, yet there remains a gap in our comprehension of the challenges faced and the strategies employed during manual debugging and repair of such systems. In this paper, we present an empirical study that investigates bug-fix patterns in ADSes, with the aim of improving reliability and safety. We have analyzed the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes with the study of bug symptoms, root causes, and bug-fix patterns. Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management. Additionally, we find that the frequency distribution of bug-fix patterns varies significantly depending on their nature and types and that certain categories of bugs are recurrent and more challenging to exterminate. Based on our findings, we propose a hierarchy of ADS bugs and two taxonomies of 15 syntactic bug-fix patterns and 27 semantic bug-fix patterns that offer guidance for bug identification and resolution. We also contribute a benchmark of 1,331 ADS bug-fix instances.
Preprint
Artifacts Available
Code Change Intention, Development Artifact and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM
Xu Yang,
Wenhan Zhu,
Michael Pacheco,
Jiayuan Zhou,
Shaowei Wang,
Xing Hu, and
Kui Liu
(University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China)
Article Search
Understanding and Characterizing Mock Assertions in Unit Tests
Hengcheng Zhu,
Valerio Terragni,
Lili Wei,
Shing-Chi Cheung,
Jiarong Wu, and
Yepang Liu
(Hong Kong University of Science and Technology, China; University of Auckland, New Zealand; McGill University, Canada; Southern University of Science and Technology, China)
Mock assertions provide developers with a powerful means to validate program behaviors that are unobservable to test assertions. Despite their significance, they are rarely considered by automated test generation techniques. Effective generation of mock assertions requires understanding how they are used in practice. Although previous studies highlighted the importance of mock assertions, none provide insight into their usages. To bridge this gap, we conducted the first empirical study on mock assertions, examining their adoption, the characteristics of the verified method invocations, and their effectiveness in fault detection. Our analysis of 4,652 test cases from 11 popular Java projects reveals that mock assertions are mostly applied to validating specific kinds of method calls, such as those interacting with external resources and those reflecting whether a certain code path was traversed in systems under test. Additionally, we find that mock assertions complement traditional test assertions by ensuring the desired side effects have been produced, validating control flow logic, and checking internal computation results. Our findings contribute to a better understanding of mock assertion usages and provide a foundation for future related research such as automated test generation that support mock assertions.
Article Search
Artifacts Available
Automated Soap Opera Testing Directed by LLMs and Scenario Knowledge: Feasibility, Challenges, and Road Ahead
Yanqi Su,
Zhenchang Xing,
Chong Wang,
Chunyang Chen,
Sherry (Xiwei) Xu,
Qinghua Lu, and
Liming Zhu
(Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; CSIRO, Australia; UNSW, Australia)
Article Search
Standing on the Shoulders of Giants: Bug-Aware Automated GUI Testing via Retrieval Augmentation
Mengzhuo Chen,
Zhe Liu,
Chunyang Chen,
Junjie Wang, Boyu Wu, Jun Hu, and
Qing Wang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; n.n., China; TU Munich, Germany)
Article Search
Automated and Accurate Token Transfer Identification and Its Applications in Cryptocurrency Security
Shuwei Song,
Ting Chen, Ao Qiao,
Xiapu Luo, Leqing Wang, Zheyuan He, Ting Wang,
Xiaodong Lin, Peng He, Wensheng Zhang, and
Xiaosong Zhang
(University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada)
Article Search
Mitigating Emergent Malware Label Noise in DNN-Based Android Malware Detection
Haodong Li,
Xiao Cheng,
Guohan Zhang,
Guosheng Xu,
Guoai Xu, and
Haoyu Wang
(Beijing University of Posts and Telecommunications, China; UNSW, Australia; Harbin Institute of Technology, Shenzhen, China; Huazhong University of Science and Technology, China)
Article Search
Detecting Metadata-Related Bugs in Enterprise Applications
Md Mahir Asef Kabir,
Xiaoyin Wang, and
Na Meng
(Virginia Tech, USA; University of Texas at San Antonio, USA)
When building enterprise applications (EAs) on Java frameworks (e.g., Spring), developers often configure application components via metadata (i.e., Java annotations and XML files). It is challenging for developers to correctly use metadata, because the usage rules can be complex and existing tools provide limited assistance. When developers misuse metadata, EAs become misconfigured, which defects can trigger erroneous runtime behaviors or introduce security vulnerabilities. To help developers correctly use metadata, this paper presents (1) RSL---a domain-specific language that domain experts can adopt to prescribe metadata checking rules, and (2) MeCheck ---a tool that takes in RSL rules and EAs to check for rule violations.
With RSL, domain experts (e.g., developers of a Java framework) can specify metadata checking rules by defining content consistency among XML files, annotations, and Java code. Given such RSL rules and a program to scan, MeCheck interprets rules as cross-file static analyzers, which analyzers scan Java and/or XML files to gather information and look for consistency violations. For evaluation, we studied the Spring and JUnit documentation to manually define 15 rules, and created 2 datasets with 115 open-source EAs. The first dataset includes 45 EAs, and the ground truth of 45 manually injected bugs. The second dataset includes multiple versions of 70 EAs. We observed that MeCheck identified bugs in the first dataset with 100% precision, 96% recall, and 98% F-score. It reported 156 bugs in the second dataset, 53 of which bugs were already fixed by developers. Our evaluation shows that MeCheck helps ensure the correct usage of metadata.
Article Search
Clone Detection for Smart Contracts: How Far Are We?
Zuobin Wang,
Zhiyuan Wan,
Yujing Chen,
Yun Zhang,
David Lo,
Difan Xie, and
Xiaohu Yang
(Zhejiang University, China; Hangzhou City University, China; Singapore Management University, Singapore; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, China)
In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research.
Article Search
Dissecting Real-World Cross-Language Bugs
Haoran Yang and
Haipeng Cai
(Washington State University, USA; SUNY Buffalo, USA)
Multilingual systems are prevalent and broadly impactful, but also complex due to the intricate interactions between the heterogeneous programming languages the systems are developed in. This complexity is further aggravated by the diversity of cross-language interoperability across different language combinations, resulting in additional, often stealthy cross-language bugs. Yet despite the growing number of tools aimed to discover cross-language bugs, a systematic understanding of such bugs is still lacking. To fill this gap, we conduct the first comprehensive study of cross-language bugs, characterizing them in 5 aspects including their symptoms, locations, manifestation, root causes, and fixes, as well as their relationships. Through careful identification and detailed analysis of 400 cross-language bugs in real-world multilingual projects classified from 54,356 relevant code commits in their GitHub repositories, we revealed not only bug characteristics of those five aspects but also how they compare between two top language combinations in the multilingual world (Python-C and Java-C). In addition to findings of the study as well as its enabling tools and datasets, we also provide practical recommendations regarding the prevention, detection, and patching of cross-language bugs.
Preprint
Info
Eliminating Backdoors in Neural Code Models for Secure Code Understanding
Weisong Sun,
Yuchen Chen,
Chunrong Fang,
Yebo Feng,
Yuan Xiao,
An Guo,
Quanjun Zhang,
Zhenyu Chen,
Baowen Xu, and
Yang Liu
(Nanjing University, Singapore; Nanjing University, China; Nanyang Technological University, Singapore)
Article Search
Why the Proof Fails in Different Versions of Theorem Provers: An Empirical Study of Compatibility Issues in Isabelle
Xiaokun Luan,
David Sanan,
Zhe Hou,
Qiyuan Xu,
Liu Chengwei,
Yufan Cai,
Yang Liu, and
Meng Sun
(Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; National University of Singapore, Singapore)
Article Search
proc time: 0.4