Powered by
Proceedings of the ACM on Software Engineering, Volume 2, Number ISSTA
Editorial Message
The Proceedings of the ACM series presents the highest-quality research conducted in diverse
areas of computer science, as represented by the ACM Special Interest Groups. The Proceedings
of the ACM on Software Engineering (PACMSE) focuses on top-quality, original research on all
aspects of software engineering. This issue of the PACMSE journal publishes 107 articles that
were submitted in response to a call for papers soliciting high-quality submissions, from both
industry and academia, which describe original and unpublished results of theoretical, empirical,
conceptual, and experimental research on software testing and analysis.
Doctor: Optimizing Container Rebuild Efficiency by Instruction Re-orchestration
Zhiling Zhu,
Tieming Chen,
Chengwei Liu,
Han Liu,
Qijie Song,
Zhengzi Xu, and
Yang Liu
(Zhejiang University of Technology, China; Nanyang Technological University, Singapore; Hong Kong University of Science and Technology, Hong Kong)
Containerization has revolutionized software deployment, with Docker leading the way due to its ease of use and consistent runtime environment. As Docker usage grows, optimizing Dockerfile performance, particularly by reducing rebuild time, has become essential for maintaining efficient CI/CD pipelines. However, existing optimization approaches primarily address single builds without considering the recurring rebuild costs associated with modifications and evolution, limiting long-term efficiency gains. To bridge this gap, we present Doctor, a method for improving Dockerfile build efficiency through instruction re-ordering that addresses key challenges: identifying instruction dependencies, predicting future modifications, ensuring behavioral equivalence, and managing the optimization’s computational complexity. We developed a comprehensive dependency taxonomy based on Dockerfile syntax and a historical modification analysis to prioritize frequently modified instructions. Using a weighted topological sorting algorithm, Doctor optimizes instruction order to minimize future rebuild time while maintaining functionality. Experiments on 2,000 GitHub repositories show that Doctor improves 92.75% of Dockerfiles, reducing rebuild time by an average of 26.5%, with 12.82% of files achieving over a 50% reduction. Notably, 86.2% of cases preserve functional similarity. These findings highlight best practices for Dockerfile management, enabling developers to enhance Docker efficiency through informed optimization strategies.
Publisher's Version
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution
Lianghong Guo,
Wei Tao,
Runhan Jiang,
Yanlin Wang,
Jiachi Chen,
Xilin Liu,
Yuchi Ma,
Mingzhi Mao,
Hongyu Zhang, and
Zibin Zheng
(Sun Yat-sen University, China; Independent, China; Huawei Cloud Computing Technologies, China; Chongqing University, China)
The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are collected from repositories across four programming languages (i.e., Python, JavaScript, TypeScript, and Java) and eight different domains. Our evaluation shows that current LLMs show limited performances on OmniGIRL. Notably, the best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we find that current LLMs struggle to resolve issues requiring understanding images. The best performance is achieved by Claude-3.5-Sonnet, which resolves only 10.5% of the issues with image information. Finally, we analyze the reasons behind current LLMs’ failure on OmniGIRL, providing insights for future improvements.
Publisher's Version
Automated Attack Synthesis for Constant Product Market Makers
Sujin Han,
Jinseo Kim,
Sung-Ju Lee, and
Insu Yun
(KAIST, Republic of Korea)
Decentralized Finance (DeFi) enables many novel applications that were impossible in traditional finances. However, it also introduces new types of vulnerabilities. An example of such vulnerabilities is a composability bug between token contracts and Decentralized Exchange (DEX) that follows the Constant Product Market Maker (CPMM) model. This type of bug, which we refer to as CPMM composability bug, originates from issues in token contracts that make them incompatible with CPMMs, thereby endangering other tokens within the CPMM ecosystem. Since 2022, 23 exploits of such kind have resulted in a total loss of 2.2M USD. BlockSec, a smart contract auditing company, reported that 138 exploits of such kind occurred just in February 2023.
In this paper, we propose CPMMX , a tool that automatically detects CPMM composability bugs across
entire blockchains. To achieve such scalability, we first formalized CPMM composability bugs and found that these bugs can be induced by breaking two safety invariants. Based on this finding, we designed CPMMX equipped with a two-step approach, called shallow-then-deep search. In more detail, it first uses shallow search to find transactions that break the invariants. Then, it uses deep search to refine these transactions, making them profitable for the attacker. We evaluated CPMMX against five baselines on two public datasets and one synthetic dataset. In our evaluation, CPMMX detected 2.5x to 1.5x more vulnerabilities compared to baseline methods. It also analyzed contracts significantly faster, achieving higher F1 scores than the baselines. Additionally, we applied CPMMX to all contracts on the latest blocks of the Ethereum and Binance networks and discovered 26 new exploits that can result in 15.7K USD profit in total.
Publisher's Version
Published Artifact
Artifacts Available
xFUZZ: A Flexible Framework for Fine-Grained, Runtime-Adaptive Fuzzing Strategy Composition
Dongsong Yu,
Yiyi Wang,
Chao Zhang,
Yang Lan,
Zhiyuan Jiang,
Shuitao Gan,
Zheyu Ma, and
Wende Tan
(Zhongguancun Laboratory, China; Tsinghua University, China; Huazhong University of Science and Technology, China; National University of Defense Technology, China; Labortory for Advanced Computing and Intelligence Engineering, China)
Fuzzing is one of the most efficient techniques for detecting vulnerabilities in software. Existing approaches struggle with performance inconsistencies across different targets and rely on rigid, coarse-grained fuzzing strategy composition, limiting the flexibility to adaptively combine the strengths of different fuzzing strategies at runtime.
To address these challenges, we present , a flexible and extensible fuzzing framework supporting fine-grained, runtime-adaptive strategy composition. integrates popular input scheduling and mutation scheduling strategies as fine-grained, independently switchable plugins, allowing users to adaptively replace any plugins throughout the fuzzing campaign. Furthermore, we introduce an adaptive algorithm based on Sliding-Window Thompson Sampling, which dynamically selects the optimal composition of the fuzzing strategy during the fuzzing campaign. Experimental results show that outperforms state-of-the-art fuzzers by achieving a 10.07% increase in unique vulnerability discovery and a 4.94% improvement in code coverage. Notably, is the first to detect 21 out of 37 vulnerabilities in the test suite, establishing its effectiveness across varied targets.
Publisher's Version
DataHook: An Efficient and Lightweight System Call Hooking Technique without Instruction Modification
Quan Hong,
Jiaqi Li,
Wen Zhang, and
Lidong Zhai
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; China Unicom Online Information Technology, China)
System calls serve as the primary interface for interaction between user-space programs and the operating system (OS) kernel. By hooking system calls, it is possible to analyze and modify the behavior of user-space programs. This paper proposes DataHook, an efficient and lightweight system call hooking technique for 32-bit programs. Compared to existing system call hooking techniques, DataHook achieves hooking with extremely low hook overhead by modifying only a few data elements without altering any program instructions. This unique characteristic not only avoids the multithreading conflicts associated with binary rewriting but also provides support for programs to apply more efficient user-space OS subsystems. However, existing system call hooking techniques struggle to meet these goals simultaneously. While techniques like syscall user dispatch (SUD) and ptrace do not require rewriting process instructions, they introduce significant hook overhead. On the other hand, low-overhead techniques typically involve binary rewriting of multiple bytes or instructions, which introduces its own set of challenges. DataHook cleverly addresses these issues by leveraging the specific behavior of 32-bit programs during system calls. In short, unlike 64-bit programs, 32-bit programs use an indirect call instruction to jump to the function executing the syscall/sysenter when making a system call. This paper achieves system call hooking by manipulating the data dependencies involved in the indirect call process. This characteristic is present in 32-bit programs on glibc-based Linux systems, whether running on x86 or x86-64 architectures. Therefore, DataHook can be deployed on these systems. Experimental results demonstrate that DataHook reduces hook overhead by 5.4 to 1,429.0 times compared to existing techniques. When DataHook was applied to a server program to make it use the user-space network stack, the server performance was improved by approximately 4.3 times. Additionally, when applied to Redis, DataHook resulted in only a 4.0% performance loss, compared to 8.0% to 94.7% with other techniques.
Publisher's Version
Beyond Static Pattern Matching? Rethinking Automatic Cryptographic API Misuse Detection in the Era of LLMs
Yifan Xia,
Zichen Xie,
Peiyu Liu,
Kangjie Lu,
Yan Liu,
Wenhai Wang, and
Shouling Ji
(Zhejiang University, China; University of Minnesota, USA; Ant Group, China)
While the automated detection of cryptographic API misuses has progressed significantly, its precision diminishes for intricate targets due to the reliance on manually defined patterns. Large Language Models (LLMs) offer a promising context-aware understanding to address this shortcoming, yet the stochastic nature and the hallucination issue pose challenges to their applications in precise security analysis. This paper presents the first systematic study to explore LLMs’ application in cryptographic API misuse detection. Our findings are noteworthy: The instability of directly applying LLMs results in over half of the initial reports being false positives. Despite this, the reliability of LLM-based detection could be significantly enhanced by aligning detection scopes with realistic scenarios and employing a novel code & analysis validation technique, achieving a nearly 90% detection recall. This improvement substantially surpasses traditional methods and leads to the discovery of previously unknown vulnerabilities in established benchmarks. Nevertheless, we identify recurring failure patterns that illustrate current LLMs’ blind spots, including cryptographic knowledge deficiencies and code semantics misinterpretations. Leveraging these findings, we deploy an LLM-based detection system and uncover 63 new vulnerabilities (47 confirmed, 7 already fixed) in open-source Java and Python repositories, including prominent projects like Apache.
Publisher's Version
MoDitector: Module-Directed Testing for Autonomous Driving Systems
Renzhi Wang,
Mingfei Cheng,
Xiaofei Xie,
Yuan Zhou, and
Lei Ma
(University of Alberta, Canada; Singapore Management University, Singapore; Zhejiang Sci-Tech University, China; University of Tokyo, Japan)
Testing Autonomous Driving Systems (ADSs) is crucial for ensuring their safety, reliability, and performance. Despite numerous testing methods available that can generate diverse and challenging scenarios to uncover potential vulnerabilities, these methods often treat ADS as a black-box, primarily focusing on identifying system-level failures like collisions or near-misses without pinpointing the specific modules responsible for these failures. This lack of root causes understanding for the failures hinders effective debugging and subsequent system repair. Furthermore, current approaches often fall short in generating violations that adequately test the individual modules of an ADS from a system-level perspective, such as perception, prediction, planning, and control. To bridge this gap, we introduce MoDitector, a root-cause-aware testing method for ADS that generates safety-critical scenarios specifically designed to expose weaknesses in targeted ADS modules. Unlike existing approaches, MoDitector not only produces scenarios that lead to violations but also pinpoints the specific module responsible for each failure. Specifically, our approach introduces Module-Specific Oracles to automatically detect module-level errors and identify the root-cause module responsible for system-level violations. To effectively generate module-specific failures, we propose a module-directed testing strategy that integrates Module-Specific Feedback and Adaptive Scenario Generation to guide the testing process. We evaluated MoDitector across four critical ADS modules and four representative testing scenarios. The results demonstrate that MoDitector can effectively and efficiently generate scenarios in which failures can be attributed to specific targeted modules. In total, MoDitector generated 216.7 expected scenarios, significantly outperforming the best baseline, which identified only 79.0 scenarios. Our approach represents a significant innovation in ADS testing by focusing on the identification and rectification of module-specific errors within the system, moving beyond conventional black-box failure detection.
Publisher's Version
Published Artifact
Artifacts Available
FreeWavm: Enhanced WebAssembly Runtime Fuzzing Guided by Parse Tree Mutation and Snapshot
Peng Qian,
Xinlei Ying,
Jiashui Wang,
Long Liu,
Lun Zhang,
Jianhai Chen, and
Qinming He
(Zhejiang University, China; Ant Group, China; GoPlus Security, China)
WebAssembly, recognized as a low-level and portable language, has been widely embraced in areas as diverse as browsers and blockchains, emerging as a revolutionary force for Internet evolution. Unfortunately, defects and flaws in WebAssembly runtimes bring about unexpected results when running WebAssembly applications. A family of solutions has been proposed to detect vulnerabilities in WebAssembly runtimes, with fuzzing surging as the most promising and persuasive approach. Despite its potential, fuzzing faces significant challenges due to the grammatical complexity of WebAssembly runtimes, which lacks an in-depth understanding of the unique Module-based code structure, and thus generates test inputs that struggle to tap into the deep logic within a WebAssembly runtime, limiting its effectiveness in unveiling vulnerabilities.
To bridge this gap, we introduce FreeWavm, a novel framework for fuzzing WebAssembly runtimes by aggressively mutating the structure of WebAssembly code. Technically, we transform the WebAssembly bytecode into a parse tree format that captures complex characteristics of code structure. To generate meaningful test inputs for WebAssembly runtime fuzzing, we design a structure-aware mutation module that engages in a customized node prioritization strategy to screen out interesting nodes in the parse tree, and then applies specific structure mutations. To ensure the validity of the mutated test inputs, FreeWavm is equipped with an automated repair mechanism to patch the mutated parse tree. Furthermore, we take advantage of parse tree snapshots to facilitate input evolution and the overall fuzzing process. Extensive experiments are conducted to evaluate FreeWavm on multiple WebAssembly runtimes. Empirical results show that FreeWavm effectively triggers structure-specific crashes in WebAssembly runtimes, outperforming other counterparts. FreeWavm has identified 69 previously unknown bugs, 24 of which are assigned CVEs thus far.
Publisher's Version
Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection
Lei Yu,
Zhirong Huang,
Hang Yuan,
Shiqi Cheng,
Li Yang,
Fengjun Zhang,
Chenjie Shen,
Jiajia Ma,
Jingyuan Zhang,
Junyi Lu, and
Chun Zuo
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Sinosoft, China)
Smart contract vulnerability detection is a critical challenge in the rapidly evolving blockchain landscape. Existing vulnerability detection methods face two main issues: (1) Existing datasets lack comprehensiveness and sufficient quality, with limited vulnerability type coverage and insufficient distinction between high-quality and low-quality explanations for preference learning. (2) Large language models (LLMs) often struggle with accurately interpreting specific concepts in smart contract security. Through our empirical analysis, we found that even after continual pre-training and supervised fine-tuning, LLMs still exhibit limitations in precisely understanding the execution order of state changes in smart contracts, which can lead to incorrect vulnerability explanations despite making correct detection decisions. These limitations result in poor detection performance, leading to potentially severe financial losses. To address these challenges, we propose Smart-LLaMA-DPO, an advanced detection method based on the LLaMA-3.1-8B. First, we construct a comprehensive dataset covering four vulnerability types and machine-unauditable vulnerabilities, containing labels, detailed explanations, and precise vulnerability locations for Supervised Fine-Tuning (SFT), as well as paired high-quality and low-quality outputs for Direct Preference Optimization (DPO). Second, we perform continual pre-training using large-scale smart contract code to enhance the LLM's understanding of specific security practices in smart contracts. Futhermore, we conduct supervised fine-tuning with our comprehensive dataset. Finally, we apply DPO, which leverages human feedback to improve the quality of generated explanations. Smart-LLaMA-DPO utilizes a specially designed loss function that encourages the LLM to increase the probability of preferred outputs while decreasing the probability of non-preferred outputs, thereby enhancing the LLM's ability to generate high-quality explanations. We evaluate Smart-LLaMA-DPO on four major vulnerability types: reentrancy, timestamp dependence, integer overflow/underflow, and delegatecall, as well as machine-unauditable vulnerabilities. Our method significantly outperforms state-of-the-art baselines, with average improvements of 10.43% in F1 score and 7.87% in accuracy. Moreover, both LLM evaluation and human evaluation demonstrate the superior quality of explanations generated by Smart-LLaMA-DPO in terms of correctness, thoroughness, and clarity.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Functional
Are Autonomous Web Agents Good Testers?
Antoine Chevrot,
Alexandre Vernotte,
Jean-Rémy Falleri,
Xavier Blanc,
Bruno Legeard, and
Aymeric Cretin
(Smartesting, France; University of Bordeaux - LaBRI - UMR 5800, France)
Despite advances in automated testing, manual testing remains prevalent due to the high maintenance demands associated with test script fragility—scripts often break with minor changes in application structure. Recent developments in Large Language Models (LLMs) offer a potential alternative by powering Autonomous Web Agents (AWAs) that can autonomously interact with applications. These agents may serve as Autonomous Test Agents (ATAs), potentially reducing the need for maintenance-heavy automated scripts by utilising natural language instructions similar to those used by human testers.
This paper investigates the feasibility of adapting AWAs for natural language test case execution and how to evaluate them.
We contribute with (1) a benchmark of three offline web applications, and a suite of 113 manual test cases, split between passing and failing cases, to evaluate and compare ATAs performance, (2) SeeAct-ATA and pinATA, two open-source ATA implementations capable of executing test steps, verifying assertions and giving verdicts, and (3) comparative experiments using our benchmark that quantifies our ATAs effectiveness. Finally we also proceed to a qualitative evaluation to identify the limitations of PinATA, our best performing implementation.
Our findings reveal that our simple implementation, SeeAct-ATA, does not perform well compared to our more advanced PinATA implementation when executing test cases (50% performance improvement). However, while PinATA obtains around 60% of correct verdict and up to a promising 94% specificity, we identify several limitations that need to be addressed to develop more resilient and reliable ATAs, paving the way for robust, low maintenance test automation.
Publisher's Version
Published Artifact
Artifacts Available
Preventing Disruption of System Backup against Ransomware Attacks
Yiwei Hou,
Lihua Guo,
Chijin Zhou,
Quan Zhang,
Wenhuan Liu,
Chengnian Sun, and
Yu Jiang
(Tsinghua University, China; Union Tech, China; University of Waterloo, Canada)
The ransomware threat to the software ecosystem has grown rapidly in recent years. Despite being well-studied, new ransomware variants continually emerge, designed to evade existing encryption-based detection mechanisms. This paper introduces Remembrall, a new perspective to defend against ransomware by monitoring and preventing system backup disruptions. Focusing on deletion actions of volume shadow copies (VSC) in Windows, Remembrall captures related malicious events and identifies all ransomware traces as a real-time defense tool. To ensure no ransomware is missing, we conduct a comprehensive investigation to classify all potential attack actions that can be used to delete VSCs throughout the application layer, OS layer, and hardware layer. Based on the analysis, Remembrall is designed to retrieve system event information and accurately identify ransomware without false negatives. We evaluate Remembrall on recent ransomware samples. Remembrall achieves 4.31%-87.55% increase in F1-score compared to other state-of-the-art anti-ransomware tools across 60 ransomware families. Remembrall has also detected eight zero-day ransomware samples in the experiment.
Publisher's Version
Fairness Mediator: Neutralize Stereotype Associations to Mitigate Bias in Large Language Models
Yisong Xiao,
Aishan Liu,
Siyuan Liang,
Xianglong Liu, and
Dacheng Tao
(Beihang University, China; National University of Singapore, Singapore; Zhongguancun Laboratory, China; Nanyang Technological University, Singapore)
Large Language Models (LLMs) have demonstrated remarkable performance across diverse applications, yet they inadvertently absorb spurious correlations from training data, leading to stereotype associations between biased concepts and specific social groups. These associations perpetuate and even amplify harmful social biases, raising significant concerns about fairness, which is a crucial issue in software engineering. To mitigate such biases, prior studies have attempted to project model embeddings into unbiased spaces during inference. However, these approaches have shown limited effectiveness due to their weak alignment with downstream social biases. Inspired by the observation that concept cognition in LLMs is primarily represented through a linear associative memory mechanism, where key-value mapping occurs in the MLP layers, we posited that biased concepts and social groups are similarly encoded as entity (key) and information (value) pairs, which can be manipulated to promote fairer associations. To this end, we propose Fairness Mediator (FairMed), an effective and efficient bias mitigation framework that neutralizes stereotype associations. Our framework comprises two main components: a stereotype association prober and an adversarial debiasing neutralizer. The prober captures stereotype associations encoded within MLP layer activations by employing prompts centered around biased concepts (keys) to detect the emission probabilities for social groups (values). Subsequently, the adversarial debiasing neutralizer intervenes in MLP activations during inference to equalize the association probabilities among different social groups. Extensive experiments across nine protected attributes show that our FairMed significantly outperforms state-of-the-art methods in effectiveness, achieving average bias reductions of up to 84.42
Publisher's Version
KRAKEN: Program-Adaptive Parallel Fuzzing
Anshunkang Zhou,
Heqing Huang, and
Charles Zhang
(Hong Kong University of Science and Technology, China; City University of Hong Kong, China)
Parallel fuzzing, which utilizes multicore computers to accelerate the fuzzing process, has been widely used in industrial-scale software defect detection. However, specifying efficient parallel fuzzing strategies for programs with different characteristics is challenging due to the difficulty of reasoning about fuzzing runtime statically. Existing efforts still use pre-defined tactics for various programs, resulting in suboptimal performance.
In this paper, we propose KraKen, a new program-adaptive parallel fuzzer that improves fuzzing efficiency through dynamic strategy optimization. The key insight is that the inefficiency in parallel fuzzing can be observed during runtime through various feedbacks, such as code coverage changes, which allows us to adjust the adopted strategy to avoid inefficient path searching, thus gradually approximating the optimal policy. Based on the above insight, our key idea is to view the task of finding the optimal strategy as an optimization problem and gradually approach the best program-specific strategy on the fly by maximizing certain objective functions. We have implemented Kraken in C/C++ and evaluated it on 19 real-world programs against 6 state-of-the-art parallel fuzzers. Experimental results show that Kraken can achieve 54.7% more code coverage and find 70.2% more bugs in the given time. Moreover, Kraken has found 192 bugs in 37 popular open-source projects, and 119 of them are assigned with CVE IDs.
Publisher's Version
Model Checking Guided Incremental Testing for Distributed Systems
Yu Gao,
Dong Wang,
Wensheng Dou,
Wenhan Feng,
Yu Liang,
Yuan Feng, and
Jun Wei
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Wuhan Dameng Database, China)
Recently, model checking guided testing (MCGT) approaches have been proposed to systematically test distributed systems. MCGT automatically generates test cases by traversing the entire verified abstract state space derived from a distributed system’s formal specification, and it checks whether the target system behaves correctly during testing. Despite the effectiveness of MCGT, testing a distributed system with MCGT is often costly and can take weeks to complete. This inefficiency is exacerbated when distributed systems evolve, such as when new features are introduced or bugs are fixed. We must re-run the entire testing process for the evolved system to verify its correctness, rendering MCGT not only resource-intensive but also inefficient.
To reduce the overhead of model checking guided testing during distributed system evolution, we propose iMocket, a novel model checking guided incremental testing approach for distributed systems. We first extract the changes from both the formal specification and system implementation. We then identify the affected states within the abstract state space and generate incremental test cases that specifically target these states, thereby avoiding redundant testing of unaffected states. We evaluate iMocket using 12 real-world change scenarios drawn from three popular distributed systems. The experimental results demonstrate that iMocket can reduce the number of test cases by an average of 74.83% and decrease testing time by 22.54% to 99.99%. This highlights its effectiveness in lowering testing costs for distributed systems.
Publisher's Version
Understanding Model Weaknesses: A Path to Strengthening DNN-Based Android Malware Detection
Haodong Li,
Xiao Cheng,
Yanjie Zhao,
Guosheng Xu,
Guoai Xu, and
Haoyu Wang
(Beijing University of Posts and Telecommunications, China; UNSW, Australia; Huazhong University of Science and Technology, China; Harbin Institute of Technology, China)
Android malware detection remains a critical challenge in cybersecurity research. Recent advancements leverage AI techniques, particularly deep neural networks (DNNs), to train a detection model, but their effectiveness is often compromised by the pronounced imbalance among malware families in commonly used training datasets. This imbalance leads to overfitting in dominant categories and poor performance in underrepresented ones, increasing predictive uncertainty for less common malware families. To address the suboptimal performance of many DNN models, we introduce MalTutor, a novel framework that enhances model robustness through an optimized training process. Our primary insight lies in transforming uncertainties from “liabilities” into “assets” by strategically incorporating them into DNN training methodologies. Specifically, we begin by evaluating the predictive uncertainty of DNN models throughout various training epochs, which guides our sample categorization. Incorporating Curriculum Learning strategies, we commence training with easy-to-learn samples with lower uncertainty, progressively incorporating difficult-to-learn ones with higher uncertainty. Our experimental results demonstrate that MalTutor significantly improves the performance of models trained on imbalanced datasets, increasing accuracy by 31.0%, elevating the F1 score by 138.8%, and specifically boosting the average accuracy in detecting various types of malicious apps by 133.9%. Our findings provide valuable insights into the potential benefits of incorporating uncertainty to improve the robustness of DNN models for prediction-oriented software engineering tasks.
Publisher's Version
LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models
Lingxiao Tang,
Jiakun Liu,
Zhongxin Liu,
Xiaohu Yang, and
Lingfeng Bao
(Zhejiang University, China; Singapore Management University, Singapore)
The SZZ algorithm is the dominant technique for identifying bug-inducing commits and serves as a foundation for many software engineering studies, such as bug prediction and static code analysis, thereby enhancing software quality and facilitating better maintenance practices. Researchers have proposed many variants to enhance the SZZalgorithm’s performance since its introduction. The majority of them rely on static techniques or heuristic assumptions, making them easy to implement, but their performance improvements are often limited. Recently, a deep learning-based SZZ algorithm has been introduced to enhance the original SZZ algorithm. However, it requires complex preprocessing and is restricted to a single programming language. Additionally, while it enhances precision, it sacrifices recall. Furthermore, most of variants overlook crucial information, such as commit messages and patch context, and are limited to bug-fixing commits involving deleted lines.
The emergence of large language models (LLMs) offers an opportunity to address these drawbacks. In
this study, we investigate the strengths and limitations of LLMs and propose LLM4SZZ, which employs two approaches (i.e., rank-based identification and context-enhanced identification) to handle different types of bug-fixing commits. We determine which approach to adopt based on the LLM’s
ability to comprehend the bug and identify whether the bug is present in a commit. The context-enhanced identification provides the LLM with more context and requires it to find the bug-inducing commit among a set of candidate commits. In rank-based identification, we ask the LLM to select buggy statements from the bug-fixing commit and rank them based on their relevance to the root cause. Experimental results show that LLM4SZZ outperforms all baselines across three datasets, improving F1-score by 6.9% to 16.0% without significantly sacrificing recall. Additionally, LLM4SZZ can identify many bug-inducing commits that the baselines fail to detect, accounting for 7.8%, 7.4% and 2.5% of the total bug-inducing commits across three datasets, respectively.
Publisher's Version
ACM SIGSOFT Distinguished Paper Award
ALMOND: Learning an Assembly Language Model for 0-Shot Code Obfuscation Detection
Xuezixiang Li,
Sheng Yu, and
Heng Yin
(University of California at Riverside, USA; Deepbits Technology, USA)
Code obfuscation is a technique used to protect software by making it difficult to understand and reverse engineer. However, it can also be exploited for malicious purposes such as code plagiarism or developing malicious programs. Learning-based techniques have achieved great success with the help of supervised learning and labeled training sets. However, when faced with real-life environments involving privately developed and undisclosed obfuscators, these supervised learning methods often raise concerns about generalizability and robustness when facing unseen and unknown classes of obfuscation techniques.
This paper presents ALMOND, a novel zero-shot approach for detecting code obfuscation in binary executables. Unlike previous supervised learning methods, ALMOND does not require labeled obfuscated samples for training. Instead, it leverages a language model pre-trained only on unobfuscated assembly code to identify the linguistic deviations introduced by obfuscation. The key innovation is the use of ”error-perplexity” as a detection metric, which focuses on tokens the model fails to predict. Continuous Error Perplexity further enhances this to capture consecutive prediction errors characteristic of obfuscated sequences. Experiments show ALMOND achieves 96.3% accuracy on unseen obfuscation methods, outperforming supervised baselines. On real-world malware samples, it demonstrates an AUC of 0.869 and significantly outperforms the supervise-learning baseline. Our Dataset, pre-trained model, and code of evaluation will be available at https://github.com/palmtreemodel/ALMOND
Publisher's Version
Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection
Niklas Risse,
Jing Liu, and
Marcel Böhme
(MPI-SP, Germany)
According to our survey of machine learning for vulnerability detection (ML4VD), 9 in every 10 papers published in the past five years define ML4VD as a function-level binary classification problem:
Given a function, does it contain a security flaw?
From our experience as security researchers, faced with deciding whether a given function makes the program vulnerable to attacks, we would often first want to understand the context in which this function is called.
In this paper, we study how often this decision can really be made without further context and study both vulnerable and non-vulnerable functions in the most popular ML4VD datasets. We call a function vulnerable if it was involved in a patch of an actual security flaw and confirmed to cause the program’s vulnerability. It is non-vulnerable otherwise. We find that in almost all cases this decision cannot be made without further context. Vulnerable functions are often vulnerable only because a corresponding vulnerability-inducing calling context exists while non-vulnerable functions would often be vulnerable if a corresponding context existed.
But why do ML4VD techniques achieve high scores even though there is demonstrably not enough information in these samples? Spurious correlations: We find that high scores can be achieved even when only word counts are available. This shows that these datasets can be exploited to achieve high scores without actually detecting any security vulnerabilities.
We conclude that the prevailing problem statement of ML4VD is ill-defined and call into question the internal validity of this growing body of work. Constructively, we call for more effective benchmarking methodologies to evaluate the true capabilities of ML4VD, propose alternative problem statements, and examine broader implications for the evaluation of machine learning and programming analysis research.
Publisher's Version
ACM SIGSOFT Distinguished Paper Award
Tracezip: Efficient Distributed Tracing via Trace Compression
Zhuangbin Chen,
Junsong Pu, and
Zibin Zheng
(Sun Yat-sen University, China; Beijing University of Posts and Telecommunications, China)
Distributed tracing serves as a fundamental building block in the monitoring and testing of cloud service systems. To reduce computational and storage overheads, the de facto practice is to capture fewer traces via sampling. However, existing work faces a trade-off between the completeness of tracing and system overhead. On one hand, head-based sampling indiscriminately selects requests to trace when they enter the system, which may miss critical events. On the other hand, tail-based sampling first captures all requests and then selectively persists the edge-case traces, which entails the overheads related to trace collection and ingestion. Taking a different path, we propose Tracezip in this paper to enhance the efficiency of distributed tracing via trace compression. Our key insight is that there exists significant redundancy among traces, which results in repetitive transmission of identical data between services and the backend. We design a new data structure named Span Retrieval Tree (SRT) that continuously encapsulates such redundancy at the service side and transforms trace spans into a lightweight form. At the backend, the complete traces can be seamlessly reconstructed by retrieving the common data that are already delivered by previous spans. Tracezip includes a series of strategies to optimize the structure of SRT and a differential update mechanism to efficiently synchronize SRT between services and the backend. Our evaluation on microservices benchmarks, popular cloud service systems, and production trace data demonstrates that Tracezip can achieve substantial performance gains in trace collection with negligible overhead. We have implemented Tracezip inside the OpenTelemetry Collector, making it compatible with existing tracing APIs.
Publisher's Version
Walls Have Ears: Demystifying Notification Listener Usage in Android Apps
Jiapeng Deng,
Tianming Liu,
Yanjie Zhao,
Chao Wang,
Lin Zhang, and
Haoyu Wang
(Huazhong University of Science and Technology, China; Monash University, Australia; CNCERT-CC, China)
The Notification Listener Service (NLS) in Android allows third-party apps to monitor and process device notifications, enabling powerful features but also introducing security and privacy risks. Despite the special permission required to access NLS, it has been recurrently exploited by malicious actors. However, there is a lack of systematic investigation into NLS usage patterns and their security implications. In this paper, we propose NLRadar, a hybrid approach combining static analysis and LLM to examine NLS usage in Android apps. We apply NLRadar to a large scale of apps, including both malware and regular apps, to demystify NLS usage and to uncover abuses. Our analysis reveals that NLS is heavily abused, with interesting discoveries such as apps insecurely storing social media messages, exploiting NLS for destructive competition or SMS credential stealing, and leveraging NLS to spread promotional messages or even malicious links. We also find undisclosed changes in NLS usage through app updates and inadequate disclosure in privacy policies. Our findings emphasize the need for more rigorous vetting of NLS usage and better developer education on responsible NLS practices.
Publisher's Version
Safe4U: Identifying Unsound Safe Encapsulations of Unsafe Calls in Rust using LLMs
Huan Li,
Bei Wang,
Xing Hu, and
Xin Xia
(Zhejiang University, China)
Rust is an emerging programming language that ensures safety through strict compile-time checks. A Rust function marked as unsafe indicates it has additional safety requirements (e.g., initialized, not null), known as contracts in the community. These unsafe functions can only be called within explicit unsafe blocks and the contracts must be guaranteed by the caller. To reuse and reduce unsafe code, the community recommends using safe encapsulation of unsafe calls (EUC) in practice. However, an EUC is unsound if any contract is not guaranteed and could lead to undefined behaviors in safe Rust, thus breaking Rust's safety promise. It is challenging to identify unsound EUCs with conventional techniques due to the limitation in cross-lingual comprehension of code and natural language. Large language models (LLMs) have demonstrated impressive capabilities, but their performance is unsatisfactory owing to the complexity of contracts and the lack of domain knowledge. To this end, we propose a novel framework, Safe4U, which incorporates LLMs, static analysis tools, and domain knowledge to identify unsound EUCs. Safe4U first utilizes static analysis tools to retrieve relevant context. Then, it decomposes the primitive contract description into several fine-grained classified contracts. Ultimately, Safe4U introduces domain knowledge and invokes the reasoning capability of LLMs to verify every fine-grained contract. The evaluation results show that Safe4U brings a general performance improvement and the fine-grained results are constructive for locating specific unsound sources. In real-world scenarios, Safe4U can identify 9 out of 11 unsound EUCs from CVE. Furthermore, Safe4U detected 22 new unsound EUCs in the most downloaded crates, 16 of which have been confirmed.
Publisher's Version
LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation
Ziyao Zhang,
Chong Wang,
Yanlin Wang,
Ensheng Shi,
Yuchi Ma,
Wanjun Zhong,
Jiachi Chen,
Mingzhi Mao, and
Zibin Zheng
(Sun Yat-sen University, China; Nanyang Technological University, Singapore; Huawei Cloud Computing Technologies, China)
Code generation aims to automatically generate code from input requirements, significantly enhancing development efficiency. Recent large language models (LLMs) based approaches have shown promising results and revolutionized code generation task. Despite the promising performance, LLMs often generate contents with hallucinations, especially for the code generation scenario requiring the handling of complex contextual dependencies in practical development process. Although previous study has analyzed hallucinations in LLM-powered code generation, the study is limited to standalone function generation. In this paper, we conduct an empirical study to study the phenomena, mechanism, and mitigation of LLM hallucinations within more practical and complex development contexts in repository-level generation scenario. First, we manually examine the code generation results from six mainstream LLMs to establish a hallucination taxonomy of LLM-generated code. Next, we elaborate on the phenomenon of hallucinations, analyze their distribution across different models. We then analyze causes of hallucinations and identify four potential factors contributing to hallucinations. Finally, we propose an RAG-based mitigation method, which demonstrates consistent effectiveness in all studied LLMs.
Publisher's Version
Bridge the Islands: Pointer Analysis for Microservice Systems
Teng Zhang,
Yufei Liang,
Ganlin Li,
Tian Tan,
Chang Xu, and
Yue Li
(Nanjing University, China)
Microservice architecture has revolutionized enterprise software, providing scalability and flexibility by decomposing applications into loosely coupled services. However, this paradigm shift introduces unique challenges for pointer analysis, a foundational static analysis crucial for supporting various client analyses. Existing fundamental analyses, primarily designed for monolithic enterprise applications, fall short in handling complex service communications—such as remote procedure call and message-based communication—and essential programming paradigms, like dependency injection and web endpoint configuration. This paper introduces Micans, the first pointer analysis specifically crafted to address these challenges in microservice systems, capable of constructing comprehensive value flows across services. We extensively evaluated Micans on real-world benchmarks from multiple domains, focusing on its effectiveness in resolving service communications, constructing essential program information like call graphs, and supporting client analyses such as taint analysis. Micans consistently and significantly outperforms state-of-the-art approaches, demonstrating its capacity to handle complex cross-service communications and diverse programming paradigms. These results highlight Micans' potential as a robust foundational analysis, advancing static analysis capabilities to meet the complexities of modern microservices.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Program Feature-Based Benchmarking for Fuzz Testing
Miao Miao,
Sriteja Kummita,
Eric Bodden, and
Shiyi Wei
(University of Texas at Dallas, USA; Fraunhofer IEM, Germany; Heinz Nixdorf Institute at Paderborn University, Germany)
Fuzzing is a powerful software testing technique renowned for its effectiveness in identifying software vulnerabilities. Traditional fuzzing evaluations typically focus on overall fuzzer performance across a set of target programs, yet few benchmarks consider how fine-grained program features influence fuzzing effectiveness. To bridge this gap, we introduce FeatureBench, a novel benchmark designed to generate programs with configurable, fine-grained program features to enhance fuzzing evaluations. We reviewed 25 recent grey-box fuzzing studies, extracting 7 program features related to control-flow and data-flow that can impact fuzzer performance. Using these features, we generated a benchmark consisting of 153 programs controlled by 10 fine-grained configurable parameters. We evaluated 11 fuzzers using this benchmark, with each fuzzer representing either distinct claimed improvements or serving as a widely used baseline in fuzzing evaluations. The results indicate that fuzzer performance varies significantly based on the program features and their strengths, highlighting the importance of incorporating program characteristics into fuzzing evaluations.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
SoK: A Taxonomic Analysis of DeFi Rug Pulls: Types, Dataset, and Tool Assessment
Dianxiang Sun,
Wei Ma,
Liming Nie, and
Yang Liu
(Nanyang Technological University, Singapore; Singapore Management University, Singapore; Shenzhen Technology University, China)
Rug pulls present a critical threat in Decentralized Finance (DeFi), causing substantial financial losses and eroding ecosystem trust. Despite research advances, effective detection remains hampered by fragmented taxonomies, limited datasets, and inadequate tool evaluations. Through systematic analysis of academic and industry sources, we develop a comprehensive taxonomy of 35 distinct rug pull types, including 9 previously undocumented variants. Our analysis reveals significant detection gaps: existing datasets cover only 20% of known types, leading us to create an enhanced dataset of 2,391 instances that increases coverage to 82.9%. Evaluation of 13 detection tools shows substantial capability variation (25.7% to 62.9%), with 9 types completely undetectable. Most critically, tool performance degrades significantly when facing complex attacks, with maximum detection rates dropping from 55.6% for single-vector cases to 31.3% for compound scenarios. These findings provide essential insights for developing more robust security testing approaches for smart contract vulnerabilities in decentralized systems.
Publisher's Version
Recurring Vulnerability Detection: How Far Are We?
Yiheng Cao,
Susheng Wu,
Ruisi Wang,
Bihuan Chen,
Yiheng Huang,
Chenhao Lu,
Zhuotong Zhou, and
Xin Peng
(Fudan University, China)
With the rapid development of open-source software, code reuse has become a common practice to accelerate development. However, it leads to inheritance from the original vulnerability, which recurs at the reusing projects, known as recurring vulnerabilities (RVs). Traditional general-purpose vulnerability detection approaches struggle with scalability and adaptability, while learning-based approaches are often constrained by limited training datasets and are less effective against unseen vulnerabilities. Though specific recurring vulnerability detection (RVD) approaches have been proposed, their effectiveness across various RV characteristics remains unclear.
In this paper, we conduct a large-scale empirical study using a newly constructed RV dataset containing 4,569 RVs, achieving a 953% expansion over prior RV datasets. Our study analyzes the characteristics of RVs, evaluates the effectiveness of the state-of-the-art RVD approaches, and investigates the root causes of false positives and false negatives, yielding key insights. Inspired by these insights, we design AntMan, a novel RVD approach that identifies both explicit and implicit call relations with modified functions, then employs inter-procedural taint analysis and intra-procedural dependency slicing within those functions to generate comprehensive signatures, and finally incorporates a flexible matching to detect RVs. Our evaluation has shown the effectiveness, generality and practical usefulness in RVD. AntManhas detected 4,593 RVs, with 307 confirmed by developers, and identified 73 new 0-day vulnerabilities across 15 projects, receiving 5 CVE identifiers.
Publisher's Version
ConTested: Consistency-Aided Tested Code Generation with LLM
Jinhao Dong,
Jun Sun,
Wenjie Zhang,
Jin Song Dong, and
Dan Hao
(Peking University, China; Singapore Management University, Singapore; National University of Singapore, Singapore)
Recent advancements in large language models (LLMs) have significantly improved code generation, which generates code snippets automatically based on natural language requirements. Despite achieving state-of-the-art performance, LLMs often struggle to generate accurate and reliable code, requiring developers to spend substantial effort debugging and evaluating the generated output. Researchers have proposed leveraging Consistency to select code that passes more tests (inter-consistency) and demonstrates consistent behavior across more counterparts (intra-consistency). However, since the tests themselves are also generated by LLMs, relying on majority voting based on incorrect tests leads to unreliable results. To address this, we propose a lightweight interaction framework that incorporates user feedback to effectively guide consistency. Our results demonstrate that, with minimal human effort, performance can be significantly improved. In each iteration, we introduce a rank-correct-fix co-evolution process between code and tests. This process iteratively
enhances the quality of both, making the consistency voting between code and tests more reliable.
We evaluate ConTested through extensive experiments, demonstrating its effectiveness across multiple LLMs, including GPT-3.5 and GPT-4o. Our results show improvements of 32.9% over GPT-3.5 and 16.97% over GPT-4o. Additionally, ConTested achieves an 11.1% improvement over the SOTA post-processing technique, MPSC. This improvement is achieved with only a 4-round interaction with users, requiring minimal user effort. A user study further confirms the feasibility and cost-effectiveness of ConTested, highlighting its ability to enhance code generation without introducing substantial overhead.
Publisher's Version
Robust Vulnerability Detection across Compilations: LLVM-IR vs. Assembly with Transformer Model
Rony Shir,
Priyanka Surve,
Yuval Elovici, and
Asaf Shabtai
(Ben-Gurion University of the Negev, Israel)
Detecting vulnerabilities in binary files is a challenging task in cybersecurity, particularly when source code is unavailable and the compilation process and its parameters are unknown. Existing deep learning-based detection methods often rely on knowing a binary’s specific compilation settings, which may limit their ability to perform well on other types of binaries. In this research, we provide a thorough comparison of assembly representation and LLVM-IR to identify which representation is more robust and suitable when compilation parameters are unknown. The choice of representation significantly influences detection accuracy. Another contribution of this paper is the use of CodeBERT, a transformer-based model, as a classification tool for detecting vulnerabilities in scenarios where the compilation process is unknown. This study applies
transformer models to the task of multi-class vulnerability detection in the LLVM-IR domain, with a focus on binary-derived representations. While recent research has explored the use of transformers for vulnerability analysis in source code and raw binary instruction streams, systematic evaluation as a classifier at the LLVMIR level remains limited. Prior work has commonly relied on RNN-based methods, which are considered state-of-the-art for this task; however, these models struggle to capture long-range dependencies effectively. To address this limitation, we extend transformer-based classification to LLVM-IR produced from binaries and provide a comprehensive evaluation in this setting. Our results highlight the potential of this approach to strengthen system security across diverse binary configurations.
Publisher's Version
RouthSearch: Inferring PID Parameter Specification for Flight Control Program by Coordinate Search
Siao Wang,
Zhen Dong,
Hui Li,
Liwei Shen,
Xin Peng, and
Dongdong She
(Fudan University, China; Hong Kong University of Science and Technology, China)
Flight control programs are widely used in unmanned aerial vehicles (UAVs) to manage and maintain UAVs’ flying behaviors dynamically. These flight control programs include a PID control module that takes three user-configurable PID parameters: Proportional (P), Integral (I), and Derivative (D). Users can also adjust these PID parameters during flight to suit the needs of various flight tasks. However, flight control programs do not have sufficient safety checks on the user-provided PID parameters, leading to a severe vulnerability of UAV—input validation bug. It happens when the user misconfigures PID parameters and causes the UAV to enter a dangerous state, such as deviation from the expected path, loss of control, or even crash.
Prior works use random testing approaches like fuzzing to identify invalid PID parameters from user input. However, they are not effective in the three-dimensional search space of PID parameters. Meanwhile, each dynamic execution of the UAV test is very expensive, further affecting the performance of random testing.
In this work, we address the problem of PID parameter misconfiguration by combining the Routh-Hurwitz stability criterion with coordinate search, introducing a method called RouthSearch. Instead of identifying misconfigured PID parameters in an ad-hoc fashion, RouthSearch principledly determines valid ranges for three-dimensional PID parameters. We first leverage the Routh-Hurwitz Criterion to identify a theoretical PID parameter boundary. We then refine the boundary using an efficient coordinate search. The valid range of three-dimensional PID parameters determined by RouthSearch can filter out misconfigured PID parameters from users during flight and further help to discover logical bugs in popular flight control programs.
We evaluated RouthSearch across eight flight modes in two popular flight control programs, PX4 and ArduPilot. The results show that RouthSearch can determine the valid ranges of the three-dimensional PID parameters with an accuracy of 92. 0% when compared to the ground truth. In terms of the total number of misconfigured PID parameters, RouthSearch discovers 3,853 sets of PID misconfigurations within 48 hours, while the STOA work PGFuzz only discovers 449 sets of PID misconfigurations, significantly outperforming prior works by 8.58 times. Additionally, our method helps to detect three bugs in ArduPilot and PX4.
Publisher's Version
ACM SIGSOFT Distinguished Paper Award
Program Analysis Combining Generalized Bit-Level and Word-Level Abstractions
Guangsheng Fan,
Liqian Chen,
Banghu Yin,
Wenyu Zhang,
Peisen Yao, and
Ji Wang
(National University of Defense Technology, China; Zhejiang University, China)
Abstract interpretation is widely used to determine programs' numerical properties. However, current abstract domains primarily focus on mathematical semantics, which do not fully capture the complexities of real-world programs relying on machine integer semantics and involving extensive bit-vector operations. This paper presents a solution that combines a bit-level abstraction and a word-level abstraction to capture machine integer semantics. First, we generalize the bit-level abstraction used in the Linux eBPF verifier for determining known and unknown bits of real-world programs, by supplementing all required operations as a standard abstract domain. Based on this abstraction, we design an abstract domain that is signedness-aware and simultaneously retains both the above bit-level and the word-level bound information. These two levels of information cooperate via a standard reduced product operation to improve analysis precision. We implement the proposed domains in the Crab analyzer and the out-of-kernel eBPF verifier PREVAL. Experiments demonstrate their effectiveness in analyzing SV-COMP benchmark programs, assisting hardware designs, and eBPF verification.
Publisher's Version
Bridging the Gaps between Graph Neural Networks and Data-Flow Analysis: The Closer, the Better
Qingchen Yu,
Xin Liu,
Qingguo Zhou, and
Chunming Wu
(Zhejiang University, China; Lanzhou University, China)
Recent advances in applying deep neural networks to programming tasks have achieved remarkable success in practice, prompting interest in exploring how well these models can perform traditional program analysis techniques. Data-flow analysis (DFA), a classic and well-established approach for analyzing programs, presents an opportunity to assess the capabilities of neural networks in this domain. Given the structural similarities between DFA and Graph Neural Networks (GNNs), we explore the extent to which GNNs can effectively model the DFA algorithm. Building on the concept of algorithmic alignment from Neural Algorithmic Reasoning (NAR), we identify two key challenges: the noninterference property of the bit-vectors used in DFA and the complex handling of external information at different stages of the algorithm. Addressing these gaps, we propose three GNN architectures — DFA-GNN−, DFA-GNN, and DFA-GNN+ — that progressively align with the DFA algorithm. Our evaluations emphasize the generalization capacity of these models, particularly in scenarios where training occurs on smaller samples while testing on much larger inputs. Results demonstrate that GNNs with higher algorithmic alignment, such as DFA-GNN+, exhibit superior generalization and sample efficiency, accurately scaling to 10 times larger inputs with minimal training data. Notably, we show that GNNs trained with only input-output pairs can perform competitively with models trained using full execution trajectory supervision, a common practice in recent NAR studies. This finding highlights the efficiency and robustness of GNNs in reasoning tasks when algorithmically aligned with the target algorithm.
Publisher's Version
Info
ACM SIGSOFT Distinguished Paper Award
AudioTest: Prioritizing Audio Test Cases
Yinghua Li,
Xueqi Dang,
Wendkûuni C. Ouédraogo,
Jacques Klein, and
Tegawendé F. Bissyandé
(University of Luxembourg, Luxembourg)
Audio classification systems, powered by deep neural networks (DNNs), are integral to various applications that impact daily lives, like voice-activated assistants. Ensuring the accuracy of these systems is crucial since inaccuracies can lead to significant security issues and user mistrust. However, testing audio classifiers presents a significant challenge: the high manual labeling cost for annotating audio test inputs. Test input prioritization has emerged as a promising approach to mitigate this labeling cost issue. It prioritizes potentially misclassified tests, allowing for the early labeling of such critical inputs and making debugging more efficient. However, when applying existing test prioritization methods to audio-type test inputs, there are some limitations: 1) Coverage-based methods are less effective and efficient than confidence-based methods. 2) Confidence-based methods rely only on prediction probability vectors, ignoring the unique characteristics of audio-type data. 3) Mutation-based methods lack designed mutation operations for audio data, making them unsuitable for audio-type test inputs. To overcome these challenges, we propose AudioTest, a novel test prioritization approach specifically designed for audio-type test inputs. The core premise is that tests closer to misclassified samples are more likely to be misclassified. Based on the special characteristics of audio-type data, AudioTest generates four types of features: time-domain features, frequency-domain features, perceptual features, and output features. For each test, AudioTest concatenates its four types of features into a feature vector and applies a carefully designed feature transformation strategy to bring misclassified tests closer in space. AudioTest leverages a trained model to predict the probability of misclassification of each test based on its transformed vectors and ranks all the tests accordingly. We evaluate the performance of AudioTest utilizing 96 subjects, encompassing natural and noisy datasets. We employed two classical metrics, Percentage of Fault Detection (PFD) and Average Percentage of Fault Detected (APFD), for our evaluation. The results demonstrate that AudioTest outperforms all the compared test prioritization approaches in terms of both PFD and APFD. The average improvement of AudioTest compared to the baseline test prioritization methods ranges from 12.63% to 54.58% on natural datasets and from 12.71% to 40.48% on noisy datasets.
Publisher's Version
QTRAN: Extending Metamorphic-Oracle Based Logical Bug Detection Techniques for Multiple-DBMS Dialect Support
Li Lin,
Qinglin Zhu,
Hongqiao Chen,
Zhuangda Wang,
Rongxin Wu, and
Xiaoheng Xie
(Xiamen University, China; Ant Group, China)
Metamorphic testing is a widely used method to detect logical bugs in Database Management Systems (DBMSs), referred to herein as MOLT (Metamorphic-Oracle based Logical Bug Detection Technique). This technique involves constructing SQL statement pairs, including original and mutated queries, and assessing whether the execution results conform to predefined metamorphic relations to detect logical bugs. However, current MOLTs rely heavily on specific DBMS grammar to generate valid SQL statement pairs, which makes it challenging to adapt these techniques to various DBMSs with different grammatical structures. As a result, only a few popular DBMSs, such as PostgreSQL, MySQL, and MariaDB, are supported by existing MOLTs, with extensive manual effort required to expand to other DBMSs. Given that many DBMSs remain inadequately tested, there is a pressing need for a method that enables effortless extension of MOLTs across diverse DBMSs.
In this paper, we propose QTRAN, a novel LLM-powered approach that automatically extends existing MOLTs to various DBMSs. Our key insight is to translate SQL statement pairs to target DBMSs for metamorphic testing from existing MOLTs using LLMs. To address the challenges of LLMs’ limited understanding of dialect differences and metamorphic mechanisms, we propose a two-phase approach comprising the transfer and mutation phases. QTRAN tackles these challenges by drawing inspiration from the developer’s process of creating a MOLT, which includes understanding the grammar of the target DBMS to generate original queries and employing a mutator for customized mutations. The transfer phase is designed to identify potential dialects and leverage information from SQL documents to enhance query retrieval, enabling LLMs to translate original queries across different DBMSs accurately. During the mutation phase, we gather SQL statement pairs from existing MOLTs to fine-tune the pretrained model, tailoring it specifically for mutation tasks. Then we employ the customized LLM to mutate the translated original queries, preserving the defined relationships necessary for metamorphic testing.
We implement our approach as a tool and apply it to extend four state-of-the-art MOLTs for eight DBMSs: MySQL, MariaDB, TiDB, PostgreSQL, SQLite, MonetDB, DuckDB, and ClickHouse. The evaluation results show that over 99% of the SQL statement pairs transferred by QTRAN satisfy the metamorphic relations required for testing. Furthermore, we have detected 24 logical bugs among these DBMSs, with 16 confirmed as unique and previously unknown bugs. We believe that the generality of QTRAN can significantly enhance the reliability of DBMSs.
Publisher's Version
GUIPilot: A Consistency-Based Mobile GUI Testing Approach for Detecting Application-Specific Bugs
Ruofan Liu,
Xiwen Teoh,
Yun Lin,
Guanjie Chen,
Ruofei Ren,
Denys Poshyvanyk, and
Jin Song Dong
(Shanghai Jiao Tong University, China; National University of Singapore, Singapore; College of William and Mary, USA)
GUI testing is crucial for ensuring the reliability of mobile applications. State-of-the-art GUI testing approaches are successful in exploring more application scenarios and discovering general bugs such as application crashes. However, industrial GUI testing also needs to investigate application-specific bugs such as deviations in screen layout, widget position, or GUI transition from the GUI design mock-ups created by the application designers. These mock-ups specify the expected screens, widgets, and their respective behaviors. Validating the consistency between the GUI design and the implementation is labor-intensive and time-consuming, yet, this validation step plays an important role in industrial GUI testing.
In this work, we propose , an approach for detecting inconsistencies between the mobile design and their implementations. The mobile design usually consists of design mock-ups that specify (1) the expected screen appearances (e.g., widget layouts, colors, and shapes) and (2) the expected screen behaviors, regarding how one screen can transition into another (e.g., labeled widgets with textual description). Given a design mock-up and the implementation of its application, reports both their screen inconsistencies as well as process inconsistencies. On the one hand, detects the screen inconsistencies by abstracting every screen into a widget container where each widget is represented by its position, width, height, and type. By defining the partial order of widgets and the costs of replacing, inserting, and deleting widgets in a screen, we convert the screen-matching problem into an optimizable widget alignment problem. On the other hand, we translate the specified GUI transition into stepwise actions on the mobile screen (e.g., click, long-press, input text on some widgets). To this end, we propose a visual prompt for the vision-language model to infer widget-specific actions on the screen. By this means, we can validate the presence or absence of expected transitions in the implementation. Our extensive experiments on 80 mobile applications and 160 design mock-ups show that (1) can achieve 99.8% precision and 98.6% recall in detecting screen inconsistencies, outperforming the state-of-the-art approach, such as GVT, by 66.2% and 56.6% respectively, and (2) reports zero errors in detecting process inconsistencies. Furthermore, our industrial case study on applying on a trading mobile application shows that has detected nine application bugs, and all the bugs were confirmed by the original application experts. Our code is available at https://github.com/code-philia/GUIPilot.
Publisher's Version
Video
Info
Testing the Fault-Tolerance of Multi-sensor Fusion Perception in Autonomous Driving Systems
Haoxiang Tian,
Wenqiang Ding,
Xingshuo Han,
Guoquan Wu,
An Guo,
Junqi Zhang,
Wei Chen,
Jun Wei, and
Tianwei Zhang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Nanyang Technological University, Singapore; Nanjing Institute of Software, China; Continental-NTU Corporate Lab, Singapore; University of Chinese Academy of Sciences Nanjing, China; University of Science and Technology of China, China)
Production-level Autonomous Driving Systems (ADSs), such as Google Waymo and Baidu Apollo, typically rely on the multi-sensor fusion (MSF) strategy to perceive their surroundings. This strategy increases the perception robustness by combining the respective strengths of the cameras and LiDAR, directly affecting the safety-critical driving decisions of autonomous vehicles (AVs). However, in real-world autonomous driving scenarios, both cameras and LiDAR are prone to various faults that can significantly impact the decision-making and subsequent behaviors of ADSs. It is important to thoroughly test the robustness of MSF during development. Existing testing methods only focus on the identification of corner cases that MSF fails to detect. However, there is still a lack of investigation on how sensor faults affect the system-level behaviors of ADSs.
To address this gap, we present FADE, the first testing methodology to comprehensively assess the fault tolerance of MSF perception-based ADSs. We systematically build fault models for both cameras and LiDAR in AVs and inject these faults into MSF-based ADSs to test their behaviors in various testing scenarios. To effectively and efficiently explore the parameter spaces of sensor fault models, we design a feedback-guided differential fuzzer to uncover safety violations of ADSs caused by the injected faults. We evaluate FADE on Baidu Apollo, a representative and practical industrial ADS. The evaluation results demonstrate the practical values of FADE, and disclose some useful findings. We further conduct physical experiments using a Baidu Apollo 6.0 EDU AV to validate these findings in real-world settings.
Publisher's Version
Published Artifact
Artifacts Available
KEENHash: Hashing Programs into Function-Aware Embeddings for Large-Scale Binary Code Similarity Analysis
Zhijie Liu,
Qiyi Tang,
Sen Nie,
Shi Wu,
Liang Feng Zhang, and
Yutian Tang
(ShanghaiTech University, China; Tencent Security Keen Lab, China; University of Glasgow, UK)
Binary code similarity analysis (BCSA) is a crucial research area in many fields such as cybersecurity. Specifically, function-level diffing tools are the most widely used in BCSA: they perform function matching one by one for evaluating the similarity between binary programs. However, such methods need a high time complexity, making them unscalable in large-scale scenarios (e.g., 1/n-to-n search). Towards effective and efficient program-level BCSA, we propose KEENHash, a novel hashing approach that hashes binaries into program-level representations through large language model (LLM)-generated function embeddings. KEENHash condenses a binary into one compact and fixed-length program embedding using K-Means and Feature Hashing, allowing us to do effective and efficient large-scale program-level BCSA, surpassing the previous state-of-the-art methods. The experimental results show that KEENHash is at least 215 times faster than the state-of-the-art function matching tools while maintaining effectiveness. Furthermore, in a large-scale scenario with 5.3 billion similarity evaluations, KEENHash takes only 395.83 seconds while these tools will cost at least 56 days. We also evaluate KEENHash on the program clone search of large-scale BCSA across extensive datasets in 202,305 binaries. Compared with 4 state-of-the-art methods, KEENHash outperforms all of them by at least 23.16%, and displays remarkable superiority over them in the large-scale BCSA security scenario of malware detection.
Publisher's Version
Info
Enhancing Vulnerability Detection via Inter-procedural Semantic Completion
Bozhi Wu,
Chengjie Liu,
Zhiming Li,
Yushi Cao,
Jun Sun, and
Shang-Wei Lin
(Nanyang Technological University, Singapore; Peking University, China; Singapore Management University, Singapore)
Inspired by advances in deep learning, numerous learning-based approaches for vulnerability detection have emerged, primarily operating at the function level for scalability. However, this design choice has a critical limitation: many vulnerabilities span multiple functions, causing function-level approaches to lose the semantics of called functions and fail to capture true vulnerability patterns. To address this issue, we propose VulnSC, a novel framework designed to enhance learning-based approaches by complementing inter-procedural semantics. VulnSC retrieves the source code of called functions for datasets and leverages large language models (LLMs) with well-designed prompts to generate summaries for these functions. The datasets, enhanced with these summaries, are fed into neural networks for improved vulnerability detection. VulnSC is the first general framework to integrate inter-procedural semantics into existing learning-based approaches for vulnerability detection while maintaining scalability. We evaluate VulnSC on four state-of-the-art learning-based approaches using two widely used datasets, and our experimental results demonstrate that VulnSC significantly enhances detection performance with minimal additional computational overhead.
Publisher's Version
Unlocking Low Frequency Syscalls in Kernel Fuzzing with Dependency-Based RAG
Zhiyu Zhang,
Longxing Li,
Ruigang Liang, and
Kai Chen
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Most coverage-guided kernel fuzzers test operating system kernels based on syscall sequence synthesis. However, there are still syscalls rarely or not covered (called low frequency syscalls, LFS) in a period of fuzzing, meaning the relevant code branches remain unexplored. This is due to the complex dependencies of the LFS and mutation uncertainty, which makes it difficult for fuzzers to generate corresponding syscall sequences. Since many kernel fuzzers can dynamically learn syscall dependencies from the current corpus based on the choice table mechanism, providing comprehensive and high-quality seeds could help fuzzers cover LFS. However, constructing such seeds relies heavily on expert experience to resolve the syscall dependencies.
In this paper, we propose SyzGPT, the first kernel fuzzing framework to automatically generate effective seeds for LFS via Large Language Model (LLM). We leverage a dependency-based retrieval-augmented generation (DRAG) method to unlock the potential of LLM and design a series of steps to improve the effectiveness of the generated seeds. First, SyzGPT automatically extracts syscall dependencies from the existing documentation via LLM. Second, SyzGPT retrieves programs from the fuzzing corpus based on the dependencies to construct adaptive context for LLM. Last, SyzGPT periodically generates and repairs seeds with feedback to enrich the fuzzing corpus for LFS. We propose a novel set of evaluation metrics for seed generation in kernel domain. Our evaluation shows that SyzGPT can generate seeds with a high valid rate of 87.84% and can be extended to offline and fine-tuned LLMs. Compared to seven state-of-the-art kernel fuzzers, SyzGPT improves code coverage by 17.73%, LFS coverage by 58.00%, and vulnerability detection by 323.22% on average. Besides, SyzGPT independently discovered 26 unknown kernel bugs (10 are LFS-related), with 11 confirmed.
Publisher's Version
Published Artifact
Artifacts Available
Structure-Aware, Diagnosis-Guided ECU Firmware Fuzzing
Qicai Chen,
Kun Hu,
Sichen Gong,
Bihuan Chen,
Zikui Kong,
Haowen Jiang,
Bingkun Sun,
You Lu, and
Xin Peng
(Fudan University, China; ANHUI GuarDrive Safety Technology, China)
Electronic Control Units (ECUs), providing a wide range of functions from basic control functions to safety-critical functions, play a critical role in modern vehicles. Fuzzing has emerged as an effective approach to ensure the functional safety and automotive security of ECU firmware. However, existing fuzzing approaches focus on the inputs from other ECUs through external buses (e.g., CAN), but neglect the inputs from internal peripherals through on-board buses (e.g., SPI). Due to the restricted input space exploration, they fail to comprehensively fuzz ECU firmware. Moreover, existing fuzzing approaches often lack visibility into ECU firmware’ internal states but rely on limited feedback (e.g., message timeouts or hardware indicators), hindering their effectiveness.
To address these limitations, we propose a structure-aware, diagnosis-guided framework, EcuFuzz, to comprehensively and effectively fuzz ECU firmware. Specifically, EcuFuzzsimultaneously considers external buses (i.e., CAN) and on-board buses (i.e., SPI). It leverages the structure of CAN and SPI to effectively mutate CAN messages and SPI sequences, and incorporates a dual-core microcontroller-based peripheral emulator to handle real-time SPI communication. In addition, EcuFuzzimplements a new feedback mechanism to guide the fuzzing process. It leverages automotive diagnostic protocols to collect ECUs’ internal states, i.e., error-related variables, trouble codes, and exception contexts. Our compatibility evaluation on ten ECUs from three major Tier 1 automotive suppliers has indicated that our framework is compatible with nine ECUs. Our effectiveness evaluation on three representative ECUs has demonstrated that our framework detects nine previously unknown safety-critical faults, which have been patched by technicians from the suppliers.
Publisher's Version
FANDANGO: Evolving Language-Based Testing
José Antonio Zamudio Amaya,
Marius Smytzek, and
Andreas Zeller
(CISPA Helmholtz Center for Information Security, Germany)
Language-based fuzzers leverage formal input specifications (languages) to generate arbitrarily large and diverse sets of valid inputs for a program under test. Modern language-based test generators combine grammars and constraints to satisfy syntactic and semantic input constraints. ISLA, the leading input generator in that space, uses symbolic constraint solving to solve input constraints. Using solvers places ISLA among the most precise fuzzers but also makes it slow.
In this paper, we explore search-based testing as an alternative to symbolic constraint solving. We employ a genetic algorithm that iteratively generates candidate inputs from an input specification, evaluates them against defined constraints, evolving a population of inputs through syntactically valid mutations and retaining those with superior fitness until the semantic input constraints are met. This evolutionary procedure, analogous to natural genetic evolution, leads to progressively improved inputs that cover both semantics and syntax. This change boosts the efficiency of language-based testing: In our experiments, compared to ISLA, our search-based FANDANGO prototype is faster by one to three orders of magnitude without sacrificing precision.
The search-based approach no longer restricts constraints to constraint solvers' (miniature) languages. In FANDANGO, constraints can use the whole Python language and library. This expressiveness gives testers unprecedented flexibility in shaping test inputs. It allows them to state arbitrary goals for test generation: "Please produce 1,000 valid test inputs where the voltage field follows a Gaussian distribution but never exceeds 20 mV."
Publisher's Version
Info
ZTaint-Havoc: From Havoc Mode to Zero-Execution Fuzzing-Driven Taint Inference
Yuchong Xie,
Wenhui Zhang, and
Dongdong She
(Hong Kong University of Science and Technology, China; Hunan University, Changsha, China)
Fuzzing is a popular software testing technique for discovering vulnerabilities. A central problem in fuzzing is identifying hot bytes that can influence program behavior. Taint analysis can track the data flow of hot bytes in a white-box fashion, but it often suffers from stability issues and cannot run on large real-world programs. Fuzzing-Driven Taint Inference (FTI) is a simple black-box technique to track hot bytes for fuzzing. It monitors the dynamic program behaviors of program execution instances and further infers hot bytes in a black-box fashion. However, this method requires additional O(N) program executions and incurs a large runtime overhead.
We observe that a widely used mutation scheme in fuzzing--havoc mode can be adapted into a lightweight FTI with zero additional program execution. In this work, we first present a computational model of the havoc mode that formally describes its mutation process. Based on this model, we show that the havoc mode can simultaneously launch FTI while generating and executing new testcases. Further, we propose a novel FTI called ZTaint-Havoc that doesn't need any additional program execution. ZTaint-Havoc incurs minimal instrumentation overhead of 3.84% on UniBench and 12.58% on FuzzBench, respectively. In the end, we give an effective mutation algorithm using the hot bytes identified by ZTaint-Havoc.
We conduct a comprehensive evaluation to investigate the computational model of havoc mode. Our evaluation result justifies that it is feasible to adapt the havoc mode to an efficient FTI without any additional program execution. We further implement our approach as a prototype ZTaint-Havoc based on the havoc mode of AFL++. We evaluate ZTaint-Havoc on two fuzzing datasets FuzzBench and UniBench. Our extensive evaluation results show that ZTaint-Havoc improves edge coverage by up to 33.71% on FuzzBench and 51.12% on UniBench over vanilla AFL++, with average improvements of 2.97% and 6.12% respectively, in 24-hour campaigns.
Publisher's Version
A Low-Cost Feature Interaction Fault Localization Approach for Software Product Lines
Haining Wang,
Yi Xiang,
Han Huang,
Jie Cao,
Kaichen Chen, and
Xiaowei Yang
(South China University of Technology, China; iSOFT Infrastructure Software, China)
In Software Product Lines (SPLs), localizing buggy feature interactions helps developers identify the root cause of test failures, thereby reducing their workload. This task is challenging because the number of potential interactions grows exponentially with the number of features, resulting in a vast search space, especially for large SPLs. Previous approaches have partially addressed this issue by constructing and examining potential feature interactions based on suspicious feature selections (e.g., those present in failed configurations but not in passed ones). However, these approaches often overlook the causal relationship between buggy feature interaction and test failures, resulting in an excessive search space and high-cost fault localization. To address this, we propose a low-cost Counterfactual Reasoning-Based Fault Localization (CRFL) approach for SPLs, which enhances fault localization efficiency by reducing both the search space and redundant computations. Specifically, CRFL employs counterfactual reasoning to infer suspicious feature selections and utilizes symmetric uncertainty to filter out irrelevant feature interactions. Additionally, CRFL incorporates two findings to prevent the repeated generation and examination of the same feature interactions. We evaluate the performance of our approach using eight publicly available SPL systems. To enable comparisons on larger real-world SPLs, we generate multiple buggy mutants for both BerkeleyDB and TankWar. Experimental results show that our approach reduces the search space by 51%∼73% for small SPLs (with 6∼9 features) and by 71%∼88% for larger SPLs (with 13∼99 features). The average runtime of our approach is approximately 15.6 times faster than that of a state-of-the-art method. Furthermore, when combined with statement-level localization techniques, CRFL can efficiently localize buggy statements, demonstrating its ability to accurately identify buggy feature interactions.
Publisher's Version
WildSync: Automated Fuzzing Harness Synthesis via Wild API Usage Recovery
Wei-Cheng Wu,
Stefan Nagy, and
Christophe Hauser
(Dartmouth College, USA; University of Utah, USA)
Fuzzing stands as one of the most practical techniques for testing software efficiently. When applying fuzzing to software library APIs, high-quality fuzzing harnesses are essential, enabling fuzzers to execute the APIs with precise sequences and function parameters. Although software developers commonly rely on manual efforts to create fuzzing harnesses, there has been a growing interest in automating this process. Existing works are often constrained in scalability and effectiveness due to their reliance on compiler-based analysis or runtime execution traces, which require manual setup and configuration. Our investigation of multiple actively fuzzed libraries reveals that a large number of exported API functions externally used by various open-source projects remain untested by existing harnesses or unit-test files. The lack of testing for these API functions increase the risk of vulnerabilities going undetected, potentially leading to security issues.
In order to address the lack of coverage affecting existing fuzzing methods, we propose a novel approach to automatically generate fuzzing harnesses by extracting usage patterns of untested functions from real-world scenarios, using techniques based on lightweight Abstract Syntax Tree parsing to extract API usage from external source code. Then, we integrate the usage patterns into existing harnesses to construct new ones covering these untested functions. We have implemented a prototype of this concept named WildSync, enabling the automatic synthesis of fuzzing harnesses for C/C++ libraries on OSS-Fuzz. In our experiments, WildSync successfully produced 469 new harnesses for 24 actively fuzzed libraries on OSS-Fuzz, and also 3 widely used libraries that can be later integrated into OSS-Fuzz. This results in a significant increase in test coverage spanning over 1.3k functions and 16k lines of code, while also identifying 7 previously undetected bugs.
Publisher's Version
Assessing Scene Generation Techniques for Testing COLREGS-Compliance of Autonomous Surface Vehicles
Dominik Frey,
Ulf Kargén, and
Dániel Varró
(Linköping University, Sweden; McGill University, Canada)
Autonomous surface vehicles (ASVs) need to complete missions without posing risks to other maritime traffic.
Safe traffic in open sea encounters is controlled by the International Regulations for Preventing Collisions at Sea
(COLREGS) formulated by the International Maritime Organization (IMO). Designed with human operators in
mind, the COLREGS are intentionally underspecified, which may result in ambiguous requirements for correct
behaviour for ASVs. Hence the systematic testing of such ambiguous situations is particularly important.
This paper investigates to what extent existing test scenario generation approaches deemed effective in the
automotive domain can be adapted to test COLREGS-compliance in a maritime context with multi-vessel
encounters. In a series of experiments involving synthetic and real-world test scenarios, their performance
is evaluated with respect to relevance, diversity, completeness, scalability and speed. Our results indicate
that (1) test scenarios derived from historic maritime traffic are insufficient for testing multi-ship encounters.
Moreover, (2) existing test scenario generation techniques provide sufficient scalability and speed, but they
are very limited in terms of diversity and completeness when the number of vessels increases.
Publisher's Version
ACM SIGSOFT Distinguished Paper Award
Clause2Inv: A Generate-Combine-Check Framework for Loop Invariant Inference
Weining Cao,
Guangyuan Wu,
Tangzhi Xu,
Yuan Yao,
Hengfeng Wei,
Taolue Chen, and
Xiaoxing Ma
(Nanjing University, China; Birkbeck University of London, UK)
Loop invariant inference is a fundamental, yet challenging, problem in program verification. Recent work adopts the guess-and-check framework, where candidate loop invariants are iteratively generated in the guess step and verified in the check step. A major challenge of this general framework is to produce high-quality candidate invariants in each iteration so that the inference procedure can converge quickly. Empirically, we observe that existing approaches may struggle with guessing the complete invariant due to the complexity of logical connectives, but usually, all the clauses of the correct loop invariant have already appeared in the previous guesses. This motivates us to refine the guess-and-check framework, resulting in a generate-combine-check framework, where the loop invariant inference task is divided into clause generation and clause combination. Specifically, we propose a novel loop invariant inference approach under the new framework, which consists of an LLM-based clause generator and a counterexample-driven clause combinator. As the clause generator, leverages LLMs to generate a multitude of clauses; as the clause combinator, leverages counterexamples from the previous rounds to convert generated clauses into invariants. Our experiments show that significantly outperforms existing loop invariant inference approaches. For example, solved 312 (out of 316) linear invariant inference tasks and 44 (out of 50) nonlinear invariant inference tasks, which is at least 93 and 16 more than the existing baselines, respectively. By design, the generate-combine-check framework is flexible to accommodate various existing approaches which are currently under the guess-and-check framework by splitting the guessed candidate invariants into clauses. The evaluation shows that our approach can, with minor adaptation, improve existing loop invariant inference approaches in both effectiveness and efficiency. For example, Code2Inv which solved 210 linear problems with an average solving time of 137.6 seconds can be improved to solve 252 problems with an average solving time of 17.8 seconds.
Publisher's Version
Copy-and-Paste? Identifying EVM-Inequivalent Code Smells in Multi-chain Reuse Contracts
Zexu Wang,
Jiachi Chen,
Tao Zhang,
Yu Zhang,
Weizhe Zhang,
Yuming Feng, and
Zibin Zheng
(Sun Yat-sen University, China; Peng Cheng Laboratory, China; Guangdong Engineering Technology Research Center of Blockchain, China; Macau University of Science and Technology, China; Harbin Institute of Technology, China)
As the development of Solidity contracts on Ethereum, more developers are reusing them on other compatible blockchains. However, developers may overlook the differences between the designs of the blockchain system, such as the Gas Mechanism and Consensus Protocol, leading to the same contracts on different blockchains not being able to achieve consistent execution as on Ethereum. This inconsistency reveals design flaws in reused contracts, exposing code smells that hinder code reusability, and we define this inconsistency as EVM-Inequivalent Code Smells.
In this paper, we conducted the first empirical study to reveal the causes and characteristics of EVM-Inequivalent Code Smells. To ensure the identified smells reflect real developer concerns, we collected and analyzed 1,379 security audit reports and 326 Stack Overflow posts related to reused contracts on EVM-compatible blockchains, such as Binance Smart Chain (BSC) and Polygon. Using the open card sorting method, we defined six types of EVM-Inequivalent Code Smells. For automated detection, we developed a tool named EquivGuard. It employs static taint analysis to identify key paths from different patterns and uses symbolic execution to verify path reachability. Our analysis of 905,948 contracts across six major blockchains shows that EVM-Inequivalent Code Smells are widespread, with an average prevalence of 17.70%. While contracts with code smells do not necessarily lead to financial loss and attacks, their high prevalence and significant asset management underscore the potential threats of reusing these smelly Ethereum contracts. Thus, developers are advised to abandon Copy-and-Paste programming practices and detect EVM-Inequivalent Code Smells before reusing Ethereum contracts.
Publisher's Version
You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects
Islem Bouzenia and
Michael Pradel
(University of Stuttgart, Germany)
The ability to execute the test suite of a project is essential in many scenarios, e.g., to assess code quality and code coverage, to validate code changes made by developers or automated tools, and to ensure compatibility with dependencies. Despite its importance, executing the test suite of a project can be challenging in practice because different projects use different programming languages, software ecosystems, build systems, testing frameworks, and other tools. These challenges make it difficult to create a reliable, universal test execution method that works across different projects. This paper presents ExecutionAgent, an automated technique that prepares scripts for building an arbitrary project from source code and running its test cases. Inspired by the way a human developer would address this task, our approach is a large language model (LLM)-based agent that autonomously executes commands and interacts with the host system. The agent uses meta-prompting to gather guidelines on the latest technologies related to the given project, and it iteratively refines its process based on feedback from the previous steps. Our evaluation applies ExecutionAgent to 50 open-source projects that use 14 different programming languages and many different build and testing tools. The approach successfully executes the test suites of 33/50 projects, while matching the test results of ground truth test suite executions with a deviation of only 7.5%. These results improve over the best previously available technique by 6.6x. The costs imposed by the approach are reasonable, with an execution time of 74 minutes and LLM costs of USD 0.16, on average per project. We envision ExecutionAgent to serve as a valuable tool for developers, automated programming tools, and researchers that need to execute tests across a wide variety of projects.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Functional
Finding 709 Defects in 258 Projects: An Experience Report on Applying CodeQL to Open-Source Embedded Software (Experience Paper)
Mingjie Shen,
Akul Abhilash Pillai,
Brian A. Yuan,
James C. Davis, and
Aravind Machiry
(Purdue University, USA)
Embedded software is deployed in billions of devices worldwide, including in safety-sensitive systems like medical devices and autonomous vehicles. Defects in embedded software can have severe consequences. Many embedded software products incorporate Open-Source Embedded Software (EMBOSS), so it is important for EMBOSS engineers to use appropriate mechanisms to avoid defects. One of the common security practices is to use Static Application Security Testing (SAST) tools, which help identify commonly occurring vulnerabilities. Existing research related to SAST tools focuses mainly on regular (or non-embedded) software. There is a lack of knowledge about the use of SAST tools in embedded software. Furthermore, embedded software greatly differs from regular software in terms of semantics, software organization, coding practices, and build setup. All of these factors influence SAST tools and could potentially affect their usage.
In this experience paper, we report on a large-scale empirical study of SAST in EMBOSS repositories. We collected a corpus of 258 of the most popular EMBOSS projects, and then measured their use of SAST tools via program analysis and a survey (N=25) of their developers. Advanced SAST tools are rarely used -- only 3% of projects go beyond trivial compiler analyses. Developers cited the perception of ineffectiveness and false positives as reasons for limited adoption. Motivated by this deficit, we applied the state-of-the-art (SOTA) CodeQL SAST tool and measured its ease of use and actual effectiveness. Across the 258 projects, CodeQL reported 709 true defects with a false positive rate of 34%. There were 535 (75%) likely security vulnerabilities, including in major projects maintained by Microsoft, Amazon, and the Apache Foundation. EMBOSS engineers have confirmed 376 (53%) of these defects, mainly by accepting our pull requests. Two CVEs were issued. Based on these results, we proposed pull requests to include our workflows as part of EMBOSS Continuous Integration (CI) pipelines, 37 (71% of active repositories) of these are already merged. In summary, we urge EMBOSS engineers to adopt the current generation of SAST tools, which offer low false positive rates and are effective at finding security-relevant defects.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Functional
Enhancing Smart Contract Security Analysis with Execution Property Graphs
Kaihua Qin,
Zhe Ye,
Zhun Wang,
Weilin Li,
Liyi Zhou,
Chao Zhang,
Dawn Song, and
Arthur Gervais
(Yale University, USA; University of California at Berkeley, USA; University College London, UK; University of Sydney, Australia; Tsinghua University, China)
Smart contract vulnerabilities have led to significant financial losses, with their increasing complexity rendering outright prevention of hacks increasingly challenging. This trend highlights the crucial need for advanced forensic analysis and real-time intrusion detection, where dynamic analysis plays a key role in dissecting smart contract executions. Therefore, there is a pressing need for a unified and generic representation of smart contract executions, complemented by an efficient methodology that enables the modeling and identification of a broad spectrum of emerging attacks.
We introduce Clue, a dynamic analysis framework specifically designed for the Ethereum virtual machine. Central to Clue is its ability to capture critical runtime information during contract executions, employing a novel graph-based representation, the Execution Property Graph. A key feature of Clue is its innovative graph traversal technique, which is adept at detecting complex attacks, including (read-only) reentrancy and price manipulation. Evaluation results reveal Clue's superior performance with high true positive rates and low false positive rates, outperforming state-of-the-art tools. Furthermore, Clue's efficiency positions it as a valuable tool for both forensic analysis and real-time intrusion detection.
Publisher's Version
MLLM-Based UI2Code Automation Guided by UI Layout Information
Fan Wu,
Cuiyun Gao,
Shuqing Li,
Xin-Cheng Wen, and
Qing Liao
(Harbin Institute of Technology, China; Chinese University of Hong Kong, China)
Converting user interfaces into code (UI2Code) is a crucial step in website development, which is time-consuming and labor-intensive. The automation of UI2Code is essential to streamline this task, beneficial for improving the development efficiency. There exist deep learning-based methods for the task; however, they heavily rely on a large amount of labeled training data and struggle with generalizing to real-world, unseen web page designs. The advent of Multimodal Large Language Models (MLLMs) presents potential for alleviating the issue, but they are difficult to comprehend the complex layouts in UIs and generate the accurate code with layout preserved. To address these issues, we propose LayoutCoder, a novel MLLM-based framework generating UI code from real-world webpage images, which includes three key modules: (1) Element Relation Construction, which aims at capturing UI layout by identifying and grouping components with similar structures; (2) UI Layout Parsing, which aims at generating UI layout trees for guiding the subsequent code generation process; and (3) Layout-Guided Code Fusion, which aims at producing the accurate code with layout preserved. For evaluation, we build a new benchmark dataset which involves 350 real-world websites named Snap2Code, divided into seen and unseen parts for mitigating the data leakage issue, besides the popular dataset Design2Code. Extensive evaluation shows the superior performance of LayoutCoder over the state-of-the-art approaches. Compared with the best-performing baseline, LayoutCoder improves 10.14% in the BLEU score and 3.95% in the CLIP score on average across all datasets.
Publisher's Version
Quantum Concolic Testing
Shangzhou Xia,
Jianjun Zhao,
Fuyuan Zhang, and
Xiaoyu Guo
(Kyushu University, Japan; University of Tokyo, Japan)
This paper presents the first concolic testing framework explicitly designed for quantum programs. The framework introduces quantum constraint generation methods for quantum control statements that quantify quantum states and offers a symbolization method for quantum variables. Based on this framework, we generate path constraints for each concrete execution path of a quantum program. These constraints guide the exploration of new paths, with a quantum constraint solver determining outcomes to create novel input samples, thereby enhancing branch coverage. Our framework has been implemented in Python and integrated with Qiskit for practical evaluation. Experimental results show that our concolic testing framework improves branch coverage, generates high-quality quantum input samples, and detects bugs, demonstrating its effectiveness and efficiency in quantum programming and bug detection. Regarding branch coverage, our framework achieves more than 74.27% on quantum programs with under 5 qubits.
Publisher's Version
BinQuery: A Novel Framework for Natural Language-Based Binary Code Retrieval
Bolun Zhang,
Zeyu Gao,
Hao Wang,
Yuxin Cui,
Siliang Qin,
Chao Zhang,
Kai Chen, and
Beibei Zhao
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Tsinghua University, China)
Binary Function Retrieval (BFR) is crucial in reverse engineering for identifying specific functions in binary code, especially those associated with malicious behavior or vulnerabilities. Traditional BFR methods rely on heuristics, often lacking the efficiency and adaptability needed for large-scale or diverse binary analysis tasks. To address these challenges, we present BinQuery, a Natural Language-based BFR (NL-based BFR) framework that uses natural language queries to retrieve relevant binary functions with improved flexibility and precision. BinQuery introduces innovative techniques to bridge information gaps between binary code and natural language, achieves fine-grained alignment for enhanced retrieval accuracy, and leverages Large Language Models (LLMs) to refine queries and generate diverse descriptions. Our extensive experiments indicate that BinQuery surpasses current state-of-the-art methods, achieving a 42.55% increase in recall@1 and a 4× improvement in performance on comparable benchmarks.
Publisher's Version
BinDSA: Efficient, Precise Binary-Level Pointer Analysis with Context-Sensitive Heap Reconstruction
Lian Gao and
Heng Yin
(University of California at Riverside, USA)
Pointer analysis serves as a fundamental component in the realm of binary code reverse engineering. It can be leveraged to reconstruct a binary program's call graph and can be further applied to various security analyses. However, the absence of symbols and type information within binary code presents formidable challenges to effective pointer analysis. Existing works often apply approximations when performing pointer analysis on binary. Nevertheless, these methods tend to be inefficient and produce numerous false positive targets. In this paper, we propose BinDSA, a novel model tailored for binary pointer analysis. BinDSA prioritizes precision and efficiency over soundness. It is field- and context-sensitive, employing unification-based techniques and reconstructing a context-sensitive heap. It jointly recovers data structure and points-to relations so that precision can be further improved. In evaluation, we demonstrate that BinDSA is 5 times more efficient and notably more precise than the current state-of-the-art technique without significantly sacrificing soundness. We also apply BinDSA on CVE reachability analysis and vulnerability detection, demonstrating its effective application to security tasks.
Publisher's Version
ACM SIGSOFT Distinguished Paper Award
Adding Spatial Memory Safety to EDK II through Checked C (Experience Paper)
Sourag Cherupattamoolayil,
Arunkumar Bhattar,
Connor Everett Glosner, and
Aravind Machiry
(Purdue University, USA)
Embedded software, predominantly written in C, is prone to memory corruption vulnerabilities due to spatial memory issues. Although various memory safety techniques exist, they are often unsuitable for embedded systems due to resource constraints and a lack of standardized OS support. Checked C, a backward-compatible, memory-safe C dialect, offers a potential solution by using pointer annotations for runtime checks to enhance spatial memory safety with minimal overhead. This paper provides the first experience report of porting EDK2 (an open-source UEFI implementation), an exemplary embedded codebase to Checked C, highlighting challenges and providing insights into applying Checked C to similar embedded systems. We also provide an enhanced automated annotation tool e3c, which improves the conversion rate by 25%, enabling easier conversion to Checked C.
Publisher's Version
Published Artifact
Artifacts Available
REACCEPT: Automated Co-evolution of Production and Test Code Based on Dynamic Validation and Large Language Models
Jianlei Chi,
Xiaotian Wang,
Yuhan Huang,
Lechen Yu,
Di Cui,
Jianguo Sun, and
Jun Sun
(Xidian University, China; Harbin Engineering University, China; Microsoft, USA; Singapore Management University, Singapore)
Synchronizing production and test code, known as PT co-evolution, is critical for software quality. Given the significant manual effort involved, researchers have tried automating PT co-evolution using predefined heuristics and machine learning models. However, existing solutions are still incomplete. Most approaches only detect and flag obsolete test cases, leaving developers to manually update them. Meanwhile, existing solutions may suffer from low accuracy, especially when applied to real-world software projects.
In this paper, we propose ReAccept, a novel approach leveraging large language models (LLMs), retrievalaugmented generation (RAG), and dynamic validation to fully automate PT co-evolution with high accuracy. ReAccept employs an experience-guided approach to generate prompt templates for the identification and subsequent update processes. After updating a test case, ReAccept performs dynamic validation by checking syntax, verifying semantics, and assessing test coverage. If the validation fails, ReAccept leverages the error messages to iteratively refine the patch. To evaluate ReAccept's effectiveness, we conducted extensive experiments with a dataset of 537 Java projects and compared ReAccept's performance with several stateof-the-art methods. The evaluation results show that ReAccept achieved an update accuracy of 60.16% on the correctly identified obsolete test code, surpassing the state-of-the-art technique CEPROT by 90%. These findings demonstrate that ReAccept can effectively maintain test code, improve overall software quality, and significantly reduce maintenance effort.
Publisher's Version
Understanding Practitioners’ Expectations on Clear Code Review Comments
Junkai Chen,
Zhenhao Li,
Qiheng Mao,
Xing Hu,
Kui Liu, and
Xin Xia
(Zhejiang University, China; York University, Canada)
The code review comment (CRC) is pivotal in the process of modern code review. It provides reviewers with the opportunity to identify potential bugs, offer constructive feedback, and suggest improvements. Clear and concise code review comments (CRCs) facilitate the communication between developers and are crucial to the correct understanding of the identified issues and proposed solutions. Despite the importance of CRCs’ clarity, there is still a lack of guidelines on what constitutes a good clarity and how to evaluate it. In this paper, we conduct a comprehensive study on understanding and evaluating the clarity of CRCs. We first derive a set of attributes related to the clarity of CRCs, namely RIE attributes (i.e., Relevance, Informativeness, and Expression), as well as their corresponding evaluation criteria based on our literature review and survey with practitioners. We then investigate the clarity of CRCs in open-source projects written in nine programming languages and find that a large portion (i.e., 28.8%) of the CRCs lack the clarity in at least one of the attributes. Finally, we explore the potential of automatically evaluating the clarity of CRCs by proposing ClearCRC. Experimental results show that ClearCRC with pre-trained language models is promising for effective evaluation of the clarity of CRCs, achieving a balanced accuracy up to 73.04% and a F-1 score up to 94.61%.
Publisher's Version
Pepper: Preference-Aware Active Trapping for Ransomware
Huan Zhang,
Zhengkai Qin,
Lixin Zhao,
Aimin Yu,
Lijun Cai, and
Dan Meng
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Ransomware encrypts files on infected systems and demands a hefty ransom for decryption, posing a significant threat to both enterprises and individuals. However, existing methods fail to capture the encryption preferences of diverse ransomware families, lacking an efficient and systematic proactive defense method. In this paper, we propose Pepper, a preference-aware active ransomware trapping method, covering decoy file generation, deployment, and monitoring. Through examination of numerous ransomware families, we have identified two prevalent encryption preferences: encryption file preferences and encryption path preferences. Deploying decoy files aligned with ransomware’s encryption preferences within its preferred pathways provides an opportunity for efficient and early trapping of ransomware. Pepper combines a GNN-based recommendation model with expert insights to unveil the encryption file and path preferences across various ransomware families, guiding the generation and deployment of decoy files. Moreover, a decoy file monitor is designed to continuously monitor decoy file changes and promptly respond to anomalies. Extensive experiments show that Pepper achieves a 98.68% detection rate for ransomware, with an average file loss of 2.27. Moreover, it exhibits robustness in detecting unknown ransomware variants and does not interfere with regular users.
Publisher's Version
Extended Reality Cybersickness Assessment via User Review Analysis
Shuqing Li,
Qisheng Zheng,
Cuiyun Gao,
Jia Feng, and
Michael R. Lyu
(Chinese University of Hong Kong, China; Harbin Institute of Technology, China)
In recent years, the extended reality (XR) software ecosystem has emerged as the next ubiquitous computing platform, as they provide users with immersive interactive experiences. However, XR ecosystem suffers from cybersickness problems, which would greatly affect user comfort and safety, leading to symptoms like headaches, disorientation, etc. That makes effective cybersickness assessment a timely and important question. The state-of-the-art methods of assessing the cybersickness of XR software typically monitor the biological metrics of users, during XR usage, which rely heavily on manual playtesting,sufferring from limited scalability issue. User reviews on XR app stores are informative for developers to learn the cybersickness ratings of their apps and the reasons behind. Nevertheless, the large number of user reviews can hardly be analyzed by developers manually, and most current automatic user review analysis methods can only provide coarse-grained or abstract results, such as extracting several high-level key topic groups discussed by reviews. Recent advancements in LLMs may bring new opportunities. However, directly leveraging LLMs for evaluating XR cybersickness is challenging because LLMs perform poorly on a large number of short texts, and their context window is limited. To address these challenges, we introduce XRCare, a comprehensive framework designed to automate the assessment of cybersickness and root cause reasoning for XR apps by resorting to fine-grained user review analysis. XRCare mainly includes three phases: (1) Insight pool construction, when XRCare collects the cybersickness analyzing chains and corresponding analyzing results from domain experts; (2) Reasoning graph construction, when XRCare dynamically extracts, categorizes, and maintains the reasons from user reviews that make users feel cybersickness on a self-evolving hierarchical graph; and (3) Multi-agent deductive cybersickness reasoning, which utilizes a multi-agent system to simulate diverse user demographics for analyzing and rating the intensity of cybersickness, as well as the causes of cybersickness. This structured approach allows XRCare to systematically identify, categorize, and address instances of cybersickness. For experiments, we construct a large-scale dataset consisting of 685,111 user reviews from 9,667 XR apps. Our evaluation shows that XRCare enhances the F1-score by 20.63% over the best-performing baseline and by 32.27% on average across all baselines, while also offering more accurate and detailed interpretability insights.
Publisher's Version
Wemby’s Web: Hunting for Memory Corruption in WebAssembly
Oussama Draissi,
Tobias Cloosters,
David Klein,
Michael Rodler,
Marius Musch,
Martin Johns, and
Lucas Davi
(University of Duisburg-Essen, Germany; TU Braunschweig, Germany; Amazon Web Services, Germany)
WebAssembly enables fast execution of performance-critical in web applications utilizing native code. However, recent research has demonstrated the potential for memory corruption errors within WebAssembly modules to exploit web applications. In this work, we present the first systematic analysis of memory corruption in WebAssembly, unveiling the prevalence of a novel threat model where memory corruption enables code injection on a victim’s browser. Our large-scale analysis across 37797 domains reveals that an alarming 29411 (77.81%) of those fully trust data coming from potentially attacker-controlled sources. As a result, an attacker can exploit memory errors to manipulate the WebAssembly memory, where the data is implicitly trusted and frequently passed into security-sensitive functions such as eval or directly into the DOM via innerHTML. Thus, an attacker can abuse this trust to gain JavaScript code execution, i.e., Cross-Site Scripting (XSS).
To tackle this issue, we present Wemby, the first viable approach to efficiently analyze WebAssembly-powered websites holistically. We demonstrate that Wemby is proficient at detecting remotely exposed memory corruption errors in web applications through fuzzing. For this purpose, we implement binary-only WebAssembly instrumentation that provides fine-grained memory corruption oracles. We applied Wemby to different websites, uncovering several memory corruption bugs, including one on the Zoom platform. In terms of performance, our ablation study demonstrates that Wemby outperforms current WebAssembly fuzzers. Specifically, Wemby achieves an average speed improvement of 232 times and delivers 46% greater code coverage compared to the state-of-the-art.
Publisher's Version
The Incredible Shrinking Context... in a Decompiler Near You
Sifis Lagouvardos,
Yannis Bollanos,
Neville Grech, and
Yannis Smaragdakis
(University of Athens, Greece; Dedaub, Greece; Dedaub, Malta)
Decompilation of binary code has arisen as a highly-important application in the space of Ethereum VM (EVM) smart contracts. Major new decompilers appear nearly every year and attain popularity, for a multitude of reverse-engineering or tool-building purposes. Technically, the problem is fundamental: it consists of recovering high-level control flow from a highly-optimized continuation-passing-style (CPS) representation. Architecturally, decompilers can be built using either static analysis or symbolic execution techniques.
We present Shrnkr, a static-analysis-based decompiler succeeding the state-of-the-art Elipmoc decompiler. Shrnkr manages to achieve drastic improvements relative to the state of the art, in all significant dimensions: scalability, completeness, precision. Chief among the techniques employed is a new variant of static analysis context: shrinking context sensitivity. Shrinking context sensitivity performs deep cuts in the static analysis context, eagerly “forgetting” control-flow history, in order to leave room for further precise reasoning.
We compare Shrnkr to state-of-the-art decompilers, both static-analysis- and symbolic-execution-based. In a standard benchmark set, Shrnkr scales to over 99.5% of contracts (compared to ∼95% for Elipmoc), covers (i.e., reaches and manages to decompile) 67% more code than Heimdall-rs, and reduces key imprecision metrics by over 65%, compared again to Elipmoc.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Functional
Causality-Aided Evaluation and Explanation of Large Language Model-Based Code Generation
Zhenlan Ji,
Pingchuan Ma,
Zongjie Li,
Zhaoyu Wang, and
Shuai Wang
(Hong Kong University of Science and Technology, China)
While code generation has been widely used in various software development scenarios, the quality of the generated code is not guaranteed. This has been a particular concern in the era of large language models (LLM)-based code generation, where LLMs, deemed a complex and powerful black-box model, are instructed by a high-level natural language specification, namely a prompt, to generate code. Nevertheless, effectively evaluating and explaining the code generation capability of LLMs is inherently challenging, given the complexity of LLMs and the lack of transparency.
Inspired by recent progress in causality analysis and its software engineering applications, this paper proposes a causality-driven approach to systematically analyze prompt-code causal relationships. However, this endeavor faces three key technical challenges: (1) representing textual prompts and code in a canonical form, (2) establishing causal relations between high-level concepts and code features, and (3) systematically analyzing diverse prompt variations. To address these challenges, we first propose a novel causal graph-based representation of the prompt and the generated code, which is established over the fine-grained, human-understandable concepts in the input prompts. The formed causal graph is then used to identify the causal relations between the prompt and the derived code. We illustrate the insights that our framework can provide by studying over four popular LLMs with over 12 prompt adjustment strategies. The results of these studies illustrate the potential of our technique to provide insights into LLM effectiveness and aid end-users in understanding predictions. Additionally, we demonstrate that our approach provides actionable insights to improve the quality of the LLM-generated code by properly calibrating the prompt.
Publisher's Version
AdverIntent-Agent: Adversarial Reasoning for Repair Based on Inferred Program Intent
He Ye,
Aidan Z.H. Yang,
Chang Hu,
Yanlin Wang,
Tao Zhang, and
Claire Le Goues
(University College London, UK; Carnegie Mellon University, USA; Macau University of Science and Technology, China; Sun Yat-sen University, China)
Automated program repair (APR) has shown promising results, particularly with the use of neural networks. Currently, most APR tools focus on code transformations specified by test suites, rather than reasoning about the program’s intent and the high-level bug specification. Without a proper understanding of program intent, these tools tend to generate patches that overfit incomplete test suites and fail to reflect the developer’s intentions. However, reasoning about program intent is challenging. In our work, we propose an approach called AdverIntent-Agent, based on critique and adversarial reasoning. Our approach is novel to shift the focus from generating multiple APR patches to inferring multiple potential program intents. Ideally, we aim to infer intents that are, to some extent, adversarial to each other, maximizing the probability that at least one aligns closely with the developer’s original intent. AdverIntent-Agent is a multi-agent approach consisting of three agents: a reasoning agent, a test agent, and a repair agent. First, the reasoning agent generates adversarial program intents along with the corresponding faulty statements. Next, the test agent produces adversarial test cases that align with each inferred intent, constructing oracles that use the same inputs but have different expected outputs. Finally, the repair agent uses dynamic and precise LLM prompts to generate patches that satisfy both the inferred program intent and the generated tests. AdverIntent-Agent was evaluated on two benchmarks: Defects4J 2.0 and HumanEval-Java. AdverIntentAgent correctly repaired 77 and 105 bugs in both benchmarks, respectively. Our work helps reduce the effort required to review patches by enabling developers to assess program intent in natural language, rather than reviewing code patches.
Publisher's Version
ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation
Pengyu Xue,
Linhao Wu,
Zhen Yang,
Chengyi Wang,
Xiang Li,
Yuxiang Zhang,
Jia Li,
Ruikai Jin,
Yifei Pei,
Zhaoyan Shen,
Xiran Lyu, and
Jacky Wai Keung
(Shandong University, China; Tsinghua University, China; City University of Hong Kong, Hong Kong)
In recent years, Large Language Models (LLMs) have dramatically advanced the performance of automated code translation, making their computational accuracy score reach up to over 80% on many previous benchmarks. However, most code samples in these benchmarks are short, standalone, statement/method-level, and algorithmic, which is not aligned with practical coding tasks. Therefore, it is still unknown the actual capability of LLMs in translating code samples written for daily development.
To achieve this, we construct a class-level code translation benchmark, ClassEval-T, and make the first attempt to extensively assess recent LLMs' performance on class-level code translation. ClassEval-T is extended from ClassEval, a well-known class-level Python code generation benchmark consisting of multiple practical coding topics, such as database operation and game design, and diverse contextual dependencies (e.g., fields, methods, and libraries). It cost us 360 person-hours to accomplish the manual migration to Java and C++ with complete code samples and associated test suites. Subsequently, we design three translation strategies (i.e., holistic, min-dependency, and standalone) for class-level code translations and evaluate eight recent LLMs of commercial, general, and code kinds in diverse families and sizes on ClassEval-T. Experimental results demonstrate a remarkable performance drop compared with the most widely studied method-level code translation benchmark, and obvious discrepancies among LLMs appear, showing the effectiveness of ClassEval-T in measuring recent LLMs. Afterwards, we further discuss the usage scenarios for diverse translation strategies and LLMs' ability to dependency awareness when translating class samples. Finally, 1,243 failure cases made by the best-performing LLM under test are thoroughly analyzed and categorized in this paper for practical guidance and future enlightenment.
Publisher's Version
Porting Software Libraries to OpenHarmony: Transitioning from TypeScript or JavaScript to ArkTS
Bo Zhou,
Jiaqi Shi,
Ying Wang,
Li Li,
Tsz On Li,
Hai Yu, and
Zhiliang Zhu
(Northeastern University, China; Beihang University, China; Hong Kong University of Science and Technology, China)
OpenHarmony emerges as a potent force in the mobile app domain, poised to stand alongside established industry giants. ArkTS is its main language, enhancing TypeScript (TS) and JavaScript (JS) with strict typing for improved performance. Developers are encouraged to port popular TS/JS libraries to OpenHarmony, supported by detailed guidelines. However, this requires a deep understanding of ArkTS syntax, following porting specifications, and making manual changes. An automated solution is crucial to streamline this process and foster a robust software ecosystem.
As a new programming language, ArkTS currently lacks essential analysis tools for automated analysis and porting of software libraries. However, the rise of Large Language Models (LLMs) shows promise for effectively addressing automated porting tasks. There are two challenges in using LLMs to automate the porting of TS/JS libraries to OpenHarmony: (1) LLMs have limited exposure to ArkTS code, making it difficult for them to grasp the syntactical differences between ArkTS and JS/TS, as well as the various adaptation scenarios. (2) Project-level code adaptation often involves correcting numerous syntax mismatches, which complicates matters for LLMs as they must handle the interactions between different mismatches and interdependent code. In response, we introduce ArkAdapter, a project-level automatic code adaptation approach. ArkAdapter addresses Challenge 1 by establishing an adaptation knowledge repository for ArkTS syntax comprehension. It expands a collection of real code adaptation examples based on expert experience across various scenarios, improving the adaptation capabilities of LLMs through few-shot learning. ArkAdapter overcomes Challenge 2 based on an adaptation priority strategy by considering both the dependency structure and the granularity of syntax-mismatching code. This strategy helps prevent interference among various syntax mismatches and their interdependent code. Evaluation shows ArkAdapter achieves high precision (86.84
Publisher's Version
Reinforcement Learning-Based Fuzz Testing for the Gazebo Robotic Simulator
Zhilei Ren,
Yitao Li,
Xiaochen Li,
Guanxiao Qi,
Jifeng Xuan, and
He Jiang
(Dalian University of Technology, China; Wuhan University, China)
Gazebo, being the most widely utilized simulator in robotics, plays a pivotal role in developing and testing robotic systems. Given its impact on the safety and reliability of robotic operations, early bug detection is critical. However, due to the challenges of strict input structures and vast state space, it is not effective to directly use existing fuzz testing approach to Gazebo.
In this paper, we present GzFuzz, the first fuzz testing framework designed for Gazebo. GzFuzz addresses these challenges through a syntax-aware feasible command generation mechanism to handle strict input requirements, and a reinforcement learning-based command generator selection mechanism to efficiently explore the state space. By combining the two mechanisms under a unified framework, GzFuzz is able to detect bugs in Gazebo effectively. In extensive experiments, GzFuzz is able to detect an average of 9.6 unique bugs in 12 hours, and exhibits a substantial increase in code coverage than existing fuzzers AFL++ and Fuzzotron, with a proportionate improvement of approximately 239%-363%. In less than six months, GzFuzz uncovered 25 unique crashes in Gazebo, 24 of which have been fixed or confirmed. Our results highlight the importance of directly fuzzing Gazebo, thereby presenting a novel and potent methodology that serves as an inspiration for enhancing testing across a broader range of simulators.
Publisher's Version
ACM SIGSOFT Distinguished Paper Award
Why Does My Transaction Fail? A First Look at Failed Transactions on the Solana Blockchain
Xiaoye Zheng,
Zhiyuan Wan,
David Lo,
Difan Xie, and
Xiaohu Yang
(Zhejiang University, China; Singapore Management University, Singapore; Hangzhou High-Tech Zone, China)
Solana is an emerging blockchain platform, recognized for its high throughput and low transaction costs, positioning it as a preferred infrastructure for Decentralized Finance (DeFi), Non-Fungible Tokens (NFTs), and other Web 3.0 applications. In the Solana ecosystem, transaction initiators submit various instructions to interact with a diverse range of Solana smart contracts, among which are decentralized exchanges (DEXs) that utilize automated market makers (AMMs), allowing users to trade cryptocurrencies directly on the blockchain without the need for intermediaries. Despite the high throughput and low transaction costs of Solana, the advantages have exposed Solana to bot spamming for financial exploitation, resulting in the prevalence of failed transactions and network congestion.
Prior work on Solana has mainly focused on the evaluation of the performance of the Solana blockchain, particularly scalability and transaction throughput, as well as on the improvement of smart contract security, leaving a gap in understanding the characteristics and implications of failed transactions on Solana. To address this gap, we conducted a large-scale empirical study of failed transactions on Solana, using a curated dataset of over 1.5 billion failed transactions across more than 72 million blocks. Specifically, we first characterized the failed transactions in terms of their initiators, failure-triggering programs, and temporal patterns, and compared their block positions and transaction costs with those of successful transactions. We then categorized the failed transactions by the error messages in their error logs, and investigated how specific programs and transaction initiators are associated with these errors.
We find that transaction failure rates on Solana exhibit recurring daily patterns, and demonstrate a strong positive correlation with the volume of failed transactions, with bots on Solana experiencing a high transaction failure rate of 58.43%. We identify ten distinct error types in the error logs of failed transactions, with price or profit not met and invalid status errors accounting for 67.18% of all failed transactions. AMMs primarily experience invalid status errors among failed transactions, while DEX aggregators are more commonly affected by price or profit not met errors. Among transaction initiators, bots encounter a broader range of errors due to their high-frequency trading and complex interactions with smart contracts. In contrast, human users experience a more limited range of errors. Based on our findings, we provide recommendations to mitigate transaction failures on Solana and outline future research directions.
Publisher's Version
Info
PatchScope: LLM-Enhanced Fine-Grained Stable Patch Classification for Linux Kernel
Rongkai Liu,
Heyuan Shi,
Shuning Liu,
Chao Hu,
Sisheng Li,
Yuheng Shen,
Runzhe Wang,
Xiaohai Shi, and
Yu Jiang
(Central South University, China; Tsinghua University, China; Alibaba Cloud Computing, China)
Stable patch classification plays a crucial role in vulnerability management for the Linux kernel, significantly contributing to the stability and security of Long-term support(LTS) versions. Although existing tools have effectively assisted in assessing whether patches should be merged into stable versions, they cannot determine which stable patches should be merged into which LTS versions. This process still requires the maintainers of the distribution community to manually screen based on the requirements of their respective versions.To address this issue, we propose PatchScope, which is designed to predict the specific merge status of patches.Patchscope consists of two components: patch analysis and patch classification.Patch analysis leverages Large Language Models(LLMs) to generate detailed patch descriptions from the commit message and code changes, thereby deepening the model's semantic understanding of patches. Patch classification utilizes a pre-trained language model to extract semantic features of the patches and employs a two-stage classifier to predict the merge status of the patches.The model is optimized using the dynamic weighted loss function to handle data imbalance and improve overall performance.Given that the primary focus is maintaining Linux kernel versions 5.10 and 6.6, we have conducted comparative experiments based on these two versions. Experimental results demonstrate that Patchscope can effectively predict the merge status of patches.
Publisher's Version
Identifying Multi-parameter Constraint Errors in Python Data Science Library API Documentation
Xiufeng Xu,
Fuman Xie,
Chenguang Zhu,
Guangdong Bai,
Sarfraz Khurshid, and
Yi Li
(Nanyang Technological University, Singapore; University of Queensland, Australia; University of Texas at Austin, USA)
Modern AI- and Data-intensive software systems rely heavily on data science and machine learning libraries that provide essential algorithmic implementations and computational frameworks. These libraries expose complex APIs whose correct usage has to follow constraints among multiple interdependent parameters. Developers using these APIs are expected to learn about the constraints through the provided documentation and any discrepancy may lead to unexpected behaviors. However, maintaining correct and consistent multi-parameter constraints in API documentation remains a significant challenge for API compatibility and reliability. To address this challenge, we propose MPChecker for detecting inconsistencies between code and documentation, specifically focusing on multi-parameter constraints. MPChecker identifies these constraints at the code level by exploring execution paths through symbolic execution and further extracts corresponding constraints from documentation using large language models (LLMs). We propose a customized fuzzy constraint logic to reconcile the unpredictability of LLM outputs and detect logical inconsistencies between the code and documentation constraints. We collected and constructed two datasets from four popular data science libraries and evaluated MPChecker on them. Our tool identified 117 of 126 inconsistent constraints, achieving a recall of 92.8% and demonstrating its effectiveness at detecting inconsistency issues. We further reported 14 detected inconsistency issues to the library developers, who have confirmed 11 issues at the time of writing.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Functional
OpDiffer: LLM-Assisted Opcode-Level Differential Testing of Ethereum Virtual Machine
Jie Ma,
Ningyu He,
Jinwen Xi,
Mingzhe Xing,
Haoyu Wang,
Ying Gao, and
Yinliang Yue
(Beihang University, China; Zhongguancun Laboratory, China; Hong Kong Polytechnic University, China; Huazhong University of Science and Technology, China)
As Ethereum continues to thrive, the Ethereum Virtual Machine (EVM) has become the cornerstone powering tens of millions of active smart contracts. Intuitively, security issues in EVMs could lead to inconsistent behaviors among smart contracts or even denial-of-service of the entire blockchain network. However, to the best of our knowledge, only a limited number of studies focus on the security of EVMs. Moreover, they suffer from 1) insufficient test input diversity and invalid semantics; and 2) the inability to automatically identify bugs and locate root causes.
To bridge this gap, we propose OpDiffer, a differential testing framework for EVM, which takes advantage of LLMs and static analysis methods to address the above two limitations.
We conducted the largest-scale evaluation, covering nine EVMs and uncovering 26 previously unknown bugs, 22 of which have been confirmed by developers and three have been assigned CNVD IDs. Compared to state-of-the-art baselines, OpDiffer can improve code coverage by at most 71.06%, 148.40% and 655.56%, respectively. Through an analysis of real-world deployed Ethereum contracts, we estimate that 7.21% of the contracts could trigger our identified EVM bugs under certain environmental settings, potentially resulting in severe negative impact on the Ethereum ecosystem.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-Based Code Generation
Yingjie Fu,
Bozhou Li,
Linyi Li,
Wentao Zhang, and
Tao Xie
(Peking University, China; Simon Fraser University, Canada)
The capabilities of Large Language Models (LLMs) in code generation have been extensively studied, particularly for implementing target functionalities from natural-language descriptions. As an alternative to natural language, input-output (I/O) examples provide an accessible, unambiguous, and flexible way to describe functionalities. However, their inherent diversity, opaqueness, and incompleteness impose greater challenges for understanding and implementing the target requirements. Therefore, generating code from I/O examples (i.e., example-based code generation) provides a new perspective, allowing us to additionally evaluate LLMs’ capability to infer target functionalities from limited information and to process new-form requirements. However, related research about LLMs in example-based code generation remains largely unexplored. To fill this gap, this paper presents the first comprehensive study on example-based code generation using LLMs. To address the incorrectness caused by the incompleteness of I/O examples, we adopt an iterative evaluation framework and formalize the objective of example-based code generation as two sequential sub-objectives: generating code conforming to the given examples and generating code that successfully implements the target functionalities from (iteratively) given examples. We assess six state-of-the-art LLMs using a new benchmark of 172 diverse target functionalities (derived from HumanEval and CodeHunt). The results demonstrate that when requirements are described using iterative I/O examples rather than natural language, the LLMs’ score decreases by over 60%, indicating that example-based code generation remains challenging for the evaluated LLMs. Notably, the vast majority (even over 95%) of successfully implemented functionalities are achieved in the first round of the iterations, suggesting that the LLMs struggle to effectively utilize the iteratively supplemented requirements. Furthermore, we find that combining I/O examples with even imprecise and fragmental natural language descriptions greatly improves LLM performance, and the selection of initial I/O examples can also influence the score, suggesting opportunities for prompt optimization. These findings highlight the importance of early prompts during interactions and offer critical insights and implications for enhancing LLM-based code generation.
Publisher's Version
No Bias Left Behind: Fairness Testing for Deep Recommender Systems Targeting General Disadvantaged Groups
Zhuo Wu,
Zan Wang,
Chuan Luo,
Xiaoning Du, and
Junjie Chen
(Tianjin University, China; Beihang University, China; Monash University, Australia)
Recommender systems play an increasingly important role in modern society, powering digital platforms that suggest a wide array of content, from news and music to job listings, and influencing many aspects of daily life. To improve personalization, these systems often use demographic information. However, ensuring fairness in recommendation quality across demographic groups is challenging, especially since recommender systems are susceptible to the "rich get richer" Matthew effect due to user feedback loops. With the adoption of deep learning algorithms, uncovering fairness issues has become even more complex. Researchers have started to explore methods for identifying the most disadvantaged user groups using optimization algorithms. Despite this, suboptimal disadvantaged groups remain underexplored, which leaves the risk of bias amplification due to the Matthew effect unaddressed. In this paper, we argue for the necessity of identifying both the most disadvantaged and suboptimal disadvantaged groups. We introduce FairAS, an adaptive sampling based approach, to achieve this goal. Through evaluations on four deep recommender systems and six datasets, FairAS demonstrates an average improvement of 19.2% in identifying the most disadvantaged groups over the state-of-the-art fairness testing approach (FairRec), while reducing testing time by 43.07%. Additionally, the extra suboptimal disadvantaged groups identified by FairAS help improve system fairness, achieving an average improvement of 70.27% over FairRec across all subjects.
Publisher's Version
Enhanced Prompting Framework for Code Summarization with Large Language Models
Minying Fang,
Xing Yuan,
Yuying Li,
Haojie Li,
Chunrong Fang, and
Junwei Du
(Qingdao University of Science and Technology, China; Nanjing University, China)
Code summarization is essential for enhancing the efficiency of software development, enabling developers to swiftly comprehend and maintain software projects. Recent efforts utilizing large language models for generating precise code summaries have shown promising performance, primarily due to their advanced generative capabilities. LLMs that employ continuous prompting techniques can explore a broader problem space, potentially unlocking greater capabilities. However, they also present specific challenges, particularly in aligning with task-specific situations—a strength of discrete prompts. Additionally, the inherent differences between programming languages and natural languages can complicate comprehension for LLMs, impacting the accuracy and relevance of the summaries in complex programming scenarios. These challenges may result in outputs that do not align with actual task needs, underscoring the necessity for further research to enhance the effectiveness of LLMs in code summarization.
To overcome these limitations, we combine the strengths of the two approaches described above and introduce EP4CS—an Enhanced Prompting framework for Code Summarization with Large Language Models. Firstly, we design Mapper, which undergoes pre-training on <Code, Knowledge> pairs and facilitates the optimization and updating of prompt vectors based on the outputs of LLMs. Additionally, we develop a Struct-Agent that enables LLMs to more accurately interpret the complex code by in-depth analysis of the programming language’s syntax and structure. Experimental results indicate that, compared to existing baseline methods, our enhanced prompting learning framework significantly improves performance while maintaining the same parameter scale. Specifically, when evaluated on Java using StarCoderBase1B, EP4CS achieved score improvements of 6.59% on BLEU, 7.06% on METEOR, and 4.43% on ROUGE-L, while also demonstrating strong robustness. And it’s closer to real-world scenarios in terms of semantic metrics SentenceBERT. The results from the human evaluation and case studies show that EP4CS surpasses the baseline methods, producing higher-quality and more relevant summaries.
Publisher's Version
An Investigation on Numerical Bugs in GPU Programs Towards Automated Bug Detection
Ravishka Rathnasuriya,
Nidhi Majoju,
Zihe Song, and
Wei Yang
(University of Texas at Dallas, USA)
General-purpose graphics processing unit (GPU) computing has emerged as a leading parallel computing paradigm, offering significant performance gains in various domains such as scientific computing and deep learning. However, GPU programs are susceptible to numerical bugs, which can lead to incorrect results or crashes. These bugs are difficult to detect, debug, and fix due to their dependence on specific input values or types and the absence of reliable error-checking mechanisms and oracles. Additionally, the unique programming conventions of GPUs complicate identifying the root causes of bugs, while fixing them requires domain-specific knowledge of GPU computing and numerical libraries. Therefore, understanding the characteristics of GPU numerical bugs (GPU-NBs) is crucial for developing effective solutions.
In this paper, we conduct a comprehensive study of GPU-NBs by analyzing 397 real-world bug samples from GitHub. We identify common root causes, symptoms, input patterns, test oracles that trigger these bugs and the strategies used to fix them. We also present GPU-NBDetect, a preliminary tool designed to detect numerical bugs across six distinct bug categories. GPU-NBDetect detected a total of 226 bugs across 186 mathematical functions in four libraries, with 60 of the bugs confirmed by developers. Our findings lay the groundwork for developing detection and prevention techniques for GPU-NBs and offer insights for building more effective debugging and auto-repair tool.
Publisher's Version
Info
A Large-Scale Empirical Study on Fine-Tuning Large Language Models for Unit Testing
Ye Shang,
Quanjun Zhang,
Chunrong Fang,
Siqi Gu,
Jianyi Zhou, and
Zhenyu Chen
(Nanjing University, China; Nanjing University of Science and Technology, China; Huawei Cloud Computing Technologies, China)
Unit testing plays a pivotal role in software development, improving software quality and reliability. However, generating effective test cases manually is time-consuming, prompting interest in unit testing research.
Recently, Large Language Models (LLMs) have shown potential in various unit testing tasks, including test generation, assertion generation, and test evolution, but existing studies are limited in scope and lack a systematic evaluation of the effectiveness of LLMs.
To bridge this gap, we present a large-scale empirical study on fine-tuning LLMs for unit testing.
Our study involves three unit testing tasks, five benchmarks, eight evaluation metrics, and 37 popular LLMs across various architectures and sizes, consuming over 3,000 NVIDIA A100 GPU hours.
We focus on three key research questions: (1) the performance of LLMs compared to state-of-the-art methods, (2) the impact of different factors on LLM performance, and (3) the effectiveness of fine-tuning versus prompt engineering.
Our findings reveal that LLMs outperform existing state-of-the-art approaches on all three unit testing tasks across nearly all metrics, highlighting the potential of fine-tuning LLMs in unit testing tasks.
Furthermore, large-scale, decoder-only models achieve the best results across tasks, while encoder-decoder models perform better under the same parameter scale.
Additionally, the comparison of the performance between fine-tuning and prompt engineering approaches reveals the considerable potential capability of the prompt engineering approach in unit testing tasks.
We then discuss the concerned issues on the test generation task, including data leakage issues, bug detection capabilities, and metrics comparisons.
Finally, we further pinpoint carious practical guidelines for LLM-based approaches to unit testing tasks in the near future.
Overall, our work demonstrates the promising future of fine-tuning LLMs on unit testing tasks and reduces the manual efforts of unit testing experts in practical scenarios.
Publisher's Version
DeCoMa: Detecting and Purifying Code Dataset Watermarks through Dual Channel Code Abstraction
Yuan Xiao,
Yuchen Chen,
Shiqing Ma,
Haocheng Huang,
Chunrong Fang,
Yanwei Chen,
Weisong Sun,
Yunfeng Zhu,
Xiaofang Zhang, and
Zhenyu Chen
(Nanjing University, China; University of Massachusetts at Amherst, USA; Soochow University, China; Nanyang Technological University, Singapore)
Watermarking is a technique to help identify the source of data points, which can be used to help prevent the misuse of protected datasets. Existing methods on code watermarking, leveraging the idea from the backdoor research, embed stealthy triggers as watermarks. Despite their high resilience against dilution attacks and backdoor detections, the robustness has not been fully evaluated. To fill this gap, we propose DeCoMa, a dual-channel approach to Detect and purify Code dataset waterMarks. To overcome the high barrier created by the stealthy and hidden nature of code watermarks, DeCoMa leverages dual-channel constraints on code to generalize and map code samples into standardized templates. Subsequently, DeCoMa extracts hidden watermarks by identifying outlier associations between paired elements within the standardized templates. Finally, DeCoMa purifies the watermarked dataset by removing all samples containing the detected watermark, enabling the silent appropriation of protected code. We conduct extensive experiments to evaluate the effectiveness and efficiency of DeCoMa, covering 14 types of code watermarks and 3 representative intelligent code tasks (a total of 14 scenarios). Experimental results demonstrate that DeCoMa achieves a stable recall of 100% in 14 code watermark detection scenarios, significantly outperforming the baselines. Additionally, DeCoMa effectively attacks code watermarks with embedding rates as low as 0.1%, while maintaining comparable model performance after training on the purified dataset. Furthermore, as DeCoMa requires no model training for detection, it achieves substantially higher efficiency than all baselines, with a speedup ranging from 31.5 to 130.9×. The results call for more advanced watermarking techniques for code models, while DeCoMa can serve as a baseline for future evaluation.
Publisher's Version
Detecting Isolation Anomalies in Relational DBMSs
Rui Yang,
Ziyu Cui,
Wensheng Dou,
Yu Gao,
Jiansen Song,
Xudong Xie, and
Jun Wei
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Relational Database Management Systems (DBMSs) utilize transactions to ensure data consistency and integrity, while providing multiple isolation levels to strike a balance between consistency and performance. However, isolation anomalies in relational DBMSs can undermine their claimed isolation levels, and lead to severe consequences, e.g., incorrect query results and database states. Existing isolation checkers can only work on simple key-value-like data models and the associated read(key) and write(key,value) operations. Therefore, they cannot be directly applied to relational DBMSs that support relational data models and complex SQL operations.
In this paper, we propose a novel black-box Isolation checker for Relational DBMSs, IsoRel, which can support relational data models and complex SQL operations. To infer dependencies among transactions in relational DBMSs, we first design an isolation-agnostic SQL statement instrumentation approach to record the data rows accessed by each SQL statement by utilizing two auxiliary columns in each database table. We then utilize the recorded data rows of each SQL statement to construct a transaction dependency graph for relational transactions, and identify isolation anomalies based on anomaly patterns. We evaluate IsoRel on five widely-used relational DBMSs, i.e., MySQL, PostgreSQL, MariaDB, CockroachDB, and TiDB, and all their supported isolation levels. Our evaluation reveals a total of 48 unique isolation anomalies that violate the isolation levels defined by Adya.
Publisher's Version
Rethinking Performance Analysis for Configurable Software Systems: A Case Study from a Fitness Landscape Perspective
Mingyu Huang,
Peili Mao, and
Ke Li
(University of Electronic Science and Technology of China, China; University of Exeter, UK)
Modern software systems are often highly configurable to tailor varied requirements from diverse stakeholders. Understanding the mapping between configurations and the desired performance attributes plays a fundamental role in advancing the controllability and tuning of the underlying system, yet has long been a dark hole of knowledge due to its black-box nature. While there have been previous efforts in performance analysis for these systems, they analyze the configurations as isolated data points without considering their inherent spatial relationships. This renders them incapable of interrogating many important aspects of the configuration space like local optima. In this work, we advocate a novel perspective to rethink performance analysis—modeling the configuration space as a structured “landscape”. To support this proposition, we utilized GraphFLA, an open-source, graph data mining empowered fitness landscape analysis (FLA) framework. By applying this framework to 86M benchmarked configurations from 32 running workloads of 3 real-world systems, we arrived at 6 main findings, which together constitute a holistic picture of the landscape topography that could have implications on both configuration tuning and performance modeling.
Publisher's Version
Validating Network Protocol Parsers with Traceable RFC Document Interpretation
Mingwei Zheng,
Danning Xie,
Qingkai Shi,
Chengpeng Wang, and
Xiangyu Zhang
(Purdue University, USA; Nanjing University, China)
Validating the correctness of network protocol implementations is highly challenging due to the oracle and traceability problems. The former determines when a protocol implementation can be considered buggy, especially when the bugs do not cause any observable symptoms. The latter allows developers to understand how an implementation violates the protocol specification, thereby facilitating bug fixes. Unlike existing works that rarely take both problems into account, this work considers both and provides an effective solution using recent advances in large language models (LLMs). Our key observation is that network protocols are often released with structured specification documents, a.k.a. RFC documents, which can be systematically translated to formal protocol message specifications via LLMs. Such specifications, which may contain errors due to the hallucination of LLMs, are used as a quasi-oracle to validate protocol parsers, while the validation results in return gradually refine the oracle. Since the oracle is derived from the document, any bugs we find in a protocol implementation can be traced back to the document, thus addressing the traceability problem. We have extensively evaluated our approach using nine network protocols and their implementations written in C, Python, and Go. The results show that our approach outperforms the state-of-the-art and has detected 69 bugs, with 36 confirmed. The project also demonstrates the potential for fully automating software validation based on natural language specifications, a process previously considered predominantly manual due to the need to understand specification documents and derive expected outputs for test inputs.
Publisher's Version
NADA: Neural Acceptance-Driven Approximate Specification Mining
Weilin Luo,
Tingchen Han,
Junming Qiu,
Hai Wan,
Jianfeng Du,
Bo Peng,
Guohui Xiao, and
Yanan Liu
(Sun Yat-sen University, China; Guangdong University of Foreign Studies, China; Southeast University, China)
It is hard to mine high-quality finite-state automata (FSAs) only from desired software behaviors, i.e., positive examples, because of a search space explosion and an overgeneralization problem induced by a lack of undesired software behaviors, i.e., negative examples. To tackle the overgeneralization problem, we suggest modeling the problem as searching for approximate FSAs from positive and negative examples with noise, where the noise originates from synthetic negative examples used to reject overgeneralized results. To obtain an effective search bias in the exploding search space, we bridge FSA acceptance to neural network inference. Our key contribution is to design a neural network whose parameter assignment corresponds to an FSA, and its neural inference process, named after neural acceptance, is able to simulate FSA acceptance. The neural acceptance provides a way to quantify how well an FSA fits noisy data efficiently. We propose NADA, a neural acceptance-driven approach, to search for approximate FSAs guided by accepting positive examples and rejecting synthetic negative examples. NADA is based on a proper continuous relaxation of the discrete search space of FSAs and an efficient gradient descent-based search algorithm. Experimental results demonstrate that, compared with state-of-the-art approaches, NADA significantly improves the quality of mined FSAs (on average improves 41.63% F1 score). Besides, NADA is 19.8X faster than the approach mining sub-high-quality FSAs.
Publisher's Version
Info
Productively Deploying Emerging Models on Emerging Platforms: A Top-Down Approach for Testing and Debugging
Siyuan Feng,
Jiawei Liu,
Ruihang Lai,
Charlie Ruan,
Yong Yu,
Lingming Zhang, and
Tianqi Chen
(Shanghai Jiao Tong University, China; University of Illinois at Urbana-Champaign, USA; Carnegie Mellon University, USA)
While existing machine learning (ML) frameworks focus on established platforms, like running CUDA on server-grade GPUs, there have been growing demands to enable emerging AI applications in a broader set of scenarios, such as running Large Language Models (LLMs) within browsers and mobile phones. However, deploying emerging models on new platforms (such as Metal and WebGPU) presents significant software engineering challenges due to rapid model evolution and limited tooling and practices for these platforms.
Previous practice for ML model deployment often follows a bottom-up fashion, where engineers first implement individual required operators and then put them together. However, this traditional development approach fails to meet the productivity requirements when deploying emerging ML applications, with the testing and debugging part as a bottleneck. To this end, we introduce TapML, a top-down approach designed to streamline model deployment on diverse platforms. While the traditional bottom-up approach requires crafting manual tests, TapML automatically creates high-quality, realistic test data through operator-wise test carving. Furthermore, TapML uses a migration-based strategy to gradually offload model implementation from the mature source platform to the target platform, minimizing the debugging scope of compound errors.
TapML has been used as the default development method in the MLC-LLM project to deploy emerging ML models. Within 2 years, TapML has accelerated the deployment of 105 emerging models in 27 model architectures across 5 emerging platforms. We show that TapML effectively boosts developer productivity while ensuring the quality of deployed models. Furthermore, we summarize comprehensive case studies from our real-world development, offering best practices for developing emerging ML systems.
Publisher's Version
DecLLM: LLM-Augmented Recompilable Decompilation for Enabling Programmatic Use of Decompiled Code
Wai Kin Wong,
Daoyuan Wu,
Huaijin Wang,
Zongjie Li,
Zhibo Liu,
Shuai Wang,
Qiyi Tang,
Sen Nie, and
Shi Wu
(Hong Kong University of Science and Technology, Hong Kong; Ohio State University, USA; Tencent Security Keen Lab, China)
Decompilers are widely used in reverse engineering (RE) to convert compiled executables into human-readable pseudocode and support various security analysis tasks. Existing decompilers, such as IDA Pro and Ghidra, focus on enhancing the readability of decompiled code rather than its recompilability, which limits further programmatic use, such as for CodeQL-based vulnerability analysis that requires compilable versions of the decompiled code. Recent LLM-based approaches for enhancing decompilation results, while useful for human RE analysts, unfortunately also follow the same path.
In this paper, we explore, for the first time, how off-the-shelf large language models (LLMs) can be used to enable recompilable decompilation—automatically correcting decompiler outputs into compilable versions. We first show that this is non-trivial through a pilot study examining existing rule-based and LLM-based approaches. Based on the lessons learned, we design DecLLM, an iterative LLM-based repair loop that utilizes both static recompilation and dynamic runtime feedback as oracles to iteratively fix decompiler outputs. We test DecLLM on popular C benchmarks and real-world binaries using two mainstream LLMs, GPT-3.5 and GPT-4, and show that off-the-shelf LLMs can achieve an upper bound of around 70% recompilation success rate, i.e., 70 out of 100 originally non-recompilable decompiler outputs are now recompilable. We also demonstrate the practical applicability of the recompilable code for CodeQL-based vulnerability analysis, which is impossible to perform directly on binaries. For the remaining 30% of hard cases, we further delve into their errors to gain insights for future improvements in decompilation-oriented LLM design.
Publisher's Version
Dynamically Fusing Python HPC Kernels
Nader Al Awar,
Muhammad Hannan Naeem,
James Almgren-Bell,
George Biros, and
Milos Gligoric
(University of Texas at Austin, USA)
Recent trends in high-performance computing show an increase in the adoption of performance portable frameworks such as Kokkos and interpreted languages such as Python. PyKokkos follows these trends and enables programmers to write performance-portable kernels in Python which greatly increases productivity. One issue that programmers still face is how to organize parallel code, as splitting code into separate kernels simplifies testing and debugging but may result in suboptimal performance. To enable programmers to organize kernels in any way they prefer while ensuring good performance, we present PyFuser, a program analysis framework for automatic fusion of performance portable PyKokkos kernels. PyFuser dynamically traces kernel calls and lazily fuses them once the result is requested by the application. PyFuser generates fused kernels that execute faster due to better reuse of data, improved compiler optimizations, and reduced kernel launch overhead, while not requiring any changes to existing PyKokkos code. We also introduce automated code transformations that further optimize the fused kernels generated by PyFuser. Our experiments show that on average PyFuser achieves speedups compared to unfused kernels of 3.8x on NVIDIA and AMD GPUs, as well as Intel and AMD CPUs.
Publisher's Version
Tratto: A Neuro-Symbolic Approach to Deriving Axiomatic Test Oracles
Davide Molinelli,
Alberto Martin-Lopez,
Elliott Zackrone,
Beyza Eken,
Michael D. Ernst, and
Mauro Pezzè
(Constructor Institute, Switzerland; USI Lugano, Switzerland; University of Washington, USA)
This paper presents Tratto, a neuro-symbolic approach that generates assertions (boolean expressions) that can serve as axiomatic oracles, from source code and documentation. The symbolic module of Tratto takes advantage of the grammar of the programming language, the unit under test, and the context of the unit (its class and available APIs) to restrict the search space of the tokens that can be successfully used to generate valid oracles. The neural module of Tratto uses transformers fine-tuned for both deciding whether to output an oracle or not and selecting the next lexical token to incrementally build the oracle from the set of tokens returned by the symbolic module. Our experiments show that Tratto outperforms the state-of-the-art axiomatic oracle generation approaches, with 73% accuracy, 72% precision, and 61% F1-score, largely higher than the best results of the symbolic and neural approaches considered in our study (61%, 62%, and 37%, respectively). Tratto can generate three times more axiomatic oracles than current symbolic approaches, while generating 10 times less false positives than GPT4 complemented with few-shot learning and Chain-of-Thought prompting.
Publisher's Version
VerLog: Enhancing Release Note Generation for Android Apps using Large Language Models
Jiawei Guo,
Haoran Yang, and
Haipeng Cai
(SUNY Buffalo, USA; Washington State University, USA)
Release notes are essential documents that communicate the details of software updates to users and developers, yet their generation remains a time-consuming and error-prone process. In this paper, we present VerLog, a novel technique that enhances the generation of software release notes using Large Language Models (LLMs). VerLog leverages few-shot in-context learning with adaptive prompting to facilitate the graph reasoning capabilities of LLMs, enabling them to accurately interpret and document the semantic information of code changes. Additionally, VerLog incorporates multi-granularity information, including fine-grained code modifications and high-level non-code artifacts, to guide the generation process and ensure comprehensive, accurate, and readable release notes. We applied VerLog to the 42 releases of 248 unique Android applications and conducted extensive evaluations. Our results demonstrate that VerLog significantly (up to 18%–21% higher precision, recall, and F1) outperforms state-of-the-art baselines in terms of completeness, accuracy, readability, and overall quality of the generated release notes, in both controlled experiments with high-quality reference release notes and in-the-wild evaluations.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
Uncovering API-Scope Misalignment in the App-in-App Ecosystem
Jiarui Che,
Chenkai Guo,
Naipeng Dong,
Jiaqi Pei,
Lingling Fan,
Xun Mi,
Xueshuo Xie,
Xiangyang Luo,
Zheli Liu, and
Renhong Cheng
(Nankai University, China; Haihe Lab of ITAI, China; University of Queensland, Australia; State Key Laboratory of Mathematical Engineering and Advanced Computing, China)
The "app-in-app" paradigm is an emerging trend in mobile systems, where super applications (short for superApps) such as WeChat, Baidu, TikTok, enable external vendors to develop mini-programs (short for miniApps) on their platforms by providing privileged APIs. To facilitate management, superApps have devised their specific permission configuration (called scope) to grant the APIs access to specific capabilities and resources. Adhering to these scopes during API implementation is crucial for maintaining security; otherwise, the permission management of superApps can be bypassed—a vulnerability we refer to as API-scope misalignment.
In this work, we conduct the first systematic study on the API-scope misalignment issues in the app-in-app ecosystems, uncovering root causes and security risks. More importantly, we developed an automatic tool called ScopeChecker to detect the API-scope misalignment in both superApps and miniApps. ScopeChecker extracts the standard API-scope mappings by integrating the Android permission mechanism into the functionalities of superApps. Then, LLM-based code generation is used to create executable API snippets as test cases. The execution results reflect the actual mappings of APIs to their scopes, which are compared with the standard API-scope mappings to identify misalignment. After that, ScopeChecker verifies the identified misalignment in miniApps by matching the misaligned APIs with a tailored method-oriented abstract syntax tree (MAST) of the target miniApp. ScopeChecker identified 38 misaligned APIs in top superApps with manual confirmation, outperforming the state-of-the-art miniApp-focused test methods. As a highlight, we received 11 positive responses from the superApp developers and CNVD, encompassing 9 vulnerability confirmations with rewards: 1 high-risk, 7 medium-risk, and 1 low-risk. To assess prevalence, ScopeChecker evaluated 42𝑘+ miniApps, and found 51% had API-scope misalignment, averaging 1.4 misaligned APIs each. At last, we illustrated 4 types of security threats raised by the API-scope misalignment by analyzing real-world exploitation cases.
Publisher's Version
Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering
Ruiqi Wang,
Jiyu Guo,
Cuiyun Gao,
Guodong Fan,
Chun Yong Chong, and
Xin Xia
(Harbin Institute of Technology, China; Monash University Malaysia, Malaysia; Zhejiang University, China)
Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored.
In this paper, we empirically explore LLM-as-a-judge methods for evaluating SE tasks, focusing on their alignment with human judgments. We select seven LLM-as-a-judge methods that utilize general-purpose LLMs, alongside two LLMs specifically fine-tuned for evaluation. After generating and manually scoring LLM responses on three recent SE datasets of code translation, code generation, and code summarization, we then prompt these methods to evaluate each response. Finally, we compare the scores generated by these methods with human evaluation. The results indicate that output-based methods reach the highest Pearson correlation of 81.32 and 68.51 with human scores in code translation and generation, achieving near-human evaluation, noticeably outperforming ChrF++, one of the best conventional metrics, at 34.23 and 64.92. Such output-based methods prompt LLMs to output judgments directly, and exhibit more balanced score distributions that resemble human score patterns. Finally, we provide insights and implications, concluding that current state-of-the-art LLM-as-a-judge methods can potentially replace human evaluations in certain SE tasks.
Publisher's Version
Effective REST APIs Testing with Error Message Analysis
Lixin Xu,
Huayao Wu,
Zhenyu Pan,
Tongtong Xu,
Shaohua Wang,
Xintao Niu, and
Changhai Nie
(Nanjing University, China; Huawei, China; Central University of Finance and Economics, China)
REST APIs are essential for building modern enterprise systems, but effectively testing them remains challenging, particularly due to difficulties in inferring constraints from specifications. Current testing approaches typically use feedback from HTTP status codes to guide input generation. However, they overlook valuable information available in the accompanying error messages, reducing their effectiveness in exploring the APIs’ input spaces. In this paper, we propose EmRest, a black-box testing approach that leverages error message analysis to enhance both valid and exceptional test input generation for REST APIs. For each operation under test, EmRest first identifies all possible value assignment strategies for each of its input parameters. It then repeatedly applies combinatorial testing to sample test inputs based on these strategies, and statistically analyzes the error messages (of 400-range status code) received to infer and exclude invalid combinations of value assignment strategies (i.e., constraints of the input space). Additionally, EmRest seeks to mutate valid value assignment strategies that are finally identified to generate test inputs for exceptional testing. The error messages (of 500-range status code) received are categorized to identify bug-prone operations, for which more testing resources are allocated. Our experimental results on 16 real-world REST APIs demonstrates the effectiveness of EmRest. It achieves higher operation coverage than state-of-the-art approaches in 50% of APIs, and detects 226 unique bugs undetected by other approaches.
Publisher's Version
Published Artifact
Artifacts Available
DepState: Detecting Synchronization Failure Bugs in Distributed Database Management Systems
Cundi Fang,
Jie Liang,
Zhiyong Wu,
Jingzhou Fu,
Zhouyang Jia,
Chun Huang,
Yu Jiang, and
Shanshan Li
(National University of Defense Technology, China; Beihang University, China; Tsinghua University, China)
Distributed Database Management Systems (DDBMSs) are crucial for managing large-scale distributed data. Unlike single-node databases, they are deployed across clusters, distributing data among multiple nodes. The synchronization process in DDBMSs maintains data consistency against data and cluster updates. Due to its complexity, synchronization bugs are inevitable and may cause data inconsistencies, transaction errors, or cluster crashes, severely compromising the availability and reliability of a DDBMS. However, there has been relatively little focus on testing the DDBMS synchronization process.
In this paper, we propose DepState, a framework to detect synchronization failure bugs. DepState enhances synchronization testing by simulating the complexities of data sharding and dynamic cluster conditions. It establishes dependencies between tables across nodes and systematically introduces controlled variations in cluster states. We utilize DepState on four DDBMSs: MySQL NDB Cluster, MySQL InnoDB Cluster, MariaDB Galera Cluster, and TiDB Cluster, discovering 25 new bugs, with 13 confirmed. We compare DepState against state-of-the-art tools. DepState finds 14 more synchronization failure bugs and covers 6.13%-66.51%, 5.82%-57.28%, 14.12%-83.30%, 36.81%-83.88%, and 43.24%-54.28% more lines in synchronization-related functions than Jepsen, Mallory, SQLsmith, SQLancer, and Mozi in 24 hours, respectively.
Publisher's Version
What Happened in This Pipeline? Diffing Build Logs with CiDiff
Nicolas Hubner,
Jean-Rémy Falleri,
Raluca Uricaru,
Thomas Degueule, and
Thomas Durieux
(University of Bordeaux - CNRS - Bordeaux INP - LaBRI - UMR 5800, France; Delft University of Technology, Netherlands)
Continuous integration (CI) is widely used by developers to ensure the quality and reliability of their software projects. However, diagnosing a CI regression is a tedious process that involves the manual analysis of lengthy build logs. In this paper, we explore how textual differencing can support the debugging of CI regressions. As off-the-shelf diff algorithms produce suboptimal results, in this work we introduce a new diff algorithm specifically tailored to build logs called CiDiff. We evaluate CiDiff against several baselines on a novel dataset of 17 906 CI regressions, performing an accuracy study, a quantitative study and a user-study. Notably, our algorithm reduces the number of lines to inspect by about 60 % in the median case, with reasonable overhead compared to the state-of-practice LCS-diff. Finally, our algorithm is preferred by the majority of participants in 70 % of the regression cases, whereas LCS-diff is preferred in only 5 % of the cases
Publisher's Version
Published Artifact
Artifacts Available
Freesia: Verifying Correctness of TEE Communication with Concurrent Separation Logic
Fanlang Zeng,
Rui Chang, and
Hongjian Liu
(Zhejiang University, China)
The Trusted Execution Environment (TEE), a security extension in modern processors, provides a secure runtime environment for sensitive code and data. Although TEEs are designed to protect applications and their private data, their large code bases often harbor vulnerabilities that could compromise data security. Even though some formal verification efforts have been directed toward the functionality and security of TEE standards and implementations, the verification of TEE correctness in concurrent scenarios remains insufficient. This paper introduces an enhancement for ensuring concurrency safety in TEEs, named Freesia, which is formally verified using concurrent separation logic. Through a thorough analysis of the GlobalPlatform TEE standards, Freesia addresses data race issues in the TEE communication interfaces and ensures consistency protection for shared memory between the client and the TEE. A prototype of Freesia is implemented in the open-source TEE platform, OP-TEE. Additionally, the concurrency correctness of Freesia is modeled and verified using the Iris concurrent separation logic framework. The effectiveness and efficiency of Freesia are further demonstrated through real-world case study and performance evaluations.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Functional
Static Program Reduction via Type-Directed Slicing
Loi Ngo Duc Nguyen,
Tahiatul Islam,
Theron Wang,
Sam Lenz, and
Martin Kellogg
(University of California at Riverside, USA; New Jersey Institute of Technology, USA; Academy for Mathematics, Science, and Engineering, USA)
A traditional program slicer constructs a smaller variant of a target program that computes the same result with respect to some target variable—that is, program slicing preserves the original program’s run-time semantics. We propose type-directed slicing, which constructs a smaller program that guarantees that a typechecker will produce the same result on the sliced program when considering only a target program location—that is, a type-directed slicer preserves the target program’s compile-time semantics, from the view of a specific typechecker, with respect to some location.
Type-directed slicing is a useful debugging aid for designers and maintainers of typecheckers. When a typechecker produces an unexpected result (a crash, a false positive warning, a missed warning, etc.) on a large codebase, the user typically reports a bug to the maintainers of the typechecker without an accompanying test case. State-of-the-art approaches to this program reduction problem are dynamic: they require repeatedly running the typechecker to validate minimizations. A type-directed slicer solves this problem statically, without rerunning the typechecker, by exploiting the modularity inherent in a typechecker’s type rules. Our prototype type-directed slicer for Java is fully automatic, can operate on incomplete programs, and is fast. It produces a small test case that preserves typechecker misbehavior for 25 of 28 (89%) historical bugs from the issue trackers of three widely-used typecheckers: the Java compiler itself, NullAway, and the Checker Framework; in each of these 25 cases, it preserved the typechecker’s behavior even without the classpath of the target program. And, it runs in under a minute on each benchmark, whose size ranges up to millions of lines of code, on a free-tier CI runner.
Publisher's Version
Published Artifact
Artifacts Available
LogBase: A Large-Scale Benchmark for Semantic Log Parsing
Chenbo Zhang,
Wenying Xu,
Jinbu Liu,
Lu Zhang,
Guiyang Liu,
Jihong Guan,
Qi Zhou, and
Shuigeng Zhou
(Fudan University, China; Alibaba Group, China; Tongji University, China)
Logs generated by large-scale software systems contain a huge amount of useful information. As the first step of automated log analysis, log parsing has been extensively studied. General log parsing techniques focus on identifying static templates from raw logs, but overlook the more important semantics implied in dynamic log parameters. With the popularity of Artificial Intelligence for IT Operations (AIOps), traditional log parsing methods no longer meet the requirements of various downstream tasks. Researchers are now exploring the next generation of log parsing techniques, i.e., semantic log parsing, to identify both log templates and semantics in log parameters. However, the absence of semantic annotations in existing datasets hinders the training and evaluation of semantic log parsers, thereby stalling the progress of semantic log parsing.
To fill this gap and advance the field of semantic log parsing, we construct LogBase, the first semantic log parsing benchmark dataset. LogBase consists of logs from 130 popular open-source projects, containing 85,300 semantically annotated log templates, surpassing existing datasets in both log source diversity and template richness.
To build Logbase, we develop the framework GenLog for constructing semantic log parsing datasets. GenLog mines log template-parameter-context triplets from popular open-source repositories on GitHub, and uses chain-of-thought (CoT) techniques with large language models (LLMs) to generate high-quality logs. Meanwhile, GenLog employs human feedback to improve the quality of the generated data and ensure its reliability. GenLog is highly automated and cost-effective, enabling researchers to easily and efficiently construct semantic log parsing datasets.
Furthermore, we also design a set of comprehensive evaluation metrics for LogBase, including general log parser metrics and the metrics specifically for semantic log parsers and LLM-based parsers.
With LogBase, we extensively evaluate 15 existing log parsers, revealing their true performance in complex scenarios. We believe that this work provides researchers with valuable data, reliable tools, and insightful findings to support and guide the future research of semantic log parsing.
Publisher's Version
STRUT: Structured Seed Case Guided Unit Test Generation for C Programs using LLMs
Jinwei Liu,
Chao Li,
Rui Chen,
Shaofeng Li,
Bin Gu, and
Mengfei Yang
(Xidian University, China; Beijing Institute of Control Engineering, China; Beijing Sunwise Information Technology, China; China Academy of Space Technology, China)
Unit testing plays a crucial role in bug detection and ensuring software correctness. It helps developers identify errors early in development, thereby reducing software defects. In recent years, large language models (LLMs) have demonstrated significant potential in automating unit test generation. However, using LLMs to generate unit tests faces many challenges. 1) The execution pass rate of the test cases generated by LLMs is low. 2) The test case coverage is inadequate, making it challenging to detect potential risks in the code. 3) Current research methods primarily focus on languages such as Java and Python, while studies on C programming are scarce, despite its importance in the real world. To address these challenges, we propose STRUT, a novel unit test generation method. STRUT utilizes structured test cases as a bridge between complex programming languages and LLMs. Instead of directly generating test code, STRUT guides LLMs to produce structured test cases, thereby alleviating the limitations of LLMs when generating code for programming languages with complex features. First, STRUT analyzes the context of focal methods and constructs structured seed test cases for them. These seed test cases then guide LLMs to generate a set of structured test cases. Subsequently, a rule-based approach is employed to convert the structured set of test cases into executable test code. We conducted a comprehensive evaluation of STRUT, which achieved an impressive execution pass rate of 96.01%, along with 77.67% line coverage and 63.60% branch coverage. This performance significantly surpasses that of the LLMs-based baseline methods and the symbolic execution tool SunwiseAUnit. These results highlight STRUT's superior capability in generating high-quality unit test cases by leveraging the strengths of LLMs while addressing their inherent limitations.
Publisher's Version
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
Xiaohan Yuan,
Jinfeng Li,
Dongxia Wang,
Yuefeng Chen,
Xiaofeng Mao,
Longtao Huang,
Jialuo Chen,
Hui Xue,
Xiaoxia Liu,
Wenhai Wang,
Kui Ren, and
Jingyi Wang
(Zhejiang University, China; Alibaba Group, China)
Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risks efficiently.
To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM Mt and a novel safety critique LLM Mc. The expert testing LLM Mt is responsible for automatically generating test cases in accordance with the proposed risk management (including 8 risk dimensions and a total of 102 subdivided risks). The safety critique LLM Mc can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval differs in significant ways: (i) efficient – we construct a multi-dimensional and open-ended benchmark comprising 220,000 test cases across 102 risks utilizing Mt and conduct safety evaluations for 21 influential LLMs via Mc on our benchmark. The entire process is fully automated and requires no human involvement. (ii) effective – extensive validations show S-Eval facilitates a more thorough assessment and better perception of potential LLM risks, and Mc not only accurately quantifies the risks of LLMs but also provides explainable and in-depth insights into their safety, surpassing comparable models such as LLaMA-Guard-2. (iii) adaptive – S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. We further study the impact of hyper-parameters and language environments on model safety, which may lead to promising directions for future research. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios.
Publisher's Version
Info
Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing
Yanzhou Mu,
Juan Zhai,
Chunrong Fang,
Xiang Chen,
Zhixiang Cao,
Peiran Yang,
Kexin Zhao,
An Guo, and
Zhenyu Chen
(Nanjing University, China; Shenzhen Research Institute of Nanjing University, China; University of Massachusetts at Amherst, USA; Nantong University, China)
Deep learning (DL) frameworks are essential to DL-based software systems, and framework bugs may lead to substantial disasters, thus requiring effective testing. Researchers adopt DL models or single interfaces as test inputs and analyze their execution results to detect bugs. However, floating-point errors, inherent randomness, and the complexity of test inputs make it challenging to analyze execution results effectively, leading to existing methods suffering from a lack of suitable test oracles. Some researchers utilize metamorphic testing to tackle this challenge. They design Metamorphic Relations (MRs) based on input data and parameter settings of a single framework interface to generate equivalent test inputs, ensuring consistent execution results between original and generated test inputs. Despite their promising effectiveness, they still face certain limitations. (1) Existing MRs overlook structural complexity, limiting test input diversity. (2) Existing MRs focus on limited interfaces, which limits generalization and necessitates additional adaptations. (3) Their detected bugs are related to the result consistency of single interfaces and far from those exposed in multi-interface combinations and runtime metrics (e.g., resource usage). To address these limitations, we propose ModelMeta, a model-level metamorphic testing method for DL frameworks with four MRs focused on the structure characteristics of DL models. ModelMeta augments seed models with diverse interface combinations to generate test inputs with consistent outputs, guided by the QR-DQN strategy. It then detects bugs through fine-grained analysis of training loss/gradients, memory/GPU usage, and execution time. We evaluate the effectiveness of ModelMeta on three popular DL frameworks (i.e., MindSpore, PyTorch, and ONNX) with 17 DL models from ten real-world tasks ranging from image classification to object detection. Results demonstrate that ModelMeta outperforms state-of-the-art baselines from the perspective of test coverage and diversity of generated test inputs. Regarding bug detection, ModelMeta has identified 31 new bugs, of which 27 have been confirmed, and 11 have been fixed. Among them, seven bugs existing methods cannot detect, i.e., five wrong resource usage bugs and two low-efficiency bugs. These results demonstrate the practicality of our method.
Publisher's Version
Hulk: Exploring Data-Sensitive Performance Anomalies in DBMSs via Data-Driven Analysis
Zhiyong Wu,
Jie Liang,
Jingzhou Fu,
Mingzhe Wang, and
Yu Jiang
(Tsinghua University, China; Beihang University, China)
Performance is crucial for database management systems (DBMSs), and they are always designed to handle ever-changing workloads efficiently. However, the complexity of the cost-based optimizer (CBO) and its interactions can introduce implementation errors, leading to data-sensitive performance anomalies. These anomalies may cause significant performance degradation compared to the expected design under certain datasets. To diagnose performance issues, DBMS developers often rely on intuitions or compare execution times to a baseline DBMS. These approaches overlook the impact of datasets on performance. As a result, only a subset of performance issues is identified and resolved.
In this paper, we propose Hulk to automatically explore these data-sensitive performance anomalies via data-driven analysis. The key idea is to identify performance anomalies as the dataset evolves. Specifically, Hulk estimates a reasonable response time range for each data volume to pinpoint performance cliffs. Then, performance cliffs are checked for deviations from expected performance by finding a reasonable plan that aligns with performance expectations. We evaluate Hulk on six widely used DBMSs, namely MySQL, MariaDB, Percona, TiDB, PostgreSQL, and AntDB. Hulk totally reports 135 anomalies, with 129 have been confirmed as new bugs, including 14 CVEs. Among them, 94 are data-sensitive performance anomalies.
Publisher's Version
ACM SIGSOFT Distinguished Paper Award
Type-Alias Analysis: Enabling LLVM IR with Accurate Types
Jinmeng Zhou,
Ziyue Pan,
Wenbo Shen,
Xingkai Wang,
Kangjie Lu, and
Zhiyun Qian
(Zhejiang University, China; University of Minnesota, USA; University of California at Riverside, USA)
LLVM Intermediate Representation (IR) underpins the LLVM compiler infrastructure, offering a strong type system and a static single-assignment (SSA) form that are well-suited for program analysis. However, its single-type design assigns exactly one type to each IR variable, even when the variable may legitimately correspond to multiple types. The recent introduction of opaque pointers exacerbates this limitation: all pointers in the IR are uniformly represented with a generic pointer type (ptr) that erases concrete pointee type information, making many type-based analyses ineffective.
To address the limitations of single-type design, we introduce type-alias analysis, a multiple-type design that maintains type-alias sets for IR variables and infers types across IR instructions. We have developed TypeCopilot, a prototype that recovers concrete pointee types for opaque-pointer-enabled LLVM IR generated from C programs. TypeCopilot achieves 98.57% accuracy with 94.98% coverage, allowing existing analysis tools to retain their effectiveness despite the adoption of opaque pointers. To foster further research and security applications, we have open-sourced TypeCopilot, providing the community with a practical foundation for precise, type-aware security analyses on modern LLVM IR.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Functional
Automated Test Transfer across Android Apps using Large Language Models
Benyamin Beyzaei,
Saghar Talebipour,
Ghazal Rafiei,
Nenad Medvidović, and
Sam Malek
(University of California at Irvine, USA; University of Southern California, USA)
The pervasiveness of mobile apps in everyday life necessitates robust testing strategies to ensure quality and efficiency, especially through end-to-end usage-based tests for mobile apps' user interfaces (UIs). However, manually creating and maintaining such tests can be costly for developers. Since many apps share similar functionalities beneath diverse UIs, previous works have shown the possibility of transferring UI tests across different apps within the same domain, thereby eliminating the need for writing the tests manually. However, these methods have struggled to accommodate real-world variations, often facing limitations in scenarios where source and target apps are not very similar or fail to accurately transfer test oracles. This paper introduces an innovative technique, LLMigrate, which leverages Large Language Models (LLMs) to efficiently transfer usage-based UI tests across mobile apps.
Our experimental evaluation shows LLMigrate can achieve a 97.5% success rate in automated test transfer, reducing the manual effort required to write tests from scratch by 91.1%. This represents an improvement of 9.1% in success rate and 38.2% in effort reduction compared to the best-performing prior technique, setting a new benchmark for automated test transfer.
Publisher's Version
Info
Incremental Verification of Concurrent Programs through Refinement Constraint Adaptation
Liangze Yin,
Yiwei Li,
Kun Chen,
Wei Dong, and
Ji Wang
(National University of Defense Technology, China)
Programs evolve continuously throughout their life cycles.
Verifying each version from scratch is usually impractical, especially for concurrent programs.
Designing efficient incremental verification techniques for concurrent programs is highly desired.
We focus on the abstraction refinement technique for concurrent program verification.
When a program is modified, those refinement constraints generated in the verifications of previous versions are adapted to the new program to avoid redundant analysis.
We propose a kernel source based refinement constraint adaptation approach for the scheduling constraint based abstraction refinement method, one of the most efficient abstraction refinement methods for concurrent program verification.
Our method supports all kinds of program modifications, and generates adapted refinement constraints according to the modifications.
Evaluation on the benchmarks from SV-COMP 2024 and Nidhugg benchmarks shows promising results of our method.
Most of the refinement constraints generated in the verification of previous versions can be adapted to the modified program in our experiments.
Compared with verifying the modified program from scratch, our incremental verification method can achieve two orders of magnitude speedup for those complex programs.
Publisher's Version
Fixing Outside the Box: Uncovering Tactics for Open-Source Security Issue Management
Lyuye Zhang,
Jiahui Wu,
Chengwei Liu,
Kaixuan Li,
Xiaoyu Sun,
Lida Zhao,
Chong Wang, and
Yang Liu
(Nanyang Technological University, Singapore; East China Normal University, China; Australian National University, Australia)
In the rapidly evolving landscape of software development, addressing security vulnerabilities in open-source
software (OSS) has become critically important. However, existing research and tools from both academia and
industry mainly relied on limited solutions, such as vulnerable version adjustment and adopting patches, to
handle identified vulnerabilities. However, far more flexible and diverse countermeasures have been actively
adopted in the open-source communities. A holistic empirical study is needed to explore the prevalence,
distribution, preferences, and effectiveness of these diverse strategies.
To this end, in this paper, we conduct a comprehensive study on the taxonomy of vulnerability remediation
tactics (RT) in OSS projects and investigate their pros and cons. This study addresses this oversight by
conducting a comprehensive empirical analysis of 21,187 issues from GitHub, aiming to understand the range
and efficacy of remediation tactics within the OSS community. We developed a hierarchical taxonomy of
44 distinct RT and evaluated their effectiveness and costs. Our findings highlight a significant reliance on
community-driven strategies, like using alternative libraries and bypassing vulnerabilities, 44% of which are
currently unsupported by cutting-edge tools. Additionally, this research exposes the community’s preferences
for certain fixing approaches by analyzing their acceptance and the reasons for rejection. It also underscores a
critical gap in modern vulnerability databases, where 54% of CVEs lack fixing suggestions—a gap that can be
significantly mitigated by leveraging the 93% of actionable solutions provided through GitHub issues.
Publisher's Version
Intention-Based GUI Test Migration for Mobile Apps using Large Language Models
Shaoheng Cao,
Minxue Pan,
Yuanhong Lan, and
Xuandong Li
(Nanjing University, China)
Graphical User Interface (GUI) testing is one of the primary quality assurance methods for mobile apps. Manually constructing high-quality test cases for GUI testing is costly and labor-intensive, leading to the development of various automated approaches that migrate test cases from a source app to a target app. Existing approaches predominantly treat this test migration task as a widget-matching problem, which performs well when the interaction logic between apps remains consistent. However, they struggle with variations in interaction logic for specific functionalities, a common scenario across different apps. To address this limitation, a novel approach named ITeM is introduced in this paper for the test migration task. Unlike existing works that model the problem as a widget-matching task, ITeM seeks a novel pathway by adopting a two-stage framework with the comprehension and reasoning capability of Large Language Models: first, a transition-aware mechanism for generating test intentions; and second, a dynamic reasoning-based mechanism for fulfilling these intentions. This approach maintains effectiveness regardless of variations across the source and target apps' interaction logic. Experimental results on 35 real-world Android apps across 280 test migration tasks demonstrate the superior effectiveness and efficiency of ITeM compared to state-of-the-art approaches.
Publisher's Version
GoPV: Detecting Blocking Concurrency Bugs Related to Shared-Memory Synchronization in Go
Wei Song,
Xiaofan Xu, and
Jeff Huang
(Nanjing University of Science and Technology, China; Texas A&M University, USA)
Go is a popular concurrent programming language that employs both message-passing and shared-memory synchronization primitives for interaction between different threads known as goroutines. However, the misuse of the synchronization primitives can easily lead to blocking concurrency bugs, including deadlocks, goroutine leaks. While blocking concurrency bugs related to message passing have received increasing attention, little work focuses on the blocking concurrency bugs caused by the misuse of shared-memory synchronization primitives. In this paper, we present GoPV, a static analyzer and an open-source tool, which performs concurrency analysis and (post-)dominator analysis to determine blocking concurrency bugs by ascertaining whether the synchronization primitives are misused. We evaluate GoPV on eight benchmark programs and 21 large real-world Go projects. The experimental results demonstrate that GoPV not only successfully detects all blocking concurrency bugs related to shared-memory synchronization in the eight benchmark programs, but also discovers 17 such bugs in the 21 large Go applications within 2.78 hours.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Reusable
More Effective JavaScript Breaking Change Detection via Dynamic Object Relation Graph
Dezhen Kong,
Jiakun Liu,
Chao Ni,
David Lo, and
Lingfeng Bao
(Zhejiang University, China; Singapore Management University, Singapore)
JavaScript libraries are characterized by their widespread use, frequent code changes, and a high tolerance for backward incompatible changes. Awareness of such breaking changes can help developers adapt to version updates and avoid negative impacts. Several tools have been targeted to or can be used to detect breaking change detection in the JavaScript community. However, these tools detect breaking changes using different ways, and there are currently no systematic reviews of these approaches. From a preliminary study on popular JavaScript libraries, we find that existing approaches, including simple regression testing, model-based testing and type differencing cannot detect many breaking changes but can produce plenty of false positives. We discuss the reasons for missing breaking changes and producing false positives.
Based on the insights from our findings, we propose a new approach named Diagnose that iteratively constructs an object relation graph based on API exploration and forced execution-based type analysis. Diagnose then refine the graphs and reconstruct the graphs in the newer versions of the libraries to detect breaking changes. By evaluating approach on the same set of libraries used in the empirical study, we find that Diagnose can detect much more breaking changes (60.2%) and produce fewer false positives. Therefore, Diagnose is suitable for practical use.
Publisher's Version
SWE-GPT: A Process-Centric Language Model for Automated Software Improvement
Yingwei Ma,
Rongyu Cao,
Yongchang Cao,
Yue Zhang,
Jue Chen,
Yibo Liu,
Yuchen Liu,
Binhua Li,
Fei Huang, and
Yongbin Li
(Alibaba Group, China)
Large language models (LLMs) have demonstrated remarkable performance in code generation, significantly enhancing the coding efficiency of developers. Recent advancements in LLM-based agents have led to significant progress in end-to-end automatic software engineering (ASE), particularly in software maintenance (e.g., fixing software issues) and evolution (e.g., adding new features). Despite these encouraging advances, current research faces two major challenges. First, state-of-the-art performance primarily depends on closed-source models like GPT-4, which significantly limits the technology’s accessibility, and potential for customization in diverse software engineering tasks. This dependence also raises concerns about data privacy, particularly when handling sensitive codebases. Second, these models are predominantly trained on static code data, lacking a deep understanding of the dynamic interactions, iterative problem-solving processes, and evolutionary characteristics inherent in software development. Consequently, they may face challenges in navigating complex project structures and generating contextually relevant solutions, which can affect their practical utility in real-world scenarios.
To address these challenges, our study adopts a software engineering perspective. We recognize that real-world software maintenance and evolution processes encompass not only static code data but also developers’ thought processes, utilization of external tools, and the interaction between different functional personnel. Our objective is to develop an open-source large language model specifically optimized for software improvement, aiming to match the performance of closed-source alternatives while offering greater accessibility and customization potential. Consequently, we introduce the Lingma SWE-GPT series, comprising Lingma SWE-GPT 7B and Lingma SWE-GPT 72B. By learning from and simulating real-world code submission activities, Lingma SWE-GPT systematically incorporates the dynamic interactions and iterative problem-solving inherent in software development process—such as repository understanding, fault localization, and patch generation—thereby achieving a more comprehensive understanding of software improvement processes. We conducted experimental evaluations using SWE-bench-Verified benchmark (comprising 500 real GitHub issues), recently proposed by OpenAI. The results demonstrate that Lingma SWE-GPT 72B successfully resolves 30.20% of the GitHub issues, marking a significant improvement in automatic issue resolution (22.76% relative improvement compared to Llama 3.1 405B), approaching the performance of closed-source models (31.80% issues of GPT-4o resolved). Notably, Lingma SWE-GPT 7B resolves 18.20% of the issues, surpassing the 17.20% resolution rate of Llama 3.1 70B, highlighting the potential for applying smaller models to ASE tasks.
Publisher's Version
ACM SIGSOFT Distinguished Paper Award
ICEPRE: ICS Protocol Reverse Engineering via Data-Driven Concolic Execution
Yibo Qu,
Dongliang Fang,
Zhen Wang,
Jiaxing Cheng,
Shuaizong Si,
Yongle Chen, and
Limin Sun
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Taiyuan University of Technology, China)
With the advancement of digital transformation, Industrial Control Systems (ICS) are becoming increasingly open and intelligent.
However, inherent vulnerabilities in ICS protocols pose significant security threats to devices and systems.
The proprietary nature of ICS protocols complicates the security analysis and deployment of protective mechanisms for ICS.
Protocol reverse engineering aims to infer the syntax, semantics, and state machines of protocols in the absence of official specifications.
Traditional protocol reverse engineering tools face considerable limitations due to the lack of executable environments, incomplete inference strategies, and low-quality network traffic.
In this paper, we present ICEPRE, a novel data-driven protocol reverse engineering method based on concolic execution, which uniquely integrates network trace with static analysis.
Unlike conventional methods that rely on executable environments, ICEPRE statically tracks the program's parsing process for specific input messages.
Furthermore, we employ an innovative field boundary inference strategy to infer the protocol's syntax by analyzing how the protocol parser handles different fields.
Our evaluation demonstrates that ICEPRE significantly outperforms previous protocol reverse engineering tools in field boundary inference, achieving an F1 score of 0.76 and a perfection score of 0.67, while DynPRE, BinaryInferno, Nemeys, and Netzob yield (0.65, 0.35), (0.42, 0.14), (0.39, 0.09), and (0.27, 0.10), respectively.
These results underscore the superior overall performance of our method.
Additionally, ICEPRE exhibits exceptional performance with proprietary protocols in real-world scenarios, highlighting its practical applicability in downstream applications.
Publisher's Version
Gamifying Testing in IntelliJ: A Replicability Study
Philipp Straubinger,
Tommaso Fulcini,
Giacomo Garaccione,
Luca Ardito, and
Gordon Fraser
(University of Passau, Germany; Politecnico di Torino, Italy)
Gamification is an emerging technique to enhance motivation and performance in traditionally unengaging tasks like software testing. Previous studies have indicated that gamified systems have the potential to improve software testing processes by providing testers with achievements and feedback. However, further evidence of these benefits across different environments, programming languages, and participant groups is required. This paper aims to replicate and validate the effects of IntelliGame, a gamification plugin for IntelliJ IDEA to engage developers in writing and executing tests. The objective is to generalize the benefits observed in earlier studies to new contexts, i.e., the TypeScript programming language and a larger participant pool. The replicability study consists of a controlled experiment with 174 participants, divided into two groups: one using IntelliGame and one with no gamification plugin. The study employed a two-group experimental design to compare testing behavior, coverage, mutation scores, and participant feedback between the groups. Data was collected through test metrics and participant surveys, and statistical analysis was performed to determine the statistical significance. Participants using IntelliGame showed higher engagement and productivity in testing practices than the control group, evidenced by the creation of more tests, increased frequency of executions, and enhanced utilization of testing tools. This ultimately led to better code implementations, highlighting the effectiveness of gamification in improving functional outcomes and motivating users in their testing endeavors. The replication study confirms that gamification, through IntelliGame, positively impacts software testing behavior and developer engagement in coding tasks. These findings suggest that integrating game elements into the testing environment can be an effective strategy to improve software testing practices.
Publisher's Version
CrossProbe: LLM-Empowered Cross-Project Bug Detection for Deep Learning Frameworks
Hao Guan,
Guangdong Bai, and
Yepang Liu
(University of Queensland, Australia; Southern University of Science and Technology, China)
Deep Learning (DL) models may introduce reliability challenges in the underlying DL frameworks. These frameworks may be prone to bugs that can lead to crash or wrong results, particularly when involving complex model architectures and substantial computational demands. Such framework bugs can disrupt DL applications, impacting customer experience and potentially causing financial losses. Traditional approaches to testing DL frameworks face limitations in adapting to the vast search space of model structures, diverse APIs, and the complexity of hybrid programming and hardware environments. Recent advancements using Large Language Models (LLMs) have improved DL framework fuzzing, but their efficacy depends heavily on the quality and diversity of input prompts, which are often constructed using single-framework data.
In this paper, we propose an innovative approach for enhancing test generation for DL frameworks by leveraging “mirroring issues”—analogous bugs identified across different frameworks with common functionalities. Our approach is inspired by the fact that DL frameworks, such as PyTorch and TensorFlow, often share common bugs due to dependencies, developer errors, or edge-case inputs. We develop CrossProbe that utilizes LLMs to effectively learn from existing issues of one framework and transfer the acquired knowledge to generate test cases for finding mirroring issues in another framework, thus enabling cross-framework bug detection. To overcome the challenges of test case generation arising from the incompatible functionalities and different implementations between frameworks, we introduce three processes: alignment, screening, and distinction. These processes help mitigate transfer errors by establishing API pair databases, filtering unsuitable cases, and highlighting cross-framework distinctions. Experiments demonstrate that CrossProbe is efficient by saving 36.3% iterations of generation, and achieves a 25.0% higher success rate in issue transferring compared to existing state-of-the-art LLM-based testing techniques. CrossProbe detects 24 unique bugs using its transferred knowledge. Out of them, 19 are previously unknown and each requires cross-framework knowledge in deep learning for identification.
Publisher's Version
Published Artifact
Artifacts Available
Artifacts Functional
proc time: 28.85