ISSTA 2024
33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024)
Powered by
Conference Publishing Consulting

33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024), September 16–20, 2024, Vienna, Austria

ISSTA 2024 – Preliminary Table of Contents

Contents - Abstracts - Authors

Frontmatter

Title Page


Welcome from the Chairs
Welcome to the 33rd edition of the International Symposium on Software Testing and Analysis, ISSTA 2024, held on September 16--20, 2024 in Vienna, Austria. ISSTA 2024 is co-located with ECOOP and MPLR 2024. ISSTA brings together academics, industrial researchers, and practitioners from all over the world working on testing and analyzing software systems.

ISSTA 2024 Organization
Committee Listings

ISSTA 2024 Sponsors and Supporters



Technical Papers

Papers Round 1

Detecting Build Dependency Errors in Incremental Builds
Jun Lyu, Shanshan Li, He Zhang, Yang Zhang, Guoping Rong, and Manuel Rigger
(Nanjing University, China; National University of Singapore, Singapore)
Incremental and parallel builds performed by build tools such as Make are the heart of modern C/C++ software projects. Their correct and efficient execution depends on build scripts. However, build scripts are prone to errors. The most prevalent errors are missing dependencies (MDs) and redundant dependencies (RDs). The state-of-the-art methods for detecting these errors rely on clean builds (i.e., full builds of a subset of software configurations in a clean environment), which is costly and takes up to a few hours for large-scale projects. To address these challenges, we propose a novel approach called EChecker to detect build dependency errors in the context of incremental builds. The core idea of EChecker is to automatically update actual build dependencies by inferring them from C/C++ pre-processor directives and Makefile changes from new commits, which avoids clean builds when possible. EChecker achieves higher efficiency than the methods that rely on clean builds while maintaining effectiveness. We selected 12 representative projects, with their sizes ranging from small to large, with 240 commits (20 commits for each project), based on which we evaluated the effectiveness and efficiency of EChecker. We compared the evaluation results with a state-of-the-art build dependency error detection tool. The evaluation shows that the F-1 score of EChecker improved by 0.18 over the state-of-the-art method. EChecker increases the build dependency error detection efficiency by an average of 85.14 times (with a median of 16.30 times). The results demonstrate that EChecker can support practitioners in detecting build dependency errors efficiently.

Publisher's Version
Face It Yourselves: An LLM-Based Two-Stage Strategy to Localize Configuration Errors via Logs
Shiwen Shan, Yintong Huo, Yuxin Su, Yichen Li, Dan Li, and Zibin Zheng
(Sun Yat-sen University, China; Chinese University of Hong Kong, China)
Configurable software systems are prone to configuration errors, resulting in significant losses to companies. However, diagnosing these errors is challenging due to the vast and complex configuration space. These errors pose significant challenges for both experienced maintainers and new end-users, particularly those without access to the source code of the software systems. Given that logs are easily accessible to most end-users, we conduct a preliminary study to outline the challenges and opportunities of utilizing logs in localizing configuration errors. Based on the insights gained from the preliminary study, we propose an LLM-based two-stage strategy for end-users to localize the root-cause configuration properties based on logs. We further implement a tool, LogConfigLocalizer, aligned with the design of the aforementioned strategy, hoping to assist end-users in coping with configuration errors through log analysis.
To the best of our knowledge, this is the first work to localize the root-cause configuration properties for end-users based on Large Language Models (LLMs) and logs. We evaluate the proposed strategy on Hadoop by LogConfigLocalizer and prove its efficiency with an average accuracy as high as 99.91%. Additionally, we also demonstrate the effectiveness and necessity of different phases of the methodology by comparing it with two other variants and a baseline tool. Moreover, we validate the proposed methodology through a practical case study to demonstrate its effectiveness and feasibility.

Publisher's Version
FastLog: An End-to-End Method to Efficiently Generate and Insert Logging Statements
Xiaoyuan Xie, Zhipeng Cai, Songqiang Chen, and Jifeng Xuan
(Wuhan University, China; Hong Kong University of Science and Technology, China)
Logs play a crucial role in modern software systems, serving as a means for developers to record essential information for future software maintenance. As the performance of these log-based maintenance tasks heavily relies on the quality of logging statements, various works have been proposed to assist developers in writing appropriate logging statements. However, these works either only support developers in partial sub-tasks of this whole activity; or perform with a relatively high time cost and may introduce unwanted modifications. To address their limitations, we propose FastLog, which can support the complete logging statement generation and insertion activity, in a very speedy manner. Specifically, given a program method, FastLog first predicts the insertion position in the finest token level, and then generates a complete logging statement to insert. We further use text splitting for long input texts to improve the accuracy of predicting where to insert logging statements. A comprehensive empirical analysis shows that our method outperforms the state-of-the-art approach in both efficiency and output quality, which reveals its great potential and practicality in current real-time intelligent development environments.

Publisher's Version
FortifyPatch: Towards Tamper-Resistant Live Patching in Linux-Based Hypervisor
Zhenyu Ye, Lei Zhou, Fengwei Zhang, Wenqiang Jin, Zhenyu Ning, Yupeng Hu, and Zheng Qin
(Hunan University, China; National University of Defense Technology, China; Southern University of Science and Technology, China; Xinchuang Haihe Laboratory, China)
Linux-based hypervisors in the cloud server suffer from an increasing number of vulnerabilities in the Linux kernel.To address these vulnerabilities in a timely manner while avoiding the economic loss caused by unplanned shutdowns, live patching schemes have been developed. Unfortunately, existing live patching solutions have failed to protect patches from post-deployment attacks. In addition, patches that involve changes to global variables can lead to practical issues with existing solutions. To address these problems, we present FortifyPatch, a tamper-resistant live patching solution for Linux-based hypervisors in cloud environments. Specifically, FortifyPatch employs multiple Granule Protection Tables from Arm Confidential Computing Architecture to protect the integrity of deployed patches. TrustZone Address Space Controller and Performance Monitor Unit are used to prevent the bypassing of the Patch via kernel code protection and timely page table verification. FortifyPatch is also able to patch global variables via well-designed data access traps.We prototype FortifyPatch and evaluate it using real-world CVE patches. The result shows that FortifyPatch is capable of deploying 81.5% of CVE patches. The performance evaluation indicates that FortifyPatch protects deployed patches with 0.98% and 3.1% overhead on average across indicative benchmarks and real-world applications, respectively.

Publisher's Version Published Artifact Artifacts Available
Unimocg: Modular Call-Graph Algorithms for Consistent Handling of Language Features
Dominik Helm, Tobias Roth, Sven Keidel, Michael Reif, and Mira Mezini
(TU Darmstadt, Germany; National Research Center for Applied Cybersecurity ATHENE, Germany; CQSE, Germany; hessian.AI, Germany)
Traditional call-graph construction algorithms conflate the computation of possible runtime types with the actual resolution of (virtual) calls. This tangled design impedes supporting complex language features and APIs and making systematic trade-offs between precision, soundness, and scalability. It also impedes implementation of precise downstream analyses that rely on type information. To address the problem, we propose Unimocg, a modular architecture for call-graph construction that decouples the computation of type information from resolving calls. Due to its modular design, Unimocg can combine a wide range of different call-graph algorithms with algorithm-agnostic modules to support individual language features. Moreover, these modules operate at the same precision as the chosen call-graph algorithm with no further effort. Additionally, Unimocg allows other analyses to easily reuse type information from the call-graph construction at full precision. We demonstrate how Unimocg enables a framework of call-graph algorithms with different precision, soundness, and scalability trade-offs from reusable modules. Unimocg currently supports ten call-graph algorithms from vastly different families, such as CHA, RTA, XTA, and k-l-CFA. These algorithms show consistent soundness without sacrificing precision or performance. We also show how an immutability analysis is improved using Unimocg.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Precise Compositional Buffer Overflow Detection via Heap Disjointness
Yiyuan Guo, Peisen Yao, and Charles Zhang
(Hong Kong University of Science and Technology, China; Zhejiang University, China)
Static analysis techniques for buffer overflow detection still struggle with being scalable for millions of lines of code, while being precise enough to have an acceptable false positive rate. The checking of buffer overflow necessitates reasoning about the heap reachability and numerical relations, which are mutually dependent. Existing techniques to resolve the dependency cycle either sacrifice precision or efficiency due to their limitations in reasoning about symbolic heap location, i.e., heap location with possibly symbolic numerical offsets. A symbolic heap location potentially aliases a large number of other heap locations, leading to a disjunction of heap states that is particularly challenging to reason precisely. Acknowledging the inherent difficulties in heap and numerical reasoning, we introduce a disjointness assumption into the analysis by shrinking the program state space so that all the symbolic locations involved in memory accesses are disjoint from each other. The disjointness property permits strong updates to be performed at symbolic heap locations, significantly improving the precision by incorporating numerical information in heap reasoning. Also, it aids in the design of a compositional analysis to boost scalability, where compact and precise function summaries are efficiently generated and reused. We implement the idea in the static buffer overflow detector Cod. When applying it to large, real-world software such as PHP and QEMU, we have uncovered 29 buffer overflow bugs with a false positive rate of 37%, while projects of millions of lines of code can be successfully analyzed within four hours.

Publisher's Version ACM SIGSOFT Distinguished Paper Award
Enhancing ROS System Fuzzing through Callback Tracing
Yuheng Shen, Jianzhong Liu, Yiru Xu, Hao Sun, Mingzhe Wang, Nan Guan, Heyuan Shi, and Yu Jiang
(Tsinghua University, China; ETH Zurich, Switzerland; City University of Hong Kong, China; Central South University, China)
The Robot Operating System 2 (ROS) is the de-facto standard for robotic software development, with a wide application in diverse safety-critical domains. There are many efforts in testing that seek to deliver a more secure ROS codebase. However, existing testing methods are often inadequate to capture the complex and stateful behaviors inherent to ROS deployments, resulting in limited test- ing effectiveness. In this paper, we propose R2D2, a ROS system fuzzer that leverages ROS’s runtime states as guidance to increase fuzzing effectiveness and efficiency. Unlike traditional fuzzers, R2D2 employs a systematic instrumentation strategy that captures the system’s runtime behaviors and profiles the current system state in real-time. This approach provides a more in-depth understanding of system behaviors, thereby facilitating a more insightful explo- ration of ROS’s extensive state space. For evaluation, we applied it to four well-known ROS applications. Our evaluation shows that R2D2 achieves an improvement of 3.91× and 2.56× in code coverage compared to state-of-the-art ROS fuzzers, including Ros2Fuzz and RoboFuzz, while also uncovering 39 previously unknown vulnera- bilities, with 6 fixed in both ROS runtime and ROS applications. For its runtime overhead, R2D2 maintains an average execution and memory usage overhead with 10.4% and 1.0% in respect, making R2D2 effective in ROS testing.

Publisher's Version
API Misuse Detection via Probabilistic Graphical Model
Yunlong Ma, Wentong Tian, Xiang Gao, Hailong Sun, and Li Li
(Beihang University, China)
API misuses can cause a range of issues in software development, including program crashes, bugs, and vulnerabilities. Different approaches have been developed to automatically detect API misuses by checking the program against usage rules extracted from extensive codebase or API documents. However, these mined rules may not be precise or complete, leading to high false positive/negative rates. In this paper, we propose a novel solution to this problem by representing the mined API usage rules as a probabilistic graphical model, where each rule's probability value represents its trustworthiness of being correct. Our approach automatically constructs probabilistic usage rules by mining codebase and documents, and aggregating knowledge from different sources. Here, the usage rules obtained from the codebase initialize the probabilistic model, while the knowledge from the documents serves as a supplement for adjusting and complementing the probabilities accordingly. We evaluate our approach on the MuBench benchmark. Experimental results show that our approach achieves 42.0% precision and 54.5% recall, significantly outperforming state-of-the-art approaches.

Publisher's Version
Ma11y: A Mutation Framework for Web Accessibility Testing
Mahan Tafreshipour, Anmol Deshpande, Forough Mehralian, Iftekhar Ahmed, and Sam Malek
(University of California at Irvine, Irvine, USA)
Despite the availability of numerous automatic accessibility testing solutions, web accessibility issues persist on many websites. Moreover, there is a lack of systematic evaluations of the efficacy of current accessibility testing tools. To address this gap, we present the first mutation analysis framework, called Ma11y, designed to assess web accessibility testing tools. Ma11y includes 25 mutation operators that intentionally violate various accessibility principles and an automated oracle to determine whether a mutant is detected by a testing tool. Evaluation on real-world websites demonstrates the practical applicability of the mutation operators and the framework’s capacity to assess tool performance. Our results demonstrate that the current tools cannot identify nearly 50% of the accessibility bugs injected by our framework, thus underscoring the need for the development of more effective accessibility testing tools. Finally, the framework’s accuracy and performance attest to its potential for seamless and automated application in practical settings.

Publisher's Version Info
Total Recall? How Good Are Static Call Graphs Really?
Dominik Helm, Sven Keidel, Anemone Kampkötter, Johannes Düsing, Tobias Roth, Ben Hermann, and Mira Mezini
(TU Darmstadt, Germany; National Research Center for Applied Cybersecurity ATHENE, Germany; TU Dortmund, Germany; hessian.AI, Germany)
Static call graphs are a fundamental building block of program analysis. However, differences in call-graph construction and the use of specific language features can yield unsoundness and imprecision. Call-graph analyses are evaluated using measures of precision and recall, but this is hard when a ground truth for real-world programs is generally unobtainable. In this work, we propose to use carefully constructed dynamic baselines based on fixed entry points and input corpora. The creation of this dynamic baseline is posed as an approximation of the ground truth---an optimization problem. We use manual extension and coverage-guided fuzzing for creating suitable input corpora. With these dynamic baselines, we study call-graph quality of multiple algorithms and implementations using four real-world Java programs. We find that our methodology provides valuable insights into call-graph quality and how to measure it. With this work, we provide a novel methodology to advance the field of static program analysis as we assess the computation of one of its core data structures---the call graph.

Publisher's Version Published Artifact Artifacts Available
CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios
Zhengran Zeng, Yidong Wang, Rui Xie, Wei Ye, and Shikun Zhang
(Peking University, China)
In the evolving landscape of large language models (LLMs) tailored for software engineering, the need for benchmarks that accurately reflect real-world development scenarios is paramount. Current benchmarks are either too simplistic or fail to capture the multi-tasking nature of software development. To address this, we introduce CoderUJB, a new benchmark designed to evaluate LLMs across diverse Java programming tasks that are executable and reflective of actual development scenarios, acknowledging Java's prevalence in real-world software production. CoderUJB comprises 2,239 programming questions derived from 17 real open-source Java projects and spans five practical programming tasks. Our empirical study on this benchmark investigates the coding abilities of various open-source and closed-source LLMs, examining the effects of continued pre-training in specific programming languages code and instruction fine-tuning on their performance. The findings indicate that while LLMs exhibit strong potential, challenges remain, particularly in non-functional code generation (e.g., test generation and defect detection). Importantly, our results advise caution in the specific programming languages continued pre-training and instruction fine-tuning, as these techniques could hinder model performance on certain tasks, suggesting the need for more nuanced strategies. CoderUJB thus marks a significant step towards more realistic evaluations of programming capabilities in LLMs, and our study provides valuable insights for the future development of these models in software engineering.

Publisher's Version
DAppFL: Just-in-Time Fault Localization for Decentralized Applications in Web3
Zhiying Wu, Jiajing Wu, Hui Zhang, Ziwei Li, Jiachi Chen, Zibin Zheng, Qing Xia, Gang Fan, and Yi Zhen
(Sun Yat-sen University, China; Institute of Software at Chinese Academy of Sciences, China; Independent, China)
Web3 describes an idea for the next evolution of the Internet, where blockchain technology enables the Internet of Value. As Web3 software, decentralized applications (DApps) have emerged in recent years. There exists a natural link between DApps and cryptocurrencies, where faults in DApps could directly lead to monetary losses associated with cryptocurrencies. Hence, efficient fault localization technology is of paramount importance for urgent DApp rescue operations and the mitigation of financial losses. However, fault localization methods applied in traditional applications are not well-suited for this specific field, due to their inability to identify DApp-specific fault features, e.g., a substantial amount of cryptocurrency is transferred from DApps to hackers. In order to explore the root cause of DApp faults, some researchers try to identify suspicious code snippets through mutation testing. Nonetheless, applying mutation testing for DApp fault localization is time-consuming and thus limited in practice. This paper conducts the first comprehensive study of DApp fault localization. We introduce DAppFL, a learning-based DApp fault localization tool that performs reverse engineering to gather executed source code and then trace cryptocurrency flow to assist in locating faulty functions. We also present the inaugural dataset for DApp fault localization, providing a new benchmark for this domain.Our experimental results demonstrate that DAppFL locates 63% of faults within the Top-5, 23% more than the state-of-the-art method. To facilitate further research, our code and dataset are freely available online: https://github.com/xplanet-sysu/awesome-works#dappfl.

Publisher's Version
CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection
Hao Wang, Zeyu Gao, Chao Zhang, Mingyang Sun, Yuchen Zhou, Han Qiu, and Xi Xiao
(Tsinghua University, China; University of Electronic Science and Technology of China, China; Beijing University of Technology, China)
Binary code similarity detection (BCSD) is a fundamental technique for various applications. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when the volume of target binaries to search is large. To address this issue, we propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches to significantly improve accuracy while minimizing overheads. Specifically, CEBin utilizes a refined embedding-based approach to extract features of target code, which efficiently narrows down the scope of candidate similar code and boosts performance. Then, it utilizes a comparison-based approach that performs a pairwise comparison on the candidates to capture more nuanced and complex relationships, which greatly improves the accuracy of similarity detection. By bridging the gap between embedding-based and comparison-based approaches, CEBin is able to provide an effective and efficient solution for detecting similar code (including vulnerable ones) in large-scale software ecosystems. Experimental results on three well-known datasets demonstrate the superiority of CEBin over existing state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD in real world, we construct a large-scale benchmark of vulnerability, offering the first precise evaluation scheme to assess BCSD methods for the 1-day vulnerability detection task. CEBin could identify the similar function from millions of candidate functions in just a few seconds and achieves an impressive recall rate of 85.46% on this more practical but challenging task, which are several order of magnitudes faster and 4.07× better than the best SOTA baseline.

Publisher's Version
Interprocedural Path Complexity Analysis
Mira Kaniyur, Ana Cavalcante-Studart, Yihan Yang, Sangeon Park, David Chen, Duy Lam, and Lucas Bang
(Harvey Mudd College, USA)
Software testing techniques like symbolic execution face significant challenges with path explosion. Asymptotic Path Complexity (APC) quantifies this path explosion complexity, but existing APC methods do not work for interprocedural functions in general. Our new algorithm, APC-IP, efficiently computes APC for a wider range of functions, including interprocedural ones, improving over previous methods in both speed and scope. We implement APC-IP atop the existing software Metrinome, and test it against a benchmark of C functions, comparing it to existing and baseline approaches as well as comparing it to the path explosion of the symbolic execution engine Klee. The results show that APC-IP not only aligns with previous APC values but also excels in performance, scalability, and handling complex source code. It also provides a complexity prediction of the number of paths explored by Klee, extending the APC metric's applicability and surpassing previous implementations.

Publisher's Version
Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models
Mingyi Zhou, Xiang Gao, Pei Liu, John Grundy, Chunyang Chen, Xiao Chen, and Li Li
(Monash University, Australia; Beihang University, China; CSIRO’s Data61, Australia; TU Munich, Germany; University of Newcastle, Australia)
Recent studies show that on-device deployed deep learning (DL) models, such as those of Tensor Flow Lite (TFLite), can be easily extracted from real-world applications and devices by attackers to generate many kinds of adversarial and other attacks. Although securing deployed on-device DL models has gained increasing attention, no existing methods can fully prevent these attacks. Traditional software protection techniques have been widely explored. If on-device models can be implemented using pure code, such as C++, it will open the possibility of reusing existing robust software protection techniques. However, due to the complexity of DL models, there is no automatic method that can translate DL models to pure code. To fill this gap, we propose a novel method, CustomDLCoder, to automatically extract on-device DL model information and synthesize a customized executable program for a wide range of DL models. CustomDLCoder first parses the DL model, extracts its backend computing codes, configures the extracted codes, and then generates a customized program to implement and deploy the DL model without explicit model representation. The synthesized program hides model information for DL deployment environments since it does not need to retain explicit model representation, preventing many attacks on the DL model. In addition, it improves ML performance because the customized code removes model parsing and preprocessing steps and only retains the data computing process. Our experimental results show that CustomDLCoder improves model security by disabling on-device model sniffing. Compared with the original on-device platform (i.e., TFLite), our method can accelerate model inference by 21.0% and 24.3% on x86-64 and ARM64 platforms, respectively. Most importantly, it can significantly reduce memory consumption by 68.8% and 36.0% on x86-64 and ARM64 platforms, respectively.

Publisher's Version Published Artifact Artifacts Available
UPBEAT: Test Input Checks of Q# Quantum Libraries
Tianmin Hu, Guixin Ye, Zhanyong Tang, Shin Hwei Tan, Huanting Wang, Meng Li, and Zheng Wang
(Northwest University, China; Concordia University, Canada; University of Leeds, United Kingdom; Hefei University of Technology, China)
High-level programming models like Q# significantly simplify the complexity of programming for quantum computing. These models are supported by a set of foundation libraries for code development. However, errors can occur in the library implementation, and one common root cause is the lack of or incomplete checks on properties like values, length, and quantum states of inputs passed to user-facing subroutines. This paper presents Upbeat, a fuzzing tool to generate random test cases for bugs related to input checking in Q# libraries. Upbeat develops an automated process to extract constraints from the API documentation and the developer implemented input-checking statements. It leverages open-source Q# code samples to synthesize test programs. It frames the test case generation as a constraint satisfaction problem for classical computing and a quantum state model for quantum computing to produce carefully generated subroutine inputs to test if the input-checking mechanism is appropriately implemented. Under 100 hours of automated test runs, Upbeat has successfully identified 16 bugs in API implementations and 4 documentation errors. Of these, 14 have been confirmed, and 12 have been fixed by the library developers.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Enhancing Robustness of Code Authorship Attribution through Expert Feature Knowledge
Xiaowei Guo, Cai Fu, Juan Chen, Hongle Liu, Lansheng Han, and Wenjin Li
(Huazhong University of Science and Technology, China; NSFOCUS Technologies Group, China)
Code authorship attribution has been an interesting research problem for decades. Recent studies have revealed that existing methods for code authorship attribution suffer from weak robustness. Under the influence of small perturbations added by the attacker, the accuracy of the method will be greatly reduced. As of now, there is no code authorship attribution method capable of effectively handling such attacks. In this paper, we attribute the weak robustness of code authorship attribution methods to dataset bias and argue that this bias can be mitigated through adjustments to the feature learning strategy. We first propose a robust code authorship attribution feature combination framework, which is composed of only simple shallow neural network structures, and introduces controllability for the framework in the feature extraction by incorporating expert knowledge. Experiments show that the framework has significantly improved robustness over mainstream code authorship attribution methods, with an average drop of 23.4% (from 37.8% to 14.3%) in the success rate of targeted attacks and 25.9% (from 46.7% to 20.8%) in the success rate of untargeted attacks. At the same time, it can also achieve results comparable to mainstream code authorship attribution methods in terms of accuracy.

Publisher's Version
A Large-Scale Empirical Study on Improving the Fairness of Image Classification Models
Junjie Yang, Jiajun Jiang, Zeyu Sun, and Junjie Chen
(Tianjin University, China; Institute of Software at Chinese Academy of Sciences, China)
Fairness has been a critical issue that affects the adoption of deep learning models in real practice. To improve model fairness, many existing methods have been proposed and evaluated to be effective in their own contexts. However, there is still no systematic evaluation among them for a comprehensive comparison under the same context, which makes it hard to understand the performance distinction among them, hindering the research progress and practical adoption of them. To fill this gap, this paper endeavours to conduct the first large-scale empirical study to comprehensively compare the performance of existing state-of-the-art fairness improving techniques. Specifically, we target the widely-used application scenario of image classification, and utilized three different datasets and five commonly-used performance metrics to assess in total 13 methods from diverse categories. Our findings reveal substantial variations in the performance of each method across different datasets and sensitive attributes, indicating over-fitting on specific datasets by many existing methods. Furthermore, different fairness evaluation metrics, due to their distinct focuses, yield significantly different assessment results. Overall, we observe that pre-processing methods and in-processing methods outperform post-processing methods, with pre-processing methods exhibiting the best performance. Our empirical study offers comprehensive recommendations for enhancing fairness in deep learning models. We approach the problem from multiple dimensions, aiming to provide a uniform evaluation platform and inspire researchers to explore more effective fairness solutions via a set of implications.

Publisher's Version
A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We?
Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jieming Zhu, and Michael R. Lyu
(Chinese University of Hong Kong, China; Sun Yat-sen University, China; Huawei Noah’s Ark Lab, China)
Log data have facilitated various tasks of software development and maintenance, such as testing, debugging and diagnosing. Due to the unstructured nature of logs, log parsing is typically required to transform log messages into structured data for automated log analysis. Given the abundance of log parsers that employ various techniques, evaluating these tools to comprehend their characteristics and performance becomes imperative. Loghub serves as a commonly used dataset for benchmarking log parsers, but it suffers from limited scale and representativeness, posing significant challenges for studies to comprehensively evaluate existing log parsers or develop new methods. This limitation is particularly pronounced when assessing these log parsers for production use. To address these limitations, we provide a new collection of annotated log datasets, denoted Loghub-2.0, which can better reflect the characteristics of log data in real-world software systems. Loghub-2.0 comprises 14 datasets with an average of 3.6 million log lines in each dataset. Based on Loghub-2.0, we conduct a thorough re-evaluation of 15 state-of-the-art log parsers in a more rigorous and practical setting. Particularly, we introduce a new evaluation metric to mitigate the sensitivity of existing metrics to imbalanced data distributions. We are also the first to investigate the granular performance of log parsers on logs that represent rare system events, offering in-depth details for software diagnosis. Accurately parsing such logs is essential, yet it remains a challenge. We believe this work could shed light on the evaluation and design of log parsers in practical settings, thereby facilitating their deployment in production systems.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection
Xin-Cheng Wen, Cuiyun Gao, Shuzheng Gao, Yang Xiao, and Michael R. Lyu
(Harbin Institute of Technology, China; Chinese University of Hong Kong, China; Chinese Academy of Sciences, China)
Recently, there has been a growing interest in automatic software vulnerability detection. Pre-trained model-based approaches have demonstrated superior performance than other Deep Learning (DL)-based approaches in detecting vulnerabilities. However, the existing pre-trained model-based approaches generally employ code sequences as input during prediction, and may ignore vulnerability-related structural information, as reflected in the following two aspects. First, they tend to fail to infer the semantics of the code statements with complex logic such as those containing multiple operators and pointers. Second, they are hard to comprehend various code execution sequences, which is essential for precise vulnerability detection. To mitigate the challenges, we propose a Structured Natural Language Comment tree-based vulnerAbiLity dEtection framework based on the pre-trained models, named . The proposed Structured Natural Language Comment Tree (SCT) integrates the semantics of code statements with code execution sequences based on the Abstract Syntax Trees (ASTs).Specifically, comprises three main modules: (1) Comment Tree Construction, which aims at enhancing the model’s ability to infer the semantics of code statements by first incorporating Large Language Models (LLMs) for comment generation and then adding the comment node to ASTs. (2) Structured Natural Language Comment Tree Construction, which aims at explicitly involving code execution sequence by combining the code syntax templates with the comment tree. (3) SCT-Enhanced Representation, which finally incorporates the constructed SCTs for well capturing vulnerability patterns. Experimental results demonstrate that outperforms the best-performing baseline, including the pre-trained model and LLMs, with improvements of 2.96%, 13.47%, and 3.75% in terms of F1 score on the FFMPeg+Qemu, Reveal, and SVulD datasets, respectively. Furthermore, can be applied to different pre-trained models, such as CodeBERT and UniXcoder, yielding the F1 score performance enhancements ranging from 1.37% to 10.87%.

Publisher's Version
Distance-Aware Test Input Selection for Deep Neural Networks
Zhong Li, Zhengfeng Xu, Ruihua Ji, Minxue Pan, Tian Zhang, Linzhang Wang, and Xuandong Li
(Nanjing University, China)
Deep Neural Network (DNN) testing is one of the common practices to guarantee the quality of DNNs. However, DNN testing in general requires a significant amount of test inputs with oracle information (labels), which can be challenging and resource-intensive to obtain. To relieve this problem, we propose DATIS, a distance-aware test input selection approach for DNNs. Specifically, DATIS adopts a two-step approach for selecting test inputs. In the first step, it selects test inputs based on improved uncertainty scores derived from the distances between the test inputs and their nearest neighbor training samples. In the second step, it further eliminates test inputs that may cover the same faults by examining the distances among the selected test inputs. To evaluate DATIS, we conduct extensive experiments on 8 diverse subjects, taking into account different domains of test inputs, varied DNN structures, and diverse types of test inputs. Evaluation results show that DATIS significantly outperforms 15 baseline approaches in both selecting test inputs with high fault-revealing power and guiding the selection of data for DNN enhancement.

Publisher's Version
LPR: Large Language Models-Aided Program Reduction
Mengxiao Zhang, Yongqiang Tian, Zhenyang Xu, Yiwen Dong, Shin Hwei Tan, and Chengnian Sun
(University of Waterloo, Canada; Hong Kong University of Science and Technology, China; Concordia University, Canada)
Program reduction is a widely used technique to facilitate debugging compilers by automatically minimizing programs that trigger compiler bugs. Existing program reduction techniques are either generic to a wide range of languages (such as Perses and Vulcan) or specifically optimized for one certain language by exploiting language-specific knowledge (e.g., C-Reduce). However, synergistically combining both generality across languages and optimality to a specific language in program reduction is yet to be explored. This paper proposes LPR, the first LLMs-aided technique leveraging LLMs to perform language-specific program reduction for multiple languages. The key insight is to utilize both the language generality of program reducers such as Perses and the languagespecific semantics learned by LLMs. Concretely, language-generic program reducers can efficiently reduce programs into a small size that is suitable for LLMs to process; LLMs can effectively transform programs via the learned semantics to create new reduction opportunities for the language-generic program reducers to further reduce the programs. Our thorough evaluation on 50 benchmarks across three programming languages (i.e., C, Rust and JavaScript) has demonstrated LPR’s practicality and superiority over Vulcan, the state-of-the-art language-generic program reducer. For effectiveness, LPR surpasses Vulcan by producing 24.93%, 4.47%, and 11.71% smaller programs on benchmarks in C, Rust and JavaScript, separately. Moreover, LPR and Vulcan have the potential to complement each other. For the C language for which C-Reduce is optimized, by applying Vulcan to the output produced by LPR, we can attain program sizes that are on par with those achieved by C-Reduce. For efficiency perceived by users, LPR is more efficient when reducing large and complex programs, taking 10.77%, 34.88%, 36.96% less time than Vulcan to finish all the benchmarks in C, Rust and JavaScript, separately.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code
Yujia Chen, Cuiyun Gao, Zezhou Yang, Hongyu Zhang, and Qing Liao
(Harbin Institute of Technology, China; Chongqing University, China)
In the field of code intelligence, effectively modeling long-range code poses a significant challenge. Existing pre-trained language models (PLMs) such as UniXcoder have achieved remarkable success, but they still face difficulties with long code inputs. This is mainly due to their limited capacity to maintain contextual continuity and memorize the key information over long-range code. To alleviate the difficulties, we propose EXPO, a framework for EXtending Pre-trained language models for lOng-range code. EXPO incorporates two innovative memory mechanisms we propose in this paper: Bridge Memory and Hint Memory. Bridge Memory uses a tagging mechanism to connect disparate snippets of long-range code, helping the model maintain contextual coherence. Hint Memory focuses on crucial code elements throughout the global context, such as package imports, by integrating a 𝑘NN attention layer to adaptively select the relevant code elements. This dual-memory approach bridges the gap between understanding local code snippets and maintaining global code coherence, thereby enhancing the model’s overall comprehension of long code sequences. We validate the effectiveness of EXPO on five popular pre-trained language models such as UniXcoder and two code intelligence tasks including API recommendation and vulnerability detection. Experimental results demonstrate that EXPO significantly improves the pre-training language models.

Publisher's Version
Define-Use Guided Path Exploration for Better Forced Execution
Dongnan He, Dongchen Xie, Yujie Wang, Wei You, Bin Liang, Jianjun Huang, Wenchang Shi, Zhuo Zhang, and Xiangyu Zhang
(Renmin University of China, China; Purdue University, USA)
The evolution of recent malware, characterized by the escalating use of cloaking techniques, poses a significant challenge in the analysis of malware behaviors. Researchers proposed forced execution to penetrate malware’s self-protection mechanisms and expose hidden behaviors, by forcefully setting certain branch outcomes. Existing studies focus on enhancing the forced executor to provide light-weight crash-free execution models. However, insufficient attention has been directed toward the path exploration strategy, an aspect equally crucial to the effectiveness. Linear search employed in state-of-the-art forced execution tools exhibits inherent limitations that lead to unnecessary path exploration and incomplete behavior exposure. In this paper, we propose a novel and practical path exploration strategy that focuses on the coverage of defineuse relations in the subject binary. We develop a fuzzing approach for exploring these define-use relations in a progressive and self-supervised way. Our experimental results show that the proposed solution outperforms the existing forced execution tools in both memory dependence coverage and malware behavior exposure.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
C2D2: Extracting Critical Changes for Real-World Bugs with Dependency-Sensitive Delta Debugging
Xuezhi Song, Yijian Wu, Shuning Liu, Bihuan Chen, Yun Lin, and Xin Peng
(Fudan University, China; Shanghai Jiao Tong University, China)
Data-driven techniques are promising for automatically locating and fixing bugs, which can reduce enormous time and effort for developers. However, the effectiveness of these techniques heavily relies on the quality and scale of bug datasets. Despite that emerging approaches to automatic bug dataset construction partially provide a solution for scalability, data quality remains a concern. Specifically, it remains a barrier for humans to isolate the minimal set of bug-inducing or bug-fixing changes, known as critical changes. Although delta debugging (DD) techniques are capable of extracting critical changes on benchmark datasets in academia, the efficiency and accuracy are still limited when dealing with real-world bugs, where code change dependencies could be overly complicated. In this paper, we propose C2D2, a novel delta debugging approach for critical change extraction, which estimates the probabilities of dependencies between code change elements. C2D2 considers the probabilities of dependencies and introduces a matrix-based search mechanism to resolve compilation errors (CE) caused by missing dependencies. It also provides hybrid mechanisms for flexibly selecting code change elements during the DD process. Experiments on Defect4J and a real-world regression bug dataset reveal that C2D2 is significantly more efficient than the traditional DD algorithm ddmin with competitive effectiveness, and significantly more effective and more efficient than the state-of-the-art DD algorithm ProbDD. Furthermore, compared to human-isolated critical changes, C2D2 produces the same or better critical change results in 56% cases in Defects4J and 86% cases in the regression dataset, demonstrating its usefulness in automatically extracting critical changes and saving human efforts in constructing large-scale bug datasets with real-world bugs.

Publisher's Version
FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion
Qi Guo, Xiaohong Li, Xiaofei Xie, Shangqing Liu, Ze Tang, Ruitao Feng, Junjie Wang, Jidong Ge, and Lei Bu
(Tianjin University, China; Singapore Management University, Singapore; Nanyang Technological University, Singapore; Nanjing University, China)
The rise of code pre-trained models has significantly enhanced various coding tasks, such as code completion, and tools like GitHub Copilot. However, the substantial size of these models, especially large models, poses a significant challenge when it comes to fine-tuning them for specific downstream tasks. As an alternative approach, retrieval-based methods have emerged as a promising solution, augmenting model predictions without the need for fine-tuning. Despite their potential, a significant challenge is that the designs of these methods often rely on heuristics, leaving critical questions about what information should be stored or retrieved and how to interpolate such information for augmenting predictions. To tackle this challenge, we first perform a theoretical analysis of the fine-tuning process, highlighting the importance of delta logits as a catalyst for improving model predictions. Building on this insight, we develop a novel retrieval-based method, FT2Ra, which aims to mimic genuine fine-tuning. While FT2Ra adopts a retrieval-based mechanism, it uniquely adopts a paradigm with a learning rate and multi-epoch retrievals, which is similar to fine-tuning. We conducted a comprehensive evaluation of FT2Ra in both token-level and line-level code completions. Our findings demonstrate the remarkable effectiveness of FT2Ra when compared to state-of-the-art methods and its potential to genuine fine-tuning. In token-level completion, which represents a relatively easier task, FT2Ra achieves a 4.29% improvement in accuracy compared to the best baseline method on UniXcoder. In the more challenging line-level completion task, we observe a substantial more than twice increase in Exact Match (EM) performance, indicating the significant advantages of our theoretical analysis. Notably, even when operating without actual fine-tuning, FT2Ra exhibits competitive performance compared to the models with real fine-tuning.

Publisher's Version
MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination Indexing
Tianyi Yang, Cheryl Lee, Jiacheng Shen, Yuxin Su, Cong Feng, Yongqiang Yang, and Michael R. Lyu
(Chinese University of Hong Kong, Hong Kong; Sun Yat-sen University, China; Huawei Cloud Computing Technology, China)
Microservice resilience, the ability of microservices to recover from failures and continue providing reliable and responsive services, is crucial for cloud vendors. However, the current practice relies on manually configured rules specific to a certain microservice system, resulting in labor-intensity and flexibility issues, given the large scale and high dynamics of microservices. A more labor-efficient and versatile solution is desired. Our insight is that resilient deployment can effectively prevent the dissemination of degradation from system performance metrics to user-aware metrics, and the latter directly affects service quality. In other words, failures in a non-resilient deployment can impact both types of metrics, leading to user dissatisfaction. With this in mind, we propose MicroRes, the first versatile resilience profiling framework for microservices via degradation dissemination indexing. MicroRes first injects failures into microservices and collects available monitoring metrics. Then, it ranks the metrics according to their contributions to the overall service degradation. It produces a resilience index by how much the degradation is disseminated from system performance metrics to user-aware metrics. Higher degradation dissemination indicates lower resilience. We evaluate MicroRes on two open-source and one industrial microservice system. The experiments show MicroRes' efficient and effective resilience profiling of microservices. We also showcase MicroRes' practical usage in production.

Publisher's Version
Isolation-Based Debugging for Neural Networks
Jialuo Chen, Jingyi Wang, Youcheng Sun, Peng Cheng, and Jiming Chen
(Zhejiang University, China; University of Manchester, United Kingdom)
Neural networks (NNs) are known to have diverse defects such as adversarial examples, backdoor and discrimination, raising great concerns about their reliability. While NN testing can effectively expose these defects to a significant degree, understanding their root causes within the network requires further examination. In this work, inspired by the idea of debugging in traditional software for failure isolation, we propose a novel unified neuron-isolation-based framework for debugging neural networks, shortly IDNN. Given a buggy NN that exhibits certain undesired properties (e.g., discrimination), the goal of IDNN is to identify the most critical and minimal set of neurons that are responsible for exhibiting these properties. Notably, such isolation is conducted with the objective that by simply ‘freezing’ these neurons, the model’s undesired properties can be eliminated, resulting in a much more efficient model repair compared to computationally expensive retraining or weight optimization as in existing literature. We conduct extensive experiments to evaluate IDNN across a diverse set of NN structures on five benchmark datasets, for solving three debugging tasks, including backdoor, unfairness, and weak class. As a lightweight framework, IDNN outperforms state-of-the-art baselines by successfully identifying and isolating a very small set of responsible neurons, demonstrating superior generalization performance across all tasks.

Publisher's Version
Atlas: Automating Cross-Language Fuzzing on Android Closed-Source Libraries
Hao Xiong, Qinming Dai, Rui Chang, Mingran Qiu, Renxiang Wang, Wenbo Shen, and Yajin Zhou
(Zhejiang University, China; ZJU-Hangzhou Global Scientific and Technological Innovation Center, China)
Fuzzing is an effective method for detecting security bugs in software, and there have been quite a few effective works on fuzzing Android. Researchers have developed methods for fuzzing open-source native APIs and Java interfaces on actual Android devices. However, the realm of automatically fuzzing Android closed-source native libraries, particularly on emulators, remains insufficiently explored. There are two key challenges: firstly, the multi-language programming model inherent to Android; and secondly, the absence of a Java runtime environment within the emulator. To address these challenges, we propose Atlas, a practical automated fuzz framework for Android closed-source native libraries. Atlas consists of an automatic harness generator and a fuzzer containing the necessary runtime environment. The generator uses static analysis techniques to deduce the correct calling sequences and parameters of the native API according to the information from the "native world" and the "Java world". To maximize the practicality of the generated harness, Atlas heuristically optimizes the generated harness. The Fuzzer provides the essential Java runtime environment in the emulator, making it possible to fuzz the Android closed-source native libraries on a multi-core server. We have tested Atlas on 17 pre-installed apps from four Android vendors. Atlas generates 820 harnesses containing 767 native APIs, of which 78% is practical. Meanwhile, Atlas has discovered 74 new security bugs with 16 CVEs assigned. The experiments show that Atlas can efficiently generate high-quality harnesses and find security bugs.

Publisher's Version
Automating Zero-Shot Patch Porting for Hard Forks
Shengyi Pan, You Wang, Zhongxin Liu, Xing Hu, Xin Xia, and Shanping Li
(Zhejiang University, China; Huawei, China)
Forking is a typical way of code reuse, which provides a simple way for developers to create a variant software (denoted as hard fork) by copying and modifying an existing codebase. Despite of the benefits, forking also leads to duplicate efforts in software maintenance. Developers need to port patches across the hard forks to address similar bugs or implement similar features. Due to the divergence between the source project and the hard fork, patch porting is complicated, which requires an adaption regarding different implementations of the same functionality. In this work, we take the first step to automate patch porting for hard forks under a zero-shot setting. We first conduct an empirical study of the patches ported from Vim to Neovim over the last ten years to investigate the necessities of patch porting and the potential flaws in the current practice. We then propose a large language model (LLM) based approach (namely PPatHF) to automatically port patches for hard forks on a function-wise basis. Specifically, PPatHF is composed of a reduction module and a porting module. Given the pre- and post-patch versions of a function from the reference project and the corresponding function from the target project, the reduction module first slims the input functions by removing code snippets less relevant to the patch. Then, the porting module leverages a LLM to apply the patch to the function from the target project. To better elicit the power of the LLM on patch porting, we design a prompt template to enable efficient in-context learning. We further propose an instruction-tuning based training task to better guide the LLM to port the patch and inject task-specific knowledge. We evaluate PPatHF on 310 Neovim patches ported from Vim. The experimental results show that PPatHF outperforms the baselines significantly. Specifically, PPatHF can correctly port 131 (42.3%) patches and automate 57% of the manual edits required for the developer to port the patch.

Publisher's Version
DiaVio: LLM-Empowered Diagnosis of Safety Violations in ADS Simulation Testing
You Lu, Yifan Tian, Yuyang Bi, Bihuan Chen, and Xin Peng
(Fudan University, China)
Simulation testing has been widely adopted by leading companies to ensure the safety of autonomous driving systems (ADSs). Anumber of scenario-based testing approaches have been developed to generate diverse driving scenarios for simulation testing, and demonstrated to be capable of finding safety violations. However, there is no automated way to diagnose whether these violations are caused by the ADS under test and which category these violations belong to. As a result, great effort is required to manually diagnose violations. To bridge this gap, we propose DiaVio to automatically diagnose safety violations in simulation testing by leveraging large language models (LLMs). It is built on top of a new domain specific language (DSL) of crash to align real-world accident reports described in natural language and violation scenarios in simulation testing. DiaVio fine-tunes a base LLM with real-world accident reports to learn diagnosis capability, and uses the fine-tuned LLM to diagnose violation scenarios in simulation testing. Our evaluation has demonstrated the effectiveness and efficiency of DiaVio in violation diagnosis.

Publisher's Version
Graph Neural Networks for Vulnerability Detection: A Counterfactual Explanation
Zhaoyang Chu, Yao Wan, Qian Li, Yang Wu, Hongyu Zhang, Yulei Sui, Guandong Xu, and Hai Jin
(Huazhong University of Science and Technology, China; Curtin University, Perth, Australia; Chongqing University, China; UNSW, Sydney, Australia; University of Technology, Sydney, Australia)
Vulnerability detection is crucial for ensuring the security and reliability of software systems. Recently, Graph Neural Networks (GNNs) have emerged as a prominent code embedding approach for vulnerability detection, owing to their ability to capture the underlying semantic structure of source code. However, GNNs face significant challenges in explainability due to their inherently black-box nature. To this end, several factual reasoning-based explainers have been proposed. These explainers provide explanations for the predictions made by GNNs by analyzing the key features that contribute to the outcomes. We argue that these factual reasoning-based explanations cannot answer critical what-if questions: "What would happen to the GNN's decision if we were to alter the code graph into alternative structures?" Inspired by advancements of counterfactual reasoning in artificial intelligence, we propose CFExplainer, a novel counterfactual explainer for GNN-based vulnerability detection. Unlike factual reasoning-based explainers, CFExplainer seeks the minimal perturbation to the input code graph that leads to a change in the prediction, thereby addressing the what-if questions for vulnerability detection. We term this perturbation a counterfactual explanation, which can pinpoint the root causes of the detected vulnerability and furnish valuable insights for developers to undertake appropriate actions for fixing the vulnerability. Extensive experiments on four GNN-based vulnerability detection models demonstrate the effectiveness of CFExplainer over existing state-of-the-art factual reasoning-based explainers.

Publisher's Version
DeFort: Automatic Detection and Analysis of Price Manipulation Attacks in DeFi Applications
Maoyi Xie, Ming Hu, Ziqiao Kong, Cen Zhang, Yebo Feng, Haijun Wang, Yue Xue, Hao Zhang, Ye Liu, and Yang Liu
(Nanyang Technological University, Singapore; Xi’an Jiaotong University, China; MetaTrust Labs, Singapore)
Although Decentralized Finance (DeFi) applications facilitate tamper-proof transactions among multiple anonymous users, since attackers can access the smart contract bytecode directly, vulnerabilities in the transaction mechanism, contract code, or third-party components can be easily exploited to manipulate token prices, leading to financial losses. Since price manipulation often relies on specific states and complex trading sequences, existing detection tools have limitations in addressing this problem. In addition, to swiftly identify the root cause of an attack and implement targeted defense and remediation measures, auditors typically prioritize understanding the methodology behind the attack, emphasizing 'how' it occurred rather than simply confirming its existence. To address these problems, this paper presents a novel automatic price manipulation detection and analysis framework, named DeFort, which contains a price manipulation behavior model to guide on-chain detection, multiple price monitoring strategies to detect pools with abnormal token prices, and various profit calculation mechanisms to confirm attacks. Based on behavioral models, DeFort can automatically locate transactions and functions that cause abnormal price fluctuations and identify attackers and victims. Experimental results demonstrate that DeFort can outperform state-of-the-art price manipulation detection methods. Furthermore, after monitoring 441 real-world projects for two months, DeFort successfully detected five price manipulation attacks.

Publisher's Version Info
Traceback: A Fault Localization Technique for Molecular Programs
Michael C. Gerten, James I. Lathrop, and Myra B. Cohen
(Iowa State University, USA)
Fault localization is essential to software maintenance tasks such as testing and automated program repair. Many fault localization techniques have been developed, the most common of which are spectrum-based. Most techniques have been designed for traditional programming paradigms that map passing and failing test cases to lines or branches of code, hence specialized programming paradigms which utilize different code abstractions may fail to localize well. In this paper, we study fault localization in the context of a class of programs, molecular programs. Recent research has designed automated testing and repair frameworks for these pro- grams but has ignored the importance of fault localization. As we demonstrate, using existing spectrum-based approaches may not provide much information. Instead we propose a novel approach, Traceback, that leverages temporal trace data. In an empirical study on a set of 89 faulty program variants, we demonstrate that Trace- back provides between a 32-90% improvement in localization over reaction-based mapping, a direct translation of spectrum-based localization. We see little difference in parameter tuning of Trace- back when all tests, or only code-based (invariant) tests are used, however the best depth and weight parameters vary when using specification based tests, which can be either functional or meta- morphic. Overall, invariant-based tests provide the best localization results (either alone or in combination with others), followed by metamorphic and then functional tests.

Publisher's Version
Silent Taint-Style Vulnerability Fixes Identification
Zhongzhen Wen, Jiayuan Zhou, Minxue Pan, Shaohua Wang, Xing Hu, Tongtong Xu, Tian Zhang, and Xuandong Li
(Nanjing University, China; Huawei, Waterloo, Canada; Central University of Finance and Economics, China; Zhejiang University, China; Huawei, China)
The coordinated vulnerability disclosure model, widely adopted in open-source software (OSS) organizations, recommends the silent resolution of vulnerabilities without revealing vulnerability information until their public disclosure. However, the inherently public nature of OSS development leads to security fixes becoming publicly available in repositories weeks before the official disclosure of vulnerabilities. This time gap poses a significant security risk to OSS users, as attackers could discover the fix and exploit vulnerabilities before disclosure. Thus, there is a critical need for OSS users to sense fixes as early as possible to address the vulnerability before any exploitation occurs. In response to this challenge, we introduce EarlyVulnFix, a novel approach designed to identify silent fixes for taint-style vulnerabilities—a persistent class of security weaknesses where attacker-controlled input reaches sensitive operations (sink) without proper sanitization. Leveraging data flow and dependency analysis, our tool distinguishes two types of connections between newly introduced code and sinks, tailored for two common fix scenarios. Our evaluation demonstrates that EarlyVulnFix surpasses state-of-the-art baselines by a substantial margin in terms of F1 score. Furthermore, when applied to the 700 latest commits across seven projects, EarlyVulnFix detected three security fixes before their respective security releases, highlighting its effectiveness in identifying unreported vulnerability fixes in the wild.

Publisher's Version
Benchmarking Automated Program Repair: An Extensive Study on Both Real-World and Artificial Bugs
Yicheng Ouyang, Jun Yang, and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA)
As bugs are inevitable and prevalent in real-world programs, many Automated Program Repair (APR) techniques have been proposed to generate patches for them. However, due to the lack of a standard for evaluating APR techniques, prior works tend to use different settings and benchmarks in evaluation, threatening the trustworthiness of the evaluation results. Additionally, they typically only adopt plausibility and genuineness as evaluation metrics, which may potentially mask some underlying issues in APR techniques. To overcome these issues, in this paper, we conduct an extensive and multi-dimensional evaluation of nine learning-based and three traditional state-of-the-art APR techniques under the same environment and settings. We employ the widely studied Defects4J V2.0.0 benchmark and a newly constructed large-scale mutation-based benchmark named MuBench, derived from Defects4J and including 1,700 artificial bugs generated by various mutators, to uncover potential limitations in these APR techniques. We also apply multi-dimensional metrics, including compilability/plausibility/genuineness metrics, as well as SYE (SYntactic Equivalence) and TCE (Trivial Compiler Equivalence) metrics, to thoroughly analyze the 1,814,652 generated patches. This paper presents noteworthy findings from the extensive evaluation: Firstly, Large Language Model (LLM) based APR demonstrates less susceptibility to overfitting on the Defects4J V1.2.0 dataset and fixes the most number of bugs. Secondly, the study suggests a promising future for combining traditional and learning-based APR techniques, as they exhibit complementary advantages in fixing different types of bugs. Additionally, this work highlights the necessity for further enhancing patch compilability of learning-based APR techniques, despite the presence of various existing strategies attempting to improve it. The study also reveals other guidelines for enhancing APR techniques, including the need for handling unresolvable symbol compilability issues and reducing duplicate/no-op patch generation. Finally, our study uncovers seven implementation issues in the studied techniques, with five of them confirmed and fixed by the corresponding authors.

Publisher's Version Published Artifact Info Artifacts Available
Multi-modal Learning for WebAssembly Reverse Engineering
Hanxian Huang and Jishen Zhao
(University of California at San Diego, San Diego, USA)
The increasing adoption of WebAssembly (Wasm) for performance-critical and security-sensitive tasks drives the demand for WebAssembly program comprehension and reverse engineering. Recent studies have introduced machine learning (ML)-based WebAssembly reverse engineering tools. Yet, the generalization of task-specific ML solutions remains challenging, because their effectiveness hinges on the availability of an ample supply of high-quality task-specific labeled data. Moreover, previous works trained models only with features extracted from WebAssembly, overlooking the high-level semantics present in the corresponding source code and its documentation. Acknowledging the abundance of available source code with documentation, which can be compiled into WebAssembly, we propose to learn representations of them concurrently and harness their mutual relationships for effective WebAssembly reverse engineering. In this paper, we present WasmRev, the first multi-modal pre-trained language model for WebAssembly reverse engineering. WasmRev is pre-trained using self-supervised learning on a large-scale multi-modal corpus encompassing source code, code documentation and the compiled WebAssembly, without requiring labeled data. WasmRev incorporates three tailored multi-modal pre-training tasks to capture various characteristics of WebAssembly and cross-modal relationships. WasmRev is only trained once to produce general-purpose representations that can broadly support WebAssembly reverse engineering tasks through few-shot fine-tuning with much less labeled data, improving data efficiency. We fine-tune WasmRev onto three important reverse engineering tasks: type recovery, function purpose identification and WebAssembly summarization. Our results show that WasmRev pre-trained on the corpus of multi-modal samples establishes a robust foundation for these tasks, achieving high task accuracy and outperforming the state-of-the-art ML methods for WebAssembly reverse engineering.

Publisher's Version ACM SIGSOFT Distinguished Paper Award
CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive Nature
Chenyan Liu, Yufan Cai, Yun Lin, Yuhuan Huang, Yunrui Pei, Bo Jiang, Ping Yang, Jin Song Dong, and Hong Mei
(Shanghai Jiao Tong University, China; National University of Singapore, Singapore; Bytedance Network Technology, Beijing, China)
Recent years have seen the development of LLM-based code generation. Compared to generating code in a software project, incremental code edits are empirically observed to be more frequent. The emerging code editing approaches usually formulate the problem as generating an edit based on known relevant prior edits and context. However, practical code edits can be more complicated. First, an editing session can include multiple (ir)relevant edits to the code under edit. Second, the inference of the subsequent edits is non-trivial as the scope of its ripple effect can be the whole project. In this work, we propose CoEdPilot, an LLM-driven solution to recommend code edits by discriminating the relevant edits, exploring their interactive natures, and estimating its ripple effect in the project. Specifically, CoEdPilot orchestrates multiple neural transformers to identify what and how to edit in the project regarding both edit location and edit content. When a user accomplishes an edit with an optional editing description, an Subsequent Edit Analysis first reports the most relevant files in the project with what types of edits (e.g., keep, insert, and replace) can happen for each line of their code. Next, an Edit-content Generator generates concrete edit options for the lines of code, regarding its relevant prior changes reported by an Edit-dependency Analyzer. Last, both the Subsequent Edit Analysis and the Edit-content Generator capture relevant prior edits as feedback to readjust their recommendations. We train our models by collecting over 180K commits from 471 open-source projects in 5 programming languages. Our extensive experiments show that (1) CoEdPilot can well predict the edits (i.e., predicting edit location with accuracy of 70.8%-85.3%, and the edit content with exact match rate of 41.8% and BLEU4 score of 60.7); (2) CoEdPilot can well boost existing edit generators such as GRACE and CCT5 on exact match rate by 8.57% points and BLEU4 score by 18.08. Last, our user study on 18 participants with 3 editing tasks (1) shows that CoEdPilot can be effective in assisting users to edit code in comparison with Copilot, and (2) sheds light on the future improvement of the tool design. The video demonstration of our tool is available at https://sites.google.com/view/coedpilot/home.

Publisher's Version Info
Automated Deep Learning Optimization via DSL-Based Source Code Transformation
Ruixin Wang, Minghai Lu, Cody Hao Yu, Yi-Hsiang Lai, and Tianyi Zhang
(Purdue University, USA; BosonAI, USA; Amazon Web Services, USA)
As deep learning models become increasingly bigger and more complex, it is critical to improve model training and inference efficiency. Though a variety of highly optimized libraries and packages (known as DL kernels) have been developed, it is tedious and time-consuming to figure out which kernel to use, where to use, and how to use them correctly. To address this challenge, we propose an Automated Deep learning OPTimization approach called Adopter. We design a Domain-Specific Language (DSL) to represent DL model architectures and leverage this DSL to specify model transformation rules required to integrate a DL kernel into a model. Given the source code of a DL model and the transformation rules for a set of kernels, Adopter first performs inter-procedural analysis to identify and express the model architecture in our DSL. Then, Adopter performs scope analysis and sub-sequence matching to identify locations in the model architecture where the transformation rules can be applied. Finally, Adopter proposes a synthesis-based code transformation method to apply the transformation rule. We curated a benchmark with 199 models from Hugging Face and a diverse set of DL kernels. We found that, compared to a state-of-the-art automated code transformation technique, Adopter helps improve the precision and recall by 3% and 56%, respectively. An in-depth analysis of 9 models revealed that on average, Adopter improved the training speed by 22.7% while decreasing the GPU memory usage by 10.5%.

Publisher's Version Published Artifact Artifacts Available
Evaluating the Effectiveness of Decompilers
Ying Cao, Runze Zhang, Ruigang Liang, and Kai Chen
(Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, China)
In software security tasks like malware analysis and vulnerability mining, reverse engineering is pivotal, with C decompilers playing a crucial role in understanding program semantics. However, reverse engineers still predominantly rely on assembly code rather than decompiled code when analyzing complex binaries. This practice underlines the limitations of current decompiled code, which hinders its effectiveness in reverse engineering. Identifying and analyzing the problems of existing decompilers and making targeted improvements can effectively enhance the efficiency of software analysis. In this study, we systematically evaluate current mainstream decompilers’ semantic consistency and readability. Semantic evaluation results show that the state-of-the-art decompiler Hex-Rays has about 55% accuracy at almost all optimization, which contradicts the common belief among many reverse engineers that decompilers are usually accurate. Readability evaluation indicates that despite years of efforts to improve the readability of the decompiled code, decompilers’ template-based approach still predominantly yields code akin to binary structures rather than human coding patterns. Additionally, our human study indicates that to enhance decompilers’ accuracy and readability, introducing human or compiler-aware strategies like a speculate-verify-correct approach to obtain recompilable decompiled code and iteratively refine it to more closely resemble the original binary, potentially offers a more effective optimization method than relying on static analysis and rule expansion.

Publisher's Version Published Artifact Artifacts Available
CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision
Hao Wang, Zeyu Gao, Chao Zhang, Zihan Sha, Mingyang Sun, Yuchen Zhou, Wenyu Zhu, Wenju Sun, Han Qiu, and Xi Xiao
(Tsinghua University, China; Information Engineering University, China; University of Electronic Science and Technology of China, China; Beijing University of Technology, China)
Binary code representation learning has shown significant performance in binary analysis tasks. But existing solutions often have poor transferability, particularly in few-shot and zero-shot scenarios where few or no training samples are available for the tasks. To address this problem, we present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code (i.e., assembly code) and get better transferability. At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations (in natural language), resulting a model able to generate better embeddings for binary code. To enable this alignment training, we then propose an efficient dataset engine that could automatically generate a large and diverse dataset comprising of binary code and corresponding natural language explanations. We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP. The evaluations of CLAP across various downstream tasks in binary analysis all demonstrate exceptional performance. Notably, without any task-specific training, CLAP is often competitive with a fully supervised baseline, showing excellent transferability.

Publisher's Version
FunRedisp: Reordering Function Dispatch in Smart Contract to Reduce Invocation Gas Fees
Yunqi Liu and Wei Song
(Nanjing University of Science and Technology, China)
Smart contracts mostly written in Solidity are Turing-complete programs executed on the blockchain platforms such as Ethereum. To prevent resource abuse, a gas fee is required when users deploy or invoke smart contracts. Although saving gas consumption has received much attention, no work investigates the effect of function dispatch on the invocation gas consumption. In this paper, after demystifying how the function dispatch affects the invocation gas consumption, we present FunRedisp, a bytecode refactoring method and an open-source tool, to reduce the overall invocation gas consumption of smart contracts. At the source code level, FunRedisp initially identifies hot functions in a smart contract that have a big chance to be invoked, and then move them to the front of the function dispatch at the bytecode level. We implement FunRedisp and evaluate it on 50 real-world smart contracts randomly selected from Ethereum. The experimental results demonstrate that FunRedisp can save approximately 125.17 units of gas per transaction with the compilation overhead increased by only 0.37 seconds.

Publisher's Version Published Artifact Artifacts Available

Papers Round 2

FDI: Attack Neural Code Generation Systems through User Feedback Channel
Zhensu Sun, Xiaoning Du, Xiapu Luo, Fu Song, David Lo, and Li Li
(Singapore Management University, Singapore; Hong Kong Polytechnic University, Hong Kong; Monash University, Australia; Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Nanjing Institute of Software Technology, China; Beihang University, China)
Neural code generation systems have recently attracted increasing attention to improve developer productivity and speed up software development. Typically, these systems maintain a pre-trained neural model and make it available to general users as a service (e.g., through remote APIs) and incorporate a feedback mechanism to extensively collect and utilize the users' reaction to the generated code, i.e., user feedback. However, the security implications of such feedback have not yet been explored. With a systematic study of current feedback mechanisms, we find that feedback makes these systems vulnerable to feedback data injection (FDI) attacks. We discuss the methodology of FDI attacks and present a pre-attack profiling strategy to infer the attack constraints of a targeted system in the black-box setting. We demonstrate two proof-of-concept examples utilizing the FDI attack surface to implement prompt injection attacks and backdoor attacks on practical neural code generation systems. The attacker may stealthily manipulate a neural code generation system to generate code with vulnerabilities, attack payload, and malicious and spam messages. Our findings reveal the security implications of feedback mechanisms in neural code generation systems, paving the way for increasing their security.

Publisher's Version
Scalable, Sound, and Accurate Jump Table Analysis
Huan Nguyen, Soumyakant Priyadarshan, and R. Sekar
(Stony Brook University, USA)
Jump tables are a common source of indirect jumps in binary code. Resolving these indirect jumps is critical for constructing a complete control-flow graph, which is an essential first step for most applications involving binaries, including binary hardening and instrumentation, binary analysis and fuzzing for vulnerability discovery, malware analysis and reverse engineering. Existing techniques for jump table analysis generally prioritize performance over soundness. While lack of soundness may be acceptable for applications such as decompilation, it can cause unpredictable runtime failures in binary instrumentation applications. We therefore present SJA, a new jump table analysis technique in this paper that is sound and scalable. Our analysis uses a novel abstract domain to systematically track the "structure" of computed code pointers without relying on syntactic pattern-matching that is common in previous works. In addition, we present a bounds analysis that efficiently and losslessly reasons about equality and inequality relations that arise in the context of jump tables. As a result, our system reduces miss rate by 35× over the next best technique. When evaluated on error rate based on F1-score, our technique outperforms the best previous techniques by 3×.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Uncovering and Mitigating the Impact of Code Obfuscation on Dataset Annotation with Antivirus Engines
Gao Cuiying, Yueming Wu, Heng Li, Wei Yuan, Haoyu Jiang, Qidan He, and Yang Liu
(Huazhong University of Science and Technology, China; JD.com, China; Nanyang Technological University, Singapore)
With the widespread application of machine learning-based Android malware detection methods, building a high-quality dataset has become increasingly important. Existing large-scale datasets are mostly annotated with VirusTotal by aggregating the decisions of antivirus engines, and most of them indiscriminately accept the decisions of all engines. In reality, however, these engines have different capabilities in detecting malware, especially those that have been obfuscated. Previous research has revealed that code obfuscation degrades the detection performance of these engines to varying degrees. This makes us believe that using all engines indiscriminately is unreasonable for dataset annotation. Therefore, in this paper, we first conduct a data-driven evaluation to confirm the negative effects of code obfuscation on engine-based dataset annotation. To gain a deeper understanding of the reasons behind this phenomenon, we evaluate the availability, effectiveness and robustness of every engine under various code obfuscation techniques. Then we categorize the engines and select a set of obfuscation-robust engines. Finally, we conduct comprehensive experiments to verify the effectiveness of the selected engines for dataset annotation. Our experiments show that when 50% obfuscated samples are mixed into the training set, on the classic malware detectors Drebin and Malscan, using our selected engines can effectively improve detection performance by 15.21% and 19.23%, respectively, compared to using all the engines.

Publisher's Version
LENT-SSE: Leveraging Executed and Near Transactions for Speculative Symbolic Execution of Smart Contracts
Peilin Zheng, Bowei Su, Xiapu Luo, Ting Chen, Neng Zhang, and Zibin Zheng
(Sun Yat-sen University, China; Hong Kong Polytechnic University, China; University of Electronic Science and Technology of China, China)
Symbolic execution has proven effective for code analytics in smart contracts. However, for smart contracts, existing symbolic tools use multiple-transaction symbolic execution, which differs from traditional symbolic tools and also exacerbates the path explosion problem. In this paper, we first quantitatively analyze the bottleneck of symbolic execution in multiple transactions (TXs), finding the redundancy of the paths of TXs. Based on this finding, we propose LENT-SSE as a new speculation heuristic for Speculative Symbolic Execution of smart contracts, which leverages the executed and near TXs for skipping and recalling the SMT solving of paths. LENT-SSE uses an executed-transaction-based skipping algorithm to reduce the time required for SMT solving by leveraging the redundancy between executed and executing paths. Moreover, LENT-SSE uses a near-transaction-based recalling algorithm to reduce false skipping of the solving paths. Experimental results on the SmartBugs dataset show that LENT-SSE can reduce the total time by 37.4% and the solving time of paths by 65.2% on average without reducing the reported bugs. On the other dataset of 1000 realistic contracts, the total time and solving time are reduced by 38.1% and 54.7%.

Publisher's Version
DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation
Mingke Yang, Yuqi Chen, Yi Liu, and Ling Shi
(ShanghaiTech University, China; Nanyang Technological University, Singapore)
Large Language Models (LLMs) have showcased their remarkable capabilities in diverse domains, encompassing natural language understanding, translation, and even code generation. The potential for LLMs to generate harmful content is a significant concern. This risk necessitates rigorous testing and comprehensive evaluation of LLMs to ensure safe and responsible use. However, extensive testing of LLMs requires substantial computational resources, making it an expensive endeavor. Therefore, exploring cost-saving strategies during the testing phase is crucial to balance the need for thorough evaluation with the constraints of resource availability. To address this, our approach begins by transferring the moderation knowledge from an LLM to a small model. Subsequently, we deploy two distinct strategies for generating malicious queries: one based on a syntax tree approach, and the other leveraging an LLM-based method. Finally, our approach incorporates a sequential filter-test process designed to identify test cases that are prone to eliciting toxic responses. By doing so, we significantly curtail unnecessary or unproductive interactions with LLMs, thereby streamlining the testing process. Our research evaluated the efficacy of DistillSeq across four LLMs: GPT-3.5, GPT-4.0, Vicuna-13B, and Llama-13B. In the absence of DistillSeq, the observed attack success rates on these LLMs stood at 31.5% for GPT-3.5, 21.4% for GPT-4.0, 28.3% for Vicuna-13B, and 30.9% for Llama-13B. However, upon the application of DistillSeq, these success rates notably increased to 58.5%, 50.7%, 52.5%, and 54.4%, respectively. This translated to an average escalation in attack success rate by a factor of 93.0% when compared to scenarios without the use of DistillSeq. Such findings highlight the significant enhancement DistillSeq offers in terms of reducing the time and resource investment required for effectively testing LLMs.

Publisher's Version Info
PatchFinder: A Two-Phase Approach to Security Patch Tracing for Disclosed Vulnerabilities in Open-Source Software
Kaixuan Li, Jian Zhang, Sen Chen, Han Liu, Yang Liu, and Yixiang Chen
(East China Normal University, China; Nanyang Technological University, Singapore; Tianjin University, China)
Open-source software (OSS) vulnerabilities are increasingly prevalent, emphasizing the importance of security patches. However, in widely used security platforms like NVD, a substantial number of CVE records still lack trace links to patches. Although rank-based approaches have been proposed for security patch tracing, they heavily rely on handcrafted features in a single-step framework, which limits their effectiveness. In this paper, we propose PatchFinder, a two-phase framework with end-to-end correlation learning for better-tracing security patches. In the initial retrieval phase, we employ a hybrid patch retriever to account for both lexical and semantic matching based on the code changes and the description of a CVE, to narrow down the search space by extracting those commits as candidates that are similar to the CVE descriptions. Afterwards, in the re-ranking phase, we design an end-to-end architecture under the supervised fine-tuning paradigm for learning the semantic correlations between CVE descriptions and commits. In this way, we can automatically rank the candidates based on their correlation scores while maintaining low computation overhead. We evaluated our system against 4,789 CVEs from 532 OSS projects. The results are highly promising: PatchFinder achieves a Recall@10 of 80.63% and a Mean Reciprocal Rank (MRR) of 0.7951. Moreover, the Manual Effort@10 required is curtailed to 2.77, marking a 1.94 times improvement over current leading methods. When applying PatchFinder in practice, we initially identified 533 patch commits and submitted them to the official, 482 of which have been confirmed by CVE Numbering Authorities.

Publisher's Version
Finding Cuts in Static Analysis Graphs to Debloat Software
Christoph Blumschein, Fabio Niephaus, Codruţ Stancu, Christian Wimmer, Jens Lincke, and Robert Hirschfeld
(Hasso Plattner Institute, Germany; University of Potsdam, Germany; Oracle Labs, Germany; Oracle Labs, Switzerland; Oracle Labs, USA)
As software projects grow increasingly more complex, debloating gains traction. While static analyses yield a coarse over-approximation of reachable code, approaches based on dynamic execution traces risk program correctness. By allowing the developer to reconsider only a few methods and still achieve a significant reduction in code size, cut-based debloating can minimize the risk. In this paper, we propose the idea of finding small cuts in the rule graphs produced by static analysis. After introducing an analysis with suitable semantics, we discuss how to encode its rules into a directed hypergraph. We then present an algorithm for efficiently finding the most effective single cut in the graph. The execution time of the proposed operations allows for the deployment in interactive tools. Finally, we show that our graph model is able to expose methods worthwhile to reconsider.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Revisiting Test-Case Prioritization on Long-Running Test Suites
Runxiang Cheng, Shuai Wang, Reyhaneh Jabbarvand, and Darko Marinov
(University of Illinois at Urbana-Champaign, USA)
The prolonged continuous integration (CI) runs are affecting timely feedback to software developers. Test-case prioritization (TCP) aims to expose faults sooner by reordering tests such that the ones more likely to fail are run earlier. TCP is thus especially important for long-running test suites. While many studies have explored TCP, they are based on outdated CI builds from over 10 years ago with test suites that last several minutes, or builds from inaccessible, proprietary projects. In this paper, we present LRTS, the first dataset of long-running test suites, with 21,255 CI builds and 57,437 test-suite runs from 10 large-scale, open-source projects that use Jenkins CI. LRTS spans from 2020 to 2023, with an average test-suite run duration of 6.5 hours. On LRTS, we study the effectiveness of 59 leading TCP techniques, the impact of confounding test failures on TCP, and TCP for failing tests with no prior failures. We revisit prior key findings (9 confirmed, 2 refuted) and establish 3 new findings. Our results show that prioritizing faster tests that recently failed performs the best, outperforming the sophisticated techniques.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Oracle-Guided Program Selection from Large Language Models
Zhiyu Fan, Haifeng Ruan, Sergey Mechtaev, and Abhik Roychoudhury
(National University of Singapore, Singapore; Peking University, China)
While large language models (LLMs) have shown significant advancements in code generation, their susceptibility to producing incorrect code poses a significant challenge to the adoption of LLM-generated programs. This issue largely stems from the reliance on natural language descriptions as informal oracles in code generation. Current strategies to mitigate this involve selecting the best program from multiple LLM-generated alternatives, judged by criteria like the consistency of their execution results on an LLM-generated test suite. However, this approach has crucial limitations: (1) LLMs often generate redundant tests or tests that cannot distinguish between correct and incorrect solutions, (2) the used consistency criteria, such as the majority vote, fail to foster developer trust due to the absence of transparent rationale behind the made choices. In this work, we propose a new perspective on increasing the quality of LLM-generated code via program selection using the LLM as a test oracle. Our method is based on our experimentally confirmed observation that LLMs serve more effectively as oracles when tasked with selecting the correct output from multiple choices. Leveraging this insight, we first generate distinguishing inputs that capture semantic discrepancies of programs sampled from an LLM, and record outputs produced by the programs on these inputs. An LLM then selects the most likely to be correct output from these, guided by the natural language problem description. We implemented this idea in a tool LLMCodeChoice and evaluated its accuracy in generating and selecting standalone programs. Our experiments demonstrated its effectiveness in improving pass@1 by 3.6-7% on HumanEval and MBPP benchmarks compared to the state-of-art CodeT. Most interestingly, the selected input-output specifications helped us to uncover incompleteness and ambiguities in task descriptions and also identify incorrect ground-truth implementations in the benchmarks.

Publisher's Version
Beyond Pairwise Testing: Advancing 3-wise Combinatorial Interaction Testing for Highly Configurable Systems
Chuan Luo, Shuangyu Lyu, Qiyuan Zhao, Wei Wu, Hongyu Zhang, and Chunming Hu
(Beihang University, China; National University of Singapore, Singapore; Central South University, China; Xiangjiang Laboratory, China; Chongqing University, China)
To meet the rising demand for software customization, highly configurable software systems play key roles in practice. Combinatorial interaction testing (CIT) is recognized as an effective approach for testing such systems. For CIT, the most important problem is constrained covering array generation (CCAG), which aims to construct a minimum-sized t-wise covering array (CA), where t denotes testing strength. Compared to pairwise testing (i.e., 2-wise CIT) that is a widely-used CIT technique, 3-wise CIT can discover more faults and bring more benefit in real-world applications. However, current state-of-the-art CCAG algorithms suffer from the severe scalability challenge for 3-wise CIT, which renders them ineffective in building 3-wise CAs for highly configurable systems. In this work, we perform an empirical study on various practical, highly configurable systems to present that it is promising to build 3-wise CA through extending 2-wise CA. Inspired by this, we propose ScalableCA, a novel and scalable algorithm that can effectively alleviate the scalability challenge for 3-wise CIT. Further, ScalableCA introduces three new and effective techniques, including fast invalidity detection, uncovering-guided sampling, and remainder-aware local search, to enhance its performance. Our experiments on extensive real-world, highly configurable systems show that, compared to current state-of-the-art algorithms, ScalableCA requires one to two orders of magnitude less running time to build 3-wise CA of 38.9% smaller size in average for large-scale instances. Our results indicate that ScalableCA greatly advances the state of the art in 3-wise CIT.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Equivalent Mutants in the Wild: Identifying and Efficiently Suppressing Equivalent Mutants for Java Programs
Benjamin Kushigian, Samuel J. Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just
(University of Washington, USA)
The presence of equivalent mutants has long been considered a major obstacle to the widespread adoption of mutation analysis and mutation testing. This paper presents a study on the types and prevalence of equivalent mutants in real-world Java programs. We conducted a ground-truth analysis of 1,992 mutants, sampled from 7 open source Java projects. Our analysis identified 215 equivalent mutants, which we grouped based on two criteria that describe why the mutants are equivalent and how challenging their detection is. From this analysis, we observed that (1) the median equivalent mutant rate across the 7 projects is 2.97%; (2) many equivalent mutants are caused by common programming patterns and their detection is not much more complex than structural pattern matching over an abstract syntax tree. Based on the findings of our ground-truth analysis, we developed Equivalent Mutant Suppression (EMS), a technique that comprises 10 efficient and targeted analyses. We evaluated EMS on 19 open- source Java projects, comparing the effectiveness and efficiency of EMS to two variants of Trivial Compiler Equivalence (TCE), the current state of the art in equivalent mutant detection. Additionally, we analyzed all 9,047 equivalent mutants reported by any tool to better understand the types and frequencies of equivalent mutants found. Overall, EMS detects 8,776 equivalent mutants within 325 seconds; TCE detects 2,124 equivalent mutants in 2,938 hours.

Publisher's Version Published Artifact Artifacts Available
Testing Graph Database Systems with Graph-State Persistence Oracle
Shuang Liu, Junhao Lan, Xiaoning Du, Jiyuan Li, Wei Lu, Jiajun Jiang, and Xiaoyong Du
(Renmin University of China, China; Tianjin University, China; Monash University, Australia)
Graph Database Management Systems (GDBMSs) store data in a graph format, facilitating rapid querying of nodes and relationships. This structure is particularly advantageous for applications like social networks and recommendation systems, which often involve frequent writing operations—such as adding new nodes, creating relationships, or modifying existing data—that potentially introduce bugs. However, existing GDBMS testing approaches tend to overlook these writing functionalities, failing to detect bugs arising from such operations. In this paper we present GraspDB, the first metamorphic testing approach specifically designed to identify bugs related to writing operations in graph database systems. GraspDB employs the Graph-State Persistence oracle, which is based on the Labeled Property Graph Isomorphism (LPG-Isomorphism) and Labeled Property Subgraph Isomorphism (LPSG-Isomorphism) relations. We also develop three classes of mutation rules aimed at engaging more diverse writing-related code logic. GraspDB has successfully detected 77 unique, previously unknown bugs across four popular open source graph database engines, among which 58 bugs are confirmed by developers, 43 bugs have been fixed and 31 are related to writing operations.

Publisher's Version Published Artifact Artifacts Available
Test Selection for Deep Neural Networks using Meta-Models with Uncertainty Metrics
Demet Demir, Aysu Betin Can, and Elif Surer
(Middle East Technical University, Ankara, Türkiye)
With the use of Deep Learning (DL) in safety-critical domains, the systematic testing of these systems has become a critical issue for human life. Due to the data-driven nature of Deep Neural Networks (DNNs), the effectiveness of tests is closely related to the adequacy of test datasets. Test data need to be labeled, which requires manual human effort and sometimes expert knowledge. DL system testers aim to select the test data that will be most helpful in identifying the weaknesses of the DNN model by using resources efficiently. To help achieve this goal, we propose a test data prioritization approach based on using a meta-model that gets uncertainty metrics as input, which are derived from outputs of other base models. Integrating different uncertainty metrics helps overcome individual limitations of these metrics and be effective in a wider range of scenarios. We train the meta-models with the objective of predicting whether a test input will lead the tested model to make an incorrect prediction or not. We conducted an experimental evaluation with popular image classification datasets and DNN models to evaluate the proposed approach. The results of the experiments demonstrate that our approach effectively prioritizes the test datasets and outperforms existing state-of-the-art test prioritization methods used in comparison. In the experiments, we evaluated the test prioritization approach from a distribution-aware perspective by generating test datasets with and without out-of-distribution data.

Publisher's Version
An Empirical Study of Static Analysis Tools for Secure Code Review
Wachiraphan Charoenwet, Patanamon Thongtanunam, Van-Thuan Pham, and Christoph Treude
(University of Melbourne, Australia; Singapore Management University, Singapore)
Early identification of security issues in software development is vital to minimize their unanticipated impacts. Code review is a widely used manual analysis method that aims to uncover security issues along with other coding issues in software projects. While some studies suggest that automated static application security testing tools (SASTs) could enhance security issue identification, there is limited understanding of SAST’s practical effectiveness in supporting secure code review. Moreover, most SAST studies rely on synthetic or fully vulnerable versions of the subject program, which may not accurately represent real-world code changes in the code review process. To address this gap, we study C/C++ SASTs using a dataset of actual code changes that contributed to exploitable vulnerabilities. Beyond SAST’s effectiveness, we quantify potential benefits when changed functions are prioritized by SAST warnings. Our dataset comprises 319 real-world vulnerabilities from 815 vulnerability-contributing commits (VCCs) in 92 C and C++ projects. The result reveals that a single SAST can produce warnings in vulnerable functions of 52% of VCCs. Prioritizing changed functions with SAST warnings can improve accuracy (i.e., 12% of precision and 5.6% of recall) and reduce Initial False Alarm (lines of code in non-vulnerable functions inspected until the first vulnerable function) by 13%. Nevertheless, at least 76% of the warnings in vulnerable functions are irrelevant to the VCCs, and 22% of VCCs remain undetected due to limitations of SAST rules. Our findings highlight the benefits and the remaining gaps of SAST-supported secure code reviews and challenges that should be addressed in future work.

Publisher's Version Published Artifact Artifacts Available
VRDSynth: Synthesizing Programs for Multilingual Visually Rich Document Information Extraction
Thanh-Dat Nguyen, Tung Do-Viet, Hung Nguyen-Duy, Tuan-Hai Luu, Hung Le, Bach Le, and Patanamon Thongtanunam
(University of Melbourne, Australia; Cinnamon AI, Vietnam; Independent Researcher, Vietnam; Deakin University, Australia)
Businesses often need to query visually rich documents (VRDs), e.g., purchase receipts, medical records, and insurance forms, among many other forms from multiple vendors, to make informed decisions. As such, several techniques have been proposed to automatically extract independent entities of interest from VRDs such as extracting price tags from purchase receipts, etc. However, for extracting semantically linked entities, such as finding corresponding price tags for each item, these techniques either have limited capability in handling new layouts, e.g., template-based approaches, or require extensive amounts of pre-training data and do not perform well, e.g., deep-learning approaches. In this work, we introduce a program synthesis method, namely VRDSynth, to automatically generate programs to extract entity relations from multilingual VRDs. Two key novelties, which empower VRDSynth to tackle flexible layouts while requiring no pre-training data for extracting entity relations, include: (1) a new domain-specific language (DSL) to effectively capture the spatial and textual relations between document entities, and (2) a novel synthesis algorithm that makes use of frequent spatial relations between entities to construct initial programs, equivalent reduction to prune the search space, and a combination of positive, negative, and mutually exclusive programs to improve the coverage of programs. We evaluate our method on two popular VRD understanding benchmarks, namely FUNSD and XFUND, on the semantic entity linking task, consisting of 1,600 forms in 8 different languages. Experiments show that VRDSynth, despite having no prior pre-training data, outperforms the state-of-the-art pre-trained deep-learning approach, namely LayoutXLM, in 5 out of 8 languages. Noticeably, VRDSynth achieved an improvement of 42% over LayoutXLM in terms of F1 score on FUNSD while being complementary to LayoutXLM in 7/8 languages. Regarding efficiency, VRDSynth significantly improves the memory footprint required for storage and inference over LayoutXLM (1M and 380MB versus that of 1.48GB and 3GB required by LayoutXLM), while maintaining similar time efficiency despite the speed differences between the languages used for implementation (Python vs C++).

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional
Towards Automatic Oracle Prediction for AR Testing: Assessing Virtual Object Placement Quality under Real-World Scenes
Xiaoyi Yang, Yuxing Wang, Tahmid Rafi, Dongfang Liu, Xiaoyin Wang, and Xueling Zhang
(Rochester Institute of Technology, USA; University of Texas at San Antonio, USA)
Augmented Reality (AR) technology opens up exciting possibilities in various fields, such as education, work guidance, shopping, communication, and gaming. However, users often encounter usability and user experience issues in current AR apps, often due to the imprecise placement of virtual objects. Detecting these inaccuracies is crucial for AR app testing, but automating the process is challenging due to its reliance on human perception and validation. This paper introduces VOPA (Virtual Object Placement Assessment), a novel approach that automatically identifies imprecise virtual object placements in real-world AR apps. VOPA involves instrumenting real-world AR apps to collect screenshots representing various object placement scenarios and their corresponding metadata under real-world scenes. The collected data are then labeled through crowdsourcing and used to train a hybrid neural network that identifies object placement errors. VOPA aims to enhance AR app testing by automating the assessment of virtual object placement quality and detecting imprecise instances. In our evaluation of a test set of 304 screenshots, VOPA achieved an accuracy of 99.34%, precision of 96.92% and recall of 100%. Furthermore, VOPA successfully identified 38 real-world object placement errors, including instances where objects were hovering between two surfaces or appearing embedded in the wall.

Publisher's Version
Sleuth: A Switchable Dual-Mode Fuzzer to Investigate Bug Impacts Following a Single PoC
Haolai Wei, Liwei Chen, Zhijie Zhang, Gang Shi, and Dan Meng
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
A proof of concept (PoC) is essential for pinpointing a bug within software. However, relying on it alone for the timely and complete repair of bugs is insufficient due to underestimating the bug impacts. The bug impact reflects that a bug may be triggered at multiple positions following from the root cause, resulting in different bug types (e.g., use-after-free, heap-buffer-overflow). Current techniques discover bug impacts using fuzzing with a specific coverage-guided strategy: assigning more energy to seeds that cover the buggy code regions. This method can utilize a single PoC to generate multiple PoCs that contain different bug impacts in a short time. Unfortunately, we observe existing techniques are still unreliable, primarily due to their failure in balancing the time between in-depth and breadth exploration: (i) in-depth exploration for bug impacts behind crash regions and (ii) breadth exploration for bug impacts alongside unreached regions. Current techniques only focus on one exploration or conduct two explorations in separate stages leading to low accuracy and efficiency. Considering the aforementioned problem, we propose Sleuth, an approach for automatically investigating bug impacts following a known single PoC to enhance bug fixing. We design Sleuth on two novel concepts: (i) a dual-mode exploration mechanism built on a fuzzer designed for efficient in-depth and breadth exploration. (ii) a dynamic switchable strategy connecting with the dual-mode exploration that facilitates the reliability of bug impact investigation. We evaluate Sleuth using 50 known CVEs, and the result of experiment shows that Sleuth can efficiently discover new bug impacts in 86% CVEs and find 1.5x more bug impacts than state-of-art tools. Furthermore, Sleuth successfully identifies 13 incomplete fixes using the generated new PoCs.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
SQLess: Dialect-Agnostic SQL Query Simplification
Li Lin, Zongyin Hao, Chengpeng Wang, Zhuangda Wang, Rongxin Wu, and Gang Fan
(Xiamen University, China; Hong Kong University of Science and Technology, China; Ant Group, China)
Database Management Systems (DBMSs) are fundamental to numerous enterprise applications. Due to the significance of DBMSs, various testing techniques have been proposed to detect DBMS bugs. However, to trigger deep bugs, most of the existing techniques focus on generating lengthy and complex queries which burdens developers with the difficult of debugging. Therefore, SQL query simplification, which aims to reduce lengthy SQL queries without compromising their ability to detect bugs, is highly demanded. To bridge this gap, we introduce SQLess, an innovative approach that employs a dialect-agnostic method for efficient and semantically correct SQL query simplification tailored for various DBMSs. Unlike previous works that have to depend on DBMS-specific grammar, SQLess utilizes an adaptive parser, which leverages error recovery and grammar expansion to support DBMS dialects. Moreover, SQLess performs a semantics-sensitive SQL query trimming, which leverages alias and dependency analysis to simplify SQL queries with preserving bug-triggering capability. We evaluate SQLess using two datasets from the state-of-theart database bug detection studies, encompassing six widely-used DBMSs and over 32,000 complex SQL queries. The results demonstrate SQLess’s superior performance: it achieves an average simplification rate of 72.45%, which significantly outperforms the stateof-the-art approaches by 84.91%.

Publisher's Version
DBStorm: Generating Various Effective Workloads for Testing Isolation Levels
Keqiang Li, Siyang Weng, Lyu Ni, Chengcheng Yang, Rong Zhang, Xuan Zhou, and Aoying Zhou
(East China Normal University, China)
Isolation level (IL) acts as a correctness contract between applications and DBMSs. Problematic IL implementations would cause incorrect transaction execution results and erroneous database states. However, existing studies could not efficiently generate various effective workloads for IL test. The core challenges come from the requirements of (a) black-box testing (trigger the IL code of a closed source DBMS), (b) effective testing (evade redundant and ineffective testing), and (c) anomaly-sensitive testing (test various ILs in a distinguishable way). For black-box testing, we investigate the IL implementations of 15 popular DBMSs and discover that they follow a generic framework that utilizes conflict graphs to manage all conflicts of a workload, and performs a verification policy to prevent non-serializable anomalies. For effective testing, we propose a lightweight data state mirroring method, which helps generate SQL operations that precisely access its expected records and participate the formation of specific conflict graphs. We also propose an efficient history-independent approach to generate dissimilar conflict graphs. It guarantees the graph generation overhead is irrelevant to the scale of historical graphs. For anomaly-sensitive testing, we propose an implantation-based approach to orchestrate conflict record accesses and inject them into different transactions according to the anomaly definition. Our approach outperforms existing approaches in testing effectiveness, efficiency, and coverage. Practically, we have successfully found 33 bugs in popular DBMSs.

Publisher's Version
Preserving Reactiveness: Understanding and Improving the Debugging Practice of Blocking-Call Bugs
Arooba Shahoor, Jooyong Yi, and Dongsun Kim
(Kyungpook National University, South Korea; Ulsan National Institute of Science and Technology, South Korea; Korea University, South Korea)
Reactive programming reacts to data items as they occur, rather than waiting for them to complete. This programming paradigm is widely used in asynchronous and event-driven scenarios, such as web applications, microservices, real-time data processing, IoT, interactive UIs, and big data. When done right, it can offer greater responsiveness without extra resource usage. However, this also requires a thorough understanding of asynchronous and non-blocking coding, posing a learning curve for developers new to this style of programming. In this work, we analyze issues reported in reactive applications and explore their corresponding fixes. Our investigation results reveal that (1) developers often do not fix or ignore reactiveness bugs as compared to other bug types, and (2) this tendency is most pronounced for blocking-call bugs -- bugs that block the execution of the program to wait for the operations (typically I/O operations) to finish, wasting CPU and memory resources. To improve the debugging practice of such blocking bugs, we develop a pattern-based proactive program repair technique and obtain 30 patches, which we submit to the developers. In addition, we hypothesize that the low patch acceptance rate for reactiveness bugs is due to the difficulty of assessing the patches. This is in contrast to functionality bugs, where the correctness of the patches can be assessed by running test cases. To assess our hypothesis, we split our patches into two groups: one with performance improvement evidence and the other without. It turns out that the patches are more likely to be accepted when submitted with performance improvement evidence.

Publisher's Version Info
Feedback-Directed Partial Execution
Ishrak Hayet, Adam Scott, and Marcelo d'Amorim
(North Carolina State University, USA)
Partial code execution is the problem of executing code with missing definitions. The problem has gained recent traction as solutions to the problem could enable various downstream analyses. We propose feedback-directed partial execution, a technique supported by a tool, named Incompleter, that uses the error feedback from executions to enable partial code execution. Incompleter builds on the observation that errors observed during the execution of incomplete snippets often follow similar error patterns. Incompleter takes an incomplete snippet as input and applies rules (e.g., add class, add field, add file, etc.) to resolve the successive dynamic errors it encounters during execution of the snippet. Incompleter stops when the snippet successfully executes or when it reaches certain bounds. Our results indicate that Incompleter outperforms LExecutor, the state-of-the-art in partial execution. For example, considering a dataset of 4.7K incomplete StackOverflow snippets, Incompleter enables the execution of 10% more code snippets compared to LExecutor and covers 23% more statements. We also show that Incompleter’s type inference significantly improves over LExecutor’s type inference, with a 37% higher F1 score.

Publisher's Version Published Artifact Artifacts Available
Midas: Mining Profitable Exploits in On-Chain Smart Contracts via Feedback-Driven Fuzzing and Differential Analysis
Mingxi Ye, Xingwei Lin, Yuhong Nan, Jiajing Wu, and Zibin Zheng
(Sun Yat-sen University, China; Zhejiang University, China)
In the context of boosting smart contract applications, prioritizing their security becomes paramount. Smart contract exploits often result in notable financial losses. Ensuring their security is by no means trivial. Rather than resulting in program crashes, most attacks in on-chain smart contracts aim to induce financial loss, referred to as profitable exploits. By constructing seemingly innocuous inputs, profitable exploits try to extract extra profit or compromise the interests of others. However, due to the complexity of call chains in on-chain smart contracts and the need for effective oracles for profitable exploits, smart contract fuzzing suffers from low efficiency and low effectiveness in finding profitable exploits. In this paper, we present Midas, a novel feedback-driven fuzzing framework to mine profitable exploits in on-chain smart contracts effectively. Midas consists of two modules: diverse validity fuzzing and profitable transaction identification. The diverse validity fuzzing module applies two waypoints to efficiently generate valid transactions, addressing the complexity of on-chain smart contract call chains. The profitable transaction identification module applies differential analysis to effectively identify profitable exploits, addressing the limitation of ad-hoc oracles. Evaluation of Midas over on-chain smart contracts showed it effectively identified 40 real-world exploits with a precision of 80%, outperforming state-of-the-art tools (i.e., ItyFuzz and Slither) in both efficiency and effectiveness. Particularly, Midas effectively mines five unknown exploits in valuable smart contracts, and two of them have already been confirmed by their DApp developers.

Publisher's Version
Certified Continual Learning for Neural Network Regression
Long H. Pham and Jun Sun
(Singapore Management University, Singapore)
On the one hand, there has been considerable progress on neural network verification in recent years, which makes certifying neural networks a possibility. On the other hand, neural networks in practice are often re-trained over time to cope with new data distribution or for solving different tasks (a.k.a. continual learning). Once re-trained, the verified correctness of the neural network is likely broken, particularly in the presence of the phenomenon known as catastrophic forgetting. In this work, we propose an approach called certified continual learning which improves existing continual learning methods by preserving, as long as possible, the established correctness properties of a verified network. Our approach is evaluated with multiple neural networks and on two different continual learning methods. The results show that our approach is efficient and the trained models preserve their certified correctness and often maintain high utility.

Publisher's Version Published Artifact Artifacts Available
Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each using ChatGPT
Chunqiu Steven Xia and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA)
Automated Program Repair (APR) aims to automatically generate patches for buggy programs. Traditional APR techniques suffer from a lack of patch variety as they rely heavily on handcrafted or mined bug fixing patterns and cannot easily generalize to other bug/fix types. To address this limitation, recent APR work has been focused on leveraging modern Large Language Models (LLMs) to directly generate patches for APR. Such LLM-based APR tools work by first constructing an input prompt built using the original buggy code and then querying the LLM to either fill-in (cloze-style APR) the correct code at the bug location or to produce a completely new code snippet as the patch. While the LLM-based APR tools are able to achieve state-of-the-art results, they still follow the classic Generate and Validate (GV) repair paradigm of first generating lots of patches by sampling from the same initial prompt and then validating each one afterwards. This not only leads to many repeated patches that are incorrect, but also misses the crucial and yet previously ignored information in test failures as well as in plausible patches. To address these aforementioned limitations, we propose ChatRepair, the first fully automated conversation-driven APR approach that interleaves patch generation with instant feedback to perform APR in a conversational style. ChatRepair first feeds the LLM with relevant test failure information to start with, and then learns from both failures and successes of earlier patching attempts of the same bug for more powerful APR. For earlier patches that failed to pass all tests, we combine the incorrect patches with their corresponding relevant test failure information to construct a new prompt for the LLM to generate the next patch. In this way, we can avoid making the same mistakes. For earlier patches that passed all the tests (i.e., plausible patches), we further ask the LLM to generate alternative variations of the original plausible patches. In this way, we can further build on and learn from earlier successes to generate more plausible patches to increase the chance of having correct patches. While our approach is general, we implement ChatRepair using state-of-the-art dialogue-based LLM – ChatGPT. Our evaluation on the widely studied Defects4j dataset shows that ChatRepair is able to achieve the new state-of-the-art in repair performance, achieving 114 and 48 correct fixes on Defects4j 1.2 and 2.0 respectively. By calculating the cost of accessing ChatGPT, we can fix 162 out of 337 bugs for $0.42 each!

Publisher's Version
DDGF: Dynamic Directed Greybox Fuzzing with Path Profiling
Haoran Fang, Kaikai Zhang, Donghui Yu, and Yuanyuan Zhang
(Shanghai Jiao Tong University, China)
Coverage-Guided Fuzzing (CGF) has become the most popular and effective method for vulnerability detection. It is usually designed as an automated “black-box” tool. Security auditors start it and then just wait for the results. However, after a period of testing, CGF struggles to find new coverage gradually, thus making it inefficient. It is difficult for users to explain reasons that prevent fuzzing from making further progress and to determine whether the existing coverage is sufficient. In addition, there is no way to interact and direct the fuzzing process. In this paper, we design the dynamic directed greybox fuzzing (DDGF) to facilitate collaboration between the user and fuzzer. By leveraging Ball-Larus path profiling algorithm, we propose two new techniques: dynamic introspection and dynamic direction. Dynamic introspection reveals the significant imbalance in the distribution of path frequency through encoding and decoding. Based on the insight from introspection, users can dynamically direct the fuzzer to focus testing on the selected paths in real time. We implement DDGF based on AFL++. Experiments on Magma show that DDGF is effective in helping the fuzzer to reproduce vulnerabilities faster, with up to 100x speedup and only 13% performance overhead. DDGF shows the great potential of human-in-the-loop for fuzzing.

Publisher's Version
VioHawk: Detecting Traffic Violations of Autonomous Driving Systems through Criticality-Guided Simulation Testing
Zhongrui Li, Jiarun Dai, Zongan Huang, Nianhao You, Yuan Zhang, and Min Yang
(Fudan University, China)
As highlighted in authoritative standards (e.g., ISO21448), traffic law compliance is a fundamental prerequisite for the commercialization of autonomous driving systems (ADS). Hence, manufacturers are in severe need of techniques to detect harsh driving situations in which the target ADS would violate traffic laws. To achieve this goal, existing works commonly resort to searching-based simulation testing, which continuously adjusts the scenario configurations (e.g., add new vehicles) of initial simulation scenarios and hunts for critical scenarios. Specifically, they apply pre-defined heuristics on each mutated scenario to approximate the likelihood of triggering ADS traffic violations, and accordingly perform searching scheduling. However, with those comparably more critical scenarios in hand, they fail to offer deterministic guidance on which and how scenario configurations should be further mutated to reliably trigger the target ADS misbehaviors. Hence, they inevitably suffer from meaningless efforts to traverse the huge scenario search space. In this work, we propose VioHawk, a novel simulation-based fuzzer that hunts for scenarios that imply ADS traffic violations. Our key idea is that, traffic law regulations can be formally modeled as hazardous/non-hazardous driving areas on the map at each timestamp during ADS simulation testing (e.g., when the traffic light is red, the intersection is marked as hazardous areas). Following this idea, VioHawk works by inducing the autonomous vehicle to drive into the law-specified hazardous areas with deterministic mutation operations. We evaluated the effectiveness of VioHawk in testing industry-grade ADS (i.e., Apollo). We constructed a benchmark dataset that contains 42 ADS violation scenarios against real-world traffic laws. Compared to existing tools, VioHawk can reproduce 3.1X~13.3X more violations within the same time budget, and save 1.6X~8.9X the reproduction time for those identified violations. Finally, with the help of VioHawk, we identified 9+8 previously unknown violations of real-world traffic laws on Apollo 7.0/8.0.

Publisher's Version Artifacts Functional
BRAFAR: Bidirectional Refactoring, Alignment, Fault Localization, and Repair for Programming Assignments
Linna Xie, Chongmin Li, Yu Pei, Tian Zhang, and Minxue Pan
(Nanjing University, China; Hong Kong Polytechnic University, China)
The problem of automated feedback generation for introductory programming assignments (IPAs) has attracted significant attention with the increasing demand for programming education. While existing approaches, like Refactory, that employ the ”block-by-block” repair strategy have produced promising results, they suffer from two limitations. First, Refactory randomly applies refactoring and mutation operations to correct and buggy programs, respectively, to align their control-flow structures (CFSs), which, however, has a relatively low success rate and often complicates the original repairing tasks. Second, Refactory generates repairs for each basic block of the buggy program when its semantics differs from the counterpart in the correct program, which, however, ignores the different roles that basic blocks play in the programs and often produces unnecessary repairs. To overcome these limitations, we propose the Brafar approach to feedback generation for IPAs. The core innovation of Brafar lies in its novel bidirectional refactoring algorithm and coarse-to-fine fault localization. The former aligns the CFSs of buggy and correct programs by applying semantics-preserving refactoring operations to both programs in a guided manner, while the latter identifies basic blocks that truly need repairs based on the semantics of their enclosing statements and themselves. In our experimental evaluation on 1783 real-life incorrect student submissions from a publicly available dataset, Brafar significantly outperformed Refactory and Clara, generating correct repairs for more incorrect programs with smaller patch sizes in a shorter time.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Synthesis-Based Enhancement for GUI Test Case Migration
Yakun Zhang, Qihao Zhu, Jiwei Yan, Chen Liu, Wenjie Zhang, Yifan Zhao, Dan Hao, and Lu Zhang
(Peking University, China; DeepSeek-AI, China; Institute of Software at Chinese Academy of Sciences, China)
GUI test case migration is the process of migrating GUI test cases from a source app to a target app for a specific functionality. However, test cases obtained via existing migration approaches can hardly be directly used to test target functionalities and typically require additional manual modifications. This problem may significantly impact the effectiveness of testing target functionalities and the practical applicability of migration approaches. In this paper, we propose MigratePro, the first approach to enhancing GUI test case migration via synthesizing a new test case based on multiple test cases for the same functionality migrated from various source apps to the target app. The aim of MigratePro is to produce functional test cases with less human intervention. Specifically, given multiple migrated test cases for the same functionality in the target app, MigratePro first combines all the GUI states related to these migrated test cases into an overall state-sequence. Then, MigratePro organizes events and assertions from migrated test cases according to the overall state-sequence and endeavors to remove the should-be-removed events and assertions, while also incorporating some connection events in order to make the should-be-included events and assertions executable. Our evaluation on 30 apps, 34 functionalities, and 127 test cases shows that MigratePro improves the capability of three representative migration approaches (i.e., Craftdroid, AppFlow, ATM), successfully improving testing the target functionalities by 86%, 333%, and 300%, respectively. These results underscore the generalizability of MigratePro for effectively enhancing migration approaches.

Publisher's Version
CREF: An LLM-Based Conversational Software Repair Framework for Programming Tutors
Boyang Yang, Haoye Tian, Weiguo Pian, Haoran Yu, Haitao Wang, Jacques Klein, Tegawendé F. Bissyandé, and Shunfu Jin
(Yanshan University, China; Beijing JudaoYouda Network Technology, China; University of Melbourne, Australia; University of Luxembourg, Luxembourg)
With the proven effectiveness of Large Language Models (LLMs) in code-related tasks, researchers have explored their potential for program repair. However, existing repair benchmarks might have influenced LLM training data, potentially causing data leakage. To evaluate LLMs’ realistic repair capabilities, (i) we introduce an extensive, non-crawled benchmark TutorCode, comprising 1,239 C++ defect codes and associated information such as tutor guidance, solution description, failing test cases, and the corrected code. Our work assesses LLM’s repair performance on TutorCode, measuring repair correctness (TOP-5 and AVG-5) and patch precision (RPSR). (ii) We then provide a comprehensive investigation into which types of extra information can help LLMs improve their repair performance. Among these types, tutor guidance was the most effective information. To fully harness LLMs’ conversational capabilities and the benefits of augmented information, (iii) we introduce a novel conversational semi-automatic repair framework CREF assisting human programming tutors. It demonstrates a remarkable AVG-5 improvement of 17.2%-24.6% compared to the baseline, achieving an impressive AVG-5 of 76.6% when utilizing GPT-4. These results highlight the potential for enhancing LLMs’ repair capabilities through tutor interactions and historical conversations. The successful application of CREF in a real-world educational setting demonstrates its effectiveness in reducing tutors’ workload and improving students’ learning experience, showing promise for code review and other software engineering tasks.

Publisher's Version
Datactive: Data Fault Localization for Object Detection Systems
Yining Yin, Yang Feng, Shihao Weng, Yuan Yao, Jia Liu, and Zhihong Zhao
(Nanjing University, China)
Object detection (OD) models are seamlessly integrated into numerous intelligent software systems, playing a crucial role in various tasks. These models are typically constructed upon humanannotated datasets, whose quality can greatly affect their performance and reliability. Erroneous and inadequate annotated datasets can induce classification/localization inaccuracies during deployment, precipitating security breaches or traffic accidents that inflict property damage or even loss of life. Therefore, ensuring and improving data quality is a crucial issue for the reliability of the object detection system. This paper introduces Datactive, a data fault localization technique for object detection systems. Datactive is designed to locate various types of data faults including mislocalization and missing objects, without utilizing the prediction of object detection models trained on dirty datasets. To achieve this, we first construct foreground-only and background-included datasets via data disassembling strategies, and then employ a robust learning method to train classifiers using disassembled datasets. Based on the classifier predictions, Datactive produces a unified suspiciousness score for both foreground annotations and image backgrounds. It allows testers to easily identify and correct faulty or missing annotations with minimal effort. To validate the effectiveness, we conducted experiments on three datasets with 6 baselines, and demonstrated the superiority of Datactive from various aspects. We also explored Datactive's ability to find natural data faults and its application in both training and evaluation scenarios.

Publisher's Version
Interpretability Based Neural Network Repair
Zuohui Chen, Jun Zhou, Youcheng Sun, Jingyi Wang, Qi Xuan, and Xiaoniu Yang
(Zhejiang University of Technology, China; Binjiang Institute of Artificial Intelligence, China; University of Manchester, United Kingdom; Zhejiang University, China; National Key Laboratory of Electromagnetic Space Security, China)
Along with the prevalent use of deep neural networks (DNNs), concerns have been raised on the security threats from DNNs such as backdoors in the network. While neural network repair methods have shown to be effective for fixing the defects in DNNs, they have been also found to produce biased models, with imbalanced accuracy across different classes, or weakened adversarial robustness, allowing malicious attackers to trick the model by adding small perturbations. To address these challenges, we propose INNER, an INterpretability-based NEural Repair approach. INNER formulates the idea of neuron routing for identifying fault neurons, in which the interpretability technique model probe is used to evaluate each neuron's contribution to the undesired behaviour of the neural network. INNER then optimizes the identified neurons for repairing the neural network. We test INNER on three typical application scenarios, including backdoor attacks, adversarial attacks, and wrong predictions. Our experimental results demonstrate that INNER can effectively repair neural networks, by ensuring accuracy, fairness, and robustness. Moreover, the performance of other repair methods can be also improved by re-using the fault neurons found by INNER, justifying the generality of the proposed approach.

Publisher's Version
Exploration-Driven Reinforcement Learning for Avionic System Fault Detection (Experience Paper)
Paul-Antoine Le Tolguenec, Emmanuel Rachelson, Yann Besse, Florent Teichteil-Koenigsbuch, Nicolas Schneider, Hélène Waeselynck, and Dennis Wilson
(ISAE SUPAERO, France; Airbus, France; LAAS-CNRS, France)
Critical software systems require stringent testing to identify possible failure cases, which can be difficult to find using manual testing. In this study, we report our industrial experience in testing a realistic R&D flight control system using a heuristic based testing method. Our approach utilizes evolutionary strategies augmented with intrinsic motivation to yield a diverse range of test cases, each revealing different potential failure scenarios within the system. This diversity allows for a more comprehensive identification and understanding of the system’s vulnerabilities. We analyze the test cases found by evolution to identify the system’s weaknesses. The results of our study show that our approach can be used to improve the reliability and robustness of avionics systems by providing high-quality test cases in an efficient and cost-effective manner.

Publisher's Version ACM SIGSOFT Distinguished Paper Award
Semantic Constraint Inference for Web Form Test Generation
Parsa Alian, Noor Nashid, Mobina Shahbandeh, and Ali Mesbah
(University of British Columbia, Canada)
Automated test generation for web forms has been a longstanding challenge, exacerbated by the intrinsic human-centric design of forms and their complex, device-agnostic structures. We introduce an innovative approach, called FormNexus, for automated web form test generation, which emphasizes deriving semantic insights from individual form elements and relations among them, utilizing textual content, DOM tree structures, and visual proximity. The insights gathered are transformed into a new conceptual graph, the Form Entity Relation Graph (FERG), which offers machine-friendly semantic information extraction. Leveraging LLMs, FormNexus adopts a feedback-driven mechanism for generating and refining input constraints based on real-time form submission responses. The culmination of this approach is a robust set of test cases, each produced by methodically invalidating constraints, ensuring comprehensive testing scenarios for web forms. This work bridges the existing gap in automated web form testing by intertwining the capabilities of LLMs with advanced semantic inference methods. Our evaluation demonstrates that FormNexus combined with GPT-4 achieves 89% coverage in form submission states. This outcome significantly outstrips the performance of the best baseline model by a margin of 25%.

Publisher's Version
Call Graph Soundness in Android Static Analysis
Jordan Samhi, René Just, Tegawendé F. Bissyandé, Michael D. Ernst, and Jacques Klein
(CISPA Helmholtz Center for Information Security, Germany; University of Washington, USA; University of Luxembourg, Luxembourg)
Static analysis is sound in theory, but an implementation may unsoundly fail to analyze all of a program's code. Any such omission is a serious threat to the validity of the tool's output. Our work is the first to measure the prevalence of these omissions. Previously, researchers and analysts did not know what is missed by static analysis, what sort of code is missed, or the reasons behind these omissions. To address this gap, we ran 13static analysis tools and a dynamic analysis on 1000 Android apps. Any method in the dynamic analysis but not in a static analysis is an unsoundness. Our findings include the following. (1) Apps built around external frameworks challenge static analyzers. On average, the 13 static analysis tools failed to capture 61% of the dynamically-executed methods. (2) A high level of precision in call graph construction is a synonym for a high level of unsoundness. (3) No existing approach significantly improves static analysis soundness. This includes those specifically tailored for a given mechanism, such as DroidRA to address reflection. It also includes systematic approaches, such as EdgeMiner, capturing all callbacks in the Android framework systematically. (4) Modeling entry point methods challenges call graph construction which jeopardizes soundness.

Publisher's Version
Guardian: A Runtime Framework for LLM-Based UI Exploration
Dezhi Ran, Hao Wang, Zihe Song, Mengzhou Wu, Yuan Cao, Ying Zhang, Wei Yang, and Tao Xie
(Peking University, China; University of Texas at Dallas, USA)
Tests for feature-based UI testing have been indispensable for ensuring the quality of mobile applications (apps for short). The high manual labor costs to create such tests have led to a strong interest in automated feature-based UI testing, where an approach automatically explores the App under Test (AUT) to find correct sequences of UI events achieving the target test objective, given only a high-level test objective description. Given that the task of automated feature-based UI testing resembles conventional AI planning problems, large language models (LLMs), known for their effectiveness in AI planning, could be ideal for this task. However, our study reveals that LLMs struggle with following specific instructions for UI testing and replanning based on new information. This limitation results in reduced effectiveness of LLM-driven solutions for automated feature-based UI testing, despite the use of advanced prompting techniques. Toward addressing the preceding limitation, we propose Guardian, a runtime system framework to improve the effectiveness of automated feature-based UI testing by offloading computational tasks from LLMs with two major strategies. First, Guardian refines UI action space that the LLM can plan over, enforcing the instruction following of the LLM by construction. Second, Guardian deliberately checks whether the gradually enriched information invalidates previous planning by the LLM. Guardian removes the invalidated UI actions from the UI action space that the LLM can plan over, restores the state of the AUT to the state before the execution of the invalidated UI actions, and prompts the LLM to re-plan with the new UI action space. We instantiate Guardian with ChatGPT and construct a benchmark named FestiVal with 58 tasks from 23 highly popular apps. Evaluation results on FestiVal show that Guardian achieves 48.3

Publisher's Version
NativeSummary: Summarizing Native Binary Code for Inter-language Static Analysis of Android Apps
Jikai Wang and Haoyu Wang
(Huazhong University of Science and Technology, China)
With the prosperity of Android app research in the last decade, many static analysis techniques have been proposed. They generally aim to tackle DEX bytecode in Android apps. Beyond DEX bytecode, native code (usually written in C/C++) is prevalent in modern Android apps, whose analysis is usually overlooked by most existing analysis frameworks. Although a few recent works attempted to handle native code, they suffer from scalability and accuracy issues. In this paper, we propose NativeSummary, a novel inter-language static analysis framework for Android apps with high accuracy, scalability, and compatibility. Our key idea is to extract semantic summary of the native binary code, then convert common usage patterns of JNI interface functions into Java bytecode operations, and additionally transform native library function calls to bytecode calls. Along with this effort, we can empower the legacy Java static frameworks with the ability of inter-language data flow analysis without tampering their inherent logic. Extensive evaluation suggests that NativeSummary outperforms SOTA techniques in terms of accuracy, scalability and compatibility. NativeSummary sheds light on the promising direction of inter-language analysis, and thousands of existing app analysis works can be boosted atop NativeSummary with almost no effort.

Publisher's Version Published Artifact Info Artifacts Available
Efficient DNN-Powered Software with Fair Sparse Models
Xuanqi Gao, Weipeng Jiang, Juan Zhai, Shiqing Ma, Xiaoyu Zhang, and Chao Shen
(Xi’an Jiaotong University, China; University of Massachusetts at Amherst, USA)
With the emergence of the Software 3.0 era, there is a growing trend of compressing and integrating large models into software systems, with significant societal implications. Regrettably, in numerous instances, model compression techniques impact the fairness performance of these models and thus the ethical behavior of DNN-powered software. One of the most notable example is the Lottery Ticket Hypothesis (LTH), a prevailing model pruning approach. This paper demonstrates that fairness issue of LTH-based pruning arises from both its subnetwork selection and training procedures, highlighting the inadequacy of existing remedies. To address this, we propose a novel pruning framework, Ballot, which employs a novel conflict-detection-based subnetwork selection to find accurate and fair subnetworks, coupled with a refined training process to attain a high-performance model, thereby improving the fairness of DNN-powered software. By means of this procedure, Ballot improves the fairness of pruning by 38.00%, 33.91%, 17.96%, and 35.82% compared to state-of-the-art baselines, namely Magnitude Pruning, Standard LTH, SafeCompress, and FairScratch respectively, based on our evaluation of five popular datasets and three widely used models. Our code is available at https://anonymous.4open.science/r/Ballot-506E.

Publisher's Version
Learning to Check LTL Satisfiability and to Generate Traces via Differentiable Trace Checking
Weilin Luo, Pingjia Liang, Junming Qiu, Polong Chen, Hai Wan, Jianfeng Du, and Weiyuan Fang
(Sun Yat-sen University, China; Guangdong University of Foreign Studies, China)
Linear temporal logic (LTL) satisfiability checking has a high complexity, i.e., PSPACE-complete. Recently, neural networks have been shown to be promising in approximately checking LTL satisfiability in polynomial time. However, there is still a lack of neural network-based approach to the problem of checking LTL satisfiability and generating traces as evidence, simply called SAT-and-GET, where a satisfiable trace is generated as evidence if the given LTL formula is detected to be satisfiable. In this paper, we tackle SAT-and-GET via bridging LTL trace checking to neural network inference. Our key theoretical contribution is to show that a well-designed neural inference process, named after neural trace checking, is able to simulate LTL trace checking. We present a neural network-based approach VSCNet. Relying on the differentiable neural trace checking, VSCNet is able to learn both to check satisfiability and to generate traces via gradient descent. Experimental results confirm the effectiveness of VSCNet, showing that it significantly outperforms the state-of-the-art (SOTA) neural network-based approaches for trace generation, on average achieving up to 41.68% improvement in semantic accuracy. Besides, compared with the SOTA logic-based approach nuXmv and Aalta, VSCNet achieves averagely 186X and 3541X speedups on large-scale datasets, respectively.

Publisher's Version Info
DeLink: Source File Information Recovery in Binaries
Zhe Lang, Zhengzi Xu, Xiaohui Chen, Shichao Lv, Zhanwei Song, Zhiqiang Shi, and Limin Sun
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Nanyang Technological University, Singapore; Imperial Global Singapore, Singapore; China Mobile Research Institute, China)
Program comprehension can help analysts understand the primary behavior of a binary and enhance the efficiency of reverse engineering analysis. The existing works focus on instruction translation and function name prediction. However, they are limited in understanding the entire program. The recovered source file information can offer insights into the primary behavior of a binary, serving as high-level program summaries. Nevertheless, the files recovered by the function clustering-based approach contain binary functions with discontinuous distributions, resulting in low accuracy. Additionally, there is no existing research related to predicting the names of these recovered files. To this end, we propose a framework for source file information recovery in binaries, DeLink. This framework first leverages a file structure recovery approach based on boundary location to recognize files within a binary. Then, it utilizes an encoder-decoder model to predict the names of these files. The experimental results show that our file structure recovery approach achieves an average improvement of 14% across six evaluation metrics and requires only an average time of 16.74 seconds, outperforming the state-of-the-art work in both recovery quality and efficiency. Additionally, our file name prediction model achieves 70.09% precision and 63.91% recall. Moreover, we demonstrate the effective application of DeLink in malware homology analysis.

Publisher's Version
Your “Notice” Is Missing: Detecting and Fixing Violations of Modification Terms in Open Source Licenses during Forking
Kaifeng Huang, Yingfeng Xia, Bihuan Chen, Siyang He, Huazheng Zeng, Zhuotong Zhou, Jin Guo, and Xin Peng
(Tongji University, China; Fudan University, China)
Open source software brings benefit to the software community but also introduces legal risks caused by license violations, which result in serious consequences such as lawsuits and financial losses. To mitigate legal risks, some approaches have been proposed to identify licenses, detect license incompatibilities and inconsistencies, and recommend licenses. As far as we know, however, there is no prior work to understand modification terms in open source licenses or to detect and fix violations of modification terms. To bridge this gap, we first empirically characterize modification terms in 48 open source licenses. These licenses all require certain forms of “notice” to describe the modifications made to the original work. Inspired by our study, we then design LiVo to automatically detect and fix violations of modification terms in open source licenses during forking. Our evaluation has shown the effectiveness and efficiency of LiVo. 18 pull requests for fixing modification term violations have received positive responses. 8 have been merged.

Publisher's Version
Wapplique: Testing WebAssembly Runtime via Execution Context-Aware Bytecode Mutation
Wenxuan Zhao, Ruiying Zeng, and Yangfan Zhou
(Fudan University, China)
Reliability is the top concern to runtimes. This paper studies how to test Wasm runtime, by presenting Wapplique, the first Wasm bytecode mutation-based fuzzing tool. Wapplique solves the diversity/efficiency dilemma in generating test cases with a specifically-tailored code-fragment substitution approach for Wasm. In particular, Wapplique appliqués code fragments from real-world programs to seed programs to enhance the diversity of the seeds. Via sophisticated code analysis algorithms we design, Wapplique also guarantees the validity of the resulting programs. This allows Wapplique to generate tremendous valid and diverse Wasm programs as test cases to well exercise target runtimes. Our experiences on applying Wapplique in testing four prevalent real-world runtimes indicate that it can generate test cases efficiently, achieve high coverage, and find 20 previously unknown bugs.

Publisher's Version Published Artifact Artifacts Available
Feedback-Driven Automated Whole Bug Report Reproduction for Android Apps
Dingbang Wang, Yu Zhao, Sidong Feng, Zhaoxu Zhang, William G. J. Halfond, Chunyang Chen, Xiaoxia Sun, Jiangfan Shi, and Tingting Yu
(University of Connecticut, USA; University of Cincinnati, USA; Monash University, Australia; University of Southern California, USA; TU Munich, Germany; China Mobile (Suzhou) Software Technology, China; Zhejiang University, China)
In software development, bug report reproduction is a challenging task. This paper introduces ReBL, a novel feedback-driven approach that leverages GPT-4, a large-scale language model (LLM), to automatically reproduce Android bug reports. Unlike traditional methods, ReBL bypasses the use of Step to Reproduce (S2R) entities. Instead, it leverages the entire textual bug report and employs innovative prompts to enhance GPT’s contextual reasoning. This approach is more flexible and context-aware than the traditional step-by-step entity matching approach, resulting in improved accuracy and effectiveness. In addition to handling crash reports, ReBL has the capability of handling non-crash functional bug reports. Our evaluation of 96 Android bug reports (73 crash and 23 non-crash) demonstrates that ReBL successfully reproduced 90.63% of these reports, averaging only 74.98 seconds per bug report. Additionally, ReBL outperformed three existing tools in both success rate and speed.

Publisher's Version
UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing
Yifeng He, Jiabo Huang, Yuyang Rong, Yiwen Guo, Ethan Wang, and Hao Chen
(University of California at Davis, USA; Tencent, China; Unaffiliated, China)
The remarkable capability of large language models (LLMs) in generating high-quality code has drawn increasing attention in the software testing community. However, existing code LLMs often demonstrate unsatisfactory capabilities in generating accurate, complete tests since they were trained on code snippets collected without differentiating between code for testing and for other purposes. In this paper, we present a large-scale dataset, UniTSyn, which can enhance LLMs for Unit Test Synthesis. Associating tests with the tested functions is crucial for LLMs to infer the expected behavior and the logic paths to be verified. By leveraging Language Server Protocol, UniTSyn achieves the challenging goal of collecting focal-test pairs without per-project execution setups or per-language heuristics, which tend to be fragile and difficult to scale. Containing 2.7 million focal-test pairs across five mainstream programming languages, it can enhance the test generation ability of LLMs. Our experiments demonstrate that, by building an autoregressive LLM based on UniTSyn, we can achieve significant benefits in learning and understanding unit test representations, resulting in improved generation accuracy and code coverage across all the evaluated programming languages.

Publisher's Version Published Artifact Artifacts Available
When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention
Lianghong Guo, Yanlin Wang, Ensheng Shi, Wanjun Zhong, Hongyu Zhang, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng
(Sun Yat-sen University, China; Xi’an Jiaotong University, China; Chongqing University, China; Huawei Cloud Computing Technologies, China)
Code generation aims to automatically generate code snippets that meet given natural language requirements and plays an important role in software development. Although Code LLMs have shown excellent performance in this domain, their long generation time poses a signification limitation in practice use. In this paper, we first conduct an in-depth preliminary study with different Code LLMs on code generation task and identify a significant efficiency issue, i.e., continual generation of excess tokens. It harms the developer productivity and leads to huge computational wastes. To address it, we introduce CodeFast, an inference acceleration approach for Code LLMs on code generation. The key idea of CodeFast is to terminate the inference process in time when unnecessary excess tokens are detected. First, we propose an automatic data construction framework to obtain training data. Then, we train a unified lightweight model GenGuard applicable to multiple programming languages to predict whether to terminate inference at the current step. Finally, we enhance Code LLM with GenGuard to accelerate its inference in code generation task. We conduct extensive experiments with CodeFast on five representative Code LLMs across four widely used code generation datasets. Experimental results show that (1) CodeFast can significantly improve the inference speed of various Code LLMs in code generation, ranging form 34% to 452%, without compromising the quality of generated code. (2) CodeFast is stable across different parameter settings and can generalize to untrained datasets. Our code and data are available at https://github.com/DeepSoftwareAnalytics/CodeFast.

Publisher's Version Info ACM SIGSOFT Distinguished Paper Award
Dance of the ADS: Orchestrating Failures through Historically-Informed Scenario Fuzzing
Tong Wang, Taotao Gu, Huan Deng, Hu Li, Xiaohui Kuang, and Gang Zhao
(Academy of Military Sciences, China)
As autonomous driving systems (ADS) advance towards higher levels of autonomy, orchestrating their safety verification becomes increasingly intricate. This paper unveils ScenarioFuzz, a pioneering scenario-based fuzz testing methodology. Designed like a choreographer who understands the past performances, it uncovers vulnerabilities in ADS without the crutch of predefined scenarios. Leveraging map road networks, such as OPENDRIVE, we extract essential data to form a foundational scenario seed corpus. This corpus, enriched with pertinent information, provides the necessary boundaries for fuzz testing in the absence of starting scenarios. Our approach integrates specialized mutators and mutation techniques, combined with a graph neural network model, to predict and filter out high-risk scenario seeds, optimizing the fuzzing process using historical test data. Compared to other methods, our approach reduces the time cost by an average of 60.3%, while the number of error scenarios discovered per unit of time increases by 103%. Furthermore, we propose a self-supervised collision trajectory clustering method, which aids in identifying and summarizing 54 high-risk scenario categories prone to inducing ADS faults. Our experiments have successfully uncovered 58 bugs across six tested systems, emphasizing the critical safety concerns of ADS.

Publisher's Version Published Artifact Artifacts Available
AsFuzzer: Differential Testing of Assemblers with Error-Driven Grammar Inference
Hyungseok Kim, Soomin Kim, Jungwoo Lee, and Sang Kil Cha
(Affiliated Institute of ETRI, South Korea; KAIST, South Korea)
Assembler is a critical component of the compiler toolchain, which has been less tested than the other components. Unfortunately, current grammar-based fuzzing techniques suffer from several challenges when testing assemblers. First, each different assembler accepts different grammar rules and syntaxes, and there are no existing assembly grammar specifications. Second, not every assembler is open-source, which makes it difficult to extract grammar rules from the source code. While existing black-box grammar inference approaches are applicable to such closed-source assemblers, they suffer from the scalability issue, which renders them impractical for testing assemblers. To address these challenges, we propose a novel way to test assemblers by automatically inferring their grammar rules with only a few queries to the target assemblers by leveraging their error messages. The key insight is that assembly error messages often deliver useful information to infer the underlying grammar rules. We have implemented our technique in a tool named AsFuzzer, and evaluated it on 4 real-world assemblers including Clang-integrated assembler (Clang), GNU assembler (GAS), Intel’s assembler (ICC), and Microsoft macro assembler (MASM). With AsFuzzer, we have successfully found 497 buggy instruction opcodes for six popular architectures, and reported them to the developers.

Publisher's Version Published Artifact Artifacts Available ACM SIGSOFT Distinguished Paper Award
Better Not Together: Staged Solving for Context-Free Language Reachability
Chenghang Shi, Haofeng Li, Jie Lu, and Lian Li
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Context-free language reachability (CFL-reachability) is a fundamental formulation for program analysis with many applications. CFL-reachability analysis is computationally expensive, with a slightly subcubic time complexity concerning the number of nodes in the input graph. This paper proposes staged solving: a new perspective on solving CFL-reachability. Our key observation is that the context-free grammar (CFG) of a CFL-based program analysis can be decomposed into (1) a smaller CFG, L, for matching parentheses, such as procedure calls/returns, field stores/loads, and (2) a regular grammar, R, capturing control/data flows. Instead of solving these two parts monolithically (as in standard algorithms), staged solving solves L-reachability and R-reachability in two distinct stages. In practice, L-reachability, though still context-free, involves only a small subset of edges, while R-reachability can be computed efficiently with close to quadratic complexity relative to the node size of the input graph. We implement our staged CFL-reachability solver, STG, and evaluate it using two clients: context-sensitive value-flow analysis and field-sensitive alias analysis. The empirical results demonstrate that STG achieves speedups of 861.59x and 4.1x for value-flow analysis and alias analysis on average, respectively, over the standard subcubic algorithm. Moreover, we also showcase that staged solving can help to significantly improve the performance of two state-of-the-art solvers, POCR and PEARL, by 74.82x (1.78x) and 37.66x (1.7x) for value-flow (alias) analysis, respectively.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
AI Coders Are among Us: Rethinking Programming Language Grammar towards Efficient Code Generation
Zhensu Sun, Xiaoning Du, Zhou Yang, Li Li, and David Lo
(Singapore Management University, Singapore; Monash University, Australia; Beihang University, China)
Artificial Intelligence (AI) models have emerged as another important audience for programming languages alongside humans and machines, as we enter the era of large language models (LLMs). LLMs can now perform well in coding competitions and even write programs like developers to solve various tasks, including mathematical problems. However, the grammar and layout of current programs are designed to cater the needs of human developers -- with many grammar tokens and formatting tokens being used to make the code easier for humans to read. While this is helpful, such a design adds unnecessary computational work for LLMs, as each token they either use or produce consumes computational resources. To improve inference efficiency and reduce computational costs, we propose the concept of AI-oriented grammar.This aims to represent code in a way that better suits the working mechanism of AI models. Code written with AI-oriented grammar discards formats and uses a minimum number of tokens to convey code semantics effectively. To demonstrate the feasibility of this concept, we explore and implement the first AI-oriented grammar for Python, named Simple Python (SimPy). SimPy is crafted by revising the original Python grammar through a series of heuristic rules. Programs written in SimPy maintain identical Abstract Syntax Tree (AST) structures to those in standard Python. This allows for not only execution via a modified AST parser, but also seamless transformation between programs written in Python and SimPy, enabling human developers and LLMs to use Python and SimPy, respectively, when they need to collaborate. We also look into methods to help existing LLMs understand and use SimPy effectively. In the experiments, compared with Python, SimPy enables a reduction in token usage by 13.5% and 10.4% for CodeLlama and GPT-4, respectively, when completing the same set of code-related tasks. Additionally, these models can maintain or even improve their performance when using SimPy instead of Python for these tasks. With these promising results, we call for further contributions to the development of AI-oriented program grammar within our community.

Publisher's Version ACM SIGSOFT Distinguished Paper Award
FRIES: Fuzzing Rust Library Interactions via Efficient Ecosystem-Guided Target Generation
Xizhe Yin, Yang Feng, Qingkai Shi, Zixi Liu, Hongwang Liu, and Baowen Xu
(Nanjing University, China)
Rust has been extensively used in software development in the past decades due to its memory safety mechanisms and gradually matured ecosystems. Enhancing the quality of Rust libraries is critical to Rust ecosystems as the libraries are often the core component of software systems. Nevertheless, we observe that existing approaches fall short in testing Rust API interactions - they either lack a Rust ownership-compliant API testing method, fail to handle the large search space of function dependencies, or are limited by pre-selected codebases, resulting in inefficiencies in finding errors. To address these issues, we propose a fuzzing technique, namely FRIES, that efficiently synthesizes and tests complex API interactions to identify defects in Rust libraries, and therefore promises to significantly improve the quality of Rust libraries. Behind our approach, a key technique is to traverse a weighted API dependency graph, which encodes not only syntactic dependency between functions but also the common usage patterns mined from the Rust ecosystem that reflect the programmer’s thinking. Combined with our efficient generation algorithm, such a graph structure significantly reduces the search space and lets us focus on finding hidden bugs in common application scenarios. Meanwhile, an ownership assurance algorithm is specially designed to ensure the validity of the generated Rust programs, notably improving the success rate of compiling fuzz targets. Experimental results demonstrate that this technique can indeed generate high-quality fuzz targets with minimal computational resources, while more efficiently discovering errors that have a greater impact on actual development, thereby mitigating the impact on the robustness of programs in the Rust ecosystem. So far, FRIES has identified 130 bugs, including 84 previously unknown bugs, in 20 well-known latest versions of Rust libraries, of which 54 have been confirmed.

Publisher's Version
Segment-Based Test Case Prioritization: A Multi-objective Approach
Hieu Huynh, Nhu Pham, Tien N. Nguyen, and Vu Nguyen
(Katalon, Vietnam; Ho Chi Minh City University of Science, Vietnam; University of Texas at Dallas, USA; Vietnam National University, Vietnam)
Regression testing of software is a crucial but time-consuming task, especially in the context of user interface (UI) testing where multiple microservices must be validated simultaneously. Test case prioritization (TCP) is a cost-efficient solution to address this by scheduling test cases in an execution order that maximizes an objective function, generally aimed at increasing the fault detection rate. While several techniques have been proposed for TCP, most rely on source code information which is usually not available for UI testing. In this paper, we introduce a multi-objective optimization approach to prioritize UI test cases, using evolutionary search algorithms and four coverage criteria focusing on web page elements as objectives for the optimization problem. Our method, which does not require source code information, is evaluated using two evolutionary algorithms (AGE-MOEA and NSGA-II) and compared with other TCP methods on a self-collected dataset of 11 test suites. The results show that our approach significantly outperforms other methods in terms of Average Percentage of Faults Detected (APFD) and APFD with Cost (APFDc), achieving the highest scores of 87.8% and 79.2%, respectively. We also introduce a new dataset and demonstrate the significant improvement of our approach over existing ones via empirical experiments. The paper’s contributions include the application of web page segmentation in TCP, the construction of a new dataset for UI TCP, and empirical comparisons that demonstrate the improvement of our approach.

Publisher's Version
Understanding Misconfigurations in ROS: An Empirical Study and Current Approaches
Paulo Canelas, Bradley Schmerl, Alcides Fonseca, and Christopher S. Timperley
(Carnegie Mellon University, USA; LASIGE, Portugal; University of Lisbon, Portugal)
The Robot Operating System (ROS) is a popular framework and ecosystem that allows developers to build robot software systems from reusable, off-the-shelf components. Systems are often built by customizing and connecting components via configuration files. While reusable components theoretically allow rapid prototyping, ensuring proper configuration and connection is challenging, as evidenced by numerous questions on developer forums. Developers must abide to the often unchecked and unstated assumptions of individual components. Failure to do so can result in misconfigurations that are only discovered during field deployment, at which point errors may lead to unpredictable and dangerous behavior. Despite misconfigurations having been studied in the broader context of software engineering, robotics software (and ROS in particular) poses domain-specific challenges with potentially disastrous consequences. To understand and improve the reliability of ROS projects, it is critical to identify the types of misconfigurations faced by developers. To that end, we perform a study of ROS Answers, a Q&A platform, to identify and categorize misconfigurations that occur during ROS development. We then conduct a literature review to assess the coverage of these misconfigurations by existing detection techniques. In total, we find 12 high-level categories and 50 sub-categories of misconfigurations. Of these categories, 27 are not covered by existing techniques. To conclude, we discuss how to tackle those misconfigurations in future work.

Publisher's Version Published Artifact Archive submitted (110 MB) Artifacts Available
Tacoma: Enhanced Browser Fuzzing with Fine-Grained Semantic Alignment
Jiashui Wang, Peng Qian, Xilin Huang, Xinlei Ying, Yan Chen, Shouling Ji, Jianhai Chen, Jundong Xie, and Long Liu
(Zhejiang University, China; Ant Group, China; Northwestern University, USA)
Browsers are responsible for managing and interpreting the diverse data coming from the web. Despite the considerable efforts of developers, however, it is nearly impossible to completely eliminate potential vulnerabilities in such complicated software. While a family of fuzzing techniques has been proposed to detect flaws in web browsers, they still face the inherent challenge of generating test inputs with low semantic correctness and poor diversity. In this paper, we propose Tacoma, a novel fuzzing framework tailored for web browsers. Tacoma comprises three main modules: a semantic parser, a semantic aligner, and an input generator. By taking advantage of fine-grained semantic alignment techniques, Tacoma is capable of generating semantically correct test inputs, which significantly improve the probability of a fuzzer in triggering a deep browser state. In particular, by integrating a scope-aware strategy into input generation, Tacoma is able to deal with asynchronous code generation, thereby substantially increasing the diversity of the generated test inputs. We conduct extensive experiments to evaluate Tacoma on three production-level browsers, i.e., Chromium, Safari, and Firefox. Empirical results demonstrate that Tacoma outperforms state-of-the-art browser fuzzers in both achieving code coverage and detecting unique crashes. So far, Tacoma has identified 32 previously unknown bugs, 10 of which have been assigned CVEs. It is worth noting that Tacoma unearthed two bugs in Chromium that have remained undetected for ten years.

Publisher's Version
Synthesis of Sound and Precise Storage Cost Bounds via Unsound Resource Analysis and Max-SMT
Elvira Albert, Jesús Correas, Pablo Gordillo, Guillermo Román-Díez, and Albert Rubio
(Complutense University of Madrid, Spain; Universidad Politécnica de Madrid, Spain)
A storage is a persistent memory whose contents are kept across different program executions. In the blockchain technology, storage contents are replicated and incur the largest costs of a program’s execution (a.k.a. gas fees). Storage costs are dynamically calculated using a rather complex model which assigns a much larger cost to the first access made in an execution to a storage key, and besides assigns different costs to write accesses depending on whether they change the values w.r.t. the initial and previous contents. Safely assuming the largest cost for all situations, as done in existing gas analyzers, is an overly-pessimistic approach that might render useless bounds because of being too loose. The challenge is to soundly, and yet accurately, synthesize storage bounds which take into account the dynamicity implicit to the cost model. Our solution consists in using an off-the-shelf static resource analysis —but do not always assuming a worst-case cost— and hence yielding unsound bounds; and then, in a posterior stage, computing corrections to recover soundness in the bounds by using a new Max-SMT based approach. We have implemented our approach and used it to improve the precision of two gas analyzers for Ethereum, gastap and asparagus. Experimental results on more than 400,000 functions show that we achieve great accuracy gains, up to 75%, on the storage bounds, being the most frequent gains between 10-20%.

Publisher's Version
Identifying Smart Contract Security Issues in Code Snippets from Stack Overflow
Jiachi Chen, Chong Chen, Jiang Hu, John Grundy, Yanlin Wang, Ting Chen, and Zibin Zheng
(Sun Yat-sen University, China; Monash University, Australia; University of Electronic Science and Technology of China, China)
Smart contract developers frequently seek solutions to developmental challenges on Q&A platforms such as Stack Overflow (SO). Although community responses often provide viable solutions, the embedded code snippets can also contain hidden vulnerabilities. Integrating such code directly into smart contracts may make them susceptible to malicious attacks. We conducted an online survey and received 74 responses from smart contract developers. The results of this survey indicate that the majority (86.4%) of participants do not sufficiently consider security when reusing SO code snippets. Despite the existence of various tools designed to detect vulnerabilities in smart contracts, these tools are typically developed for analyzing fully-completed smart contracts and thus are ineffective for analyzing typical code snippets as found on SO. We introduce SOChecker, the first tool designed to identify potential vulnerabilities in incomplete SO smart contract code snippets. SOChecker first leverages a fine-tuned Llama2 model for code completion, followed by the application of symbolic execution methods for vulnerability detection. Our experimental results, derived from a dataset comprising 897 code snippets collected from smart contract-related SO posts, demonstrate that SOChecker achieves an F1 score of 68.2%, greatly surpassing GPT-3.5 and GPT-4 (20.9% and 33.2% F1 Scores respectively). Our findings underscore the need to improve the security of code snippets from Q&A websites.

Publisher's Version Info ACM SIGSOFT Distinguished Paper Award
Domain Adaptation for Code Model-Based Unit Test Case Generation
Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, and Song Wang
(York University, Canada; University of Calgary, Canada)
Recently, deep learning-based test case generation approaches have been proposed to automate the generation of unit test cases. In this study, we leverage Transformer-based code models to generate unit tests with the help of Domain Adaptation (DA) at a project level. Specifically, we use CodeT5, a relatively small language model trained on source code data, and fine-tune it on the test generation task. Then, we apply domain adaptation to each target project data to learn project-specific knowledge (project-level DA). We use the Methods2test dataset to fine-tune CodeT5 for the test generation task and the Defects4j dataset for project-level domain adaptation and evaluation. We compare our approach with (a) CodeT5 fine-tuned on the test generation without DA, (b) the A3Test tool, and (c) GPT-4 on five projects from the Defects4j dataset. The results show that tests generated using DA can increase the line coverage by 18.62%, 19.88%, and 18.02% and mutation score by 16.45%, 16.01%, and 12.99% compared to the above (a), (b), and (c) baselines, respectively. The overall results show consistent improvements in metrics such as parse rate, compile rate, BLEU, and CodeBLEU. In addition, we show that our approach can be seen as a complementary solution alongside existing search-based test generation tools such as EvoSuite, to increase the overall coverage and mutation scores with an average of 34.42% and 6.8%, for line coverage and mutation score, respectively.

Publisher's Version
How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation
Cen Zhang, Yaowen Zheng, Mingqiang Bai, Yeting Li, Wei Ma, Xiaofei Xie, Yuekang Li, Limin Sun, and Yang Liu
(Nanyang Technological University, Singapore; Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Singapore Management University, Singapore; UNSW, Sydney, Australia)
Fuzz drivers are essential for library API fuzzing. However, automatically generating fuzz drivers is a complex task, as it demands the creation of high-quality, correct, and robust API usage code. An LLM-based (Large Language Model) approach for generating fuzz drivers is a promising area of research. Unlike traditional program analysis-based generators, this text-based approach is more generalized and capable of harnessing a variety of API usage information, resulting in code that is friendly for human readers. However, there is still a lack of understanding regarding the fundamental issues on this direction, such as its effectiveness and potential challenges. To bridge this gap, we conducted the first in-depth study targeting the important issues of using LLMs to generate effective fuzz drivers. Our study features a curated dataset with 86 fuzz driver generation questions from 30 widely-used C projects. Six prompting strategies are designed and tested across five state-of-the-art LLMs with five different temperature settings. In total, our study evaluated 736,430 generated fuzz drivers, with 0.85 billion token costs ($8,000+ charged tokens). Additionally, we compared the LLM-generated drivers against those utilized in industry, conducting extensive fuzzing experiments (3.75 CPU-year). Our study uncovered that: 1) While LLM-based fuzz driver generation is a promising direction, it still encounters several obstacles towards practical applications; 2) LLMs face difficulties in generating effective fuzz drivers for APIs with intricate specifics. Three featured design choices of prompt strategies can be beneficial: issuing repeat queries, querying with examples, and employing an iterative querying process; 3) While LLM-generated drivers can yield fuzzing outcomes that are on par with those used in the industry, there are substantial opportunities for enhancement, such as extending contained API usage, or integrating semantic oracles to facilitate logical bug detection. Our insights have been implemented to improve the OSS-Fuzz-Gen project, facilitating practical fuzz driver generation in industry.

Publisher's Version Info
Commit Artifact Preserving Build Prediction
Guoqing Wang, Zeyu Sun, Yizhou Chen, Yifan Zhao, Qingyuan Liang, and Dan Hao
(Peking University, China; Institute of Software at Chinese Academy of Sciences, China)
In Continuous Integration (CI), accurate build prediction is crucial for minimizing development costs and enhancing efficiency. However, existing build prediction methods, typically based on predefined rules or machine learning classifiers employing feature engineering, have been constrained by their limited ability to fully capture the intricate details of commit artifacts, such as code change and commit messages. These artifacts are critical for understanding the commit under a build but have been inadequately utilized in existing approaches. To address this problem, we propose GitSense, a Transformer-based model specifically designed to incorporate the rich and complex information contained within commit artifacts for the first. GitSense employs an advanced textual encoder with built-in sliding window text samplers for textual features and a statistical feature encoder for extracted statistical features. This innovative approach allows for a comprehensive analysis of lengthy and intricate commit artifacts, surpassing the capabilities of traditional methods. We conduct comprehensive experiments to compare GitSense with five state-of-the-art build prediction models, Longformer, and ChatGPT. The experimental results show that GitSense outperforms these models in predicting failed builds, evidenced by 32.7%-872.1.0% better on F1-score, 23.9%-437.5% better on Precision, and 40.2%-1396.0% better on Recall.

Publisher's Version
Toward the Automated Localization of Buggy Mobile App UIs from Bug Descriptions
Antu Saha, Yang Song, Junayed Mahmud, Ying Zhou, Kevin Moran, and Oscar Chaparro
(William & Mary, USA; University of Central Florida, USA; George Mason University, USA)
Bug report management is a costly software maintenance process comprised of several challenging tasks. Given the UI-driven nature of mobile apps, bugs typically manifest through the UI, hence the identification of buggy UI screens and UI components (Buggy UI Localization) is important to localizing the buggy behavior and eventually fixing it. However, this task is challenging as developers must reason about bug descriptions (which are often low-quality), and the visual or code-based representations of UI screens. This paper is the first to investigate the feasibility of automating the task of Buggy UI Localization through a comprehensive study that evaluates the capabilities of one textual and two multi-modal deep learning (DL) techniques and one textual unsupervised technique. We evaluate such techniques at two levels of granularity, Buggy UI Screen and UI Component localization. Our results illustrate the individual strengths of models that make use of different representations, wherein models that incorporate visual information perform better for UI screen localization, and models that operate on textual screen information perform better for UI component localization – highlighting the need for a localization approach that blends the benefits of both types of techniques. Furthermore, we study whether Buggy UI Localization can improve traditional buggy code localization, and find that incorporating localized buggy UIs leads to improvements of 9%-12% in Hits@10.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
WASMaker: Differential Testing of WebAssembly Runtimes via Semantic-Aware Binary Generation
Shangtong Cao, Ningyu He, Xinyu She, Yixuan Zhang, Mu Zhang, and Haoyu Wang
(Beijing University of Posts and Telecommunications, China; Peking University, China; Huazhong University of Science and Technology, China; University of Utah, USA)
A fundamental component of the Wasm ecosystem is the Wasm runtime, as it directly impacts whether Wasm applications can be executed as expected. Bugs in Wasm runtimes are frequently reported, so the research community has made a few attempts to design automated testing frameworks to detect bugs in Wasm runtimes. However, existing testing frameworks are limited by the quality of test cases, i.e., they face challenges in generating Wasm binaries that are both semantically rich and syntactically correct. As a result, complicated bugs cannot be triggered effectively. In this work, we present WASMaker, a novel differential testing framework that can generate complicated Wasm test cases by disassembling and assembling real-world Wasm binaries, which can trigger hidden inconsistencies among Wasm runtimes. To further pinpoint the root causes of unexpected behaviors, we design a runtime-agnostic root cause location method to locate bugs accurately. Extensive evaluation suggests that WASMaker outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. We have uncovered 33 unique bugs in popular Wasm runtimes, among which 25 have been confirmed.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
ThinkRepair: Self-Directed Automated Program Repair
Xin Yin, Chao Ni, Shaohua Wang, Zhenhao Li, Limin Zeng, and Xiaohu Yang
(Zhejiang University, China; Central University of Finance and Economics, China; Concordia University, Canada)
Though many approaches have been proposed for Automated Program Repair (APR) and indeed achieved remarkable performance, they still have limitations in fixing bugs that require analyzing and reasoning about the logic of the buggy program. Recently, large language models (LLMs) instructed by prompt engineering have attracted much attention for their powerful ability to address many kinds of tasks including bug-fixing. However, the quality of the prompt will highly affect the ability of LLMs and manually constructing high-quality prompts is a costly endeavor. To address this limitation, we propose a self-directed LLM-based automated program repair, ThinkRepair, with two main phases: collection phase and fixing phase. The former phase automatically collects various chains of thoughts that constitute pre-fixed knowledge by instructing LLMs with the Chain-of-Thought (CoT) prompt. The latter phase targets fixing a bug by first selecting examples for few-shot learning and second automatically interacting with LLMs, optionally appending with feedback of testing information. Evaluations on two widely studied datasets (Defects4J and QuixBugs) by comparing ThinkRepair with 12 SOTA APRs indicate the priority of ThinkRepair in fixing bugs. Notably, ThinkRepair fixes 98 bugs and improves baselines by 27%∼344.4% on Defects4J V1.2. On Defects4J V2.0, ThinkRepair fixes 12∼65 more bugs than the SOTA APRs. Additionally, ThinkRepair also makes a considerable improvement on QuixBugs (31 for Java and 21 for Python at most).

Publisher's Version
Fuzzing MLIR Compiler Infrastructure via Operation Dependency Analysis
Chenyao Suo, Junjie Chen, Shuang Liu, Jiajun Jiang, Yingquan Zhao, and Jianrong Wang
(Tianjin University, China; Renmin University of China, China)
MLIR (Multi-Level Intermediate Representation) compiler infrastructure has gained widespread popularity in recent years. It introduces dialects to accommodate various levels of abstraction within the representation. Due to its fundamental role in compiler construction, it is critical to ensure its correctness. Recently, a grammar-based fuzzing technique (i.e., MLIRSmith) has been proposed for it and achieves notable effectiveness. However, MLIRSmith generates test programs in a random manner, which restricts the exploration of the input space, thereby limiting the overall fuzzing effectiveness. In this work, we propose a novel fuzzing technique, called MLIR. As complicated or uncommon data/control dependencies among various operations are often helpful to trigger MLIR bugs, it constructs the operation dependency graph for an MLIR program and defines the associated operation dependency coverage to guide the fuzzing process. To drive the fuzzing process towards increasing operation dependency coverage, MLIR then designs a set of dependency-targeted mutation rules. By applying MLIR to the latest revisions of the MLIR compiler infrastructure, it detected 63 previously unknown bugs, among which 38/48 bugs have been fixed/confirmed by developers.

Publisher's Version
Evaluating Deep Neural Networks in Deployment: A Comparative Study (Replicability Study)
Eduard Pinconschi, Divya Gopinath, Rui Abreu, and Corina S. Păsăreanu
(Carnegie Mellon University, USA; KBR, USA; NASA Ames, USA; INESC-ID, Portugal; University of Porto, Portugal)
As deep neural networks (DNNs) are increasingly used in safety-critical applications, there is a growing concern for their reliability. Even highly trained, high-performant networks are not 100% accurate. However, it is very difficult to predict their behavior during deployment without ground truth. In this paper, we provide a comparative and replicability study on recent approaches that have been proposed to evaluate the reliability of DNNs in deployment. We find that it is hard to run and reproduce the results for these approaches on their replication packages and even more difficult to run them on artifacts other than their own. Further, it is difficult to compare the effectiveness of the approaches, due to the lack of clearly defined evaluation metrics. Our results indicate that more effort is needed in our research community to obtain sound techniques for evaluating the reliability of neural networks in safety-critical domains. To this end, we contribute an evaluation framework that incorporates the considered approaches and enables evaluation on common benchmarks, using common metrics.

Publisher's Version
Towards Understanding the Bugs in Solidity Compiler
Haoyang Ma, Wuqi Zhang, Qingchao Shen, Yongqiang Tian, Junjie Chen, and Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; Tianjin University, China)
Solidity compiler plays a key role in enabling the development of smart contract applications on Ethereum by governing the syntax of a domain-specific language called Solidity and performing compilation and optimization of Solidity code. The correctness of Solidity compiler is critical in fostering transparency, efficiency, and trust in industries reliant on smart contracts. However, like other software systems, Solidity compiler is prone to bugs, which may produce incorrect bytecodes on blockchain platforms, resulting in severe security concerns. As a domain-specific compiler for smart contracts, Solidity compiler differs from other compilers in many perspectives, posing unique challenges to detect its bugs. To understand the bugs in Solidity compiler and benefit future research, in this paper, we present the first systematic study on 533 Solidity compiler bugs. We carefully examined their characteristics (including symptoms, root causes, and distribution), and their triggering test cases. Our study leads to seven bug-revealing takeaways for Solidity compiler. Moreover, to study the limitations of Solidity compiler fuzzers and bring our findings into practical scenarios, we evaluate three Solidity compiler fuzzers on our constructed benchmark. The results show that these fuzzers are inefficient in detecting Solidity compiler bugs. The inefficiency arises from their failure to consider the interesting bug-inducing features, bug-related compilation flags, and test oracles.

Publisher's Version
Foliage: Nourishing Evolving Software by Characterizing and Clustering Field Bugs
Zhanyao Lei, Yixiong Chen, Mingyuan Xia, and Zhengwei Qi
(Shanghai Jiao Tong University, Shanghai, China; AppetizerIO, Shanghai, China)
Modern programs, characterized by their complex functionalities, high integration, and rapid iteration cycles, are prone to errors. This complexity poses challenges in program analysis and software testing, making it difficult to achieve comprehensive bug coverage during the development phase. As a result, many bugs are only discovered during the software’s production phase. Tracking and understanding these field bugs is essential but challenging: the uploaded field error reports are extensive, and trivial yet high-frequency bugs can overshadow important low-frequency bugs. Additionally, application codebases evolve rapidly, causing a single bug to produce varied exceptions and stack traces across different code releases. In this paper, we introduce Foliage, a bug tracking and clustering toolchain designed to trace and characterize field bugs in JavaScript applications, aiding developers in locating and fixing these bugs. To address the challenges of efficiently tracking and analyzing the dynamic and complex nature of software bugs, Foliage proposes an error message enhancement technique. Foliage also introduces the verbal-characteristic-based clustering technique, along with three evaluation metrics for bug clustering: V-measure, cardinality bias, and hit rate. The results show that Foliage’s verbal-characteristic-based bug clustering outperforms previous bug clustering approaches by an average of 31.1% across these three metrics. We present an empirical study of Foliage applied to a complex real-world application over a two-year production period, capturing over 250,000 error reports and clustering them into 132 unique bugs. Finally, we open-source a bug dataset consisting of real and labeled error reports, which can be used to benchmark bug clustering techniques.

Publisher's Version
Towards More Complete Constraints for Deep Learning Library Testing via Complementary Set Guided Refinement
Gwihwan Go, Chijin Zhou, Quan Zhang, Xiazijian Zou, Heyuan Shi, and Yu Jiang
(Tsinghua University, China; Central South University, China)
Deep learning library is important in AI systems. Recently, many works have been proposed to ensure its reliability. They often model inputs of tensor operations as constraints to guide the generation of test cases. However, these constraints may narrow the search space, resulting in incomplete testing. This paper introduces a complementary set-guided refinement that can enhance the completeness of constraints. The basic idea is to see if the complementary set of constraints yields valid test cases. If so, the original constraint is incomplete and needs refinement. Based on this idea, we design an automatic constraint refinement tool, DeepConstr, which adopts a genetic algorithm to refine constraints for better completeness. We evaluated it on two DL libraries, PyTorch and TensorFlow. DeepConstr discovered 84 unknown bugs, out of which 72 were confirmed, with 51 fixed. Compared to state-of-the-art fuzzers, DeepConstr increased coverage for 43.44% of operators supported by NNSmith, and 59.16% of operators supported by NeuRI.

Publisher's Version Published Artifact Artifacts Available
Prospector: Boosting Directed Greybox Fuzzing for Large-Scale Target Sets with Iterative Prioritization
Zhijie Zhang, Liwei Chen, Haolai Wei, Gang Shi, and Dan Meng
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Directed grey-box fuzzing (DGF) is an advanced technique in security testing, specifically designed to guide fuzzing tools toward predefined target sites within a software program. To improve its scalability on multiple targets, recent DGFs prioritize seeds that close to targets based on a more precise distance metric, and dynamically discard well-explored targets, thus steering toward all targets simultaneously. However, not all targets hold equal importance, particularly when facing large-scale target sets. Therefore, current works that blindly tracking all targets diverts computing resources from critical targets, thereby reducing the overall efficiency of triggering targets. In this paper, we present Prospector, a novel DGF approach that can handle large-scale target sets scenarios. Prospector employs an iterative process to focus on a select group of ”focused targets”. To dynamically maintain these targets, Prospector present a more fine-grained strategy that considers the vulnerable patterns and test adequacy of targets. Subsequently, Prospector further sharpens its fuzzing approach toward ”focused targets” by refining strategies in explore-exploit scheduling, seed selection, and byte scheduling. We evaluate Prospector on 24 programs by setting all sanitizer labels as targets. The experimental results show that Prospector exposed bugs faster than AFL++, WindRanger, ParmeSan, and FishFuzz by 125, 141, 84, and 100 cases, respectively. Among 38 unique bugs in the program group with the largest target sets, Prospector reproduces 18 (47.37%) existing bugs faster than other fuzzers. Prospector also discovered 6 new bugs in 4 real-world programs with 5 CVE IDs assigned.

Publisher's Version Published Artifact Artifacts Available
Bugs in Pods: Understanding Bugs in Container Runtime Systems
Jiongchi Yu, Xiaofei Xie, Cen Zhang, Sen Chen, Yuekang Li, and Wenbo Shen
(Singapore Management University, Singapore; Nanyang Technological University, Singapore; Tianjin University, China; UNSW, Sydney, Australia; Zhejiang University, China)
Container Runtime Systems (CRSs), which form the foundational infrastructure of container clouds, are critically important due to their impact on the quality of container cloud implementations. However, a comprehensive understanding of the quality issues present in CRS implementations remains lacking. To bridge this gap, we conduct the first comprehensive empirical study of CRS bugs. Specifically, we gather 429 bugs from 8,271 commits across dominant CRS projects, including runc, gvisor, containerd, and cri-o. Through manual analysis, we develop taxonomies of CRS bug symptoms and root causes, comprising 16 and 13 categories, respectively. Furthermore, we evaluate the capability of popular testing approaches, including unit testing, integration testing, and fuzz testing in detecting these bugs. The results show that 78.79% of the bugs cannot be detected due to the lack of test drivers, oracles, and effective test cases. Based on the findings of our study, we present implications and future research directions for various stakeholders in the domain of CRSs. We hope that our work can lay the groundwork for future research on CRS bug detection.

Publisher's Version
Automated Data Binding Vulnerability Detection for Java Web Frameworks via Nested Property Graph
Xiaoyong Yan, Biao He, Wenbo Shen, Yu Ouyang, Kaihang Zhou, Xingjian Zhang, Xingyu Wang, Yukai Cao, and Rui Chang
(Zhejiang University, China; Ant Group, China)
Data binding has been widely adopted by popular web frameworks due to its convenience of automatically binding web request parameters to the web program's properties. However, its improper implementation in web frameworks exposes sensitive properties, leading to data binding vulnerabilities, which can be exploited to launch severe attacks, such as the Spring4Shell remote code execution. Despite their criticalness, these issues are overlooked, and there is no systematic study addressing them. This paper presents the first automatic analysis of the data binding vulnerabilities in Java web frameworks. We develop an automatic Data bInding Vulnerabilities dEtectoR, named DIVER, to analyze data binding vulnerabilities. DIVER employs three new techniques: the Nested Property Graph-based Extraction to extract nested properties, the Bind-Site Instrumentation-based Identification to identify bindable nested properties, and the Property-aware Fuzzing to trigger and detect data binding vulnerabilities. We evaluated DIVER on two widely used Java web frameworks, Spring and Grails, and discovered 81 data binding vulnerabilities. These vulnerabilities can be exploited to launch remote code execution, arbitrary file read, and denial of service attacks. We have responsibly reported these vulnerabilities to the corresponding teams and helped to fix them. Three new CVEs with critical and high severity ratings have been assigned to us, including the infamous Spring4Shell.

Publisher's Version Published Artifact Artifacts Available
SelfPiCo: Self-Guided Partial Code Execution with LLMs
Zhipeng Xue, Zhipeng Gao, Shaohua Wang, Xing Hu, Xin Xia, and Shanping Li
(Zhejiang University, China; Central University of Finance and Economics, China)
Code executability plays a vital role in software debugging and testing (e.g., detecting runtime exceptions or assertion violations). However, code execution, especially partial or arbitrary code execution, is a non-trivial task due to missing definitions and complex third-party dependencies. To make partial code (such as code snippets posted on the web or code fragments deep inside complex software projects) executable, the existing study has proposed a machine learning model to predict the undefined element types and inject the pre-defined dummy values into execution. However, the performance of their tool is limited due to its simply designed dummy values and the inability to continue learning. In this paper, we design and implement a novel framework, named SelfPiCo (Self-Guided Partial Code Executor), to dynamically guide partial code execution by incorporating the open-source LLM (i.e., Code Llama) within an interactive loop. Particularly, SelfPiCo leverages few-shot in-context learning and chain-of-thought reasoning to elicit human knowledge and logical reasoning based on fine-tuning the Code Llama model. SelfPiCo continuously learns from code execution results and refines its predictions step after step. Our evaluations demonstrate that SelfPiCo can execute 72.7% and 83.3% of all lines in the open-source code and Stack Overflow snippets, outperforming the most recent state-of-the-art Lexecutor by 37.9% and 33.5%, respectively. Moreover, SelfPiCo successfully detected 18 and 33 runtime type error issues by executing the partial code from eight GitHub software projects and 43 Stack Overflow posts, demonstrating the practical usage and potential application of our framework in practice.

Publisher's Version
Neurosymbolic Repair of Test Flakiness
Yang Chen and Reyhaneh Jabbarvand
(University of Illinois at Urbana-Champaign, USA)
Test flakiness, a non-deterministic behavior of builds irrelevant to code changes, is a major and continuing impediment to deliver- ing reliable software. The very few techniques for the automated repair of test flakiness are specifically crafted to repair either Order- Dependent (OD) or Implementation-Dependent (ID) flakiness. They are also all symbolic approaches, i.e., they leverage program analy- sis to detect and repair known test flakiness patterns and root causes, failing to generalize. To bridge the gap, we propose FlakyDoctor, a neuro-symbolic technique that combines the power of LLMs— generalizability—and program analysis—soundness—to fix different types of test flakiness. Our extensive evaluation using 873 confirmed flaky tests (332 OD and 541 ID) from 243 real-world projects demonstrates the ability of FlakyDoctor in repairing flakiness, achieving 57% (OD) and 59% (ID) success rate. Comparing to three alternative flakiness repair approaches, FlakyDoctor can repair 8% more ID tests than DexFix, 12% more OD flaky tests than ODRepair, and 17% more OD flaky tests than iFixFlakies. Regardless of underlying LLM, the non-LLM components of FlakyDoctor contribute to 12–31 % of the overall performance, i.e., while part of the FlakyDoctor power is from using LLMs, they are not good enough to repair flaky tests in real-world projects alone. What makes the proposed technique superior to related research on test flakiness mitigation specifically and program repair, in general, is repairing 79 previously unfixed flaky tests in real-world projects. We opened pull requests for all cases with corresponding patches; 19 of them were accepted and merged at the time of submission.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Inconsistencies in TeX-Produced Documents
Jovyn Tan and Manuel Rigger
(National University of Singapore, Singapore)
TeX is a widely-used typesetting system adopted by most publishers and professional societies. While TeX is responsible for generating a significant number of documents, irregularities in the TeX ecosystem may produce inconsistent documents. These inconsistencies may occur across different TeX engines or different versions of TeX distributions, resulting in failures to adhere to formatting specifications, or the same document rendering differently for different authors. In this work, we investigate and quantify the robustness of the TeX ecosystem through a large-scale study of 432 documents. We developed an automated pipeline to evaluate the cross-engine and cross-version compatibility of the TeX ecosystem. We found significant inconsistencies in the outputs of different TeX engines: only 0.2% of documents compiled to identical output with XeTeX and PDFTeX due to a lack of cross-engine support in popular LaTeX packages and classes used in academic conferences. A smaller—yet significant—extent of inconsistencies was found across different TeX Live distributions, with only 42.1% of documents producing the same output from 2020 to 2023. Our automated pipeline additionally reduces the human effort in bug-finding: from a sample of 10 unique root causes of inconsistencies, we identified two new bugs in LaTeX packages and five existing bugs that were fixed independently of this study. We also observed potentially unintended inconsistencies across different TeX Live distributions beyond the updates listed in changelogs. We expect that this study will help authors of TeX documents to avoid unexpected outcomes by understanding how they may be affected by the often undocumented subtleties of the TeX ecosystem, while benefiting developers by demonstrating how different implementations result in unintended inconsistencies.

Publisher's Version Published Artifact Artifacts Available
CoSec: On-the-Fly Security Hardening of Code LLMs via Supervised Co-decoding
Dong Li, Meng Yan, Yaosheng Zhang, Zhongxin Liu, Chao Liu, Xiaohong Zhang, Ting Chen, and David Lo
(Chongqing University, China; Zhejiang University, China; University of Electronic Science and Technology of China, China; Singapore Management University, Singapore)
Large Language Models (LLMs) specialized in code have shown exceptional proficiency across various programming-related tasks, particularly code generation. Nonetheless, due to its nature of pretraining on massive uncritically filtered data, prior studies have shown that code LLMs are prone to generate code with potential vulnerabilities. Existing approaches to mitigate this risk involve crafting data without vulnerability and subsequently retraining or fine-tuning the model. As the number of parameters exceeds a billion, the computation and data demands of the above approaches will be enormous. Moreover, an increasing number of code LLMs tend to be distributed as services, where the internal representation is not accessible, and the API is the only way to reach the LLM, making the prior mitigation strategies non-applicable. To cope with this, we propose CoSec, an on-the-fly Security hardening method of code LLMs based on security model-guided Co-decoding, to reduce the likelihood of code LLMs to generate code containing vulnerabilities. Our key idea is to train a separate but much smaller security model to co-decode with a target code LLM. Since the trained secure model has higher confidence for secure tokens, it guides the generation of the target base model towards more secure code generation. By adjusting the probability distributions of tokens during each step of the decoding process, our approach effectively influences the tendencies of generation without accessing the internal parameters of the target code LLM. We have conducted extensive experiments across various parameters in multiple code LLMs (i.e., CodeGen, StarCoder, and DeepSeek-Coder), and the results show that our approach is effective in security hardening. Specifically, our approach improves the average security ratio of six base models by 5.02%-37.14%, while maintaining the functional correctness of the target model.

Publisher's Version
Following the “Thread”: Toward Finding Manipulatable Bottlenecks in Blockchain Clients
Shuohan Wu, Zihao Li, Hao Zhou, Xiapu Luo, Jianfeng Li, and Haoyu Wang
(Hong Kong Polytechnic University, China; Xi’an Jiaotong University, China; Huazhong University of Science and Technology, China)
Blockchain clients are the fundamental element of the blockchain network, each keeping a copy of the blockchain’s ledger. They play a crucial role in ensuring the network’s decentralization, integrity, and stability. As complex software systems, blockchain clients are not exempt from bottlenecks. Some bottlenecks create new attack surfaces, where attackers deliberately overload these weak points to congest client’s execution, thereby causing denial of service (DoS). We call them manipulatable bottlenecks. Existing research primarily focuses on a few such bottlenecks, and heavily relies on manual analysis. To the best of our knowledge, there has not been any study proposing a systematic approach to identify manipulatable bottlenecks in blockchain clients. To bridge the gap, this paper delves into the primary causes of bottlenecks in software, and develops a novel tool named ThreadNeck to monitor the symptoms that signal these issues during client runtime. ThreadNeck models the clients as a number of threads, delineating their inter-relationship to accurately characterize the client’s behavior. Building on this, we can identify the suspicious bottlenecks and determine if they could be exploited by external attackers. After applying ThreadNeck to four mainstream clients developed in different programming languages, we totally discover 13 manipulatable bottlenecks, six of which are previously unknown.

Publisher's Version
CooTest: An Automated Testing Approach for V2X Communication Systems
An Guo, Xinyu Gao, Zhenyu Chen, Yuan Xiao, Jiakai Liu, Xiuting Ge, Weisong Sun, and Chunrong Fang
(Nanjing University, China)
Perceiving the complex driving environment precisely is crucial to the safe operation of autonomous vehicles. With the tremendous advancement of deep learning and communication technology, Vehicle-to-Everything (V2X) collaboration has the potential to address limitations in sensing distant objects and occlusion for a single-agent perception system. However, despite spectacular progress, several communication challenges can undermine the effectiveness of multi-vehicle cooperative perception. The low interpretability of Deep Neural Networks (DNNs) and the high complexity of communication mechanisms make conventional testing techniques inapplicable for the cooperative perception of autonomous driving systems (ADS). Besides, the existing testing techniques, depending on manual data collection and labeling, become time-consuming and prohibitively expensive. In this paper, we design and implement CooTest, the first automated testing tool of the V2X-oriented cooperative perception module. CooTest devises the V2X-specific metamorphic relation and equips communication and weather transformation operators that can reflect the impact of the various cooperative driving factors to produce transformed scenes. Furthermore, we adopt a V2X-oriented guidance strategy for the transformed scene generation process and improve testing efficiency. We experiment CooTest with multiple cooperative perception models with different fusion schemes to evaluate its performance on different tasks. The experiment results show that CooTest can effectively detect erroneous behaviors under various V2X-oriented driving conditions. Also, the results confirm that CooTest can improve detection average precision and decrease misleading cooperation errors by retraining with the generated scenes.

Publisher's Version
Interoperability in Deep Learning: A User Survey and Failure Analysis of ONNX Model Converters
Purvish Jajal, Wenxin Jiang, Arav Tewari, Erik Kocinare, Joseph Woo, Anusha Sarraf, Yung-Hsiang Lu, George K. Thiruvathukal, and James C. Davis
(Purdue University, USA; Loyola University Chicago, USA)
Software engineers develop, fine-tune, and deploy deep learning (DL) models using a variety of development frameworks and runtime environments. DL model converters move models between frameworks and to runtime environments. Conversion errors compromise model quality and disrupt deployment. However, the failure characteristics of DL model converters are unknown, adding risk when using DL interoperability technologies. This paper analyzes failures in DL model converters. We survey software engineers about DL interoperability tools, use cases, and pain points (N=92). Then, we characterize failures in model converters associated with the main interoperability tool, ONNX (N=200 issues in PyTorch and TensorFlow). Finally, we formulate and test two hypotheses about structural causes for the failures we studied. We find that the node conversion stage of a model converter accounts for ∼75% of the defects and 33% of reported failure are related to semantically incorrect models. The cause of semantically incorrect models is elusive, but models with behaviour inconsistencies share operator sequences. Our results motivate future research on making DL interoperability software simpler to maintain, extend, and validate. Research into behavioural tolerances and architectural coverage metrics would be fruitful.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
TeDA: A Testing Framework for Data Usage Auditing in Deep Learning Model Development
Xiangshan Gao, Jialuo Chen, Jingyi Wang, Jie Shi, Peng Cheng, and Jiming Chen
(Zhejiang University, China; Huawei Technology, China; Huawei International, Singapore; Hangzhou Dianzi University, China)
It is notoriously challenging to audit the potential unauthorized data usage in deep learning (DL) model development lifecycle, i.e., to judge whether certain private user data has been used to train or fine-tune a DL model without authorization. Yet, such data usage auditing is crucial to respond to the urgent requirements of trustworthy Artificial Intelligence (AI) such as data transparency, which are promoted and enforced in recent AI regulation rules or acts like General Data Protection Regulation (GDPR) and EU AI Act. In this work, we propose TeDA, a simple and flexible testing framework for auditing data usage in DL model development process. Given a set of user’s private data to protect (Dp), the intuition of TeDA is to apply membership inference (with good intention) for judging whether the model to audit (Ma) is likely to be trained with Dp. Notably, to significantly expose the usage under membership inference, TeDA applies imperceptible perturbation directed by boundary search to generate a carefully crafted test suite Dt (which we call ‘isotope’) based on Dp. With the test suite, TeDA then adopts membership inference combined with hypothesis testing to decide whether a user’s private data has been used to train Ma with statistical guarantee. We evaluated TeDA through extensive experiments on ranging data volumes across various model architectures for data-sensitive face recognition and medical diagnosis tasks. TeDA demonstrates high feasibility, effectiveness and robustness under various adaptive strategies (e.g., pruning and distillation).

Publisher's Version
Enhancing Multi-agent System Testing with Diversity-Guided Exploration and Adaptive Critical State Exploitation
Xuyan Ma, Yawen Wang, Junjie Wang, Xiaofei Xie, Boyu Wu, Shoubin Li, Fanjiang Xu, and Qing Wang
(Institute of Software at Chinese Academy of Sciences, China; Singapore Management University, Singapore)
Multi-agent systems (MASs) have achieved remarkable success in multi-robot control, intelligent transportation, and multiplayer games, etc. Thorough testing for MAS is urgently needed to ensure its robustness in the face of constantly changing and unexpected scenarios. Existing methods mainly focus on single-agent system testing and cannot be directly applied to MAS testing due to the complexity of MAS. To our best knowledge, there are fewer studies on MAS testing. While several studies have focused on adversarial attacks on MASs, they primarily target failure detection from an attack perspective, i.e., discovering failure scenarios, while ignoring the diversity of scenarios. In this paper, to highlight a typical balance between exploration (diversifying behaviors) and exploitation (detecting failures), we propose an advanced testing framework for MAS called with diversity-guided exploration and adaptive critical state exploitation. It incorporates both individual diversity and team diversity, and designs an adaptive perturbation mechanism to perturb the action at the critical states, so as to trigger more and more diverse failure scenarios of the system. We evaluate MASTest on two popular MAS simulation environments: Coop Navi and StarCraft II. Results show that the average distance of the resulting failure scenarios is increased by 29.55%-103.57% and 74.07%-370.00% on two environments compared to the baselines. Also, the failure patterns found by MASTest are improved by 71.44%-300.00% and 50%-500.00% on two experimental environments compared to the baselines.

Publisher's Version
Reproducing Timing-Dependent GUI Flaky Tests in Android Apps via a Single Event Delay
Xiaobao Cai, Zhen Dong, Yongjiang Wang, Abhishek Tiwari, and Xin Peng
(Fudan University, China; University of Passau, Germany)
Flaky tests hinder the development process by exhibiting uncertain behavior in regression testing. A flaky test may pass in some runs and fail in others while running on the same code version. The non-deterministic outcome frequently misleads the developers into debugging non-existent faults in the code. To effectively debug the flaky tests, developers need to reproduce them. The industry de facto to reproduce flaky tests is to rerun them multiple times. However, rerunning a flaky test numerous times is time and resource-consuming. This work presents a technique for rapidly and reliably reproducing timing-dependent GUI flaky tests, acknowledged as the most common type of flaky tests in Android apps. Our insight is that flakiness in such tests often stems from event racing on GUI data. Given stack traces of a failure, our technique employs dynamic analysis to infer event races likely leading to the failure and reproduces it by selectively delaying only relevant events involved in these races. Thus, our technique can efficiently reproduce a failure within minimal test runs. The experiments conducted on 80 timing-dependent flaky tests collected from 22 widely-used Android apps show our technique is efficient in flaky test failure reproduction. Out of the 80 flaky tests, our technique could successfully reproduce 73 within 1.71 test runs on average. Notably, it exhibited extremely high reliability by consistently reproducing the failure for 20 runs.

Publisher's Version Info
Arfa: An Agile Regime-Based Floating-Point Optimization Approach for Rounding Errors
Jinchen Xu, Mengqi Cui, Fei Li, Zuoyan Zhang, Hongru Yang, Bei Zhou, and Jie Zhao
(Information Engineering University, China; Hunan University, China)
We introduce a floating-point (FP) error optimization approach called Arfa that partitions the domain D of an FP expression fe into regimes and rewrites fe in each regime where fe shows larger errors. First, Arfa seeks a rewrite substitution fo with lower errors across D, whose error distribution is plotted for effective regime inference. Next, Arfa generates an incomplete set of ordered rewrite candidates within each regime of interest, so that searching for the best rewrite substitutions is performed efficiently. Finally, Arfa selects the best rewrite substitution by inspecting the errors of top ranked rewrite candidates, with enhancing precision also considered. Experiments on 56 FPbench examples and four real-life programs show that Arfa not only reduces the maximum and average errors of fe by 4.73 and 2.08 bits on average (and up to 33 and 16 bits), but also exhibits lower errors, sometimes to a significant degree, than Herbie and NumOpt.

Publisher's Version
One-to-One or One-to-Many? Suggesting Extract Class Refactoring Opportunities with Intra-class Dependency Hypergraph Neural Network
Di Cui, Qiangqiang Wang, Yutong Zhao, Jiaqi Wang, Minjie Wei, Jingzhao Hu, Luqiao Wang, and Qingshan Li
(Xidian University, China; University of Central Missouri, USA)
Excessively large classes that encapsulate multiple responsibilities are challenging to comprehend and maintain. Addressing this issue, several Extract Class refactoring tools have been proposed, employing a two-phase process: identifying suitable fields or methods for extraction, and implementing the mechanics of refactoring. These tools traditionally generate an intra-class dependency graph to analyze the class structure, applying hard-coded rules based on this graph to unearth refactoring opportunities. Yet, the graph-based approach predominantly illuminates direct, “one-to-one” relationship between pairwise entities. Such a perspective is restrictive as it overlooks the complex, “one-to-many” dependencies among multiple entities that are prevalent in real-world classes. This narrow focus can lead to refactoring suggestions that may diverge from developers’ actual needs, given their multifaceted nature. To bridge this gap, our paper leverages the concept of intra-class dependency hypergraph to model one-to-many dependency relationship and proposes a hypergraph learning-based approach to suggest Extract Class refactoring opportunities named HECS. For each target class, we first construct its intra-class dependency hypergraph and assign attributes to nodes with a pre-trained code model. All the attributed hypergraphs are fed into an enhanced hypergraph neural network for training. Utilizing this trained neural network alongside a large language model (LLM), we construct a refactoring suggestion system. We trained HECS on a large-scale dataset and evaluated it on two real-world datasets. The results show that demonstrates an increase of 38.5% in precision, 9.7% in recall, and 44.4% in f1-measure compared to 3 state-of-the-art refactoring tools including JDeodorant, SSECS, and LLMRefactor, which is more useful for 64% of participants. The results also unveil practical suggestions and new insights that benefit existing extract-related refactoring techniques.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
NeuFair: Neural Network Fairness Repair with Dropout
Vishnu Asutosh Dasu, Ashish Kumar, Saeid Tizpaz-Niari, and Gang Tan
(Pennsylvania State University, USA; University of Texas at El Paso, USA)
This paper investigates neuron dropout as a post-processing bias mitigation method for deep neural networks (DNNs). Neural-driven software solutions are increasingly applied in socially critical domains with significant fairness implications. While DNNs are exceptional at learning statistical patterns from data, they may encode and amplify historical biases. Existing bias mitigation algorithms often require modifying the input dataset or the learning algorithms. We posit that prevalent dropout methods may be an effective and less intrusive approach to improve fairness of pre-trained DNNs during inference. However, finding the ideal set of neurons to drop is a combinatorial problem. We propose NeuFair, a family of post-processing randomized algorithms that mitigate unfairness in pre-trained DNNs via dropouts during inference. Our randomized search is guided by an objective to minimize discrimination while maintaining the model’s utility. We show that NeuFair is efficient and effective in improving fairness (up to 69%) with minimal or no model performance degradation. We provide intuitive explanations of these phenomena and carefully examine the influence of various hyperparameters of NeuFair on the results. Finally, we empirically and conceptually compare NeuFair to different state-of-the-art bias mitigators.

Publisher's Version Published Artifact Artifacts Available
One Size Does Not Fit All: Multi-granularity Patch Generation for Better Automated Program Repair
Bo Lin, Shangwen Wang, Ming Wen, Liqian Chen, and Xiaoguang Mao
(National University of Defense Technology, China; Huazhong University of Science and Technology, China)
Automated program repair aims to automate bug correction and alleviate the burden of manual debugging, which plays a crucial role in software development and maintenance. Recent studies reveal that learning-based approaches have outperformed conventional APR techniques (e.g., search-based APR). Existing learning-based APR techniques mainly center on treating program repair either as a translation task or a cloze task. The former primarily emphasizes statement-level repair, while the latter concentrates on token-level repair, as per our observations. In practice, however, patches may manifest at various repair granularity, including statement, expression, or token levels. Consequently, merely generating patches from a single granularity would be ineffective to tackle real-world defects. Motivated by this observation, we propose Mulpor, a multi-granularity patch generation approach designed to address the diverse nature of real-world bugs. Mulpor comprises three components: statement-level, expression-level, and token-level generator, each is pre-trained to generate correct patches at its respective granularity. The approach involves generating candidate patches from various granularities, followed by a re-ranking process based on a heuristic to prioritize patches. Experimental results on the Defects4J dataset demonstrate that Mulpor correctly repair 92 bugs on Defects4J-v1.2, which achieves 27.0% (20 bugs) and 12.2% (10 bugs) improvement over the previous state-of-the-art NMT-style Rap-Gen and Cloze-style GAMMA. We also studied the generalizability of Mulpor in repairing vulnerabilities, revealing a notable 51% increase in the number of correctly-fixed patches compared with state-of-the-art vulnerability repair approaches. This paper underscores the importance of considering multiple granularities in program repair techniques for a comprehensive strategy to address the diverse nature of real-world software defects. Mulpor, as proposed herein, exhibits promising results in achieving effective and diverse bug fixes across various program repair scenarios.

Publisher's Version Published Artifact Artifacts Available ACM SIGSOFT Distinguished Paper Award
Policy Testing with MDPFuzz (Replicability Study)
Quentin Mazouni, Helge Spieker, Arnaud Gotlieb, and Mathieu Acher
(Simula Research Laboratory, Norway; University of Rennes - Inria - CNRS - IRISA, France)
In recent years, following tremendous achievements in Reinforcement Learning, a great deal of interest has been devoted to ML models for sequential decision-making. Together with these scientific breakthroughs/advances, research has been conducted to develop automated functional testing methods for finding faults in black-box Markov decision processes. Pang et al. (ISSTA 2022) presented a black-box fuzz testing framework called MDPFuzz. The method consists of a fuzzer whose main feature is to use Gaussian Mixture Models (GMMs) to compute coverage of the test inputs as the likelihood to have already observed their results. This guidance through coverage evaluation aims at favoring novelty during testing and fault discovery in the decision model. Pang et al. evaluated their work with four use cases, by comparing the number of failures found after twelve-hour testing campaigns with or without the guidance of the GMMs (ablation study). In this paper, we verify some of the key findings of the original paper and explore the limits of MDPFuzz through reproduction and replication. We re-implemented the proposed methodology and evaluated our replication in a large-scale study that extends the original four use cases with three new ones. Furthermore, we compare MDPFuzz and its ablated counterpart with a random testing baseline. We also assess the effectiveness of coverage guidance for different parameters, something that has not been done in the original evaluation. Despite this parameter analysis and unlike Pang et al.’s original conclusions, we find that in most cases, the aforementioned ablated Fuzzer outperforms MDPFuzz, and conclude that the coverage model proposed does not lead to finding more faults.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable ACM SIGSOFT Distinguished Paper Award
Large Language Models Can Connect the Dots: Exploring Model Optimization Bugs with Domain Knowledge-Aware Prompts
Hao Guan, Guangdong Bai, and Yepang Liu
(University of Queensland, Australia; Southern University of Science and Technology, China)
Model optimization, such as pruning and quantization, has become the de facto pre-deployment phase when deploying deep learning (DL) models on resource-constrained platforms. However, the complexity of DL models often leads to non-trivial bugs in model optimizers, known as model optimization bugs (MOBs). These MOBs are characterized by involving complex data types and layer structures inherent to DL models, causing significant hurdles in detecting them through traditional static analysis and dynamic testing techniques. In this work, we leverage Large Language Models (LLMs) with prompting techniques to generate test cases for MOB detection. We explore how LLMs can draw an understanding of the MOB domain from scattered bug instances and generalize to detect new ones, a paradigm we term as concentration and diffusion. We extract MOB domain knowledge from the artifacts of known MOBs, such as their issue reports and fixes, and design knowledge-aware prompts to guide LLMs in generating effective test cases. The domain knowledge of code structure and error description provides precise in-depth depictions of the problem domain, i.e., the concentration, and heuristic directions to generate innovative test cases, i.e., the diffusion. Our approach is implemented as a tool named YanHui and benchmarked against existing few-shot LLM-based fuzzing techniques. Test cases generated by YanHui demonstrate enhanced capability to find relevant API and data combinations for exposing MOBs, leading to an 11.4% increase in generating syntactically valid code and a 22.3% increase in generating on-target code specific to model optimization. YanHui detects 17 MOBs, and among them, five are deep MOBs that are difficult to reveal without our prompting technique.

Publisher's Version Published Artifact Artifacts Available
AutoCodeRover: Autonomous Program Improvement
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury
(National University of Singapore, Singapore)
Researchers have made significant progress in automating the software development process in the past decades. Automated techniques for issue summarization, bug reproduction, fault localization, and program repair have been built to ease the workload of developers. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding. Nevertheless, software engineering involves the process of program improvement apart from coding, specifically to enable software maintenance (e.g. program repair to fix bugs) and software evolution (e.g. feature additions). In this paper, we propose an automated approach for solving Github issues to autonomously achieve program improvement. In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. In contrast to recent LLM agent approaches from AI researchers and practitioners, our outlook is more software engineering oriented. We work on a program representation (abstract syntax tree) as opposed to viewing a software project as a mere collection of files. Our code search exploits the program structure in the form of classes/methods to enhance LLM’s understanding of the issue’s root cause, and effectively retrieve a context via iterative search. The use of spectrum-based fault localization using tests, further sharpens the context, as long as a test-suite is available. Experiments on the recently proposed SWE-bench-lite (300 real-life Github issues) show increased efficacy in solving Github issues (19% on SWE-bench-lite), which is higher than the efficacy of the recently reported Swe-agent. Interestingly, our approach resolved 57 GitHub issues in about 4 minutes each (pass@1), whereas developers spent more than 2.68 days on average. In addition, AutoCodeRover achieved this efficacy with significantly lower cost (on average, $0.43 USD), compared to other baselines. We posit that our workflow enables autonomous software engineering, where, in future, auto-generated code from LLMs can be autonomously improved.

Publisher's Version
See the Forest, not Trees: Unveiling and Escaping the Pitfalls of Error-Triggering Inputs in Neural Network Testing
Yuanyuan Yuan, Shuai Wang, and Zhendong Su
(Hong Kong University of Science and Technology, China; ETH Zurich, Switzerland)
Recent efforts in deep neural network (DNN) testing commonly use error-triggering inputs (ETIs) to quantify DNN errors and to fine-tune the tested DNN for repairing. This study reveals the pitfalls of ETIs in DNN testing. Specifically, merely seeking for more ETIs “traps” the testing campaign into local plateaus, where similar ETIs are continuously generated using a few fixed input transformations. Similarly, fine-tuning the DNN with ETIs, while capable of fixing the exposed DNN mis-predictions, undermines the DNN’s resilience towards certain input transformations. However, these ETI-induced pitfalls have been overlooked in previous research, due to the insufficient input transformations (usually < 10), and we show that the severity of such deceptive phenomena is enlarged when testing DNNs with more and diverse real-life input transformations. This paper presents a comprehensive study on the pitfalls of ETIs in DNN testing. We first augment conventional DNN testing pipelines with a large set of input transformations; the correctness and validity of these new transformations are verified with large-scale human studies. Based on this, we show that launching an endless pursuit for ETIs cannot alleviate the “trapped testing” issue, and the undermined resilience pervasively occurs in many input transformations. Accordingly, we propose a novel and holistic viewpoint over DNN errors: instead of counting which input triggers a DNN mis-prediction, we record which input transformation can generate ETIs. The targeted input property of this transformation, termed erroneous property (EP), counts one DNN error and guides DNN testing (i.e., our new paradigm aims to find more EPs rather than ETIs). Evaluation shows that this EP-oriented testing paradigm significantly expands the explored DNN error space. Moreover, fine-tuning DNNs with EPs effectively improves their resilience towards different input transformations.

Publisher's Version
Practitioners’ Expectations on Automated Test Generation
Xiao Yu, Lei Liu, Xing Hu, Jacky Keung, Xin Xia, and David Lo
(Huawei, China; Xi’an Jiaotong University, China; Zhejiang University, China; City University of Hong Kong, China; Singapore Management University, Singapore)
Automated test generation can help developers craft high-quality software tests while mitigating the manual effort needed for writing test code. Despite significant research efforts in automated test generation for nearly 50 years, there is a lack of clarity about what practitioners expect from automated test generation tools and whether the existing research meets their needs. To address this issue, we follow a mixed-methods approach to gain insights into practitioners' expectations of automated test generation. We first conduct the qualitative analysis from semi-structured interviews with 13 professionals, followed by a quantitative survey of 339 practitioners from 46 countries across five continents. We then conduct a literature review of premier venue papers from 2022 to 2024 (in the last three years) and compare current research findings with practitioners' expectations. From this comparison, we outline future research directions for researchers to bridge the gap between automated test generation research and practitioners' expectations.

Publisher's Version
An Empirical Examination of Fuzzer Mutator Performance
James Kukucka, Luís Pina, Paul Ammann, and Jonathan Bell
(George Mason University, USA; University of Illinois at Chicago, USA; Northeastern University, USA)
Over the past decade, hundreds of fuzzers have been published in top-tier security and software engineering conferences. Fuzzers are used to automatically test programs, ideally creating high-coverage input corpora and finding bugs. Modern “greybox” fuzzers evolve a corpus of inputs by applying mutations to inputs and then executing those new inputs while collecting coverage. New inputs that are “interesting” (e.g. reveal new coverage) are saved to the corpus. Given their non-deterministic nature, the impact of each design decision on the fuzzer’s performance can be difficult to predict. Some design decisions (e.g., ” Should the fuzzer perform deterministic mutations of inputs? ”) are exposed to end-users as configuration flags, but others (e.g., ” What kinds of random mutations to apply to inputs?”) are typically baked into the fuzzer code itself. This paper describes our over 12.5-CPU-year evaluation of the set of mutation operators employed by the popular AFL++ fuzzer, including the havoc phase, splicing, and , exploring the impact of adjusting some of those unexposed configurations. In this experience paper, we propose a methodology for determining different fuzzers’ behavioral diversity with respect to branch coverage and bug detection using rigorous statistical methods. Our key finding is that, across a range of targets, disabling certain mutation operators (some of which were previously “baked-in” to the fuzzer) resulted in inputs that cover different lines of code and reveal different bugs. A surprising result is disabling certain mutators leads to more diverse coverage and allows the fuzzer to find more bugs faster. We call for researchers to investigate seemingly simple design decisions in fuzzers more thoroughly and encourage fuzzer developers to expose more configuration parameters pertaining to these design decisions to end users.

Publisher's Version Published Artifact Artifacts Available
LLM4Fin: Fully Automating LLM-Powered Test Case Generation for FinTech Software Acceptance Testing
Zhiyi Xue, Liangguo Li, Senyue Tian, Xiaohong Chen, Pingping Li, Liangyu Chen, Tingting Jiang, and Min Zhang
(East China Normal University, China; Guotai Junan Securities, China)
FinTech software, crucial for both safety and timely market deployment, presents a compelling case for automated acceptance testing against regulatory business rules. However, the inherent challenges of comprehending unstructured natural language descriptions of these rules and crafting comprehensive test cases demand human intelligence. The emergence of Large Language Models (LLMs) holds promise for automated test case generation, leveraging their natural language processing capabilities. Yet, their dependence on human intervention for effective prompting hampers efficiency. In response, we introduce a groundbreaking, fully automated approach for generating high-coverage test cases from natural language business rules. Our methodology seamlessly integrates the versatility of LLMs with the predictability of algorithmic methods. We fine-tune pre-trained LLMs for improved information extraction accuracy and algorithmically generate comprehensive testable scenarios for the extracted business rules. Our prototype, LLM4Fin, is designed for testing real-world stock-trading software. Experimental results demonstrate LLM4Fin’s superiority over both state-of-the-art LLM, such as ChatGPT, and skilled testing engineers. We achieve remarkable performance, with up to 98.18% and an average of 20%−110% improvement on business scenario coverage, and up to 93.72% on code coverage, while reducing the time cost from 20 minutes to a mere 7 seconds. These results provide robust evidence of the framework’s practical applicability and efficiency, marking a significant advancement in FinTech software testing.

Publisher's Version Published Artifact Artifacts Available
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation
Jueon Eom, Seyeon Jeong, and Taekyoung Kwon
(Yonsei University, South Korea; Suresofttech, South Korea)
JavaScript interpreters, crucial for modern web browsers, require an effective fuzzing method to identify security-related bugs. However, the strict grammatical requirements for input present significant challenges. Recent efforts to integrate language models for context- aware mutation in fuzzing are promising but lack the necessary coverage guidance to be fully effective. This paper presents a novel technique called CovRL (Coverage-guided Reinforcement Learning) that combines Large Language Models (LLMs) with Reinforcement Learning (RL) from coverage feedback. Our fuzzer, CovRL-Fuzz, integrates coverage feedback directly into the LLM by leveraging the Term Frequency-Inverse Document Frequency (TF-IDF) method to construct a weighted coverage map. This map is key in calculating the fuzzing reward, which is then applied to the LLM-based mutator through reinforcement learning. CovRL-Fuzz, through this approach, enables the generation of test cases that are more likely to discover new coverage areas, thus improving bug detection while minimizing syntax and semantic errors, all without needing extra post-processing. Our evaluation results show that CovRL-Fuzz outperforms the state-of-the-art fuzzers in enhancing code coverage and identifying bugs in JavaScript interpreters: CovRL-Fuzz identified 58 real-world security-related bugs in the latest JavaScript interpreters, including 50 previously unknown bugs and 15 CVEs.

Publisher's Version
Decomposition of Deep Neural Networks into Modules via Mutation Analysis
Ali Ghanbari
(Auburn University, USA)
Recently, several approaches have been proposed for decomposing deep neural network (DNN) classifiers into binary classifier modules to facilitate modular development and repair of such models. These approaches concern only the problem of decomposing classifier models, and some of them rely on the activation patterns of the neurons, thereby limiting their applicability. In this paper, we propose a DNN decomposition technique, named Incite, that uses neuron mutation to quantify the contributions of the neurons to a given output of a model. Then, for each model output, a subgraph induced by the nodes with highest contribution scores for that output are selected and extracted as a module. Incite is agnostic to the type of the model and the activation functions used in its construction, and is applicable to not just classifiers, but to regression models as well. Furthermore, the costs of mutation analysis in Incite has been reduced by heuristic clustering of neurons, enabling its application to models with millions of parameters. Lastly, Incite prunes away the neurons that do not contribute to the outcome of the modules, producing compressed, efficient modules. We have evaluated Incite using 16 DNN models for well-known classification and regression problems and report its effectiveness along combined accuracy (and MAE) of the modules, the overlap in model elements between the modules, and the compression ratio. We observed that, for classification models, Incite, on average, incurs 3.44% loss in accuracy, and the average overlap between the modules is 71.76%, while the average compression ratio is 1.89X. Meanwhile, for regression models, Incite, on average, incurs 18.56% gain in MAE, and the overlap between modules is 80.14%, while the average compression ratio is 1.83X.

Publisher's Version
Empirical Study of Move Smart Contract Security: Introducing MoveScan for Enhanced Analysis
Shuwei Song, Jiachi Chen, Ting Chen, Xiapu Luo, Teng Li, Wenwu Yang, Leqing Wang, Weijie Zhang, Feng Luo, Zheyuan He, Yi Lu, and Pan Li
(University of Electronic Science and Technology of China, China; Sun Yat-sen University, China; Hong Kong Polytechnic University, China; Jiangsu University of Science and Technology, China; BitsLab, Singapore; MoveBit, China)
Move, a programming language for smart contracts, stands out for its focus on security. However, the practical security efficacy of Move contracts remains an open question. This work conducts the first comprehensive empirical study on the security of Move contracts. Our initial step involves collaborating with a security company to manually audit 652 contracts from 92 Move projects. This process reveals eight types of defects, with half previously unreported. These defects present potential security risks, cause functional flaws, mislead users, or waste computational resources. To further evaluate the prevalence of these defects in real-world Move contracts, we present MoveScan, an automated analysis framework that translates bytecode into an intermediate representation (IR), extracts essential meta-information, and detects all eight defect types. By leveraging MoveScan, we uncover 97,028 defects across all 37,302 deployed contracts in the Aptos and Sui blockchains, indicating a high prevalence of defects. Experimental results demonstrate that the precision of MoveScan reaches 98.85%, with an average project analysis time of merely 5.45 milliseconds. This surpasses previous state-of-the-art tools MoveLint, which exhibits an accuracy of 87.50% with an average project analysis time of 71.72 milliseconds, and Move Prover, which has a recall rate of 6.02% and requires manual intervention. Our research also yields new observations and insights that aid in developing more secure Move contracts.

Publisher's Version
Testing Gremlin-Based Graph Database Systems via Query Disassembling
Yingying Zheng, Wensheng Dou, Lei Tang, Ziyu Cui, Yu Gao, Jiansen Song, Liang Xu, Jiaxin Zhu, Wei Wang, Jun Wei, Hua Zhong, and Tao Huang
(Institute of Software at Chinese Academy of Sciences, China; Jinling Institute of Technology, China)
Graph Database Systems (GDBs) support efficiently storing and retrieving graph data, and have become a critical component in many important applications. Many widely-used GDBs utilize the Gremlin query language to create, modify, and retrieve data in graph databases, in which developers can assemble a sequence of Gremlin APIs to perform a complex query. However, incorrect implementations and optimizations of GDBs can introduce logic bugs, which can cause Gremlin queries to return incorrect query results, e.g., omitting vertices in a graph database. In this paper, we propose Query Di sassembling (QuDi), an effective testing technique to automatically detect logic bugs in Gremlin-based GDBs. Given a Gremlin query Q, QuDi disassembles Q into a sequence of atomic graph traversals TList, which shares the equivalent execution semantics with Q. If the execution results of Q and TList are different, a logic bug is revealed in the target GDB. We evaluate QuDi on six popular GDBs, and have found 25 logic bugs in these GDBs, 10 of which have been confirmed as previously-unknown bugs by GDB developers.

Publisher's Version Published Artifact Artifacts Available
Synthesizing Boxes Preconditions for Deep Neural Networks
Zengyu Liu, Liqian Chen, Wanwei Liu, and Ji Wang
(National University of Defense Technology, China)
Deep neural network (DNN) has been increasingly deployed as a key component in safety-critical systems. However, the credibility of DNN components is uncertain due to the absence of formal specifications for their data preconditions, which are essential for ensuring trustworthy postconditions.In this paper, we propose a guess-and-check-based framework PreBoxes to automatically synthesize Boxes sufficient preconditions for DNN concerning rich safety and robustness postconditions.The framework operates in two phases: the guess phase generates potentially complex candidate preconditions through heuristic methods, while the check phase verifies these candidates with formal guarantees.The entire framework supports automatic and adaptive iterative running to obtain weaker preconditions as well.Such resulting preconditions can be leveraged to shield DNN for safety and enhance the interpretability of DNN in application.PreBoxes has been evaluated on over 20 models with 23 trustworthy properties of 4 benchmarks and compared with 3 existing typical schemes.The results show that not only does PreBoxes generally infer weaker non-trivial sufficient preconditions for DNN than others, but also it expands competitive capabilities to handle both complex properties and Non-ReLU complex structured networks.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Logos: Log Guided Fuzzing for Protocol Implementations
Feifan Wu, Zhengxiong Luo, Yanyang Zhao, Qingpeng Du, Junze Yu, Ruikang Peng, Heyuan Shi, and Yu Jiang
(Tsinghua University, China; Beijing University of Posts and Telecommunications, China; Central South University, China)
Network protocols are extensively used in a variety of network devices, making the security of their implementations crucial. Protocol fuzzing has shown promise in uncovering vulnerabilities in these implementations. However traditional methods often require instrumentation of the target implementation to provide guidance, which is intrusive, adds overhead, and can hinder black-box testing. This paper presents Logos, a protocol fuzzer that utilizes non-intrusive runtime log information for fuzzing guidance. Logos first standardizes the unstructured logs and embeds them into a high-dimensional vector space for semantic representation.Then, Logos filters the semantic representation and dynamically maintains a semantic coverage to chart the explored space for customized guidance.We evaluate Logos on eight widely used implementations of well-known protocols. Results show that, compared to existing intrusive or expert knowledge-driven protocol fuzzers, Logos achieves 26.75%-106.19% higher branch coverage within 24 hours. Furthermore, Logos exposed 12 security-critical vulnerabilities in these prominent protocol implementations, with 9 CVEs assigned.

Publisher's Version
Large Language Models for Equivalent Mutant Detection: How Far Are We?
Zhao Tian, Honglin Shu, Dong Wang, Xuejie Cao, Yasutaka Kamei, and Junjie Chen
(Tianjin University, China; Kyushu University, Japan)
Mutation testing is vital for ensuring software quality. However, the presence of equivalent mutants is known to introduce redundant cost and bias issues, hindering the effectiveness of mutation testing in practical use. Although numerous equivalent mutant detection (EMD) techniques have been proposed, they exhibit limitations due to the scarcity of training data and challenges in generalizing to unseen mutants. Recently, large language models (LLMs) have been extensively adopted in various code-related tasks and have shown superior performance by more accurately capturing program semantics. Yet the performance of LLMs in equivalent mutant detection remains largely unclear. In this paper, we conduct an empirical study on 3,302 method-level Java mutant pairs to comprehensively investigate the effectiveness and efficiency of LLMs for equivalent mutant detection. Specifically, we assess the performance of LLMs compared to existing EMD techniques, examine the various strategies of LLMs, evaluate the orthogonality between EMD techniques, and measure the time overhead of training and inference. Our findings demonstrate that LLM-based techniques significantly outperform existing techniques (i.e., the average improvement of 35.69% in terms of F1-score), with the fine-tuned code embedding strategy being the most effective. Moreover, LLM-based techniques offer an excellent balance between cost (relatively low training and inference time) and effectiveness. Based on our findings, we further discuss the impact of model size and embedding quality, and provide several promising directions for future research. This work is the first to examine LLMs in equivalent mutant detection, affirming their effectiveness and efficiency.

Publisher's Version ACM SIGSOFT Distinguished Paper Award
An Empirical Study on Kubernetes Operator Bugs
Qingxin Xu, Yu Gao, and Jun Wei
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Kubernetes is the leading cluster management platform, and within Kubernetes, an operator is an application-specific program that leverages the Kubernetes API to automate operation tasks for managing an application deployed on a Kubernetes cluster. Users can declare a desired state for the managed cluster, specifying their configuration preferences. The operator program is responsible for reconciling the cluster's actual state to align with the desired state. However, the complex, dynamic, and distributed nature of the overall system can introduce operator bugs, and lead to severe consequences, e.g., outages and undesired cluster state. In this paper, we conduct the first comprehensive study on 210 operator bugs from 36 Kubernetes operators. For all the studied bugs, we investigate their root causes, manifestations, impacts and fixing. Our study reveals many interesting findings that can guide the detection and testing of operator bugs, as well as the development of more reliable operators.

Publisher's Version Published Artifact Artifacts Available
Maltracker: A Fine-Grained NPM Malware Tracker Copiloted by LLM-Enhanced Dataset
Zeliang Yu, Ming Wen, Xiaochen Guo, and Hai Jin
(Huazhong University of Science and Technology, China)
As the largest package registry, Node Package Manager (NPM) has become the prime target for various supply chain attacks recently and has been flooded with numerous malicious packages, posing significant security risks to end-users. Learning-based methods have demonstrated promising performance with good adaptability to various types of attacks. However, they suffer from two main limitations. First, they often utilize metadata features or coarse-grained code features extracted at the package level while overlooking complex code semantics. Second, the dataset used to train the model often suffers from a lack of variety both in quantity and diversity, and thus cannot detect significant types of attacks. To address these problems, we introduce Maltracker, a learningbased NPM malware tracker based on fine-grained features empowered by LLM-enhanced dataset. First, Maltracker constructs precise call graphs to extract suspicious functions that are reachable to a pre-defined set of sensitive APIs, and then utilizes community detection algorithm to identify suspicious code gadgets based on program dependency graph, from which fine-grained features are then extracted. To address the second limitation, we extend the dataset using advanced large language models (LLM) to translate malicious functions from other languages (e.g., C/C++, Python, and Go) into JavaScript. Evaluations shows that Maltracker can achieve an improvement of about 12.6% in terms of F1-score at the package level and 31.0% at the function level compared with the SOTA learning-based methods. Moreover, the key components of 𝑀𝑎𝑙𝑡𝑟𝑎𝑐𝑘𝑒𝑟 all contribute to the effectiveness of its performance. Finally, Maltracker has also detected 230 new malicious packages in NPM and received 61 thanks letters, among which some contain new malicious behaviors that cannot be detected by existing tools.

Publisher's Version Info
Characterizing and Detecting Program Representation Faults of Static Analysis Frameworks
Huaien Zhang, Yu Pei, Shuyun Liang, Zezhong Xing, and Shin Hwei Tan
(Hong Kong Polytechnic University, China; Southern University of Science and Technology, China; Concordia University, Canada)
Static analysis frameworks (SAFs) such as Soot and WALA have been a fundamental support in today’s software analysis. They usually adopt various analysis techniques to transform programs into different representations which imply specific properties, e.g., call graph can demonstrate the calling relationships between methods in a program, and users rely on these program representations for further analysis like vulnerability detection and privacy leakage recognition. Hence, providing proper program representation is essential for SAFs. We conducted a systematic empirical study on program representation faults of static analysis frameworks. In our study, we first collect 141 issues from four popular SAFs and summarize their root causes, symptoms, and fix strategies, and reveal nine findings and some implications to avoid and detect program representation faults. Additionally, we implemented an automated testing framework named SAScope based on the metamorphic and differential testing motivated by findings and implications. Overall, SAScope can detect 19 program representation faults where 6 of them have been confirmed or fixed, demonstrating its effectiveness.

Publisher's Version
Calico: Automated Knowledge Calibration and Diagnosis for Elevating AI Mastery in Code Tasks
Yuxin Qiu, Jie Hu, Qian Zhang, and Heng Yin
(University of California at Riverside, USA)
Recent advancements in large language models (LLMs) have exhibited promising capabilities in addressing various tasks such as defect detection and program repair. Despite their prevalence, LLMs still face limitations in effectively handling these tasks. Common strategies to adapt them and improve their performance for specific tasks involve fine-tuning models based on user data or employing in-context learning with examples of desired inputs and outputs. However, they pose challenges for practical adoption due to the need for extensive computational resources, high-quality data, and continuous maintenance. Furthermore, neither strategy can explain or reason about the deficiencies of LLMs in the given tasks. We propose Calico to address the high cost of fine-tuning, eliminate the necessity for task-specific examples, and provide explanations of LLM deficiency. At the heart of Calico is an evolutionary approach that interleaves knowledge calibration and AI deficiency diagnosis. The key essence of Calico is as follows. First, it focuses on identifying knowledge gaps in LLMs’ program comprehension. Second, it conducts automated code refactoring to integrate the overlooked knowledge into the source code for mitigating those gaps. Third, it employs what-if analysis and counterfactual reasoning to determine a minimum set of overlooked knowledge necessary to improve the performance of LLMs in code tasks. We have extensively evaluated Calico over 8,938 programs on three most commonly seen code tasks. Our experimental results show that vanilla ChatGPT cannot fully understand code structures. With knowledge calibration, Calico improves it by 20% and exhibits comparable proficiency compared to fine-tuned LLMs. Deficiency diagnosis contributes to 8% reduction in program sizes while ensuring performance. These impressive results demonstrate the feasibility of utilizing a vanilla LLM for automated software engineering (SE) tasks, thereby avoiding the high computational costs associated with a fine-tuned model.

Publisher's Version
An In-Depth Study of Runtime Verification Overheads during Software Testing
Kevin Guan and Owolabi Legunsen
(Cornell University, USA)
Runtime verification (RV) monitors program executions against formal specifications (specs). Researchers showed that RV during software testing amplifies the bug-finding ability of tests, and found hundreds of new bugs by using RV to monitor passing tests in open-source projects. But, RV’s runtime overhead is widely seen as a hindrance to its broad adoption, especially during continuous integration. Yet, there is no in-depth study of the prevalence, usefulness for bug finding, and components of these overheads during testing, so that researchers can better understand how to speed up RV. We study RV overhead during testing, monitoring developer-written unit tests in 1,544 open-source projects against 160 specs of correct JDK API usage. We make four main findings. (1) RV overhead is below 12.48 seconds, which others considered acceptable, in 40.9% of projects, but up to 5,002.9x (or, 28.7 hours) in the other projects. (2) 99.87% of monitors that RV generates to dynamically check program traces are wasted; they can only find bugs that the other 0.13% find. (3) Contrary to conventional wisdom, RV overhead in most projects is dominated by instrumentation, not monitoring. (4) 36.74% of monitoring time is spent in test code or libraries. As evidence that our study provides a new basis that future work can exploit, we perform two more experiments. First, we show that offline instrumentation (when possible) greatly reduces RV runtime overhead for single versions of many projects. Second, we show that simply amortizing high instrumentation costs across multiple program versions can outperform, by up to 4.53x, a recent evolution-aware RV technique that uses complex program analysis.

Publisher's Version ACM SIGSOFT Distinguished Paper Award

ISSTA/ECOOP Tool Demonstrations

The Flexcrash Platform for Testing Autonomous Vehicles in Mixed-Traffic Scenarios
Alessio Gambi, Shreya Mathews, Benedikt Steininger, Mykhailo Poienko, and David Bobek
(Austrian Institute of Technology, Austria; IMC University of Applied Sciences Krems, Austria)
Autonomous vehicles (AV) leverage Artificial Intelligence to reduce accidents and improve fuel efficiency while sharing the roads with human drivers. Current AV prototypes have not yet reached these goals, highlighting the need for better development and testing methodologies. AV testing practices extensively rely on simulations, but existing AV tools focus on testing single AV instances or do not consider human drivers. Thus, they might generate many irrelevant mixed-traffic test scenarios. The Flexcrash platform addresses these issues by allowing the generation and simulation of mixed-traffic scenarios, thus enabling testers to identify realistic critical scenarios, traffic experts to create new datasets, and regulators to extend consumer testing benchmarks.

Publisher's Version Video
SeeWasm: An Efficient and Fully-Functional Symbolic Execution Engine for WebAssembly Binaries
Ningyu He, Zhehao Zhao, Hanqin Guan, Jikai Wang, Shuo Peng, Ding Li, Haoyu Wang, Xiangqun Chen, and Yao Guo
(Peking University, China; Huazhong University of Science and Technology, China)
WebAssembly (Wasm), as a compact, fast, and isolation-guaranteed binary format, can be compiled from more than 40 high-level programming languages. However, vulnerabilities in Wasm binaries could lead to sensitive data leakage and even threaten their hosting environments. To identify them, symbolic execution is widely adopted due to its soundness and the ability to automatically generate exploitations. However, existing symbolic executors for Wasm binaries are typically platform-specific, which means that they cannot support all Wasm features. They may also require significant manual interventions to complete the analysis and suffer from efficiency issues as well. In this paper, we propose an efficient and fully-functional symbolic execution engine, named SeeWasm. Compared with existing tools, we demonstrate that SeeWasm supports full-featured Wasm binaries without further manual intervention, while accelerating the analysis by 2 to 6 times. SeeWasm has been adopted by existing works to identify more than 30 0-day vulnerabilities or security issues in well-known C, Go, and SGX applications after compiling them to Wasm binaries.

Publisher's Version Video Info
Testing Concurrent Algorithms on JVM with Lincheck and IntelliJ IDEA
Aleksandr Potapov, Maksim Zuev, Evgenii Moiseenko, and Nikita Koval
(JetBrains, Germany; JetBrains, Netherlands; JetBrains Research, Serbia)
This paper presents an IntelliJ IDEA plugin for Lincheck -- a popular framework for testing concurrent data structures on JVM. The Lincheck framework automatically generates concurrent scenarios and examines them with a model checker, providing a detailed execution trace that reproduces the detected error. This trace includes all shared memory access and synchronization events. The IntelliJ IDEA plugin offers record-and-replay debugging to study the execution trace, providing native debugging experience in the IDE. The Lincheck plugin pauses the failed execution at the beginning and provides additional panels that visualize the failed scenario, the execution trace, and the current state of the data structure. One can step through the trace and reproduce the error, moving forward and backward and observing how the data structure changes. These novel capabilities significantly improve the debugging process, making identifying and fixing complex concurrency bugs easier.

Publisher's Version
DMMPP: Constructing Dummy Main Methods for Android Apps with Path-Sensitive Predicates
Baoquan Cui, Jiwei Yan, and Jian Zhang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Android is based on an event-driven model, which hides the main method, and is driven by the lifecycle methods and listeners from user interaction. FlowDroid, constructs a dummy main method statically emulating the lifecycle methods. The dummy main method has been widely used by FlowDroid and also other Android analyzers as their entry points. However, the existing dummy main method is not designed for path-sensitive analysis, whose paths may be unsatisfiable. Thus, when using original dummy main methods, path-sensitive analysis, e.g., symbolic execution, may suffer from infeasible paths. In this paper, we present DMMPP, the first dummy main method generator for Android applications with path-sensitive predicates, and the corresponding path condition is satisfiable. DMMPP constructs dummy main methods for the four types of components in an application with a more realistic simulation for the lifecycle methods. The experiment demonstrates the benefits of our tool for path-sensitive analyzers, improving 28.5 times more explored paths with a low time overhead.

Publisher's Version
JCWIT: A Correctness-Witness Validator for Java Programs Based on Bounded Model Checking
Zaiyu Cheng, Tong Wu, Peter Schrammel, Norbert Tihanyi, Eddie B. de Lima Filho, and Lucas C. Cordeiro
(University of Manchester, United Kingdom; University of Sussex, United Kingdom; Eotvos Lorand University, Hungary; TPV Technology, Brazil; Federal University of Amazonas, Manaus, Brazil)
Witness validation is a formal verification method to independently verify software verification tool results, with two main categories: violation and correctness witness validators. Validators for violation witnesses in Java include Wit4Java and GWIT, but no dedicated correctness witness validators exist. To address this gap, this paper presents the Java Correctness-Witness Validator (JCWIT), the first tool to validate correctness witnesses in Java programs. JCWIT accepts an original program, a specification, and a correctness witness as inputs. Then, it uses invariants of each witness’s execution state as conditions to be incorporated into the original program in the form of assertions, thus instrumenting it. Next, JCWIT employs an established tool, Java Bounded Model Checker (JBMC), to verify the transformed program, hence examining the reproducibility of correct witness results. We evaluated JCWIT in the SV-COMP ReachSafety benchmark, and the results show that JCWIT can correctly validate the correctness witnesses generated by Java verifiers.

Publisher's Version Archive submitted (150 kB) Video Info
ESBMC-Python: A Bounded Model Checker for Python Programs
Bruno Farias, Rafael Menezes, Eddie B. de Lima Filho, Youcheng Sun, and Lucas C. Cordeiro
(University of Manchester, United Kingdom; Federal University of Amazonas, Manaus, Brazil; TPV Technology, Brazil)
This paper introduces a tool for verifying Python programs, which, using type annotation and front-end processing, can harness the capabilities of a bounded model-checking (BMC) pipeline. It transforms an input program into an abstract syntax tree to infer and add type information. Then, it translates Python expressions and statements into an intermediate representation. Finally, it converts this description into formulae evaluated with satisfiability modulo theories (SMT) solvers. The proposed approach was realized with the efficient SMT-based bounded model checker (ESBMC), which resulted in a tool called ESBMC-Python, the first BMC-based Python-code verifier. Experimental results, with a test suite specifically developed for this purpose, showed its effectiveness, where successful and failed tests were correctly evaluated. Moreover, it found a real problem in the Ethereum Consensus Specification.

Publisher's Version
PolyTracker: Whole-Input Dynamic Information Flow Tracing
Evan Sultanik, Marek Surovič, Henrik Brodin, Kelly Kaoudis, Facundo Tuesca, Carson Harmon, Lisa Overall, Joseph Sweeney, and Bradford Larsen
(Trail of Bits, USA; Trail of Bits, Czechia; Trail of Bits, Sweden; Trail of Bits, Netherlands)
We present PolyTracker, a whole-program, whole-input dynamic information flow tracing (DIFT) framework. Given an LLVM compatible codebase or a binary that has been lifted to LLVM intermediate representation (IR), PolyTracker compiles it, adding static instrumentation. The instrumented program will run normally with modest performance overhead, but will additionally output a runtime trace artifact in the co-designed TDAG (Tainted Directed Acyclic Graph) format. TDAGs can be post-processed for a variety of analyses, including tracing every input byte through program execution. TDAGs can be generated either by running the program over a corpus of inputs or by employing a randomized input generator such as a fuzzer. PolyTracker traces (TDAGs) are useful not only for very localized, targeted dynamic program analysis as with smaller-scale DIFT: TDAGs are primarily intended for whole-program runtime exploration and bug finding, granular information-flow diffing between program runs, and comparisons of implementations of the same input specification without any need to emulate and instrument the entire running environment. For user-friendliness and reproducibility, the software repository provides a number of examples of PolyTracker-instrumented builds of popular open-source software projects. We also provide an analysis library and REPL written in Python that are designed to assist users with operating over TDAGs.

Publisher's Version Video Info
FRAFOL: FRAmework FOr Learning mutation testing
Pedro Tavares, Ana Paiva, Domenico Amalfitano, and René Just
(University of Porto, Portugal; Federico II University of Naples, Italy; University of Washington, USA)
Mutation testing has evolved beyond academic research, is deployed in industrial and open-source settings, and is increasingly part of universities' software engineering curricula. While many mutation testing tools exist, each with different strengths and weaknesses, integrating them into educational activities and exercises remains challenging due to the tools' complexity and the need to integrate them into a development environment. Additionally, it may be desirable to use different tools so that students can explore differences, e.g., in the types or numbers of generated mutants. Asking students to install and learn multiple tools would only compound technical complexity and likely result in unwanted differences in how and what students learn. This paper presents FRAFOL, a framework for learning mutation testing. FRAFOL provides a common environment for using different mutation testing tools in an educational setting.

Publisher's Version Video Info
HECS: A Hypergraph Learning-Based System for Detecting Extract Class Refactoring Opportunities
Luqiao Wang, Qiangqiang Wang, Jiaqi Wang, Yutong Zhao, Minjie Wei, Zhou Quan, Di Cui, and Qingshan Li
(Xidian University, China; University of Central Missouri, USA)
HECS is an advanced tool designed for Extract Class refactoring by leveraging hypergraph learning to model complex dependencies within large classes. Unlike traditional tools that rely on direct one-to-one dependency graphs, HECS uses intra-class dependency hypergraphs to capture one-to-many relationships. This allows HECS to provide more accurate and relevant refactoring suggestions. The tool constructs hypergraphs for each target class, attributes nodes using a pre-trained code model, and trains an enhanced hypergraph neural network. Coupled with a large language model, HECS delivers practical refactoring suggestions. In evaluations on large-scale and real-world datasets, HECS achieved a 38.5% increase in precision, 9.7% in recall, and 44.4% in f1-measure compared to JDeodorant, SSECS, and LLMRefactor. These improvements make HECS a valuable tool for developers, offering practical insights and enhancing existing refactoring techniques.

Publisher's Version
FixCheck: A Tool for Improving Patch Correctness Analysis
Facundo Molina, Juan Manuel Copia, and Alessandra Gorla
(IMDEA Software Institute, Spain; Universidad Politécnica de Madrid, Spain)
Patch correctness assessment aims at effectively detecting overfitted patches, i.e., patches that causes all tests to pass but do not actually fix the bug. Although several automated techniques for assessing patch correctness have been proposed, these techniques typically yield a binary result (correct/incorrect) without providing any additional information explaining the rationale behind the decision of classifying a patch as correct or incorrect. This tool demo paper presents FixCheck, a tool based on static analysis, random testing and Large Language Models (LLMs), that seeks to improve the patch correctness analysis process by providing fault-revealing tests for potentially incorrect patches. To this end, FixCheck first employs static analysis and random testing to generate a comprehensive set of test cases that are similar to the original failing test case. Then, FixCheck relies on LLMs to derive meaningful assertions for each new test case. Finally, FixCheck executes the generated tests, and those that fail are selected and prioritized based on their likelihood of revealing a defect in the patch.

Publisher's Version
Generalized Concurrency Testing Tool for Distributed Systems
Ege Berkay Gulcan, João Neto, and Burcu Kulahcioglu Ozkan
(Delft University of Technology, Netherlands)
Controlled concurrency testing (CCT) is an effective approach for testing distributed system implementations. However, existing CCT tools suffer from the drawbacks of language dependency and the cost of source code instrumentation, which makes them difficult to apply to real-world production systems. We propose DSTest, a generalized CCT tool for testing distributed system implementations. DSTest intercepts messages on the application layer and, hence, eliminates the instrumentation cost and achieves language independence with minimal input. We provide a clean and modular interface to extend DSTest for various event schedulers for CCT. We package DSTest with three well-known event schedulers and validate the tool by applying it to popular production systems.

Publisher's Version Published Artifact Video Artifacts Available
SMBugFinder: An Automated Framework for Testing Protocol Implementations for State Machine Bugs
Paul Fiterău-Broştean, Bengt Jonsson, Konstantinos Sagonas, and Fredrik Tåquist
(Uppsala University, Sweden; National Technical University of Athens, Greece)
Implementations of stateful network protocols must keep track of the presence, order and type of exchanged messages. Any errors, so-called state machine bugs, can compromise security. SMBugFinder provides an automated framework for detecting these bugs in network protocol implementations using black-box testing. It takes as input a state machine model of the protocol implementation which is tested and a catalogue of bug patterns for the protocol conveniently specified as finite automata. It then produces sequences that expose the catalogued bugs in the tested implementation. Connection to a harness allows SMBugFinder to validate these sequences. The technique behind SMBugFinder has been evaluated successfully on DTLS and SSH in prior work. In this paper, we provide a user-level view of the tool using the EDHOC protocol as an example.

Publisher's Version Published Artifact Info Artifacts Available
Panda: A Concurrent Scheduler for Compiler-Based Tools
Xutong Ma, Jiwei Yan, Jun Yan, and Jian Zhang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
The widely-used Compiler-Based Tools (CBT), such as static analyzers, process input source code using data structures inside a compiler. CBTs can be invoked together with compilers by injecting the compilation process. However, it is seldom the best practice for the inconvenience of running various CBTs, the unexpected failures due to interference with compilers, and the efficiency degradation under compilation dependencies. To fill this gap, we propose Panda, an efficient scheduler for C/C++ CBTs. It executes various CBTs in a compilation-independent manner to avoid mutual interference with the build system, and parallels the process based on an estimated makespan to improve the execution efficiency. The assessment indicates that Panda can reduce the total execution time by 19%–47% compared with compilation-coupled execution, with an average 39.03×–52.15× speedup with 64 parallel workers.

Publisher's Version
FunRedisp: A Function Redispatch Tool to Reduce Invocation Gas Fees in Solidity Smart Contracts
Yunqi Liu and Wei Song
(Nanjing University of Science and Technology, China)
FunRedisp is a function dispatch refactoring tool to reduce the overall invocation gas consumption of Solidity smart contracts. It initially extracts all external functions in a contract at the source code level. After identifying the functions which are highly likely to be invoked among these functions by a pre-trained TextCNN model, FunRedisp moves them to the front of the function dispatch at the bytecode level. FunRedisp saves about 125.17 units of gas per contract transaction on the 50 real-world smart contracts, which are randomly selected from Ethereum. The time overhead for refactoring each contract is only 0.37 seconds on average, demonstrating its effectiveness and efficiency.

Publisher's Version Published Artifact Video Artifacts Available

ISSTA/ECOOP Doctoral Symposium

Late PhD Papers

Robustness against the C/C++11 Memory Model
Roy Margalit
(Tel Aviv University, Israel)
Concurrency is extremely useful, however reasoning about the correctness of concurrent programs is more difficult than ever. Sequential consistency (SC) semantics provide the ability to reason about correctness, however these come at a high synchronization cost. C11 introduced a memory model that was supposed to help with these problems, attempting to provide balance between performance and reasoning. However, this model was much more complicated than expected, proving to be a challenge even for domain experts. We propose a method that enables the programmer to reason about correctness with SC semantics without compromising performance. By proving robustness of a program it can only exhibit SC behaviors and can thus be reasoned about with SC semantics.

Publisher's Version
Learning the Effects of Software Changes
Laura Plein
(CISPA Helmholtz Center for Information Security, Germany)
Software development requires several stages of code iterations, each one requiring debugging, testing, localizing and fixing bugs. While several tools have been developed to automate one of those tasks individually, integrating and combining those results still requires a huge manual effort from software developers. Additionally, many approaches, e.g., in Automated Program Repair, are based on specifically curated fix templates that fail to generalize to most complex software projects. To address those challenges, I propose a new research agenda to learn the effects of software changes in order to localize and fix bugs without being limited to a specific project or a specific programming language. Additionally, my approach can be used to predict the effects of software changes. My preliminary results indicate the feasibility of successfully training a model on Python software changes and their effects, that is capable of producing 80% of correct patches and predicting the effect of a change with an accuracy of 81%. My results highlight the potential of learning the effects of software changes for better software development.

Publisher's Version
Soft Verification for Actor Contract Systems
Bram Vandenbogaerde
(Vrije Universiteit Brussel, Belgium)
Design-by-contract is a software engineering practice where programmers annotate program elements with contract specifications that make expectations towards the user and supplier of the program element explicit. This practice has been applied in various contexts such as higher-order programming languages. However, support for contracts in distributed actor programs is limited. Unfortunately, contract specifications need to be checked while executing the program which introduces a substantial overhead. To counter this, soft verification techniques have been proposed to verify (parts of) contract specifications, but have only been applied in the context of sequential programs. The goal of our research is therefore twofold: designing contract languages for distributed actor programs and developing techniques for their soft verification. In this context, we present a work plan and method, and show our preliminary results.

Publisher's Version
From Fault Injection to Formal Verification: A Holistic Approach to Fault Diagnosis in Cyber-Physical Systems
Drishti Yadav
(TU Wien, Austria)
Cyber-Physical Systems (CPSs) face growing complexity, especially in safety-critical areas. Ensuring their correctness is vital to maintain full operational capacity, as undetected failures can be both costly and life-threatening. Therefore, advanced fault diagnosis procedures are essential for thorough CPS testing, enabling accurate fault detection, explanation, and rectification. This doctoral research contributes to the field by developing novel tools and techniques to enhance fault-based testing and diagnosis of CPSs. Our research focuses on testing of CPS dataflow models created in Simulink, validated against strict formal specifications. Our contributions include (i) an automated tool for systematic fault injection, (ii) a bio-inspired global optimization algorithm, (iii) a robust fault localization method, (iv) a novel approach to mutation testing for evaluating test suites against formal properties, and (v) a new coverage criterion tailored for CPS dataflow models. This comprehensive approach offers significant improvements over existing methods, ensuring thorough testing across various scenarios. We validate the effectiveness of our solutions using publicly available benchmarks from various domains. Our findings open new perspectives on CPS testing, laying the foundation for more robust CPSs

Publisher's Version
Shaping Test Inputs in Grammar-Based Fuzzing
José Antonio Zamudio Amaya
(CISPA Helmholtz Center for Information Security, Germany)
Fuzzing is an essential method for finding vulnerabilities. Conventional fuzzing looks across a wide input space, but it cannot handle systems that need intricate and specialized input patterns. Grammar-based fuzzing uses formal grammars to shape the inputs the fuzzer generates. This method is crucial for directing fuzzers to generate complicated inputs that adhere to syntactical requirements. However, existing approaches are biased towards certain input features, leading to significant portions of the solution space being under-explored or ignored. In this paper, we review the state-of-the-art methods, emphasizing the limitations of grammar-based fuzzing, and we provide a first approach for incorporating distribution sampling into fuzzing, accompanied by encouraging first findings. This work can represent a significant step towards achieving comprehensive input space exploration in grammar-based fuzzing, with implications for enhancing the robustness and reliability of the fuzzing targets.

Publisher's Version

Early PhD Papers

Leveraging Natural Language Processing and Data Mining to Augment and Validate APIs
Alix Decrop
(University of Namur, Belgium)
APIs are increasingly prominent for modern web applications, allowing millions of users around the world to access data. Reducing the risk of API defects - and consequently failures - is key, notably for security, availability, and maintainability purposes. Documenting an API is crucial, allowing the user to better understand it. Moreover, API testing techniques often require formal documentation as input. However, documenting is a time-consuming and error-prone task, often overlooked by developers. Natural Language Processing (NLP) could assist API development, as recent Large Language Models (LLMs) demonstrated exceptional abilities to automate tasks based on their colossal training data. Data mining could also be utilized, synthesizing API information scattered across the web. Hence, I present my PhD project aimed at exploring the usage of NLP-related technologies and data mining to augment and validate APIs. The research questions of this PhD project are: (1) What types of APIs can benefit from NLP and data mining assistance? (2) What API problems can be solved with such methods? (3) How effective are the methods (i.e. LLMs) in assisting APIs? (4) How efficient are the methods in assisting APIs (i.e. time and costs)?

Publisher's Version
Decentralized Near-Synchronous Local-First Programming Collaboration
Leon Freudenthaler
(FH Campus Wien, Austria)
This dissertation investigates near-synchronous collaboration in programming environments by means of Conflict-free Replicated Data Types (CRDTs). While screen sharing and video conferencing offer basic features for collaborative programming, IDE-embedded collaboration tools provide precise tracking of source code changes and developer control. CRDTs enable decentralized collaboration, eliminating central server dependencies. However, existing CRDTs fail to preserve user intent and handle complex formats, leading to inefficiencies. Using a Design Science Research approach, this dissertation addresses four key research areas: (1) Identifying the limitations of existing CRDT algorithms in preserving user intent for programming files, (2) Optimizing CRDT implementations for programming collaboration requirements, (3) Determining the appropriate granularity for replicated data types in programming environments, (4) Exploring supervised imitation learning to enhance system capabilities in merging conflicts. This work aims to improve collaborative programming tools, ensuring higher consistency, efficiency, and user satisfaction.

Publisher's Version
Integrating Mutation Techniques to Keep Specification and Source Code in Sync
Kerstin Jacob
(University of Bamberg, Germany)
Most software development processes rely on testing for assuring that a software's source code is kept in sync with its specification during software evolution. In testing, the degree of conformance between source code and specification is measured by the test success rate. However, the practical value of this metric depends on the quality of the test suite, i.e., its completeness wrt. the specification and source code. Mutation testing has proved to be effective in assessing test suite completeness; however related work either considers mutating the source code or the specification. Our research will investigate how approaches based on mutating code and specification can be integrated to keep both in sync during software evolution. We aim to contribute (1) a concise method to measuring the degree of conformance between specification and code with each modification, (2) a technique for the identification and visualization of fine-granular trace links between them, and (3) an approach for suggesting updates of specification and code based on detected deviations between them.

Publisher's Version
Quality Assurance for Non-trivial Systems: Use Case GCC Plugins
Nimantha Kariyakarawana
(KU Leuven, Belgium)
There are software systems that function without human interven- tion. Such systems work according to preset rules, algorithms, or data. In this doctoral research, we aim to explore the testing-related challenges associated with such systems with GCC plugins as our use case. The GCC compiler family is widely known especially as a compiler for C and C++. GCC allows extensions to be added through the GCC Plugin API. GCC plugins execute at the com- piler level and influence the compiler at different compiling stages. Plugins depend on the GCC internal data structures such as AST, GIMPLE, Basic Blocks and RTL. Therefore, it is difficult to test the correctness of a GCC plugin. Testing is essential for ensuring the quality and the correctness of the plugins. The attempts made in the past are insufficient. They depend on testing the plugins from a black-box perspective. Internal issues may remain hidden when only testing from a black-box perspective is considered. Testing the plugins from both white-box and black-box perspectives is essential to guarantee their quality. We intend to shed light on the complex- ity and challenges of GCC plugin testing. We propose a four-tiered approach for GCC plugin testing. The approach consists of static and dynamic testing techniques. The main challenge is white box testing. This research aims to address the challenge and provide a solution by utilising logs focusing on higher test coverage. We intend to utilise the findings of the study with other comparable systems

Publisher's Version
Search-Based Translations for Tensor Operations
Jie Qiu
(Duolingo, USA)
Tensor operations are essential for various tasks, including image processing, deep learning, and scientific computing. Both hardware and software communities are working to make tensor processing more efficient. This includes creating new hardware and developing domain-specific languages (DSLs). A crucial link between these communities is the compiler. In this work, we propose developing efficient compilers that translate programs written in general-purpose languages into tensor operations, enabling them to benefit from these optimizations. Unlike traditional pattern-matching approaches, our method uses program synthesis and verification to find semantically equivalent translations.

Publisher's Version
Automated Testing of Networked Systems Reliability
Michal Rozsíval
(Brno University of Technology, Czechia)
The reliability of a network is a crucial requirement for systems such as IoT, client-server, or cloud-based solutions. Unfortunately, real networks cannot be assumed to be fault-free, especially when considering various hardware problems, performance issues, or malicious attacks. Testing networked systems should therefore include evaluating fault tolerance under various network conditions. The paper presents a doctoral research project on automated verification of networked systems using fault-attack injection using a derived model of network communication.

Publisher's Version
Graph Learning for Extract Class Refactoring
Luqiao Wang
(Xidian University, China)
Extract Class refactoring is essential for decomposing large, complex classes to improve code maintainability and readability. Traditional refactoring tools rely heavily on metrics like cohesion and coupling, which often require expert judgment and are not universally applicable. This research proposes a novel approach leveraging deep class property graphs and advanced graph neural networks to automate the identification of refactoring opportunities. By integrating deep semantic properties and fine-grained structural dependencies, this method aims to reduce reliance on expert knowledge and improve the precision and adaptability of refactoring suggestions. Future work will explore hypergraph learning to capture more complex code relationships, further enhancing the proposed method's effectiveness.

Publisher's Version
Collaboration to Repository-Level Vulnerability Detection
Xin-Cheng Wen
(Harbin Institute of Technology, China)
Large Language Model (LLM)-based methods have proven to be effective for many software engineering domains, with a potential for substantial productivity effective for software vulnerability detection. However, due to the limitation of the length of input contexts of LLM, the existing LLM-based methods mainly focus on detecting function-level and leveraging the in-file context information for vulnerability detection (i.e., intra-procedural vulnerabilities), ignoring the more complex inter-procedural vulnerability detection scenarios in practice. For instance, in real-world scenarios, developers routinely engage with program analysis to detect vulnerabilities that span multiple cross-file information within repositories. Since complex processes tend to have redundancy dependencies from spanning multiple files in the repository level and invoking multiple static analysis tools, the ideal goal of vulnerability detection is to extract the vulnerability-related information from the repository and provide potential possible explanations for vulnerability triggers. However, such a goal is hard to achieve, and thus in this work, we design three works through multi-agent collaboration to approach the goal of repository-level vulnerability detection.

Publisher's Version

proc time: 45.82