Powered by
33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024),
September 16–20, 2024,
Vienna, Austria
Frontmatter
Papers Round 1
Detecting Build Dependency Errors in Incremental Builds
Jun Lyu
![ORCID logo](images/orcid.svg)
, Shanshan Li
![ORCID logo](images/orcid.svg)
, He Zhang
![ORCID logo](images/orcid.svg)
, Yang Zhang
![ORCID logo](images/orcid.svg)
, Guoping Rong
![ORCID logo](images/orcid.svg)
, and Manuel Rigger
(Nanjing University, China; National University of Singapore, Singapore)
Incremental and parallel builds performed by build tools such as Make are the heart of modern C/C++ software projects. Their correct and efficient execution depends on build scripts. However, build scripts are prone to errors. The most prevalent errors are missing dependencies (MDs) and redundant dependencies (RDs). The state-of-the-art methods for detecting these errors rely on clean builds (i.e., full builds of a subset of software configurations in a clean environment), which is costly and takes up to a few hours for large-scale projects. To address these challenges, we propose a novel approach called EChecker to detect build dependency errors in the context of incremental builds. The core idea of EChecker is to automatically update actual build dependencies by inferring them from C/C++ pre-processor directives and Makefile changes from new commits, which avoids clean builds when possible. EChecker achieves higher efficiency than the methods that rely on clean builds while maintaining effectiveness. We selected 12 representative projects, with their sizes ranging from small to large, with 240 commits (20 commits for each project), based on which we evaluated the effectiveness and efficiency of EChecker. We compared the evaluation results with a state-of-the-art build dependency error detection tool. The evaluation shows that the F-1 score of EChecker improved by 0.18 over the state-of-the-art method. EChecker increases the build dependency error detection efficiency by an average of 85.14 times (with a median of 16.30 times). The results demonstrate that EChecker can support practitioners in detecting build dependency errors efficiently.
Article Search
Face It Yourselves: An LLM-Based Two-Stage Strategy to Localize Configuration Errors via Logs
Shiwen Shan
![ORCID logo](images/orcid.svg)
, Yintong Huo
![ORCID logo](images/orcid.svg)
, Yuxin Su
![ORCID logo](images/orcid.svg)
, Yichen Li
![ORCID logo](images/orcid.svg)
, Dan Li
![ORCID logo](images/orcid.svg)
, and Zibin Zheng
(Sun Yat-sen University, China; Chinese University of Hong Kong, China)
Configurable software systems are prone to configuration errors, resulting in significant losses to companies. However, diagnosing these errors is challenging due to the vast and complex configuration space. These errors pose significant challenges for both experienced maintainers and new end-users, particularly those without access to the source code of the software systems. Given that logs are easily accessible to most end-users, we conduct a preliminary study to outline the challenges and opportunities of utilizing logs in localizing configuration errors. Based on the insights gained from the preliminary study, we propose an LLM-based two-stage strategy for end-users to localize the root-cause configuration properties based on logs. We further implement a tool, LogConfigLocalizer, aligned with the design of the aforementioned strategy, hoping to assist end-users in coping with configuration errors through log analysis.
To the best of our knowledge, this is the first work to localize the root-cause configuration properties for end-users based on Large Language Models (LLMs) and logs. We evaluate the proposed strategy on Hadoop by LogConfigLocalizer and prove its efficiency with an average accuracy as high as 99.91%. Additionally, we also demonstrate the effectiveness and necessity of different phases of the methodology by comparing it with two other variants and a baseline tool. Moreover, we validate the proposed methodology through a practical case study to demonstrate its effectiveness and feasibility.
Article Search
FastLog: An End-to-End Method to Efficiently Generate and Insert Logging Statements
Xiaoyuan Xie
![ORCID logo](images/orcid.svg)
, Zhipeng Cai
![ORCID logo](images/orcid.svg)
, Songqiang Chen
![ORCID logo](images/orcid.svg)
, and
Jifeng Xuan
(Wuhan University, China; Hong Kong University of Science and Technology, China)
Logs play a crucial role in modern software systems, serving as a means for developers to record essential information for future software maintenance. As the performance of these log-based maintenance tasks heavily relies on the quality of logging statements, various works have been proposed to assist developers in writing appropriate logging statements. However, these works either only support developers in partial sub-tasks of this whole activity; or perform with a relatively high time cost and may introduce unwanted modifications. To address their limitations, we propose FastLog, which can support the complete logging statement generation and insertion activity, in a very speedy manner. Specifically, given a program method, FastLog first predicts the insertion position in the finest token level, and then generates a complete logging statement to insert. We further use text splitting for long input texts to improve the accuracy of predicting where to insert logging statements. A comprehensive empirical analysis shows that our method outperforms the state-of-the-art approach in both efficiency and output quality, which reveals its great potential and practicality in current real-time intelligent development environments.
Preprint
FortifyPatch: Towards Tamper-Resistant Live Patching in Linux-Based Hypervisor
Zhenyu Ye
![ORCID logo](images/orcid.svg)
, Lei Zhou
![ORCID logo](images/orcid.svg)
, Fengwei Zhang
![ORCID logo](images/orcid.svg)
, Wenqiang Jin
![ORCID logo](images/orcid.svg)
, Zhenyu Ning
![ORCID logo](images/orcid.svg)
, Yupeng Hu
![ORCID logo](images/orcid.svg)
, and Zheng Qin
(Hunan University, China; National University of Defense Technology, China; Southern University of Science and Technology, China; Xinchuang Haihe Laboratory, China)
Linux-based hypervisors in the cloud server suffer from an increasing number of vulnerabilities in the Linux kernel.To address these vulnerabilities in a timely manner while avoiding the economic loss caused by unplanned shutdowns, live patching schemes have been developed. Unfortunately, existing live patching solutions have failed to protect patches from post-deployment attacks. In addition, patches that involve changes to global variables can lead to practical issues with existing solutions. To address these problems, we present FortifyPatch, a tamper-resistant live patching solution for Linux-based hypervisors in cloud environments. Specifically, FortifyPatch employs multiple Granule Protection Tables from Arm Confidential Computing Architecture to protect the integrity of deployed patches. TrustZone Address Space Controller and Performance Monitor Unit are used to prevent the bypassing of the Patch via kernel code protection and timely page table verification. FortifyPatch is also able to patch global variables via well-designed data access traps.We prototype FortifyPatch and evaluate it using real-world CVE patches. The result shows that FortifyPatch is capable of deploying 81.5% of CVE patches. The performance evaluation indicates that FortifyPatch protects deployed patches with 0.98% and 3.1% overhead on average across indicative benchmarks and real-world applications, respectively.
Article Search
Unimocg: Modular Call-Graph Algorithms for Consistent Handling of Language Features
Dominik Helm
![ORCID logo](images/orcid.svg)
, Tobias Roth
![ORCID logo](images/orcid.svg)
, Sven Keidel
![ORCID logo](images/orcid.svg)
, Michael Reif
![ORCID logo](images/orcid.svg)
, and
Mira Mezini
(TU Darmstadt, Germany; National Research Center for Applied Cybersecurity ATHENE, Germany; CQSE, Germany; hessian.AI, Germany)
Traditional call-graph construction algorithms conflate the computation of possible runtime types with the actual resolution of (virtual) calls. This tangled design impedes supporting complex language features and APIs and making systematic trade-offs between precision, soundness, and scalability. It also impedes implementation of precise downstream analyses that rely on type information. To address the problem, we propose Unimocg, a modular architecture for call-graph construction that decouples the computation of type information from resolving calls. Due to its modular design, Unimocg can combine a wide range of different call-graph algorithms with algorithm-agnostic modules to support individual language features. Moreover, these modules operate at the same precision as the chosen call-graph algorithm with no further effort. Additionally, Unimocg allows other analyses to easily reuse type information from the call-graph construction at full precision. We demonstrate how Unimocg enables a framework of call-graph algorithms with different precision, soundness, and scalability trade-offs from reusable modules. Unimocg currently supports ten call-graph algorithms from vastly different families, such as CHA, RTA, XTA, and k-l-CFA. These algorithms show consistent soundness without sacrificing precision or performance. We also show how an immutability analysis is improved using Unimocg.
Article Search
Artifacts Available
Precise Compositional Buffer Overflow Detection via Heap Disjointness
Yiyuan Guo
![ORCID logo](images/orcid.svg)
, Peisen Yao
![ORCID logo](images/orcid.svg)
, and
Charles Zhang
(Hong Kong University of Science and Technology, China; Zhejiang University, China)
Static analysis techniques for buffer overflow detection still struggle with being scalable for millions of lines of code, while being precise enough to have an acceptable false positive rate. The checking of buffer overflow necessitates reasoning about the heap reachability and numerical relations, which are mutually dependent. Existing techniques to resolve the dependency cycle either sacrifice precision or efficiency due to their limitations in reasoning about symbolic heap location, i.e., heap location with possibly symbolic numerical offsets. A symbolic heap location potentially aliases a large number of other heap locations, leading to a disjunction of heap states that is particularly challenging to reason precisely. Acknowledging the inherent difficulties in heap and numerical reasoning, we introduce a disjointness assumption into the analysis by shrinking the program state space so that all the symbolic locations involved in memory accesses are disjoint from each other. The disjointness property permits strong updates to be performed at symbolic heap locations, significantly improving the precision by incorporating numerical information in heap reasoning. Also, it aids in the design of a compositional analysis to boost scalability, where compact and precise function summaries are efficiently generated and reused. We implement the idea in the static buffer overflow detector Cod. When applying it to large, real-world software such as PHP and QEMU, we have uncovered 29 buffer overflow bugs with a false positive rate of 37%, while projects of millions of lines of code can be successfully analyzed within four hours.
Article Search
Enhancing ROS System Fuzzing through Callback Tracing
Yuheng Shen
![ORCID logo](images/orcid.svg)
, Jianzhong Liu
![ORCID logo](images/orcid.svg)
, Yiru Xu
![ORCID logo](images/orcid.svg)
, Hao Sun
![ORCID logo](images/orcid.svg)
, Mingzhe Wang
![ORCID logo](images/orcid.svg)
, Nan Guan
![ORCID logo](images/orcid.svg)
, Heyuan Shi
![ORCID logo](images/orcid.svg)
, and
Yu Jiang
(Tsinghua University, China; ETH Zurich, Switzerland; City University of Hong Kong, China; Central South University, China)
The Robot Operating System 2 (ROS) is the de-facto standard for robotic software development, with a wide application in diverse safety-critical domains. There are many efforts in testing that seek to deliver a more secure ROS codebase. However, existing testing methods are often inadequate to capture the complex and stateful behaviors inherent to ROS deployments, resulting in limited test- ing effectiveness. In this paper, we propose R2D2, a ROS system fuzzer that leverages ROS’s runtime states as guidance to increase fuzzing effectiveness and efficiency. Unlike traditional fuzzers, R2D2 employs a systematic instrumentation strategy that captures the system’s runtime behaviors and profiles the current system state in real-time. This approach provides a more in-depth understanding of system behaviors, thereby facilitating a more insightful explo- ration of ROS’s extensive state space. For evaluation, we applied it to four well-known ROS applications. Our evaluation shows that R2D2 achieves an improvement of 3.91× and 2.56× in code coverage compared to state-of-the-art ROS fuzzers, including Ros2Fuzz and RoboFuzz, while also uncovering 39 previously unknown vulnera- bilities, with 6 fixed in both ROS runtime and ROS applications. For its runtime overhead, R2D2 maintains an average execution and memory usage overhead with 10.4% and 1.0% in respect, making R2D2 effective in ROS testing.
Article Search
API Misuse Detection via Probabilistic Graphical Model
Yunlong Ma
![ORCID logo](images/orcid.svg)
, Wentong Tian
![ORCID logo](images/orcid.svg)
, Xiang Gao
![ORCID logo](images/orcid.svg)
, Hailong Sun
![ORCID logo](images/orcid.svg)
, and Li Li
(Beihang University, China)
API misuses can cause a range of issues in software development, including program crashes, bugs, and vulnerabilities. Different approaches have been developed to automatically detect API misuses by checking the program against usage rules extracted from extensive codebase or API documents. However, these mined rules may not be precise or complete, leading to high false positive/negative rates. In this paper, we propose a novel solution to this problem by representing the mined API usage rules as a probabilistic graphical model, where each rule's probability value represents its trustworthiness of being correct.
Our approach automatically constructs probabilistic usage rules by mining codebase and documents, and aggregating knowledge from different sources.
Here, the usage rules obtained from the codebase initialize the probabilistic model, while the knowledge from the documents serves as a supplement for adjusting and complementing the probabilities accordingly.
We evaluate our approach on the MuBench benchmark.
Experimental results show that our approach achieves 42.0% precision and 54.5% recall, significantly outperforming state-of-the-art approaches.
Article Search
Ma11y: A Mutation Framework for Web Accessibility Testing
Mahan Tafreshipour
![ORCID logo](images/orcid.svg)
, Anmol Deshpande
![ORCID logo](images/orcid.svg)
,
Forough Mehralian ![ORCID logo](images/orcid.svg)
, Iftekhar Ahmed
![ORCID logo](images/orcid.svg)
, and
Sam Malek
(University of California at Irvine, Irvine, USA)
Despite the availability of numerous automatic accessibility testing solutions, web accessibility issues persist on many websites. Moreover, there is a lack of systematic evaluations of the efficacy of current accessibility testing tools. To address this gap, we present the first mutation analysis framework, called Ma11y, designed to assess web accessibility testing tools. Ma11y includes 25 mutation operators that intentionally violate various accessibility principles and an automated oracle to determine whether a mutant is detected by a testing tool. Evaluation on real-world websites demonstrates the practical applicability of the mutation operators and the framework’s capacity to assess tool performance. Our results demonstrate that the current tools cannot identify nearly 50% of the accessibility bugs injected by our framework, thus underscoring the need for the development of more effective accessibility testing tools. Finally, the framework’s accuracy and performance attest to its potential for seamless and automated application in practical settings.
Article Search
Info
Total Recall? How Good Are Static Call Graphs Really?
Dominik Helm
![ORCID logo](images/orcid.svg)
, Sven Keidel
![ORCID logo](images/orcid.svg)
, Anemone Kampkötter
![ORCID logo](images/orcid.svg)
, Johannes Düsing
![ORCID logo](images/orcid.svg)
, Tobias Roth
![ORCID logo](images/orcid.svg)
, Ben Hermann
![ORCID logo](images/orcid.svg)
, and
Mira Mezini
(TU Darmstadt, Germany; National Research Center for Applied Cybersecurity ATHENE, Germany; TU Dortmund, Germany; hessian.AI, Germany)
Static call graphs are a fundamental building block of program analysis.
However, differences in call-graph construction and the use of specific language features can yield unsoundness and imprecision.
Call-graph analyses are evaluated using measures of precision and recall, but this is hard when a ground truth for real-world programs is generally unobtainable.
In this work, we propose to use carefully constructed dynamic baselines based on fixed entry points and input corpora.
The creation of this dynamic baseline is posed as an approximation of the ground truth---an optimization problem.
We use manual extension and coverage-guided fuzzing for creating suitable input corpora.
With these dynamic baselines, we study call-graph quality of multiple algorithms and implementations using four real-world Java programs.
We find that our methodology provides valuable insights into call-graph quality and how to measure it.
With this work, we provide a novel methodology to advance the field of static program analysis as we assess the computation of one of its core data structures---the call graph.
Article Search
Artifacts Available
CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios
Zhengran Zeng
![ORCID logo](images/orcid.svg)
, Yidong Wang
![ORCID logo](images/orcid.svg)
, Rui Xie
![ORCID logo](images/orcid.svg)
, Wei Ye
![ORCID logo](images/orcid.svg)
, and Shikun Zhang
(Peking University, China)
In the evolving landscape of large language models (LLMs) tailored for software engineering, the need for benchmarks that accurately reflect real-world development scenarios is paramount. Current benchmarks are either too simplistic or fail to capture the multi-tasking nature of software development. To address this, we introduce CoderUJB, a new benchmark designed to evaluate LLMs across diverse Java programming tasks that are executable and reflective of actual development scenarios, acknowledging Java's prevalence in real-world software production. CoderUJB comprises 2,239 programming questions derived from 17 real open-source Java projects and spans five practical programming tasks. Our empirical study on this benchmark investigates the coding abilities of various open-source and closed-source LLMs, examining the effects of continued pre-training in specific programming languages code and instruction fine-tuning on their performance. The findings indicate that while LLMs exhibit strong potential, challenges remain, particularly in non-functional code generation (e.g., test generation and defect detection). Importantly, our results advise caution in the specific programming languages continued pre-training and instruction fine-tuning, as these techniques could hinder model performance on certain tasks, suggesting the need for more nuanced strategies. CoderUJB thus marks a significant step towards more realistic evaluations of programming capabilities in LLMs, and our study provides valuable insights for the future development of these models in software engineering.
Article Search
DAppFL: Just-in-Time Fault Localization for Decentralized Applications in Web3
Zhiying Wu
![ORCID logo](images/orcid.svg)
, Jiajing Wu
![ORCID logo](images/orcid.svg)
, Hui Zhang
![ORCID logo](images/orcid.svg)
, Ziwei Li
![ORCID logo](images/orcid.svg)
, Jiachi Chen
![ORCID logo](images/orcid.svg)
, Zibin Zheng
![ORCID logo](images/orcid.svg)
, Qing Xia
![ORCID logo](images/orcid.svg)
, Gang Fan
![ORCID logo](images/orcid.svg)
, and Yi Zhen
(Sun Yat-sen University, China; Institute of Software at Chinese Academy of Sciences, China; n.n., China)
Web3 describes an idea for the next evolution of the Internet, where blockchain technology enables the Internet of Value. As Web3 software, decentralized applications (DApps) have emerged in recent years. There exists a natural link between DApps and cryptocurrencies, where faults in DApps could directly lead to monetary losses associated with cryptocurrencies. Hence, efficient fault localization technology is of paramount importance for urgent DApp rescue operations and the mitigation of financial losses. However, fault localization methods applied in traditional applications are not well-suited for this specific field, due to their inability to identify DApp-specific fault features, e.g., a substantial amount of cryptocurrency is transferred from DApps to hackers. In order to explore the root cause of DApp faults, some researchers try to identify suspicious code snippets through mutation testing. Nonetheless, applying mutation testing for DApp fault localization is time-consuming and thus limited in practice. This paper conducts the first comprehensive study of DApp fault localization. We introduce DAppFL, a learning-based DApp fault localization tool that performs reverse engineering to gather executed source code and then trace cryptocurrency flow to assist in locating faulty functions. We also present the inaugural dataset for DApp fault localization, providing a new benchmark for this domain.Our experimental results demonstrate that DAppFL locates 63% of faults within the Top-5, 23% more than the state-of-the-art method. To facilitate further research, our code and dataset are freely available online: https://github.com/xplanet-sysu/awesome-works#dappfl.
Article Search
CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection
Hao Wang ![ORCID logo](images/orcid.svg)
, Zeyu Gao
![ORCID logo](images/orcid.svg)
, Chao Zhang
![ORCID logo](images/orcid.svg)
, Mingyang Sun
![ORCID logo](images/orcid.svg)
, Yuchen Zhou
![ORCID logo](images/orcid.svg)
, Han Qiu
![ORCID logo](images/orcid.svg)
, and Xi Xiao
(Tsinghua University, China; University of Electronic Science and Technology of China, China; Beijing University of Technology, China)
Binary code similarity detection (BCSD) is a fundamental technique for various applications. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when the volume of target binaries to search is large. To address this issue, we propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches to significantly improve accuracy while minimizing overheads. Specifically, CEBin utilizes a refined embedding-based approach to extract features of target code, which efficiently narrows down the scope of candidate similar code and boosts performance. Then, it utilizes a comparison-based approach that performs a pairwise comparison on the candidates to capture more nuanced and complex relationships, which greatly improves the accuracy of similarity detection. By bridging the gap between embedding-based and comparison-based approaches, CEBin is able to provide an effective and efficient solution for detecting similar code (including vulnerable ones) in large-scale software ecosystems. Experimental results on three well-known datasets demonstrate the superiority of CEBin over existing state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD in real world, we construct a large-scale benchmark of vulnerability, offering the first precise evaluation scheme to assess BCSD methods for the 1-day vulnerability detection task. CEBin could identify the similar function from millions of candidate functions in just a few seconds and achieves an impressive recall rate of 85.46% on this more practical but challenging task, which are several order of magnitudes faster and 4.07× better than the best SOTA baseline.
Preprint
Interprocedural Path Complexity Analysis
Mira Kaniyur
![ORCID logo](images/orcid.svg)
, Ana Cavalcante-Studart
![ORCID logo](images/orcid.svg)
, Yihan Yang
![ORCID logo](images/orcid.svg)
, Sangeon Park
![ORCID logo](images/orcid.svg)
, David Chen
![ORCID logo](images/orcid.svg)
, Duy Lam
![ORCID logo](images/orcid.svg)
, and Lucas Bang
(Harvey Mudd College, USA)
Software testing techniques like symbolic execution face significant challenges with path explosion. Asymptotic Path Complexity (APC) quantifies this path explosion complexity, but existing APC methods do not work for interprocedural functions in general. Our new algorithm, APC-IP, efficiently computes APC for a wider range of functions, including interprocedural ones, improving over previous methods in both speed and scope. We implement APC-IP atop the existing software Metrinome, and test it against a benchmark of C functions, comparing it to existing and baseline approaches as well as comparing it to the path explosion of the symbolic execution engine Klee. The results show that APC-IP not only aligns with previous APC values but also excels in performance, scalability, and handling complex source code. It also provides a complexity prediction of the number of paths explored by Klee, extending the APC metric's applicability and surpassing previous implementations.
Article Search
Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models
Mingyi Zhou
![ORCID logo](images/orcid.svg)
, Xiang Gao
![ORCID logo](images/orcid.svg)
, Pei Liu
![ORCID logo](images/orcid.svg)
, John Grundy
![ORCID logo](images/orcid.svg)
, Chunyang Chen
![ORCID logo](images/orcid.svg)
, Xiao Chen
![ORCID logo](images/orcid.svg)
, and Li Li
(Monash University, Australia; Beihang University, China; CSIRO’s Data61, Australia; TU Munich, Germany; University of Newcastle, Australia)
Recent studies show that on-device deployed deep learning (DL) models, such as those of Tensor Flow Lite (TFLite), can be easily extracted from real-world applications and devices by attackers to generate many kinds of adversarial and other attacks. Although securing deployed on-device DL models has gained increasing attention, no existing methods can fully prevent these attacks. Traditional software protection techniques have been widely explored. If on-device models can be implemented using pure code, such as C++, it will open the possibility of reusing existing robust software protection techniques. However, due to the complexity of DL models, there is no automatic method that can translate DL models to pure code. To fill this gap, we propose a novel method, CustomDLCoder, to automatically extract on-device DL model information and synthesize a customized executable program for a wide range of DL models. CustomDLCoder first parses the DL model, extracts its backend computing codes, configures the extracted codes, and then generates a customized program to implement and deploy the DL model without explicit model representation. The synthesized program hides model information for DL deployment environments since it does not need to retain explicit model representation, preventing many attacks on the DL model. In addition, it improves ML performance because the customized code removes model parsing and preprocessing steps and only retains the data computing process. Our experimental results show that CustomDLCoder improves model security by disabling on-device model sniffing. Compared with the original on-device platform (i.e., TFLite), our method can accelerate model inference by 21.0% and 24.3% on x86-64 and ARM64 platforms, respectively. Most importantly, it can significantly reduce memory consumption by 68.8% and 36.0% on x86-64 and ARM64 platforms, respectively.
Article Search
Artifacts Available
UPBEAT: Test Input Checks of Q# Quantum Libraries
Tianmin Hu
![ORCID logo](images/orcid.svg)
, Guixin Ye
![ORCID logo](images/orcid.svg)
, Zhanyong Tang
![ORCID logo](images/orcid.svg)
, Shin Hwei Tan
![ORCID logo](images/orcid.svg)
, Huanting Wang
![ORCID logo](images/orcid.svg)
, Meng Li
![ORCID logo](images/orcid.svg)
, and
Zheng Wang
(Northwest University, China; Concordia University, Canada; University of Leeds, United Kingdom; Hefei University of Technology, China)
High-level programming models like Q# significantly simplify the complexity of programming for quantum computing. These models are supported by a set of foundation libraries for code development. However, errors can occur in the library implementation, and one common root cause is the lack of or incomplete checks on properties like values, length, and quantum states of inputs passed to user-facing subroutines. This paper presents Upbeat, a fuzzing tool to generate random test cases for bugs related to input checking in Q# libraries. Upbeat develops an automated process to extract constraints from the API documentation and the developer implemented input-checking statements. It leverages open-source Q# code samples to synthesize test programs. It frames the test case generation as a constraint satisfaction problem for classical computing and a quantum state model for quantum computing to produce carefully generated subroutine inputs to test if the input-checking mechanism is appropriately implemented. Under 100 hours of automated test runs, Upbeat has successfully identified 16 bugs in API implementations and 4 documentation errors. Of these, 14 have been confirmed, and 12 have been fixed by the library developers.
Article Search
Enhancing Robustness of Code Authorship Attribution through Expert Feature Knowledge
Xiaowei Guo
![ORCID logo](images/orcid.svg)
, Cai Fu
![ORCID logo](images/orcid.svg)
, Juan Chen
![ORCID logo](images/orcid.svg)
, Hongle Liu
![ORCID logo](images/orcid.svg)
, Lansheng Han
![ORCID logo](images/orcid.svg)
, and Wenjin Li
(Huazhong University of Science and Technology, China; NSFOCUS Technologies Group, China)
Code authorship attribution has been an interesting research problem for decades. Recent studies have revealed that existing methods for code authorship attribution suffer from weak robustness. Under the influence of small perturbations added by the attacker, the accuracy of the method will be greatly reduced. As of now, there is no code authorship attribution method capable of effectively handling such attacks. In this paper, we attribute the weak robustness of code authorship attribution methods to dataset bias and argue that this bias can be mitigated through adjustments to the feature learning strategy. We first propose a robust code authorship attribution feature combination framework, which is composed of only simple shallow neural network structures, and introduces controllability for the framework in the feature extraction by incorporating expert knowledge. Experiments show that the framework has significantly improved robustness over mainstream code authorship attribution methods, with an average drop of 23.4% (from 37.8% to 14.3%) in the success rate of targeted attacks and 25.9% (from 46.7% to 20.8%) in the success rate of untargeted attacks. At the same time, it can also achieve results comparable to mainstream code authorship attribution methods in terms of accuracy.
Article Search
A Large-Scale Empirical Study on Improving the Fairness of Image Classification Models
Junjie Yang
![ORCID logo](images/orcid.svg)
, Jiajun Jiang
![ORCID logo](images/orcid.svg)
, Zeyu Sun
![ORCID logo](images/orcid.svg)
, and Junjie Chen
(Tianjin University, China; Institute of Software at Chinese Academy of Sciences, China)
Fairness has been a critical issue that affects the adoption of deep learning models in real practice. To improve model fairness, many existing methods have been proposed and evaluated to be effective in their own contexts. However, there is still no systematic evaluation among them for a comprehensive comparison under the same context, which makes it hard to understand the performance distinction among them, hindering the research progress and practical adoption of them. To fill this gap, this paper endeavours to conduct the first large-scale empirical study to comprehensively compare the performance of existing state-of-the-art fairness improving techniques. Specifically, we target the widely-used application scenario of image classification, and utilized three different datasets and five commonly-used performance metrics to assess in total 13 methods from diverse categories. Our findings reveal substantial variations in the performance of each method across different datasets and sensitive attributes, indicating over-fitting on specific datasets by many existing methods. Furthermore, different fairness evaluation metrics, due to their distinct focuses, yield significantly different assessment results. Overall, we observe that pre-processing methods and in-processing methods outperform post-processing methods, with pre-processing methods exhibiting the best performance. Our empirical study offers comprehensive recommendations for enhancing fairness in deep learning models. We approach the problem from multiple dimensions, aiming to provide a uniform evaluation platform and inspire researchers to explore more effective fairness solutions via a set of implications.
Article Search
A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We?
Zhihan Jiang
![ORCID logo](images/orcid.svg)
, Jinyang Liu
![ORCID logo](images/orcid.svg)
, Junjie Huang
![ORCID logo](images/orcid.svg)
, Yichen Li
![ORCID logo](images/orcid.svg)
, Yintong Huo
![ORCID logo](images/orcid.svg)
, Jiazhen Gu
![ORCID logo](images/orcid.svg)
, Zhuangbin Chen
![ORCID logo](images/orcid.svg)
, Jieming Zhu
![ORCID logo](images/orcid.svg)
, and Michael R. Lyu
(Chinese University of Hong Kong, China; Sun Yat-sen University, China; Huawei Noah’s Ark Lab, China)
Log data have facilitated various tasks of software development and maintenance, such as testing, debugging and diagnosing. Due to the unstructured nature of logs, log parsing is typically required to transform log messages into structured data for automated log analysis. Given the abundance of log parsers that employ various techniques, evaluating these tools to comprehend their characteristics and performance becomes imperative. Loghub serves as a commonly used dataset for benchmarking log parsers, but it suffers from limited scale and representativeness, posing significant challenges for studies to comprehensively evaluate existing log parsers or develop new methods. This limitation is particularly pronounced when assessing these log parsers for production use. To address these limitations, we provide a new collection of annotated log datasets, denoted Loghub-2.0, which can better reflect the characteristics of log data in real-world software systems. Loghub-2.0 comprises 14 datasets with an average of 3.6 million log lines in each dataset. Based on Loghub-2.0, we conduct a thorough re-evaluation of 15 state-of-the-art log parsers in a more rigorous and practical setting. Particularly, we introduce a new evaluation metric to mitigate the sensitivity of existing metrics to imbalanced data distributions. We are also the first to investigate the granular performance of log parsers on logs that represent rare system events, offering in-depth details for software diagnosis. Accurately parsing such logs is essential, yet it remains a challenge. We believe this work could shed light on the evaluation and design of log parsers in practical settings, thereby facilitating their deployment in production systems.
Article Search
SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection
Xin-Cheng Wen
![ORCID logo](images/orcid.svg)
, Cuiyun Gao
![ORCID logo](images/orcid.svg)
, Shuzheng Gao
![ORCID logo](images/orcid.svg)
, Yang Xiao
![ORCID logo](images/orcid.svg)
, and Michael R. Lyu
(Harbin Institute of Technology, China; Chinese University of Hong Kong, China; Chinese Academy of Sciences, China)
Recently, there has been a growing interest in automatic software vulnerability detection. Pre-trained model-based approaches have demonstrated superior performance than other Deep Learning (DL)-based approaches in detecting vulnerabilities. However, the existing pre-trained model-based approaches generally employ code sequences as input during prediction, and may ignore vulnerability-related structural information, as reflected in the following two aspects. First, they tend to fail to infer the semantics of the code statements with complex logic such as those containing multiple operators and pointers. Second, they are hard to comprehend various code execution sequences, which is essential for precise vulnerability detection. To mitigate the challenges, we propose a Structured Natural Language Comment tree-based vulnerAbiLity dEtection framework based on the pre-trained models, named . The proposed Structured Natural Language Comment Tree (SCT) integrates the semantics of code statements with code execution sequences based on the Abstract Syntax Trees (ASTs).Specifically, comprises three main modules: (1) Comment Tree Construction, which aims at enhancing the model’s ability to infer the semantics of code statements by first incorporating Large Language Models (LLMs) for comment generation and then adding the comment node to ASTs. (2) Structured Natural Language Comment Tree Construction, which aims at explicitly involving code execution sequence by combining the code syntax templates with the comment tree. (3) SCT-Enhanced Representation, which finally incorporates the constructed SCTs for well capturing vulnerability patterns. Experimental results demonstrate that outperforms the best-performing baseline, including the pre-trained model and LLMs, with improvements of 2.96%, 13.47%, and 3.75% in terms of F1 score on the FFMPeg+Qemu, Reveal, and SVulD datasets, respectively. Furthermore, can be applied to different pre-trained models, such as CodeBERT and UniXcoder, yielding the F1 score performance enhancements ranging from 1.37% to 10.87%.
Preprint
Distance-Aware Test Input Selection for Deep Neural Networks
Zhong Li
![ORCID logo](images/orcid.svg)
, Zhengfeng Xu
![ORCID logo](images/orcid.svg)
, Ruihua Ji
![ORCID logo](images/orcid.svg)
,
Minxue Pan ![ORCID logo](images/orcid.svg)
, Tian Zhang
![ORCID logo](images/orcid.svg)
, Linzhang Wang
![ORCID logo](images/orcid.svg)
, and Xuandong Li
(Nanjing University, China)
Deep Neural Network (DNN) testing is one of the common practices to guarantee the quality of DNNs.
However, DNN testing in general requires a significant amount of test inputs with oracle information (labels), which can be challenging and resource-intensive to obtain. To relieve this problem, we propose DATIS, a distance-aware test input selection approach for DNNs. Specifically, DATIS adopts a two-step approach for selecting test inputs. In the first step, it selects test inputs based on improved uncertainty scores derived from the distances between the test inputs and their nearest neighbor training samples. In the second step, it further eliminates test inputs that may cover the same faults by examining the distances among the selected test inputs. To evaluate DATIS, we conduct extensive experiments on 8 diverse subjects, taking into account different domains of test inputs, varied DNN structures, and diverse types of test inputs. Evaluation results show that DATIS significantly outperforms 15 baseline approaches in both selecting test inputs with high fault-revealing power and guiding the selection of data for DNN enhancement.
Article Search
LPR: Large Language Models-Aided Program Reduction
Mengxiao Zhang
![ORCID logo](images/orcid.svg)
, Yongqiang Tian
![ORCID logo](images/orcid.svg)
, Zhenyang Xu
![ORCID logo](images/orcid.svg)
, Yiwen Dong
![ORCID logo](images/orcid.svg)
, Shin Hwei Tan
![ORCID logo](images/orcid.svg)
, and Chengnian Sun
(University of Waterloo, Canada; Hong Kong University of Science and Technology, China; Concordia University, Canada)
Program reduction is a widely used technique to facilitate debugging
compilers by automatically minimizing programs that trigger
compiler bugs. Existing program reduction techniques are either
generic to a wide range of languages (such as Perses and Vulcan)
or specifically optimized for one certain language by exploiting
language-specific knowledge (e.g., C-Reduce). However, synergistically
combining both generality across languages and optimality
to a specific language in program reduction is yet to be explored.
This paper proposes LPR, the first LLMs-aided technique leveraging
LLMs to perform language-specific program reduction for
multiple languages. The key insight is to utilize both the language
generality of program reducers such as Perses and the languagespecific
semantics learned by LLMs. Concretely, language-generic
program reducers can efficiently reduce programs into a small size
that is suitable for LLMs to process; LLMs can effectively transform
programs via the learned semantics to create new reduction opportunities
for the language-generic program reducers to further
reduce the programs.
Our thorough evaluation on 50 benchmarks across three programming
languages (i.e., C, Rust and JavaScript) has demonstrated
LPR’s practicality and superiority over Vulcan, the state-of-the-art
language-generic program reducer. For effectiveness, LPR surpasses
Vulcan by producing 24.93%, 4.47%, and 11.71% smaller programs
on benchmarks in C, Rust and JavaScript, separately. Moreover, LPR
and Vulcan have the potential to complement each other. For the C
language for which C-Reduce is optimized, by applying Vulcan to
the output produced by LPR, we can attain program sizes that are
on par with those achieved by C-Reduce. For efficiency perceived
by users, LPR is more efficient when reducing large and complex
programs, taking 10.77%, 34.88%, 36.96% less time than Vulcan to
finish all the benchmarks in C, Rust and JavaScript, separately.
Article Search
Artifacts Available
Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code
Yujia Chen
![ORCID logo](images/orcid.svg)
, Cuiyun Gao
![ORCID logo](images/orcid.svg)
, Zezhou Yang
![ORCID logo](images/orcid.svg)
, Hongyu Zhang
![ORCID logo](images/orcid.svg)
, and Qing Liao
(Harbin Institute of Technology, China; Chongqing University, China)
In the field of code intelligence, effectively modeling long-range code poses a significant challenge. Existing pre-trained language models (PLMs) such as UniXcoder have achieved remarkable success, but they still face difficulties with long code inputs. This is mainly due to their limited capacity to maintain contextual continuity and memorize the key information over long-range code. To alleviate the difficulties, we propose EXPO, a framework for EXtending Pre-trained language models for lOng-range code. EXPO incorporates two innovative memory mechanisms we propose in this paper: Bridge Memory and Hint Memory. Bridge Memory uses a tagging mechanism to connect disparate snippets of long-range code, helping the model maintain contextual coherence. Hint Memory focuses on crucial code elements throughout the global context, such as package imports, by integrating a 𝑘NN attention layer to adaptively select the relevant code elements. This dual-memory approach bridges the gap between understanding local code snippets and maintaining global code coherence, thereby enhancing the model’s overall comprehension of long code sequences. We validate the effectiveness of EXPO on five popular pre-trained language models such as UniXcoder and two code intelligence tasks including API recommendation and vulnerability detection. Experimental results demonstrate that EXPO significantly improves the pre-training language models.
Article Search
Define-Use Guided Path Exploration for Better Forced Execution
Dongnan He
![ORCID logo](images/orcid.svg)
, Dongchen Xie
![ORCID logo](images/orcid.svg)
, Yujie Wang
![ORCID logo](images/orcid.svg)
, Wei You
![ORCID logo](images/orcid.svg)
, Bin Liang
![ORCID logo](images/orcid.svg)
, Jianjun Huang
![ORCID logo](images/orcid.svg)
, Wenchang Shi
![ORCID logo](images/orcid.svg)
, Zhuo Zhang
![ORCID logo](images/orcid.svg)
, and
Xiangyu Zhang
(Renmin University of China, China; Purdue University, USA)
The evolution of recent malware, characterized by the escalating use of cloaking techniques, poses a significant challenge in the analysis of malware behaviors. Researchers proposed forced execution to penetrate malware’s self-protection mechanisms and expose hidden behaviors, by forcefully setting certain branch outcomes. Existing studies focus on enhancing the forced executor to provide light-weight crash-free execution models. However, insufficient attention has been directed toward the path exploration strategy, an aspect equally crucial to the effectiveness. Linear search employed in state-of-the-art forced execution tools exhibits inherent limitations that lead to unnecessary path exploration and incomplete behavior exposure. In this paper, we propose a novel and practical path exploration strategy that focuses on the coverage of defineuse relations in the subject binary. We develop a fuzzing approach for exploring these define-use relations in a progressive and self-supervised way. Our experimental results show that the proposed solution outperforms the existing forced execution tools in both memory dependence coverage and malware behavior exposure.
Article Search
C2D2: Extracting Critical Changes for Real-World Bugs with Dependency-Sensitive Delta Debugging
Xuezhi Song
![ORCID logo](images/orcid.svg)
, Yijian Wu
![ORCID logo](images/orcid.svg)
, Shuning Liu
![ORCID logo](images/orcid.svg)
, Bihuan Chen
![ORCID logo](images/orcid.svg)
, Yun Lin
![ORCID logo](images/orcid.svg)
, and Xin Peng
(Fudan University, China; Shanghai Jiao Tong University, China)
Data-driven techniques are promising for automatically locating and fixing bugs, which can reduce enormous time and effort for developers. However, the effectiveness of these techniques heavily
relies on the quality and scale of bug datasets. Despite that emerging approaches to automatic bug dataset construction partially provide a solution for scalability, data quality remains a concern. Specifically, it remains a barrier for humans to isolate the minimal set of bug-inducing or bug-fixing changes, known as critical changes. Although delta debugging (DD) techniques are capable of extracting critical changes on benchmark datasets in academia, the efficiency and accuracy are still limited when dealing with real-world bugs, where code change dependencies could be overly complicated.
In this paper, we propose C2D2, a novel delta debugging approach for critical change extraction, which estimates the probabilities of dependencies between code change elements. C2D2 considers the probabilities of dependencies and introduces a matrix-based search mechanism to resolve compilation errors (CE) caused by missing dependencies. It also provides hybrid mechanisms for flexibly selecting code change elements during the DD process. Experiments on Defect4J and a real-world regression bug dataset reveal that C2D2 is significantly more efficient than the traditional DD algorithm ddmin with competitive effectiveness, and significantly more effective and more efficient than the state-of-the-art DD algorithm ProbDD. Furthermore, compared to human-isolated critical changes, C2D2 produces the same or better critical change results in 56% cases in Defects4J and 86% cases in the regression dataset, demonstrating its usefulness in automatically extracting
critical changes and saving human efforts in constructing large-scale bug datasets with real-world bugs.
Article Search
FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion
Qi Guo
![ORCID logo](images/orcid.svg)
, Xiaohong Li
![ORCID logo](images/orcid.svg)
,
Xiaofei Xie ![ORCID logo](images/orcid.svg)
, Shangqing Liu
![ORCID logo](images/orcid.svg)
, Ze Tang
![ORCID logo](images/orcid.svg)
, Ruitao Feng
![ORCID logo](images/orcid.svg)
, Junjie Wang
![ORCID logo](images/orcid.svg)
, Jidong Ge
![ORCID logo](images/orcid.svg)
, and Lei Bu
(Tianjin University, China; Singapore Management University, Singapore; Nanyang Technological University, Singapore; Nanjing University, China)
The rise of code pre-trained models has significantly enhanced various coding tasks, such as code completion, and tools like GitHub Copilot. However, the substantial size of these models, especially large models, poses a significant challenge when it comes to fine-tuning them for specific downstream tasks. As an alternative approach, retrieval-based methods have emerged as a promising solution, augmenting model predictions without the need for fine-tuning.
Despite their potential, a significant challenge is that the designs of these methods often rely on heuristics, leaving critical questions about what information should be stored or retrieved and how to interpolate such information for augmenting predictions.
To tackle this challenge, we first perform a theoretical analysis of the fine-tuning process, highlighting the importance of delta logits as a catalyst for improving model predictions. Building on this insight, we develop a novel retrieval-based method, FT2Ra, which aims to mimic genuine fine-tuning. While FT2Ra adopts a retrieval-based mechanism, it uniquely adopts a paradigm with a learning rate and multi-epoch retrievals, which is similar to fine-tuning.
We conducted a comprehensive evaluation of FT2Ra in both token-level and line-level code completions. Our findings demonstrate the remarkable effectiveness of FT2Ra when compared to state-of-the-art methods and its potential to genuine fine-tuning.
In token-level completion, which represents a relatively easier task, FT2Ra achieves a 4.29% improvement in accuracy compared to the best baseline method on UniXcoder. In the more challenging line-level completion task, we observe a substantial more than twice increase in Exact Match (EM) performance, indicating the significant advantages of our theoretical analysis. Notably, even when operating without actual fine-tuning, FT2Ra exhibits competitive performance compared to the models with real fine-tuning.
Article Search
MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination Indexing
Tianyi Yang
![ORCID logo](images/orcid.svg)
, Cheryl Lee
![ORCID logo](images/orcid.svg)
, Jiacheng Shen
![ORCID logo](images/orcid.svg)
, Yuxin Su
![ORCID logo](images/orcid.svg)
, Cong Feng
![ORCID logo](images/orcid.svg)
, Yongqiang Yang
![ORCID logo](images/orcid.svg)
, and Michael R. Lyu
(Chinese University of Hong Kong, Hong Kong; Sun Yat-sen University, China; Huawei Cloud Computing Technology, China)
Microservice resilience, the ability of microservices to recover from failures and continue providing reliable and responsive services, is crucial for cloud vendors. However, the current practice relies on manually configured rules specific to a certain microservice system, resulting in labor-intensity and flexibility issues, given the large scale and high dynamics of microservices. A more labor-efficient and versatile solution is desired. Our insight is that resilient deployment can effectively prevent the dissemination of degradation from system performance metrics to user-aware metrics, and the latter directly affects service quality. In other words, failures in a non-resilient deployment can impact both types of metrics, leading to user dissatisfaction. With this in mind, we propose MicroRes, the first versatile resilience profiling framework for microservices via degradation dissemination indexing. MicroRes first injects failures into microservices and collects available monitoring metrics. Then, it ranks the metrics according to their contributions to the overall service degradation. It produces a resilience index by how much the degradation is disseminated from system performance metrics to user-aware metrics. Higher degradation dissemination indicates lower resilience. We evaluate MicroRes on two open-source and one industrial microservice system. The experiments show MicroRes' efficient and effective resilience profiling of microservices. We also showcase MicroRes' practical usage in production.
Preprint
Isolation-Based Debugging for Neural Networks
Jialuo Chen
![ORCID logo](images/orcid.svg)
, Jingyi Wang
![ORCID logo](images/orcid.svg)
, Youcheng Sun
![ORCID logo](images/orcid.svg)
, Peng Cheng
![ORCID logo](images/orcid.svg)
, and Jiming Chen
(Zhejiang University, China; University of Manchester, United Kingdom)
Neural networks (NNs) are known to have diverse defects such as adversarial examples, backdoor and discrimination, raising great concerns about their reliability. While NN testing can effectively expose these defects to a significant degree, understanding their root causes within the network requires further examination. In this work, inspired by the idea of debugging in traditional software for failure isolation, we propose a novel unified neuron-isolation-based framework for debugging neural networks, shortly IDNN. Given a buggy NN that exhibits certain undesired properties (e.g., discrimination), the goal of IDNN is to identify the most critical and minimal set of neurons that are responsible for exhibiting these properties. Notably, such isolation is conducted with the objective that by simply ‘freezing’ these neurons, the model’s undesired properties can be eliminated, resulting in a much more efficient model repair compared to computationally expensive retraining or weight optimization as in existing literature. We conduct extensive experiments to evaluate IDNN across a diverse set of NN structures on five benchmark datasets, for solving three debugging tasks, including backdoor, unfairness, and weak class. As a lightweight framework, IDNN outperforms state-of-the-art baselines by successfully identifying and isolating a very small set of responsible neurons, demonstrating superior generalization performance across all tasks.
Article Search
Atlas: Automating Cross-Language Fuzzing on Android Closed-Source Libraries
Hao Xiong
![ORCID logo](images/orcid.svg)
, Qinming Dai
![ORCID logo](images/orcid.svg)
, Rui Chang
![ORCID logo](images/orcid.svg)
, Mingran Qiu
![ORCID logo](images/orcid.svg)
, Renxiang Wang
![ORCID logo](images/orcid.svg)
, Wenbo Shen
![ORCID logo](images/orcid.svg)
, and Yajin Zhou
(Zhejiang University, China; ZJU-Hangzhou Global Scientific and Technological Innovation Center, China)
Fuzzing is an effective method for detecting security bugs in software, and there have been quite a few effective works on fuzzing Android. Researchers have developed methods for fuzzing open-source native APIs and Java interfaces on actual Android devices. However, the realm of automatically fuzzing Android closed-source native libraries, particularly on emulators, remains insufficiently explored. There are two key challenges: firstly, the multi-language programming model inherent to Android; and secondly, the absence of a Java runtime environment within the emulator.
To address these challenges, we propose Atlas, a practical automated fuzz framework for Android closed-source native libraries. Atlas consists of an automatic harness generator and a fuzzer containing the necessary runtime environment. The generator uses static analysis techniques to deduce the correct calling sequences and parameters of the native API according to the information from the "native world" and the "Java world". To maximize the practicality of the generated harness, Atlas heuristically optimizes the generated harness. The Fuzzer provides the essential Java runtime environment in the emulator, making it possible to fuzz the Android closed-source native libraries on a multi-core server. We have tested Atlas on 17 pre-installed apps from four Android vendors. Atlas generates 820 harnesses containing 767 native APIs, of which 78% is practical. Meanwhile, Atlas has discovered 74 new security bugs with 16 CVEs assigned. The experiments show that Atlas can efficiently generate high-quality harnesses and find security bugs.
Article Search
Automating Zero-Shot Patch Porting for Hard Forks
Shengyi Pan
![ORCID logo](images/orcid.svg)
, You Wang
![ORCID logo](images/orcid.svg)
,
Zhongxin Liu ![ORCID logo](images/orcid.svg)
, Xing Hu
![ORCID logo](images/orcid.svg)
, Xin Xia
![ORCID logo](images/orcid.svg)
, and Shanping Li
(Zhejiang University, China; Huawei, China)
Forking is a typical way of code reuse, which provides a simple way for developers to create a variant software (denoted as hard fork) by copying and modifying an existing codebase. Despite of the benefits, forking also leads to duplicate efforts in software maintenance. Developers need to port patches across the hard forks to address similar bugs or implement similar features. Due to the divergence between the source project and the hard fork, patch porting is complicated, which requires an adaption regarding different implementations of the same functionality. In this work, we take the first step to automate patch porting for hard forks under a zero-shot setting. We first conduct an empirical study of the patches ported from Vim to Neovim over the last ten years to investigate the necessities of patch porting and the potential flaws in the current practice. We then propose a large language model (LLM) based approach (namely PPatHF) to automatically port patches for hard forks on a function-wise basis. Specifically, PPatHF is composed of a reduction module and a porting module. Given the pre- and post-patch versions of a function from the reference project and the corresponding function from the target project, the reduction module first slims the input functions by removing code snippets less relevant to the patch. Then, the porting module leverages a LLM to apply the patch to the function from the target project. To better elicit the power of the LLM on patch porting, we design a prompt template to enable efficient in-context learning. We further propose an instruction-tuning based training task to better guide the LLM to port the patch and inject task-specific knowledge. We evaluate PPatHF on 310 Neovim patches ported from Vim. The experimental results show that PPatHF outperforms the baselines significantly. Specifically, PPatHF can correctly port 131 (42.3%) patches and automate 57% of the manual edits required for the developer to port the patch.
Article Search
DiaVio: LLM-Empowered Diagnosis of Safety Violations in ADS Simulation Testing
You Lu
![ORCID logo](images/orcid.svg)
, Yifan Tian
![ORCID logo](images/orcid.svg)
, Yuyang Bi
![ORCID logo](images/orcid.svg)
, Bihuan Chen
![ORCID logo](images/orcid.svg)
, and Xin Peng
(Fudan University, China)
Simulation testing has been widely adopted by leading companies to ensure the safety of autonomous driving systems (ADSs). Anumber of scenario-based testing approaches have been developed to generate diverse driving scenarios for simulation testing, and demonstrated to be capable of finding safety violations. However, there is no automated way to diagnose whether these violations are caused by the ADS under test and which category these violations belong to. As a result, great effort is required to manually diagnose violations.
To bridge this gap, we propose DiaVio to automatically diagnose safety violations in simulation testing by leveraging large language models (LLMs). It is built on top of a new domain specific language (DSL) of crash to align real-world accident reports described in natural language and violation scenarios in simulation testing. DiaVio fine-tunes a base LLM with real-world accident reports to learn diagnosis capability, and uses the fine-tuned LLM to diagnose violation scenarios in simulation testing. Our evaluation has demonstrated the effectiveness and efficiency of DiaVio in violation diagnosis.
Article Search
Graph Neural Networks for Vulnerability Detection: A Counterfactual Explanation
Zhaoyang Chu
![ORCID logo](images/orcid.svg)
, Yao Wan
![ORCID logo](images/orcid.svg)
, Qian Li
![ORCID logo](images/orcid.svg)
, Yang Wu
![ORCID logo](images/orcid.svg)
, Hongyu Zhang
![ORCID logo](images/orcid.svg)
, Yulei Sui
![ORCID logo](images/orcid.svg)
, Guandong Xu
![ORCID logo](images/orcid.svg)
, and Hai Jin
(Huazhong University of Science and Technology, China; Curtin University, Perth, Australia; Chongqing University, China; UNSW, Sydney, Australia; University of Technology, Sydney, Australia)
Vulnerability detection is crucial for ensuring the security and reliability of software systems. Recently, Graph Neural Networks (GNNs) have emerged as a prominent code embedding approach for vulnerability detection, owing to their ability to capture the underlying semantic structure of source code. However, GNNs face significant challenges in explainability due to their inherently black-box nature. To this end, several factual reasoning-based explainers have been proposed. These explainers provide explanations for the predictions made by GNNs by analyzing the key features that contribute to the outcomes. We argue that these factual reasoning-based explanations cannot answer critical what-if questions: ``What would happen to the GNN's decision if we were to alter the code graph into alternative structures?'' Inspired by advancements of counterfactual reasoning in artificial intelligence, we propose CFExplainer, a novel counterfactual explainer for GNN-based vulnerability detection. Unlike factual reasoning-based explainers, CFExplainer seeks the minimal perturbation to the input code graph that leads to a change in the prediction, thereby addressing the what-if questions for vulnerability detection. We term this perturbation a counterfactual explanation, which can pinpoint the root causes of the detected vulnerability and furnish valuable insights for developers to undertake appropriate actions for fixing the vulnerability. Extensive experiments on four GNN-based vulnerability detection models demonstrate the effectiveness of CFExplainer over existing state-of-the-art factual reasoning-based explainers.
Article Search
DeFort: Automatic Detection and Analysis of Price Manipulation Attacks in DeFi Applications
Maoyi Xie ![ORCID logo](images/orcid.svg)
, Ming Hu
![ORCID logo](images/orcid.svg)
, Ziqiao Kong
![ORCID logo](images/orcid.svg)
, Cen Zhang
![ORCID logo](images/orcid.svg)
, Yebo Feng
![ORCID logo](images/orcid.svg)
, Haijun Wang
![ORCID logo](images/orcid.svg)
, Yue Xue
![ORCID logo](images/orcid.svg)
, Hao Zhang
![ORCID logo](images/orcid.svg)
, Ye Liu
![ORCID logo](images/orcid.svg)
, and Yang Liu
(Nanyang Technological University, Singapore; Xi’an Jiaotong University, China; MetaTrust Labs, Singapore)
Although Decentralized Finance (DeFi) applications facilitate tamper-proof transactions among multiple anonymous users, since attackers can access the smart contract bytecode directly, vulnerabilities in the transaction mechanism, contract code, or third-party components can be easily exploited to manipulate token prices, leading to financial losses. Since price manipulation often relies on specific states and complex trading sequences, existing detection tools have limitations in addressing this problem. In addition, to swiftly identify the root cause of an attack and implement targeted defense and remediation measures, auditors typically prioritize understanding the methodology behind the attack, emphasizing `how' it occurred rather than simply confirming its existence. To address these problems, this paper presents a novel automatic price manipulation detection and analysis framework, named DeFort, which contains a price manipulation behavior model to guide on-chain detection, multiple price monitoring strategies to detect pools with abnormal token prices, and various profit calculation mechanisms to confirm attacks. Based on behavioral models, DeFort can automatically locate transactions and functions that cause abnormal price fluctuations and identify attackers and victims. Experimental results demonstrate that DeFort can outperform state-of-the-art price manipulation detection methods. Furthermore, after monitoring 441 real-world projects for two months, DeFort successfully detected five price manipulation attacks.
Article Search
Info
Traceback: A Fault Localization Technique for Molecular Programs
Michael C. Gerten
![ORCID logo](images/orcid.svg)
, James I. Lathrop
![ORCID logo](images/orcid.svg)
, and Myra B. Cohen
(Iowa State University, USA)
Fault localization is essential to software maintenance tasks such
as testing and automated program repair. Many fault localization
techniques have been developed, the most common of which are
spectrum-based. Most techniques have been designed for traditional programming paradigms that map passing and failing test
cases to lines or branches of code, hence specialized programming
paradigms which utilize different code abstractions may fail to localize well. In this paper, we study fault localization in the context
of a class of programs, molecular programs. Recent research has
designed automated testing and repair frameworks for these pro-
grams but has ignored the importance of fault localization. As we
demonstrate, using existing spectrum-based approaches may not
provide much information. Instead we propose a novel approach,
Traceback, that leverages temporal trace data. In an empirical study
on a set of 89 faulty program variants, we demonstrate that Trace-
back provides between a 32-90% improvement in localization over
reaction-based mapping, a direct translation of spectrum-based
localization. We see little difference in parameter tuning of Trace-
back when all tests, or only code-based (invariant) tests are used,
however the best depth and weight parameters vary when using
specification based tests, which can be either functional or meta-
morphic. Overall, invariant-based tests provide the best localization
results (either alone or in combination with others), followed by
metamorphic and then functional tests.
Article Search
Silent Taint-Style Vulnerability Fixes Identification
Zhongzhen Wen
![ORCID logo](images/orcid.svg)
, Jiayuan Zhou
![ORCID logo](images/orcid.svg)
,
Minxue Pan ![ORCID logo](images/orcid.svg)
, Shaohua Wang
![ORCID logo](images/orcid.svg)
, Xing Hu
![ORCID logo](images/orcid.svg)
, Tongtong Xu
![ORCID logo](images/orcid.svg)
, Tian Zhang
![ORCID logo](images/orcid.svg)
, and Xuandong Li
(Nanjing University, China; Huawei, Waterloo, Canada; Central University of Finance and Economics, China; Zhejiang University, China; Huawei, China)
The coordinated vulnerability disclosure model, widely adopted in open-source software (OSS) organizations, recommends the silent resolution of vulnerabilities without revealing vulnerability information until their public disclosure. However, the inherently public nature of OSS development leads to security fixes becoming publicly available in repositories weeks before the official disclosure of vulnerabilities. This time gap poses a significant security risk to OSS users, as attackers could discover the fix and exploit vulnerabilities before disclosure. Thus, there is a critical need for OSS users to sense fixes as early as possible to address the vulnerability before any exploitation occurs.
In response to this challenge, we introduce EarlyVulnFix, a novel approach designed to identify silent fixes for taint-style vulnerabilities—a persistent class of security weaknesses where attacker-controlled input reaches sensitive operations (sink) without proper sanitization. Leveraging data flow and dependency analysis, our tool distinguishes two types of connections between newly introduced code and sinks, tailored for two common fix scenarios. Our evaluation demonstrates that EarlyVulnFix surpasses state-of-the-art baselines by a substantial margin in terms of F1 score. Furthermore, when applied to the 700 latest commits across seven projects, EarlyVulnFix detected three security fixes before their respective security releases, highlighting its effectiveness in identifying unreported vulnerability fixes in the wild.
Article Search
Benchmarking Automated Program Repair: An Extensive Study on Both Real-World and Artificial Bugs
Yicheng Ouyang
![ORCID logo](images/orcid.svg)
, Jun Yang
![ORCID logo](images/orcid.svg)
, and Lingming Zhang
(University of Illinois at Urbana-Champaign, USA)
As bugs are inevitable and prevalent in real-world programs, many Automated Program Repair (APR) techniques have been proposed to generate patches for them. However, due to the lack of a standard for evaluating APR techniques, prior works tend to use different settings and benchmarks in evaluation, threatening the trustworthiness of the evaluation results. Additionally, they typically only adopt plausibility and genuineness as evaluation metrics, which may potentially mask some underlying issues in APR techniques. To overcome these issues, in this paper, we conduct an extensive and multi-dimensional evaluation of nine learning-based and three traditional state-of-the-art APR techniques under the same environment and settings. We employ the widely studied Defects4J V2.0.0 benchmark and a newly constructed large-scale mutation-based benchmark named MuBench, derived from Defects4J and including 1,700 artificial bugs generated by various mutators, to uncover potential limitations in these APR techniques. We also apply multi-dimensional metrics, including compilability/plausibility/genuineness metrics, as well as SYE (SYntactic Equivalence) and TCE (Trivial Compiler Equivalence) metrics, to thoroughly analyze the 1,814,652 generated patches. This paper presents noteworthy findings from the extensive evaluation: Firstly, Large Language Model (LLM) based APR demonstrates less susceptibility to overfitting on the Defects4J V1.2.0 dataset and fixes the most number of bugs. Secondly, the study suggests a promising future for combining traditional and learning-based APR techniques, as they exhibit complementary advantages in fixing different types of bugs. Additionally, this work highlights the necessity for further enhancing patch compilability of learning-based APR techniques, despite the presence of various existing strategies attempting to improve it. The study also reveals other guidelines for enhancing APR techniques, including the need for handling unresolvable symbol compilability issues and reducing duplicate/no-op patch generation. Finally, our study uncovers seven implementation issues in the studied techniques, with five of them confirmed and fixed by the corresponding authors.
Article Search
Info
Artifacts Available
Multi-modal Learning for WebAssembly Reverse Engineering
Hanxian Huang
![ORCID logo](images/orcid.svg)
and Jishen Zhao
(University of California at San Diego, San Diego, USA)
The increasing adoption of WebAssembly (Wasm) for performance-critical and security-sensitive tasks drives the demand for WebAssembly program comprehension and reverse engineering. Recent studies have introduced machine learning (ML)-based WebAssembly reverse engineering tools. Yet, the generalization of task-specific ML solutions remains challenging, because their effectiveness hinges on the availability of an ample supply of high-quality task-specific labeled data. Moreover, previous works trained models only with features extracted from WebAssembly, overlooking the high-level semantics present in the corresponding source code and its documentation. Acknowledging the abundance of available source code with documentation, which can be compiled into WebAssembly, we propose to learn representations of them concurrently and harness their mutual relationships for effective WebAssembly reverse engineering.
In this paper, we present WasmRev, the first multi-modal pre-trained language model for WebAssembly reverse engineering. WasmRev is pre-trained using self-supervised learning on a large-scale multi-modal corpus encompassing source code, code documentation and the compiled WebAssembly, without requiring labeled data. WasmRev incorporates three tailored multi-modal pre-training tasks to capture various characteristics of WebAssembly and cross-modal relationships. WasmRev is only trained once to produce general-purpose representations that can broadly support WebAssembly reverse engineering tasks through few-shot fine-tuning with much less labeled data, improving data efficiency. We fine-tune WasmRev onto three important reverse engineering tasks: type recovery, function purpose identification and WebAssembly summarization. Our results show that WasmRev pre-trained on the corpus of multi-modal samples establishes a robust foundation for these tasks, achieving high task accuracy and outperforming the state-of-the-art ML methods for WebAssembly reverse engineering.
Article Search
CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive Nature
Chenyan Liu
![ORCID logo](images/orcid.svg)
, Yufan Cai
![ORCID logo](images/orcid.svg)
, Yun Lin
![ORCID logo](images/orcid.svg)
, Yuhuan Huang
![ORCID logo](images/orcid.svg)
, Yunrui Pei
![ORCID logo](images/orcid.svg)
, Bo Jiang
![ORCID logo](images/orcid.svg)
, Ping Yang
![ORCID logo](images/orcid.svg)
, Jin Song Dong
![ORCID logo](images/orcid.svg)
, and Hong Mei
(Shanghai Jiao Tong University, China; National University of Singapore, Singapore; Bytedance Network Technology, Beijing, China)
Recent years have seen the development of LLM-based code generation. Compared to generating code in a software project, incremental code edits are empirically observed to be more frequent. The emerging code editing approaches usually formulate the problem as generating an edit based on known relevant prior edits and context. However, practical code edits can be more complicated. First, an editing session can include multiple (ir)relevant edits to the code under edit. Second, the inference of the subsequent edits is non-trivial as the scope of its ripple effect can be the whole project.
In this work, we propose CoEdPilot, an LLM-driven solution to recommend code edits by discriminating the relevant edits, exploring their interactive natures, and estimating its ripple effect in the project. Specifically, CoEdPilot orchestrates multiple neural transformers to identify what and how to edit in the project regarding both edit location and edit content. When a user accomplishes an edit with an optional editing description, an Subsequent Edit Analysis first reports the most relevant files in the project with what types of edits (e.g., keep, insert, and replace) can happen for each line of their code. Next, an Edit-content Generator generates concrete edit options for the lines of code, regarding its relevant prior changes reported by an Edit-dependency Analyzer. Last, both the Subsequent Edit Analysis and the Edit-content Generator capture relevant prior edits as feedback to readjust their recommendations. We train our models by collecting over 180K commits from 471 open-source projects in 5 programming languages. Our extensive experiments show that (1) CoEdPilot can well predict the edits (i.e., predicting edit location with accuracy of 70.8%-85.3%, and the edit content with exact match rate of 41.8% and BLEU4 score of 60.7); (2) CoEdPilot can well boost existing edit generators such as GRACE and CCT5 on exact match rate by 8.57% points and BLEU4 score by 18.08. Last, our user study on 18 participants with 3 editing tasks (1) shows that CoEdPilot can be effective in assisting users to edit code in comparison with Copilot, and (2) sheds light on the future improvement of the tool design. The video demonstration of our tool is available at https://sites.google.com/view/coedpilot/home.
Article Search
Info
Automated Deep Learning Optimization via DSL-Based Source Code Transformation
Ruixin Wang
![ORCID logo](images/orcid.svg)
, Minghai Lu
![ORCID logo](images/orcid.svg)
, Cody Hao Yu
![ORCID logo](images/orcid.svg)
, Yi-Hsiang Lai
![ORCID logo](images/orcid.svg)
, and Tianyi Zhang
(Purdue University, USA; BosonAI, USA; Amazon Web Services, USA)
As deep learning models become increasingly bigger and more complex, it is critical to improve model training and inference efficiency. Though a variety of highly optimized libraries and packages (known as DL kernels) have been developed, it is tedious and time-consuming to figure out which kernel to use, where to use, and how to use them correctly. To address this challenge, we propose an Automated Deep learning OPTimization approach called Adopter. We design a Domain-Specific Language (DSL) to represent DL model architectures and leverage this DSL to specify model transformation rules required to integrate a DL kernel into a model. Given the source code of a DL model and the transformation rules for a set of kernels, Adopter first performs inter-procedural analysis to identify and express the model architecture in our DSL. Then, Adopter performs scope analysis and sub-sequence matching to identify locations in the model architecture where the transformation rules can be applied. Finally, Adopter proposes a synthesis-based code transformation method to apply the transformation rule. We curated a benchmark with 199 models from Hugging Face and a diverse set of DL kernels. We found that, compared to a state-of-the-art automated code transformation technique, Adopter helps improve the precision and recall by 3% and 56%, respectively. An in-depth analysis of 9 models revealed that on average, Adopter improved the training speed by 22.7% while decreasing the GPU memory usage by 10.5%.
Article Search
Artifacts Available
Evaluating the Effectiveness of Decompilers
Ying Cao
![ORCID logo](images/orcid.svg)
, Runze Zhang
![ORCID logo](images/orcid.svg)
, Ruigang Liang
![ORCID logo](images/orcid.svg)
, and Kai Chen
(Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, China)
In software security tasks like malware analysis and vulnerability mining, reverse engineering is pivotal, with C decompilers playing a crucial role in understanding program semantics. However, reverse engineers still predominantly rely on assembly code rather than decompiled code when analyzing complex binaries. This practice underlines the limitations of current decompiled code, which hinders its effectiveness in reverse engineering. Identifying and analyzing the problems of existing decompilers and making targeted improvements can effectively enhance the efficiency of software analysis. In this study, we systematically evaluate current mainstream decompilers’ semantic consistency and readability. Semantic evaluation results show that the state-of-the-art decompiler Hex-Rays has about 55% accuracy at almost all optimization, which contradicts the common belief among many reverse engineers that decompilers are usually accurate. Readability evaluation indicates that despite years of efforts to improve the readability of the decompiled code, decompilers’ template-based approach still predominantly yields code akin to binary structures rather than human coding patterns. Additionally, our human study indicates that to enhance decompilers’ accuracy and readability, introducing human or compiler-aware strategies like a speculate-verify-correct approach to obtain recompilable decompiled code and iteratively refine it to more closely resemble the original binary, potentially offers a more effective optimization method than relying on static analysis and rule expansion.
Article Search
Artifacts Available
CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision
Hao Wang ![ORCID logo](images/orcid.svg)
, Zeyu Gao
![ORCID logo](images/orcid.svg)
, Chao Zhang
![ORCID logo](images/orcid.svg)
, Zihan Sha
![ORCID logo](images/orcid.svg)
, Mingyang Sun
![ORCID logo](images/orcid.svg)
, Yuchen Zhou
![ORCID logo](images/orcid.svg)
, Wenyu Zhu
![ORCID logo](images/orcid.svg)
, Wenju Sun
![ORCID logo](images/orcid.svg)
, Han Qiu
![ORCID logo](images/orcid.svg)
, and Xi Xiao
(Tsinghua University, China; Information Engineering University, China; University of Electronic Science and Technology of China, China; Beijing University of Technology, China)
Binary code representation learning has shown significant performance in binary analysis tasks. But existing solutions often have poor transferability, particularly in few-shot and zero-shot scenarios where few or no training samples are available for the tasks. To address this problem, we present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code (i.e., assembly code) and get better transferability. At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations (in natural language), resulting a model able to generate better embeddings for binary code. To enable this alignment training, we then propose an efficient dataset engine that could automatically generate a large and diverse dataset comprising of binary code and corresponding natural language explanations. We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP. The evaluations of CLAP across various downstream tasks in binary analysis all demonstrate exceptional performance. Notably, without any task-specific training, CLAP is often competitive with a fully supervised baseline, showing excellent transferability.
Preprint
Papers Round 2
FDI: Attack Neural Code Generation Systems through User Feedback Channel
Zhensu Sun
![ORCID logo](images/orcid.svg)
, Xiaoning Du
![ORCID logo](images/orcid.svg)
,
Xiapu Luo ![ORCID logo](images/orcid.svg)
,
Fu Song ![ORCID logo](images/orcid.svg)
,
David Lo ![ORCID logo](images/orcid.svg)
, and Li Li
(Hong Kong Polytechnic University, China; Monash University, Australia; ShanghaiTech University, China; Automotive Software Innovation Center, China; Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Singapore Management University, Singapore; Beihang University, China)
Article Search
DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation
Mingke Yang
![ORCID logo](images/orcid.svg)
, Yuqi Chen
![ORCID logo](images/orcid.svg)
, Yi Liu
![ORCID logo](images/orcid.svg)
, and Ling Shi
(ShanghaiTech University, China; Nanyang Technological University, Singapore)
Large Language Models (LLMs) have showcased their remarkable capabilities in diverse domains, encompassing natural language understanding, translation, and even code generation. The potential for LLMs to generate harmful content is a significant concern. This risk necessitates rigorous testing and comprehensive evaluation of LLMs to ensure safe and responsible use. However, extensive testing of LLMs requires substantial computational resources, making it an expensive endeavor. Therefore, exploring cost-saving strategies during the testing phase is crucial to balance the need for thorough evaluation with the constraints of resource availability. To address this, our approach begins by transferring the moderation knowledge from an LLM to a small model. Subsequently, we deploy two distinct strategies for generating malicious queries: one based on a syntax tree approach, and the other leveraging an LLM-based method. Finally, our approach incorporates a sequential filter-test process designed to identify test cases that are prone to eliciting toxic responses. By doing so, we significantly curtail unnecessary or unproductive interactions with LLMs, thereby streamlining the testing process. Our research evaluated the efficacy of DistillSeq across four LLMs: GPT-3.5, GPT-4.0, Vicuna-13B, and Llama-13B. In the absence of DistillSeq, the observed attack success rates on these LLMs stood at 31.5% for GPT-3.5, 21.4% for GPT-4.0, 28.3% for Vicuna-13B, and 30.9% for Llama-13B. However, upon the application of DistillSeq, these success rates notably increased to 58.5%, 50.7%, 52.5%, and 54.4%, respectively. This translated to an average escalation in attack success rate by a factor of 93.0% when compared to scenarios without the use of DistillSeq. Such findings highlight the significant enhancement DistillSeq offers in terms of reducing the time and resource investment required for effectively testing LLMs.
Article Search
Info
PatchFinder: A Two-Phase Approach to Security Patch Tracing for Disclosed Vulnerabilities in Open-Source Software
Kaixuan Li
![ORCID logo](images/orcid.svg)
, Jian Zhang
![ORCID logo](images/orcid.svg)
, Sen Chen
![ORCID logo](images/orcid.svg)
, Han Liu
![ORCID logo](images/orcid.svg)
, Yang Liu
![ORCID logo](images/orcid.svg)
, and Yixiang Chen
(East China Normal University, China; Nanyang Technological University, Singapore; Tianjin University, China)
Open-source software (OSS) vulnerabilities are increasingly prevalent, emphasizing the importance of security patches. However, in widely used security platforms like NVD, a substantial number of CVE records still lack trace links to patches. Although rank-based approaches have been proposed for security patch tracing, they heavily rely on handcrafted features in a single-step framework, which limits their effectiveness.
In this paper, we propose PatchFinder, a two-phase framework with end-to-end correlation learning for better-tracing security patches. In the **initial retrieval** phase, we employ a hybrid patch retriever to account for both lexical and semantic matching based on the code changes and the description of a CVE, to narrow down the search space by extracting those commits as candidates that are similar to the CVE descriptions. Afterwards, in the **re-ranking** phase, we design an end-to-end architecture under the supervised fine-tuning paradigm for learning the semantic correlations between CVE descriptions and commits. In this way, we can automatically rank the candidates based on their correlation scores while maintaining low computation overhead. We evaluated our system against 4,789 CVEs from 532 OSS projects. The results are highly promising: PatchFinder achieves a Recall@10 of 80.63% and a Mean Reciprocal Rank (MRR) of 0.7951. Moreover, the Manual Effort@10 required is curtailed to 2.77, marking a 1.94 times improvement over current leading methods. When applying PatchFinder in practice, we initially identified 533 patch commits and submitted them to the official, 482 of which have been confirmed by CVE Numbering Authorities.
Article Search
Oracle-Guided Program Selection from Large Language Models
Zhiyu Fan
![ORCID logo](images/orcid.svg)
, Haifeng Ruan
![ORCID logo](images/orcid.svg)
, Sergey Mechtaev
![ORCID logo](images/orcid.svg)
, and
Abhik Roychoudhury
(National University of Singapore, Singapore; Peking University, China)
While large language models (LLMs) have shown significant advancements in code generation, their susceptibility to producing incorrect code poses a significant challenge to the adoption of LLM-generated programs. This issue largely stems from the reliance on natural language descriptions as informal oracles in code generation. Current strategies to mitigate this involve selecting the best program from multiple LLM-generated alternatives, judged by criteria like the consistency of their execution results on an LLM-generated test suite. However, this approach has crucial limitations: (1) LLMs often generate redundant tests or tests that cannot distinguish between correct and incorrect solutions, (2) the used consistency criteria, such as the majority vote, fail to foster developer trust due to the absence of transparent rationale behind the made choices. In this work, we propose a new perspective on increasing the quality of LLM-generated code via program selection using the LLM as a test oracle. Our method is based on our experimentally confirmed observation that LLMs serve more effectively as oracles when tasked with selecting the correct output from multiple choices. Leveraging this insight, we first generate distinguishing inputs that capture semantic discrepancies of programs sampled from an LLM, and record outputs produced by the programs on these inputs. An LLM then selects the most likely to be correct output from these, guided by the natural language problem description. We implemented this idea in a tool LLMCodeChoice and evaluated its accuracy in generating and selecting standalone programs. Our experiments demonstrated its effectiveness in improving pass@1 by 3.6-7% on HumanEval and MBPP benchmarks compared to the state-of-art CodeT. Most interestingly, the selected input-output specifications helped us to uncover incompleteness and ambiguities in task descriptions and also identify incorrect ground-truth implementations in the benchmarks.
Article Search
Beyond Pairwise Testing: Advancing 3-wise Combinatorial Interaction Testing for Highly Configurable Systems
Chuan Luo
![ORCID logo](images/orcid.svg)
, Shuangyu Lyu,
Qiyuan Zhao ![ORCID logo](images/orcid.svg)
, Wei Wu
![ORCID logo](images/orcid.svg)
, Hongyu Zhang
![ORCID logo](images/orcid.svg)
, and Chunming Hu
(Beihang University, China; National University of Singapore, Singapore; Central South University, Australia; Xiangjiang Laboratory, Australia; Chongqing University, China)
Article Search
An Empirical Study of Static Analysis Tools for Secure Code Review
Wachiraphan Charoenwet
![ORCID logo](images/orcid.svg)
, Patanamon Thongtanunam
![ORCID logo](images/orcid.svg)
,
Van-Thuan Pham ![ORCID logo](images/orcid.svg)
, and Christoph Treude
(University of Melbourne, Australia; Singapore Management University, Singapore)
Early identification of security issues in software development is vital to minimize their unanticipated impacts. Code review is a widely used manual analysis method that aims to uncover security issues along with other coding issues in software projects. While some studies suggest that automated static application security testing tools (SASTs) could enhance security issue identification, there is limited understanding of SAST’s practical effectiveness in supporting secure code review. Moreover, most SAST studies rely on synthetic or fully vulnerable versions of the subject program, which may not accurately represent real-world code changes in the
code review process.
To address this gap, we study C/C++ SASTs using a dataset of actual code changes that contributed to exploitable vulnerabilities. Beyond SAST’s effectiveness, we quantify potential benefits when changed functions are prioritized by SAST warnings. Our dataset comprises 319 real-world vulnerabilities from 815 vulnerability-contributing commits (VCCs) in 92 C and C++ projects. The result reveals that a single SAST can produce warnings in vulnerable functions of 52% of VCCs. Prioritizing changed functions with SAST warnings can improve accuracy (i.e., 12% of precision and
5.6% of recall) and reduce Initial False Alarm (lines of code in non-vulnerable functions inspected until the first vulnerable function) by 13%. Nevertheless, at least 76% of the warnings in vulnerable functions are irrelevant to the VCCs, and 22% of VCCs remain undetected due to limitations of SAST rules. Our findings highlight the benefits and the remaining gaps of SAST-supported secure code reviews and challenges that should be addressed in future work.
Article Search
Artifacts Available
VRDSynth: Synthesizing Programs for Multilingual Visually Rich Document Information Extraction
Thanh-Dat Nguyen
![ORCID logo](images/orcid.svg)
, Tung Do-Viet
![ORCID logo](images/orcid.svg)
, Hung Nguyen-Duy
![ORCID logo](images/orcid.svg)
, Tuan-Hai Luu
![ORCID logo](images/orcid.svg)
, Hung Le
![ORCID logo](images/orcid.svg)
, Bach Le
![ORCID logo](images/orcid.svg)
, and Patanamon Thongtanunam
(University of Melbourne, Australia; Cinnamon AI, n.n.; Independent Researcher, n.n.; Deakin University, Australia)
Businesses often need to query visually rich documents (VRDs), e.g., purchase receipts, medical records, and insurance forms, among many other forms from multiple vendors, to make informed decisions. As such, several techniques have been proposed to automatically extract independent entities of interest from VRDs such as extracting price tags from purchase receipts, etc. However, for extracting semantically linked entities, such as finding corresponding price tags for each item, these techniques either have limited capability in handling new layouts, e.g., template-based approaches, or require extensive amounts of pre-training data and do not perform well, e.g., deep-learning approaches.
In this work, we introduce a program synthesis method, namely VRDSynth, to automatically generate programs to extract entity relations from multilingual VRDs. Two key novelties, which empower VRDSynth to tackle flexible layouts while requiring no pre-training data for extracting entity relations, include: (1) a new domain-specific language (DSL) to effectively capture the spatial and textual relations between document entities, and (2) a novel synthesis algorithm that makes use of frequent spatial relations between entities to construct initial programs, equivalent reduction to prune the search space, and a combination of positive, negative, and mutually exclusive programs to improve the coverage of programs.
We evaluate our method on two popular VRD understanding benchmarks, namely FUNSD and XFUND, on the semantic entity linking task, consisting of 1,600 forms in 8 different languages. Experiments show that VRDSynth, despite having no prior pre-training data, outperforms the state-of-the-art pre-trained deep-learning approach, namely LayoutXLM, in 5 out of 8 languages. Noticeably, VRDSynth achieved an improvement of 42% over LayoutXLM in terms of F1 score on FUNSD while being complementary to LayoutXLM in 7/8 languages. Regarding efficiency, VRDSynth significantly improves the memory footprint required for storage and inference over LayoutXLM (1M and 380MB versus that of 1.48GB and 3GB required by LayoutXLM), while maintaining similar time efficiency despite the speed differences between the languages used for implementation (Python vs C++).
Preprint
Info
Artifacts Available
Sleuth: A Switchable Dual-Mode Fuzzer to Investigate Bug Impacts Following a Single PoC
Haolai Wei
![ORCID logo](images/orcid.svg)
, Liwei Chen
![ORCID logo](images/orcid.svg)
, Zhijie Zhang
![ORCID logo](images/orcid.svg)
, Gang Shi
![ORCID logo](images/orcid.svg)
, and Dan Meng
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
A proof of concept (PoC) is essential for pinpointing a bug within software. However, relying on it alone for the timely and complete repair of bugs is insufficient due to underestimating the bug impacts. The bug impact reflects that a bug may be triggered at multiple positions following from the root cause, resulting in different bug types (e.g., use-after-free, heap-buffer-overflow). Current techniques discover bug impacts using fuzzing with a specific coverage-guided strategy: assigning more energy to seeds that cover the buggy code regions. This method can utilize a single PoC to generate multiple PoCs that contain different bug impacts in a short time. Unfortunately, we observe existing techniques are still unreliable, primarily due to their failure in balancing the time between in-depth and breadth exploration: (i) in-depth exploration for bug impacts behind crash regions and (ii) breadth exploration for bug impacts alongside unreached regions. Current techniques only focus on one exploration or conduct two explorations in separate stages leading to low accuracy and efficiency. Considering the aforementioned problem, we propose Sleuth, an approach for automatically investigating bug impacts following a known single PoC to enhance bug fixing. We design Sleuth on two novel concepts: (i) a dual-mode exploration mechanism built on a fuzzer designed for efficient in-depth and breadth exploration. (ii) a dynamic switchable strategy connecting with the dual-mode exploration that facilitates the reliability of bug impact investigation. We evaluate Sleuth using 50 known CVEs, and the result of experiment shows that Sleuth can efficiently discover new bug impacts in 86% CVEs and find 1.5x more bug impacts than state-of-art tools. Furthermore, Sleuth successfully identifies 13 incomplete fixes using the generated new PoCs.
Article Search
Artifacts Available
SQLess: Dialect-Agnostic SQL Query Simplification
Li Lin
![ORCID logo](images/orcid.svg)
, Zongyin Hao
![ORCID logo](images/orcid.svg)
,
Chengpeng Wang ![ORCID logo](images/orcid.svg)
, Zhuangda Wang
![ORCID logo](images/orcid.svg)
, Rongxin Wu
![ORCID logo](images/orcid.svg)
, and Gang Fan
(Xiamen University, China; Hong Kong University of Science and Technology, China; Ant Group, China)
Database Management Systems (DBMSs) are fundamental to numerous enterprise applications. Due to the significance of DBMSs, various testing techniques have been proposed to detect DBMS bugs. However, to trigger deep bugs, most of the existing techniques focus on generating lengthy and complex queries which burdens developers with the difficult of debugging. Therefore, SQL query simplification, which aims to reduce lengthy SQL queries without compromising their ability to detect bugs, is highly demanded.
To bridge this gap, we introduce SQLess, an innovative approach that employs a dialect-agnostic method for efficient and semantically correct SQL query simplification tailored for various DBMSs. Unlike previous works that have to depend on DBMS-specific grammar, SQLess utilizes an adaptive parser, which leverages error recovery and grammar expansion to support DBMS dialects. Moreover, SQLess performs a semantics-sensitive SQL query trimming, which leverages alias and dependency analysis to simplify SQL queries with preserving bug-triggering capability.
We evaluate SQLess using two datasets from the state-of-theart database bug detection studies, encompassing six widely-used DBMSs and over 32,000 complex SQL queries. The results demonstrate SQLess’s superior performance: it achieves an average simplification rate of 72.45%, which significantly outperforms the stateof-the-art approaches by 84.91%.
Article Search
BRAFAR: Bidirectional Refactoring, Alignment, Fault Localization, and Repair for Programming Assignments
Linna Xie
![ORCID logo](images/orcid.svg)
, Chongmin Li
![ORCID logo](images/orcid.svg)
, Yu Pei
![ORCID logo](images/orcid.svg)
, Tian Zhang
![ORCID logo](images/orcid.svg)
, and
Minxue Pan
(Nanjing University, China; Hong Kong Polytechnic University, China)
The problem of automated feedback generation for introductory programming assignments (IPAs) has attracted significant attention with the increasing demand for programming education. While existing approaches, like Refactory, that employ the ”block-by-block” repair strategy have produced promising results, they suffer from two limitations. First, Refactory randomly applies refactoring and mutation operations to correct and buggy programs, respectively, to align their control-flow structures (CFSs), which, however, has a relatively low success rate and often complicates the original repairing tasks. Second, Refactory generates repairs for each basic block of the buggy program when its semantics differs from the counterpart in the correct program, which, however, ignores the different roles that basic blocks play in the programs and often produces unnecessary repairs. To overcome these limitations, we propose the Brafar approach to feedback generation for IPAs. The core innovation of Brafar lies in its novel bidirectional refactoring algorithm and coarse-to-fine fault localization. The former aligns the CFSs of buggy and correct programs by applying semantics-preserving refactoring operations to both programs in a guided manner, while the latter identifies basic blocks that truly need repairs based on the semantics of their enclosing statements and themselves. In our experimental evaluation on 1783 real-life incorrect student submissions from a publicly available dataset, Brafar significantly outperformed Refactory and Clara, generating correct repairs for more incorrect programs with smaller patch sizes in a shorter time.
Article Search
CREF: An LLM-Based Conversational Software Repair Framework for Programming Tutors
Boyang Yang
![ORCID logo](images/orcid.svg)
, Haoye Tian
![ORCID logo](images/orcid.svg)
, Weiguo Pian
![ORCID logo](images/orcid.svg)
, Haoran Yu, Haitao Wang,
Jacques Klein ![ORCID logo](images/orcid.svg)
, Tegawendé F. Bissyandé
![ORCID logo](images/orcid.svg)
, and Shunfu Jin
(Yanshan University, China; University of Melbourne, Australia; University of Luxembourg, Luxembourg; Jisuanke, n.n.)
Article Search
Call Graph Soundness in Android Static Analysis
Jordan Samhi ![ORCID logo](images/orcid.svg)
, René Just
![ORCID logo](images/orcid.svg)
, Tegawendé F. Bissyandé
![ORCID logo](images/orcid.svg)
,
Michael D. Ernst ![ORCID logo](images/orcid.svg)
, and
Jacques Klein
(CISPA Helmholtz Center for Information Security, Germany; University of Washington, USA; University of Luxembourg, Luxembourg)
Static analysis is sound in theory, but an implementation may unsoundly fail to analyze all of a program's code. Any such omission is a serious threat to the validity of the tool's output. Our work is the first to measure the prevalence of these omissions. Previously, researchers and analysts did not know what is missed by static analysis, what sort of code is missed, or the reasons behind these omissions. To address this gap, we ran 13static analysis tools and a dynamic analysis on 1000 Android apps. Any method in the dynamic analysis but not in a static analysis is an unsoundness.
Our findings include the following. (1) Apps built around external frameworks challenge static analyzers. On average, the 13 static analysis tools failed to capture 61% of the dynamically-executed methods. (2) A high level of precision in call graph construction is a synonym for a high level of unsoundness. (3) No existing approach significantly improves static analysis soundness. This includes those specifically tailored for a given mechanism, such as DroidRA to address reflection. It also includes systematic approaches, such as EdgeMiner, capturing all callbacks in the Android framework systematically. (4) Modeling entry point methods challenges call graph construction which jeopardizes soundness.
Article Search
Efficient DNN-Powered Software with Fair Sparse Models
Xuanqi Gao
![ORCID logo](images/orcid.svg)
, Weipeng Jiang
![ORCID logo](images/orcid.svg)
, Juan Zhai
![ORCID logo](images/orcid.svg)
, Shiqing Ma
![ORCID logo](images/orcid.svg)
, Xiaoyu Zhang
![ORCID logo](images/orcid.svg)
, and Chao Shen
(Xi’an Jiaotong University, China; University of Massachusetts at Amherst, USA)
With the emergence of the Software 3.0 era, there is a growing trend of compressing and integrating large models into software systems, with significant societal implications. Regrettably, in numerous instances, model compression techniques impact the fairness performance of these models and thus the ethical behavior of DNN-powered software. One of the most notable example is the Lottery Ticket Hypothesis (LTH), a prevailing model pruning approach. This paper demonstrates that fairness issue of LTH-based pruning arises from both its subnetwork selection and training procedures, highlighting the inadequacy of existing remedies. To address this, we propose a novel pruning framework, Ballot, which employs a novel conflict-detection-based subnetwork selection to find accurate and fair subnetworks, coupled with a refined training process to attain a high-performance model, thereby improving the fairness of DNN-powered software. By means of this procedure, Ballot improves the fairness of pruning by 38.00%, 33.91%, 17.96%, and 35.82% compared to state-of-the-art baselines, namely Magnitude Pruning, Standard LTH, SafeCompress, and FairScratch respectively, based on our evaluation of five popular datasets and three widely used models. Our code is available at https://anonymous.4open.science/r/Ballot-506E.
Article Search
DeLink: Source File Information Recovery in Binaries
Zhe Lang
![ORCID logo](images/orcid.svg)
, Zhengzi Xu
![ORCID logo](images/orcid.svg)
, Xiaohui Chen
![ORCID logo](images/orcid.svg)
, Shichao Lv
![ORCID logo](images/orcid.svg)
, Zhanwei Song
![ORCID logo](images/orcid.svg)
, Zhiqiang Shi
![ORCID logo](images/orcid.svg)
, and Limin Sun
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Nanyang Technological University, Singapore; Imperial Global Singapore, Singapore; China Mobile Research Institute, China)
Program comprehension can help analysts understand the primary behavior of a binary and enhance the efficiency of reverse engineering analysis. The existing works focus on instruction translation and function name prediction. However, they are limited in understanding the entire program. The recovered source file information can offer insights into the primary behavior of a binary, serving as high-level program summaries. Nevertheless, the files recovered by the function clustering-based approach contain binary functions with discontinuous distributions, resulting in low accuracy. Additionally, there is no existing research related to predicting the names of these recovered files. To this end, we propose a framework for source file information recovery in binaries, DeLink. This framework first leverages a file structure recovery approach based on boundary location to recognize files within a binary. Then, it utilizes an encoder-decoder model to predict the names of these files. The experimental results show that our file structure recovery approach achieves an average improvement of 14% across six evaluation metrics and requires only an average time of 16.74 seconds, outperforming the state-of-the-art work in both recovery quality and efficiency. Additionally, our file name prediction model achieves 70.09% precision and 63.91% recall. Moreover, we demonstrate the effective application of DeLink in malware homology analysis.
Article Search
Feedback-Driven Automated Whole Bug Report Reproduction for Android Apps
Dingbang Wang, Yu Zhao, Sidong Feng
![ORCID logo](images/orcid.svg)
, Zhaoxu Zhang
![ORCID logo](images/orcid.svg)
, William G. J. Halfond
![ORCID logo](images/orcid.svg)
, Chunyang Chen
![ORCID logo](images/orcid.svg)
, Xiaoxia Sun, Jiangfan Shi, and Tingting Yu
(University of Connecticut, USA; University of Cincinnati, USA; Monash University, Australia; University of Southern California, USA; China Mobile (Suzhou) Software Technology, China; Zhejiang University, China)
Article Search
When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention
Lianghong Guo, Yanlin Wang
![ORCID logo](images/orcid.svg)
, Ensheng Shi
![ORCID logo](images/orcid.svg)
, Wanjun Zhong, Hongyu Zhang
![ORCID logo](images/orcid.svg)
, Jiachi Chen
![ORCID logo](images/orcid.svg)
, Ruikai Zhang, Yuchi Ma
![ORCID logo](images/orcid.svg)
, and Zibin Zheng
(Sun Yat-sen University, China; Xi’an Jiaotong University, China; Chongqing University, China; Huawei Cloud Computing Technologies, n.n.)
Article Search
Dance of the ADS: Orchestrating Failures through Historically-Informed Scenario Fuzzing
Tong Wang
![ORCID logo](images/orcid.svg)
, Taotao Gu
![ORCID logo](images/orcid.svg)
, Huan Deng
![ORCID logo](images/orcid.svg)
, Hu Li
![ORCID logo](images/orcid.svg)
, Xiaohui Kuang
![ORCID logo](images/orcid.svg)
, and Gang Zhao
(Academy of Military Sciences, China)
As autonomous driving systems (ADS) advance towards higher levels of autonomy, orchestrating their safety verification becomes increasingly intricate. This paper unveils ScenarioFuzz, a pioneering scenario-based fuzz testing methodology. Designed like a choreographer who understands the past performances, it uncovers vulnerabilities in ADS without the crutch of predefined scenarios. Leveraging map road networks, such as OPENDRIVE, we extract essential data to form a foundational scenario seed corpus. This corpus, enriched with pertinent information, provides the necessary boundaries for fuzz testing in the absence of starting scenarios. Our approach integrates specialized mutators and mutation techniques, combined with a graph neural network model, to predict and filter out high-risk scenario seeds, optimizing the fuzzing process using historical test data. Compared to other methods, our approach reduces the time cost by an average of 60.3%, while the number of error scenarios discovered per unit of time increases by 103%. Furthermore, we propose a self-supervised collision trajectory clustering method, which aids in identifying and summarizing 54 high-risk scenario categories prone to inducing ADS faults. Our experiments have successfully uncovered 58 bugs across six tested systems, emphasizing the critical safety concerns of ADS.
Preprint
Artifacts Available
Better Not Together: Staged Solving for Context-Free Language Reachability
Chenghang Shi
![ORCID logo](images/orcid.svg)
, Haofeng Li
![ORCID logo](images/orcid.svg)
, Jie Lu
![ORCID logo](images/orcid.svg)
, and Lian Li
(Institute of Computing Technology at Chinese Academy of Sciences, China)
Context-free language reachability (CFL-reachability) is a fundamental formulation for program analysis with many applications. CFL-reachability analysis is computationally expensive, with a slightly subcubic time complexity concerning the number of nodes in the input graph.
This paper proposes staged solving: a new perspective on solving CFL-reachability. Our key observation is that the context-free grammar (CFG) of a CFL-based program analysis can be decomposed into (1) a smaller CFG, L, for matching parentheses, such as procedure calls/returns, field stores/loads, and (2) a regular grammar, R, capturing control/data flows. Instead of solving these two parts monolithically (as in standard algorithms), staged solving solves L-reachability and R-reachability in two distinct stages. In practice, L-reachability, though still context-free, involves only a small subset of edges, while R-reachability can be computed efficiently with close to quadratic complexity relative to the node size of the input graph. We implement our staged CFL-reachability solver, STG, and evaluate it using two clients: context-sensitive value-flow analysis and field-sensitive alias analysis. The empirical results demonstrate that STG achieves speedups of 861.59x and 4.1x for value-flow analysis and alias analysis on average, respectively, over the standard subcubic algorithm. Moreover, we also showcase that staged solving can help to significantly improve the performance of two state-of-the-art solvers, POCR and PEARL, by 74.82x (1.78x) and 37.66x (1.7x) for value-flow (alias) analysis, respectively.
Article Search
Artifacts Available
Understanding Misconfigurations in ROS: An Empirical Study and Current Approaches
Paulo Canelas ![ORCID logo](images/orcid.svg)
, Bradley Schmerl, Alcides Fonseca
![ORCID logo](images/orcid.svg)
, and Christopher S. Timperley
(Carnegie Mellon University, USA; LASIGE, Portugal; University of Lisbon, Portugal)
The Robot Operating System (ROS) is a popular framework and ecosystem that allows developers to build robot software systems from reusable, off-the-shelf components. Systems are often built by customizing and connecting components via configuration files. While reusable components theoretically allow rapid prototyping, ensuring proper configuration and connection is challenging, as evidenced by numerous questions on developer forums. Developers must abide to the often unchecked and unstated assumptions of individual components. Failure to do so can result in misconfigurations that are only discovered during field deployment, at which point errors may lead to unpredictable and dangerous behavior. Despite misconfigurations having been studied in the broader context of software engineering, robotics software (and ROS in particular) poses domain-specific challenges with potentially disastrous consequences. To understand and improve the reliability of ROS projects, it is critical to identify the types of misconfigurations faced by developers. To that end, we perform a study of ROS Answers, a Q&A platform, to identify and categorize misconfigurations that occur during ROS development. We then conduct a literature review to assess the coverage of these misconfigurations by existing detection techniques. In total, we find 12 high-level categories and 50 sub-categories of misconfigurations. Of these categories, 27 are not covered by existing techniques. To conclude, we discuss how to tackle those misconfigurations in future work.
Article Search
Tacoma: Enhanced Browser Fuzzing with Fine-Grained Semantic Alignment
Jiashui Wang
![ORCID logo](images/orcid.svg)
, Peng Qian
![ORCID logo](images/orcid.svg)
, Xilin Huang
![ORCID logo](images/orcid.svg)
, Xinlei Ying
![ORCID logo](images/orcid.svg)
, Yan Chen
![ORCID logo](images/orcid.svg)
, Shouling Ji
![ORCID logo](images/orcid.svg)
, Jianhai Chen
![ORCID logo](images/orcid.svg)
, Jundong Xie
![ORCID logo](images/orcid.svg)
, and Long Liu
(Zhejiang University, China; Ant Group, China)
Browsers are responsible for managing and interpreting the diverse data coming from the web. Despite the considerable efforts of developers, however, it is nearly impossible to completely eliminate potential vulnerabilities in such complicated software. While a family of fuzzing techniques has been proposed to detect flaws in web browsers, they still face the inherent challenge of generating test inputs with low semantic correctness and poor diversity.
In this paper, we propose Tacoma, a novel fuzzing framework tailored for web browsers. Tacoma comprises three main modules: a semantic parser, a semantic aligner, and an input generator. By taking advantage of fine-grained semantic alignment techniques, Tacoma is capable of generating semantically correct test inputs, which significantly improve the probability of a fuzzer in triggering a deep browser state. In particular, by integrating a scope-aware strategy into input generation, Tacoma is able to deal with asynchronous code generation, thereby substantially increasing the diversity of the generated test inputs. We conduct extensive experiments to evaluate Tacoma on three production-level browsers, i.e., Chromium, Safari, and Firefox. Empirical results demonstrate that Tacoma outperforms state-of-the-art browser fuzzers in both achieving code coverage and detecting unique crashes. So far, Tacoma has identified 32 previously unknown bugs, 10 of which have been assigned CVEs. It is worth noting that Tacoma unearthed two bugs in Chromium that have remained undetected for ten years.
Article Search
Identifying Smart Contract Security Issues in Code Snippets from Stack Overflow
Jiachi Chen
![ORCID logo](images/orcid.svg)
, Chong Chen
![ORCID logo](images/orcid.svg)
, Jiang Hu
![ORCID logo](images/orcid.svg)
, John Grundy
![ORCID logo](images/orcid.svg)
, Yanlin Wang
![ORCID logo](images/orcid.svg)
, Ting Chen
![ORCID logo](images/orcid.svg)
, and Zibin Zheng
(Sun Yat-sen University, China; Monash University, Australia; University of Electronic Science and Technology of China, China)
Smart contract developers frequently seek solutions to developmental challenges on Q&A platforms such as Stack Overflow (SO). Although community responses often provide viable solutions, the embedded code snippets can also contain hidden vulnerabilities. Integrating such code directly into smart contracts may make them susceptible to malicious attacks. We conducted an online survey and received 74 responses from smart contract developers. The results of this survey indicate that the majority (86.4%) of participants do not sufficiently consider security when reusing SO code snippets. Despite the existence of various tools designed to detect vulnerabilities in smart contracts, these tools are typically developed for analyzing fully-completed smart contracts and thus are ineffective for analyzing typical code snippets as found on SO. We introduce SOChecker, the first tool designed to identify potential vulnerabilities in incomplete SO smart contract code snippets. SOChecker first leverages a fine-tuned Llama2 model for code completion, followed by the application of symbolic execution methods for vulnerability detection. Our experimental results, derived from a dataset comprising 897 code snippets collected from smart contract-related SO posts, demonstrate that SOChecker achieves an F1 score of 68.2%, greatly surpassing GPT-3.5 and GPT-4 (20.9% and 33.2% F1 Scores respectively). Our findings underscore the need to improve the security of code snippets from Q&A websites.
Article Search
How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation
Cen Zhang
![ORCID logo](images/orcid.svg)
, Yaowen Zheng
![ORCID logo](images/orcid.svg)
, Mingqiang Bai
![ORCID logo](images/orcid.svg)
, Yeting Li
![ORCID logo](images/orcid.svg)
, Wei Ma
![ORCID logo](images/orcid.svg)
,
Xiaofei Xie ![ORCID logo](images/orcid.svg)
, Yuekang Li
![ORCID logo](images/orcid.svg)
, Limin Sun
![ORCID logo](images/orcid.svg)
, and Yang Liu
(Nanyang Technological University, Singapore; Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Singapore Management University, Singapore; University of New South Wales, Australia)
Fuzz drivers are essential for library API fuzzing. However, automatically generating fuzz drivers is a complex task, as it demands the creation of high-quality, correct, and robust API usage code. An LLM-based (Large Language Model) approach for generating fuzz drivers is a promising area of research. Unlike traditional program analysis-based generators, this text-based approach is more generalized and capable of harnessing a variety of API usage information, resulting in code that is friendly for human readers. However, there is still a lack of understanding regarding the fundamental issues on this direction, such as its effectiveness and potential challenges.
To bridge this gap, we conducted the first in-depth study targeting the important issues of using LLMs to generate effective fuzz drivers. Our study features a curated dataset with 86 fuzz driver generation questions from 30 widely-used C projects. Six prompting strategies are designed and tested across five state-of-the-art LLMs with five different temperature settings. In total, our study evaluated 736,430 generated fuzz drivers, with 0.85 billion token costs ($8,000+ charged tokens). Additionally, we compared the LLM-generated drivers against those utilized in industry, conducting extensive fuzzing experiments (3.75 CPU-year). Our study uncovered that:
1) While LLM-based fuzz driver generation is a promising direction, it still encounters several obstacles towards practical applications;
2) LLMs face difficulties in generating effective fuzz drivers for APIs with intricate specifics. Three featured design choices of prompt strategies can be beneficial: issuing repeat queries, querying with examples, and employing an iterative querying process;
3) While LLM-generated drivers can yield fuzzing outcomes that are on par with those used in the industry, there are substantial opportunities for enhancement, such as extending contained API usage, or integrating semantic oracles to facilitate logical bug detection.
Our insights have been implemented to improve the OSS-Fuzz-Gen project, facilitating practical fuzz driver generation in industry.
Preprint
Info
Evaluating Deep Neural Networks in Deployment: A Comparative Study (Replicability Study)
Eduard Pinconschi
![ORCID logo](images/orcid.svg)
, Divya Gopinath
![ORCID logo](images/orcid.svg)
,
Rui Abreu ![ORCID logo](images/orcid.svg)
, and Corina S. Păsăreanu
(Carnegie Mellon University, USA; KBR, USA; NASA Ames, USA; INESC-ID, Portugal; University of Porto, Portugal)
As deep neural networks (DNNs) are increasingly used in safety-critical applications, there is a growing concern for their reliability. Even highly trained, high-performant networks are not 100% accurate. However, it is very difficult to predict their behavior during deployment without ground truth. In this paper, we provide a comparative and replicability study on recent approaches that have been proposed to evaluate the reliability of DNNs in deployment. We find that it is hard to run and reproduce the results for these approaches on their replication packages and even more difficult to run them on artifacts other than their own. Further, it is difficult to compare the effectiveness of the approaches, due to the lack of clearly defined evaluation metrics. Our results indicate that more effort is needed in our research community to obtain sound techniques for evaluating the reliability of neural networks in safety-critical domains. To this end, we contribute an evaluation framework that incorporates the considered approaches and enables evaluation on common benchmarks, using common metrics.
Article Search
Towards Understanding the Bugs in Solidity Compiler
Haoyang Ma
![ORCID logo](images/orcid.svg)
, Wuqi Zhang
![ORCID logo](images/orcid.svg)
, Qingchao Shen
![ORCID logo](images/orcid.svg)
, Yongqiang Tian
![ORCID logo](images/orcid.svg)
, Junjie Chen
![ORCID logo](images/orcid.svg)
, and
Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; Tianjin University, China)
Solidity compiler plays a key role in enabling the development of smart contract applications on Ethereum by governing the syntax of a domain-specific language called Solidity and performing compilation and optimization of Solidity code.
The correctness of Solidity compiler
is critical in fostering transparency, efficiency,
and trust in industries reliant on smart contracts.
However,
like other software systems,
Solidity compiler is prone to bugs,
which may produce incorrect bytecodes on blockchain platforms,
resulting in severe security concerns.
As a domain-specific compiler for smart contracts,
Solidity compiler differs from other compilers in many perspectives,
posing unique challenges to detect its bugs.
To understand the bugs in Solidity compiler
and benefit future research,
in this paper,
we present the first systematic study on 533 Solidity compiler bugs.
We carefully examined their characteristics (including symptoms, root causes, and distribution), and their triggering test cases.
Our study leads to seven bug-revealing takeaways for Solidity compiler.
Moreover,
to study the limitations of Solidity compiler fuzzers and bring our findings into practical scenarios, we evaluate three Solidity compiler fuzzers on our constructed benchmark.
The results show that these fuzzers are inefficient in detecting Solidity compiler bugs.
The inefficiency arises from their failure to consider the interesting bug-inducing features, bug-related compilation flags, and test oracles.
Article Search
Prospector: Boosting Directed Greybox Fuzzing for Large-Scale Target Sets with Iterative Prioritization
Zhijie Zhang
![ORCID logo](images/orcid.svg)
, Liwei Chen
![ORCID logo](images/orcid.svg)
, Haolai Wei
![ORCID logo](images/orcid.svg)
, Gang Shi
![ORCID logo](images/orcid.svg)
, and Dan Meng
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Directed grey-box fuzzing (DGF) is an advanced technique in security testing, specifically designed to guide fuzzing tools toward predefined target sites within a software program. To improve its scalability on multiple targets, recent DGFs prioritize seeds that close to targets based on a more precise distance metric, and dynamically discard well-explored targets, thus steering toward all targets simultaneously. However, not all targets hold equal importance, particularly when facing large-scale target sets. Therefore, current works that blindly tracking all targets diverts computing resources from critical targets, thereby reducing the overall efficiency of triggering targets. In this paper, we present Prospector, a novel DGF approach that can handle large-scale target sets scenarios. Prospector employs an iterative process to focus on a select group of ”focused targets”. To dynamically maintain these targets, Prospector present a more fine-grained strategy that considers the vulnerable patterns and test adequacy of targets. Subsequently, Prospector further sharpens its fuzzing approach toward ”focused targets” by refining strategies in explore-exploit scheduling, seed selection, and byte scheduling. We evaluate Prospector on 24 programs by setting all sanitizer labels as targets. The experimental results show that Prospector exposed bugs faster than AFL++, WindRanger, ParmeSan, and FishFuzz by 125, 141, 84, and 100 cases, respectively. Among 38 unique bugs in the program group with the largest target sets, Prospector reproduces 18 (47.37%) existing bugs faster than other fuzzers. Prospector also discovered 6 new bugs in 4 real-world programs with 5 CVE IDs assigned.
Article Search
Bugs in Pods: Understanding Bugs in Container Runtime Systems
Jiongchi Yu
![ORCID logo](images/orcid.svg)
,
Xiaofei Xie ![ORCID logo](images/orcid.svg)
, Cen Zhang
![ORCID logo](images/orcid.svg)
, Sen Chen
![ORCID logo](images/orcid.svg)
, Yuekang Li
![ORCID logo](images/orcid.svg)
, and Wenbo Shen
(Singapore Management University, Singapore; Nanyang Technological University, Singapore; Tianjin University, China; UNSW, Australia; Zhejiang University, China)
Article Search
Automated Data Binding Vulnerability Detection for Java Web Frameworks via Nested Property Graph
Xiaoyong Yan
![ORCID logo](images/orcid.svg)
, Biao He
![ORCID logo](images/orcid.svg)
, Wenbo Shen
![ORCID logo](images/orcid.svg)
, Yu Ouyang
![ORCID logo](images/orcid.svg)
, Kaihang Zhou
![ORCID logo](images/orcid.svg)
, Xingjian Zhang
![ORCID logo](images/orcid.svg)
, Xingyu Wang
![ORCID logo](images/orcid.svg)
, Yukai Cao
![ORCID logo](images/orcid.svg)
, and Rui Chang
(Zhejiang University, China; Ant Group, China)
Article Search
SelfPiCo: Self-Guided Partial Code Execution with LLMs
Zhipeng Xue
![ORCID logo](images/orcid.svg)
,
Zhipeng Gao ![ORCID logo](images/orcid.svg)
, Shaohua Wang
![ORCID logo](images/orcid.svg)
, Xing Hu
![ORCID logo](images/orcid.svg)
, Xin Xia
![ORCID logo](images/orcid.svg)
, and Shanping Li
(Zhejiang University, China; Central University of Finance and Economics, China)
Code executability plays a vital role in software debugging and testing (e.g., detecting runtime exceptions or assertion violations). However, code execution, especially partial or arbitrary code execution, is a non-trivial task due to missing definitions and complex third-party dependencies. To make partial code (such as code snippets posted on the web or code fragments deep inside complex software projects) executable, the existing study has proposed a machine learning model to predict the undefined element types and inject the pre-defined dummy values into execution. However, the performance of their tool is limited due to its simply designed dummy values and the inability to continue learning. In this paper, we design and implement a novel framework, named SelfPiCo (Self-Guided Partial Code Executor), to dynamically guide partial code execution by incorporating the open-source LLM (i.e., Code Llama) within an interactive loop. Particularly, SelfPiCo leverages few-shot in-context learning and chain-of-thought reasoning to elicit human knowledge and logical reasoning based on fine-tuning the Code Llama model. SelfPiCo continuously learns from code execution results and refines its predictions step after step. Our evaluations demonstrate that SelfPiCo can execute 72.7% and 83.3% of all lines in the open-source code and Stack Overflow snippets, outperforming the most recent state-of-the-art Lexecutor by 37.9% and 33.5%, respectively. Moreover, SelfPiCo successfully detected 18 and 33 runtime type error issues by executing the partial code from eight GitHub software projects and 43 Stack Overflow posts, demonstrating the practical usage and potential application of our framework in practice.
Article Search
CoSec: On-the-Fly Security Hardening of Code LLMs via Supervised Co-decoding
Dong Li
![ORCID logo](images/orcid.svg)
, Meng Yan
![ORCID logo](images/orcid.svg)
, Yaosheng Zhang
![ORCID logo](images/orcid.svg)
,
Zhongxin Liu ![ORCID logo](images/orcid.svg)
, Chao Liu
![ORCID logo](images/orcid.svg)
, Xiaohong Zhang
![ORCID logo](images/orcid.svg)
, Ting Chen
![ORCID logo](images/orcid.svg)
, and
David Lo
(Chongqing University, China; Zhejiang University, China; University of Electronic Science and Technology of China, China; Singapore Management University, Singapore)
Large Language Models (LLMs) specialized in code have shown exceptional proficiency across various programming-related tasks, particularly code generation. Nonetheless, due to its nature of pretraining on massive uncritically filtered data, prior studies have shown that code LLMs are prone to generate code with potential vulnerabilities. Existing approaches to mitigate this risk involve crafting data without vulnerability and subsequently retraining or fine-tuning the model. As the number of parameters exceeds a billion, the computation and data demands of the above approaches will be enormous. Moreover, an increasing number of code LLMs tend to be distributed as services, where the internal representation is not accessible, and the API is the only way to reach the LLM, making the prior mitigation strategies non-applicable. To cope with this, we propose CoSec, an on-the-fly Security hardening method of code LLMs based on security model-guided Co-decoding, to reduce the likelihood of code LLMs to generate code containing vulnerabilities. Our key idea is to train a separate but much smaller security model to co-decode with a target code LLM. Since the trained secure model has higher confidence for secure tokens, it guides the generation of the target base model towards more secure code generation. By adjusting the probability distributions of tokens during each step of the decoding process, our approach effectively influences the tendencies of generation without accessing the internal parameters of the target code LLM. We have conducted extensive experiments across various parameters in multiple code LLMs (i.e., CodeGen, StarCoder, and DeepSeek-Coder), and the results show that our approach is effective in security hardening. Specifically, our approach improves the average security ratio of six base models by 5.02%-37.14%, while maintaining the functional correctness of the target model.
Article Search
CooTest: An Automated Testing Approach for V2X Communication Systems
An Guo
![ORCID logo](images/orcid.svg)
, Xinyu Gao
![ORCID logo](images/orcid.svg)
,
Zhenyu Chen ![ORCID logo](images/orcid.svg)
, Yuan Xiao
![ORCID logo](images/orcid.svg)
, Jiakai Liu
![ORCID logo](images/orcid.svg)
, Xiuting Ge
![ORCID logo](images/orcid.svg)
, Weisong Sun
![ORCID logo](images/orcid.svg)
, and Chunrong Fang
(Nanjing University, China)
Perceiving the complex driving environment precisely is crucial to the safe operation of autonomous vehicles. With the tremendous advancement of deep learning and communication technology, Vehicle-to-Everything (V2X) collaboration has the potential to address limitations in sensing distant objects and occlusion for a single-agent perception system. However, despite spectacular progress, several communication challenges can undermine the effectiveness of multi-vehicle cooperative perception. The low interpretability of Deep Neural Networks (DNNs) and the high complexity of communication mechanisms make conventional testing techniques inapplicable for the cooperative perception of autonomous driving systems (ADS). Besides, the existing testing techniques, depending on manual data collection and labeling, become time-consuming and prohibitively expensive.
In this paper, we design and implement CooTest, the first automated testing tool of the V2X-oriented cooperative perception module. CooTest devises the V2X-specific metamorphic relation and equips communication and weather transformation operators that can reflect the impact of the various cooperative driving factors to produce transformed scenes. Furthermore, we adopt a V2X-oriented guidance strategy for the transformed scene generation process and improve testing efficiency. We experiment CooTest with multiple cooperative perception models with different fusion schemes to evaluate its performance on different tasks. The experiment results show that CooTest can effectively detect erroneous behaviors under various V2X-oriented driving conditions. Also, the results confirm that CooTest can improve detection average precision and decrease misleading cooperation errors by retraining with the generated scenes.
Article Search
Interoperability in Deep Learning: A User Survey and Failure Analysis of ONNX Model Converters
Purvish Jajal, Wenxin Jiang
![ORCID logo](images/orcid.svg)
, Arav Tewari, Erik Kocinare
![ORCID logo](images/orcid.svg)
, Joseph Woo, Anusha Sarraf, Yung-Hsiang Lu, George K. Thiruvathukal, and
James C. Davis
(Purdue University, USA; Loyola University Chicago, USA)
Article Search
TeDA: A Testing Framework for Data Usage Auditing in Deep Learning Model Development
Xiangshan Gao
![ORCID logo](images/orcid.svg)
, Jialuo Chen
![ORCID logo](images/orcid.svg)
, Jingyi Wang
![ORCID logo](images/orcid.svg)
, Jie Shi
![ORCID logo](images/orcid.svg)
, Peng Cheng
![ORCID logo](images/orcid.svg)
, and Jiming Chen
(Zhejiang University, China; Huawei Technology, China; Huawei International, Singapore; Hangzhou Dianzi University, China)
It is notoriously challenging to audit the potential unauthorized data usage in deep learning (DL) model development lifecycle, i.e., to judge whether certain private user data has been used to train or fine-tune a DL model without authorization. Yet, such data usage auditing is crucial to respond to the urgent requirements of trustworthy Artificial Intelligence (AI) such as data transparency, which are promoted and enforced in recent AI regulation rules or acts like General Data Protection Regulation (GDPR) and EU AI Act. In this work, we propose TeDA, a simple and flexible testing framework for auditing data usage in DL model development process. Given a set of user’s private data to protect (Dp), the intuition of TeDA is to apply membership inference (with good intention) for judging whether the model to audit (Ma) is likely to be trained with Dp. Notably, to significantly expose the usage under membership inference, TeDA applies imperceptible perturbation directed by boundary search to generate a carefully crafted test suite Dt (which we call ‘isotope’) based on Dp. With the test suite, TeDA then adopts membership inference combined with hypothesis testing to decide whether a user’s private data has been used to train Ma with statistical guarantee. We evaluated TeDA through extensive experiments on ranging data volumes across various model architectures for data-sensitive face recognition and medical diagnosis tasks. TeDA demonstrates high feasibility, effectiveness and robustness under various adaptive strategies (e.g., pruning and distillation).
Article Search
Enhancing Multi-agent System Testing with Diversity-Guided Exploration and Adaptive Critical State Exploitation
Xuyan Ma
![ORCID logo](images/orcid.svg)
, Yawen Wang
![ORCID logo](images/orcid.svg)
, Junjie Wang
![ORCID logo](images/orcid.svg)
, Xiaofei Xie
![ORCID logo](images/orcid.svg)
, Boyu Wu
![ORCID logo](images/orcid.svg)
, Shoubin Li
![ORCID logo](images/orcid.svg)
, Fanjiang Xu
![ORCID logo](images/orcid.svg)
, and Qing Wang
(University of Chinese Academy of Sciences, China; Institute of Software at Chinese Academy of Sciences, China; Singapore Management University, Singapore)
Multi-agent systems (MASs) have achieved remarkable success in multi-robot control, intelligent transportation, and multiplayer games, etc. Thorough testing for MAS is urgently needed to ensure its robustness in the face of constantly changing and unexpected scenarios. Existing methods mainly focus on single-agent system testing and cannot be directly applied to MAS testing due to the complexity of MAS. To our best knowledge, there are fewer studies on MAS testing. While several studies have focused on adversarial attacks on MASs, they primarily target failure detection from an attack perspective, i.e., discovering failure scenarios, while ignoring the diversity of scenarios. In this paper, to highlight a typical balance between exploration (diversifying behaviors) and exploitation (detecting failures), we propose an advanced testing framework for MAS called with diversity-guided exploration and adaptive critical state exploitation. It incorporates both individual diversity and team diversity, and designs an adaptive perturbation mechanism to perturb the action at the critical states, so as to trigger more and more diverse failure scenarios of the system. We evaluate MASTest on two popular MAS simulation environments: Coop Navi and StarCraft II. Results show that the average distance of the resulting failure scenarios is increased by 29.55%-103.57% and 74.07%-370.00% on two environments compared to the baselines. Also, the failure patterns found by MASTest are improved by 71.44%-300.00% and 50%-500.00% on two experimental environments compared to the baselines.
Article Search
Arfa: An Agile Regime-Based Floating-Point Optimization Approach for Rounding Errors
Jinchen Xu
![ORCID logo](images/orcid.svg)
, Mengqi Cui
![ORCID logo](images/orcid.svg)
, Fei Li
![ORCID logo](images/orcid.svg)
, Zuoyan Zhang
![ORCID logo](images/orcid.svg)
, Hongru Yang
![ORCID logo](images/orcid.svg)
, Bei Zhou
![ORCID logo](images/orcid.svg)
, and Jie Zhao
(Information Engineering University, China; Hunan University, China)
We introduce a floating-point (FP) error optimization approach called Arfa that partitions the domain D of an FP expression fe into regimes and rewrites fe in each regime where fe shows larger errors. First, Arfa seeks a rewrite substitution fo with lower errors across D, whose error distribution is plotted for effective regime inference. Next, Arfa generates an incomplete set of ordered rewrite candidates within each regime of interest, so that searching for the best rewrite substitutions is performed efficiently. Finally, Arfa selects the best rewrite substitution by inspecting the errors of top ranked rewrite candidates, with enhancing precision also considered. Experiments on 56 FPbench examples and four real-life programs show that Arfa not only reduces the maximum and average errors of fe by 4.73 and 2.08 bits on average (and up to 33 and 16 bits), but also exhibits lower errors, sometimes to a significant degree, than Herbie and NumOpt.
Article Search
NeuFair: Neural Network Fairness Repair with Dropout
Vishnu Asutosh Dasu
![ORCID logo](images/orcid.svg)
, Ashish Kumar
![ORCID logo](images/orcid.svg)
,
Saeid Tizpaz-Niari ![ORCID logo](images/orcid.svg)
, and
Gang Tan
(Pennsylvania State University, USA; University of Texas at El Paso, USA)
This paper investigates neuron dropout as a post-processing bias mitigation method for deep neural networks (DNNs). Neural-driven software solutions are increasingly applied in socially critical domains with significant fairness implications. While DNNs are exceptional at learning statistical patterns from data, they may encode and amplify historical biases. Existing bias mitigation algorithms often require modifying the input dataset or the learning algorithms. We posit that prevalent dropout methods may be an effective and less intrusive approach to improve fairness of pre-trained DNNs during inference. However, finding the ideal set of neurons to drop is a combinatorial problem.
We propose NeuFair, a family of post-processing randomized algorithms that mitigate unfairness in pre-trained DNNs via dropouts during inference. Our randomized search is guided by an objective to minimize discrimination while maintaining the model’s utility. We show that NeuFair is efficient and effective in improving fairness (up to 69%) with minimal or no model performance degradation. We provide intuitive explanations of these phenomena and carefully examine the influence of various hyperparameters of NeuFair on the results. Finally, we empirically and conceptually compare NeuFair to different state-of-the-art bias mitigators.
Article Search
One Size Does Not Fit All: Multi-granularity Patch Generation for Better Automated Program Repair
Bo Lin
![ORCID logo](images/orcid.svg)
,
Shangwen Wang ![ORCID logo](images/orcid.svg)
,
Ming Wen ![ORCID logo](images/orcid.svg)
, Liqian Chen
![ORCID logo](images/orcid.svg)
, and Xiaoguang Mao
(National University of Defense Technology, China; Huazhong University of Science and Technology, China)
Automated program repair aims to automate bug correction and alleviate the burden of manual debugging, which plays a crucial role in software development and maintenance. Recent studies reveal that learning-based approaches have outperformed conventional APR techniques (e.g., search-based APR). Existing learning-based APR techniques mainly center on treating program repair either as a translation task or a cloze task. The former primarily emphasizes statement-level repair, while the latter concentrates on token-level repair, as per our observations. In practice, however, patches may manifest at various repair granularity, including statement, expression, or token levels. Consequently, merely generating patches from a single granularity would be ineffective to tackle real-world defects. Motivated by this observation, we propose Mulpor, a multi-granularity patch generation approach designed to address the diverse nature of real-world bugs. Mulpor comprises three components: statement-level, expression-level, and token-level generator, each is pre-trained to generate correct patches at its respective granularity. The approach involves generating candidate patches from various granularities, followed by a re-ranking process based on a heuristic to prioritize patches. Experimental results on the Defects4J dataset demonstrate that Mulpor correctly repair 92 bugs on Defects4J-v1.2, which achieves 27.0% (20 bugs) and 12.2% (10 bugs) improvement over the previous state-of-the-art NMT-style Rap-Gen and Cloze-style GAMMA. We also studied the generalizability of Mulpor in repairing vulnerabilities, revealing a notable 51% increase in the number of correctly-fixed patches compared with state-of-the-art vulnerability repair approaches. This paper underscores the importance of considering multiple granularities in program repair techniques for a comprehensive strategy to address the diverse nature of real-world software defects. Mulpor, as proposed herein, exhibits promising results in achieving effective and diverse bug fixes across various program repair scenarios.
Article Search
Policy Testing with MDPFuzz (Replicability Study)
Quentin Mazouni
![ORCID logo](images/orcid.svg)
, Helge Spieker
![ORCID logo](images/orcid.svg)
, Arnaud Gotlieb
![ORCID logo](images/orcid.svg)
, and
Mathieu Acher
(Simula Research Laboratory, Norway; University of Rennes - Inria - CNRS - IRISA, France)
In recent years, following tremendous achievements in Reinforcement Learning, a great deal of interest has been devoted to ML models for sequential decision-making. Together with these scientific breakthroughs/advances, research has been conducted to develop automated functional testing methods for finding faults in black-box Markov decision processes. In 2022, Pang et al. presented a black-box fuzz testing framework called MDPFuzz. The method consists of a fuzzer whose main feature is to use Gaussian Mixture Models (GMMs) to compute coverage of the test inputs as the likelihood to have already observed their results. This guidance through coverage evaluation aims at favoring novelty during testing and fault discovery in the decision model.
Pang et al. evaluated their work with four use cases, by comparing the number of failures found after twelve-hour testing campaigns with or without the guidance of the GMMs (ablation study). In this paper, we verify some of the key findings of the original paper and explore the limits of MDPFuzz through reproduction and replication. We re-implemented the proposed methodology and evaluated our replication in a large-scale study that extends the original four use cases with three new ones. Furthermore, we compare MDPFuzz and its ablated counterpart with a random testing baseline. We also assess the effectiveness of coverage guidance for different parameters, something that has not been done in the original evaluation. Despite this parameter analysis and unlike Pang et al .’ original conclusions, we find that in most cases, the aforementioned ablated Fuzzer outperforms MDPFuzz, and conclude that the coverage model proposed does not lead to finding more faults.
Article Search
Artifacts Available
AutoCodeRover: Autonomous Program Improvement
Yuntong Zhang
![ORCID logo](images/orcid.svg)
, Haifeng Ruan
![ORCID logo](images/orcid.svg)
, Zhiyu Fan
![ORCID logo](images/orcid.svg)
, and
Abhik Roychoudhury
(National University of Singapore, Singapore)
Researchers have made significant progress in automating the software development process in the past decades. Automated techniques for issue summarization, bug reproduction, fault localization, and program repair have been built to ease the workload of developers. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding. Nevertheless, software engineering involves the process of program improvement apart from coding, specifically to enable software maintenance (e.g. program repair to fix bugs) and software evolution (e.g. feature additions). In this paper, we propose an automated approach for solving Github issues to autonomously achieve program improvement. In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. In contrast to recent LLM agent approaches from AI researchers and practitioners, our outlook is more software engineering oriented. We work on a program representation (abstract syntax tree) as opposed to viewing a software project as a mere collection of files. Our code search exploits the program structure in the form of classes/methods to enhance LLM’s understanding of the issue’s root cause, and effectively retrieve a context via iterative search. The use of spectrum-based fault localization using tests, further sharpens the context, as long as a test-suite is available. Experiments on the recently proposed SWE-bench-lite (300 real-life Github issues) show increased efficacy in solving Github issues (19% on SWE-bench-lite), which is higher than the efficacy of the recently reported Swe-agent. Interestingly, our approach resolved 57 GitHub issues in about 4 minutes each (pass@1), whereas developers spent more than 2.68 days on average. In addition, AutoCodeRover achieved this efficacy with significantly lower cost (on average, $0.43 USD), compared to other baselines. We posit that our workflow enables autonomous software engineering, where, in future, auto-generated code from LLMs can be autonomously improved.
Article Search
Practitioners’ Expectations on Automated Test Generation
Xiao Yu
![ORCID logo](images/orcid.svg)
, Lei Liu
![ORCID logo](images/orcid.svg)
, Xing Hu
![ORCID logo](images/orcid.svg)
, Jacky Keung
![ORCID logo](images/orcid.svg)
, Xin Xia
![ORCID logo](images/orcid.svg)
, and
David Lo
(Huawei, China; Xi’an Jiaotong University, China; Zhejiang University, China; City University of Hong Kong, China; Singapore Management University, Singapore)
Article Search
Empirical Study of Move Smart Contract Security: Introducing MoveScan for Enhanced Analysis
Shuwei Song
![ORCID logo](images/orcid.svg)
, Jiachi Chen
![ORCID logo](images/orcid.svg)
, Ting Chen
![ORCID logo](images/orcid.svg)
,
Xiapu Luo ![ORCID logo](images/orcid.svg)
, Teng Li
![ORCID logo](images/orcid.svg)
, Wenwu Yang
![ORCID logo](images/orcid.svg)
, Leqing Wang
![ORCID logo](images/orcid.svg)
, Weijie Zhang
![ORCID logo](images/orcid.svg)
, Feng Luo
![ORCID logo](images/orcid.svg)
, Zheyuan He
![ORCID logo](images/orcid.svg)
, Yi Lu
![ORCID logo](images/orcid.svg)
, and Pan Li
(University of Electronic Science and Technology of China, China; Sun Yat-sen University, China; Hong Kong Polytechnic University, China; Jiangsu University of Science and Technology, China; BitsLab, Singapore; MoveBit, China)
Article Search
Testing Gremlin-Based Graph Database Systems via Query Disassembling
Yingying Zheng
![ORCID logo](images/orcid.svg)
,
Wensheng Dou ![ORCID logo](images/orcid.svg)
, Lei Tang
![ORCID logo](images/orcid.svg)
, Ziyu Cui
![ORCID logo](images/orcid.svg)
, Yu Gao
![ORCID logo](images/orcid.svg)
, Jiansen Song
![ORCID logo](images/orcid.svg)
, Liang Xu
![ORCID logo](images/orcid.svg)
, Jiaxin Zhu
![ORCID logo](images/orcid.svg)
, Wei Wang
![ORCID logo](images/orcid.svg)
, Jun Wei
![ORCID logo](images/orcid.svg)
, Hua Zhong
![ORCID logo](images/orcid.svg)
, and Tao Huang
(Institute of Software at Chinese Academy of Sciences, China; Jinling Institute of Technology, China)
Article Search
Logos: Log Guided Fuzzing for Protocol Implementations
Feifan Wu
![ORCID logo](images/orcid.svg)
, Zhengxiong Luo
![ORCID logo](images/orcid.svg)
, Yanyang Zhao
![ORCID logo](images/orcid.svg)
, Qingpeng Du
![ORCID logo](images/orcid.svg)
, Junze Yu
![ORCID logo](images/orcid.svg)
, Ruikang Peng
![ORCID logo](images/orcid.svg)
, Heyuan Shi
![ORCID logo](images/orcid.svg)
, and
Yu Jiang
(Tsinghua University, China; Beijing University of Posts and Telecommunications, China; Central South University, China)
Network protocols are extensively used in a variety of network devices, making the security of their implementations crucial. Protocol fuzzing has shown promise in uncovering vulnerabilities in these implementations. However traditional methods often require instrumentation of the target implementation to provide guidance, which is intrusive, adds overhead, and can hinder black-box testing. This paper presents Logos, a protocol fuzzer that utilizes non-intrusive runtime log information for fuzzing guidance. Logos first standardizes the unstructured logs and embeds them into a high-dimensional vector space for semantic representation.Then, Logos filters the semantic representation and dynamically maintains a semantic coverage to chart the explored space for customized guidance.We evaluate Logos on eight widely used implementations of well-known protocols. Results show that, compared to existing intrusive or expert knowledge-driven protocol fuzzers, Logos achieves 26.75%-106.19% higher branch coverage within 24 hours. Furthermore, Logos exposed 12 security-critical vulnerabilities in these prominent protocol implementations, with 9 CVEs assigned.
Article Search
Tool Demonstrations
SeeWasm: An Efficient and Fully-Functional Symbolic Execution Engine for WebAssembly Binaries
Ningyu He
![ORCID logo](images/orcid.svg)
, Zhehao Zhao
![ORCID logo](images/orcid.svg)
, Hanqin Guan
![ORCID logo](images/orcid.svg)
, Jikai Wang
![ORCID logo](images/orcid.svg)
, Shuo Peng, Ding Li
![ORCID logo](images/orcid.svg)
, Haoyu Wang
![ORCID logo](images/orcid.svg)
, Xiangqun Chen
![ORCID logo](images/orcid.svg)
, and Yao Guo
(Peking University, China; Huazhong University of Science and Technology, China)
Article Search
Doctoral Symposium
proc time: 3.58