ISSTA 2024
33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024)

Powered by

Conference Publishing Consulting

33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024), September 16–20, 2024, Vienna, Austria

ISSTA 2024 – Preliminary Table of Contents

Contents - Abstracts - Authors

Frontmatter

Title Page

Message from the Chairs

Committees

Sponsors

Papers Round 1

Detecting Build Dependency Errors in Incremental Builds
Jun Lyu

, Shanshan Li

, He Zhang

, Yang Zhang

, Guoping Rong

, and Manuel Rigger

(Nanjing University, China; National University of Singapore, Singapore)
Incremental and parallel builds performed by build tools such as Make are the heart of modern C/C++ software projects. Their correct and efficient execution depends on build scripts. However, build scripts are prone to errors. The most prevalent errors are missing dependencies (MDs) and redundant dependencies (RDs). The state-of-the-art methods for detecting these errors rely on clean builds (i.e., full builds of a subset of software configurations in a clean environment), which is costly and takes up to a few hours for large-scale projects. To address these challenges, we propose a novel approach called EChecker to detect build dependency errors in the context of incremental builds. The core idea of EChecker is to automatically update actual build dependencies by inferring them from C/C++ pre-processor directives and Makefile changes from new commits, which avoids clean builds when possible. EChecker achieves higher efficiency than the methods that rely on clean builds while maintaining effectiveness. We selected 12 representative projects, with their sizes ranging from small to large, with 240 commits (20 commits for each project), based on which we evaluated the effectiveness and efficiency of EChecker. We compared the evaluation results with a state-of-the-art build dependency error detection tool. The evaluation shows that the F-1 score of EChecker improved by 0.18 over the state-of-the-art method. EChecker increases the build dependency error detection efficiency by an average of 85.14 times (with a median of 16.30 times). The results demonstrate that EChecker can support practitioners in detecting build dependency errors efficiently.

Face It Yourselves: An LLM-Based Two-Stage Strategy to Localize Configuration Errors via Logs
Shiwen Shan

, Yintong Huo

, Yuxin Su

, Yichen Li

, Dan Li

, and Zibin Zheng

(Sun Yat-sen University, China; Chinese University of Hong Kong, China)
Configurable software systems are prone to configuration errors, resulting in significant losses to companies. However, diagnosing these errors is challenging due to the vast and complex configuration space. These errors pose significant challenges for both experienced maintainers and new end-users, particularly those without access to the source code of the software systems. Given that logs are easily accessible to most end-users, we conduct a preliminary study to outline the challenges and opportunities of utilizing logs in localizing configuration errors. Based on the insights gained from the preliminary study, we propose an LLM-based two-stage strategy for end-users to localize the root-cause configuration properties based on logs. We further implement a tool, LogConfigLocalizer, aligned with the design of the aforementioned strategy, hoping to assist end-users in coping with configuration errors through log analysis.
To the best of our knowledge, this is the first work to localize the root-cause configuration properties for end-users based on Large Language Models (LLMs) and logs. We evaluate the proposed strategy on Hadoop by LogConfigLocalizer and prove its efficiency with an average accuracy as high as 99.91%. Additionally, we also demonstrate the effectiveness and necessity of different phases of the methodology by comparing it with two other variants and a baseline tool. Moreover, we validate the proposed methodology through a practical case study to demonstrate its effectiveness and feasibility.

FastLog: An End-to-End Method to Efficiently Generate and Insert Logging Statements
Xiaoyuan Xie

, Zhipeng Cai

, Songqiang Chen

, and Jifeng Xuan

(Wuhan University, China; Hong Kong University of Science and Technology, China)
Logs play a crucial role in modern software systems, serving as a means for developers to record essential information for future software maintenance. As the performance of these log-based maintenance tasks heavily relies on the quality of logging statements, various works have been proposed to assist developers in writing appropriate logging statements. However, these works either only support developers in partial sub-tasks of this whole activity; or perform with a relatively high time cost and may introduce unwanted modifications. To address their limitations, we propose FastLog, which can support the complete logging statement generation and insertion activity, in a very speedy manner. Specifically, given a program method, FastLog first predicts the insertion position in the finest token level, and then generates a complete logging statement to insert. We further use text splitting for long input texts to improve the accuracy of predicting where to insert logging statements. A comprehensive empirical analysis shows that our method outperforms the state-of-the-art approach in both efficiency and output quality, which reveals its great potential and practicality in current real-time intelligent development environments.

FortifyPatch: Towards Tamper-Resistant Live Patching in Linux-Based Hypervisor
Zhenyu Ye

, Lei Zhou

, Fengwei Zhang

, Wenqiang Jin

, Zhenyu Ning

, Yupeng Hu

, and Zheng Qin

(Hunan University, China; National University of Defense Technology, China; Southern University of Science and Technology, China; Xinchuang Haihe Laboratory, China)
Linux-based hypervisors in the cloud server suffer from an increasing number of vulnerabilities in the Linux kernel.To address these vulnerabilities in a timely manner while avoiding the economic loss caused by unplanned shutdowns, live patching schemes have been developed. Unfortunately, existing live patching solutions have failed to protect patches from post-deployment attacks. In addition, patches that involve changes to global variables can lead to practical issues with existing solutions. To address these problems, we present FortifyPatch, a tamper-resistant live patching solution for Linux-based hypervisors in cloud environments. Specifically, FortifyPatch employs multiple Granule Protection Tables from Arm Confidential Computing Architecture to protect the integrity of deployed patches. TrustZone Address Space Controller and Performance Monitor Unit are used to prevent the bypassing of the Patch via kernel code protection and timely page table verification. FortifyPatch is also able to patch global variables via well-designed data access traps.We prototype FortifyPatch and evaluate it using real-world CVE patches. The result shows that FortifyPatch is capable of deploying 81.5% of CVE patches. The performance evaluation indicates that FortifyPatch protects deployed patches with 0.98% and 3.1% overhead on average across indicative benchmarks and real-world applications, respectively.

Unimocg: Modular Call-Graph Algorithms for Consistent Handling of Language Features
Dominik Helm

, Tobias Roth

, Sven Keidel

, Michael Reif

, and Mira Mezini

(TU Darmstadt, Germany; National Research Center for Applied Cybersecurity ATHENE, Germany; CQSE, Germany; hessian.AI, Germany)
Traditional call-graph construction algorithms conflate the computation of possible runtime types with the actual resolution of (virtual) calls. This tangled design impedes supporting complex language features and APIs and making systematic trade-offs between precision, soundness, and scalability. It also impedes implementation of precise downstream analyses that rely on type information.
To address the problem, we propose Unimocg, a modular architecture for call-graph construction that decouples the computation of type information from resolving calls. Due to its modular design, Unimocg can combine a wide range of different call-graph algorithms with algorithm-agnostic modules to support individual language features. Moreover, these modules operate at the same precision as the chosen call-graph algorithm with no further effort. Additionally, Unimocg allows other analyses to easily reuse type information from the call-graph construction at full precision.
We demonstrate how Unimocg enables a framework of call-graph algorithms with different precision, soundness, and scalability trade-offs from reusable modules. Unimocg currently supports ten call-graph algorithms from vastly different families, such as CHA, RTA, XTA, and k-l-CFA. These algorithms show consistent soundness without sacrificing precision or performance. We also show how an immutability analysis is improved using Unimocg.

Precise Compositional Buffer Overflow Detection via Heap Disjointness
Yiyuan Guo

, Peisen Yao

, and Charles Zhang

(Hong Kong University of Science and Technology, China; Zhejiang University, China)
Static analysis techniques for buffer overflow detection still struggle with being scalable for millions of lines of code, while being precise enough to have an acceptable false positive rate. The checking of buffer overflow necessitates reasoning about the heap reachability and numerical relations, which are mutually dependent. Existing techniques to resolve the dependency cycle either sacrifice precision or efficiency due to their limitations in reasoning about symbolic heap location, i.e., heap location with possibly symbolic numerical offsets. A symbolic heap location potentially aliases a large number of other heap locations, leading to a disjunction of heap states that is particularly challenging to reason precisely. Acknowledging the inherent difficulties in heap and numerical reasoning, we introduce a disjointness assumption into the analysis by shrinking the program state space so that all the symbolic locations involved in memory accesses are disjoint from each other. The disjointness property permits strong updates to be performed at symbolic heap locations, significantly improving the precision by incorporating numerical information in heap reasoning. Also, it aids in the design of a compositional analysis to boost scalability, where compact and precise function summaries are efficiently generated and reused. We implement the idea in the static buffer overflow detector Cod. When applying it to large, real-world software such as PHP and QEMU, we have uncovered 29 buffer overflow bugs with a false positive rate of 37%, while projects of millions of lines of code can be successfully analyzed within four hours.

Enhancing ROS System Fuzzing through Callback Tracing
Yuheng Shen

, Jianzhong Liu

, Yiru Xu

, Hao Sun

, Mingzhe Wang

, Heyuan Shi

, Nan Guan

, and Yu Jiang

(Tsinghua University, China; ETH Zurich, Switzerland; Central South University, China; City University of Hong Kong, China)
The Robot Operating System 2 (ROS) is the de-facto standard for robotic software development, with a wide application in diverse safety-critical domains. There are many efforts in testing that seek to deliver a more secure ROS codebase. However, existing testing methods are often inadequate to capture the complex and stateful behaviors inherent to ROS deployments, resulting in limited test- ing effectiveness. In this paper, we propose R2D2, a ROS system fuzzer that leverages ROS’s runtime states as guidance to increase fuzzing effectiveness and efficiency. Unlike traditional fuzzers, R2D2 employs a systematic instrumentation strategy that captures the system’s runtime behaviors and profiles the current system state in real-time. This approach provides a more in-depth understanding of system behaviors, thereby facilitating a more insightful explo- ration of ROS’s extensive state space. For evaluation, we applied it to four well-known ROS applications. Our evaluation shows that R2D2 achieves an improvement of 3.91× and 2.56× in code coverage compared to state-of-the-art ROS fuzzers, including Ros2Fuzz and RoboFuzz, while also uncovering 39 previously unknown vulnera- bilities, with 6 fixed in both ROS runtime and ROS applications. For its runtime overhead, R2D2 maintains an average execution and memory usage overhead with 10.4% and 1.0% in respect, making R2D2 effective in ROS testing.

API Misuse Detection via Probabilistic Graphical Model
Yunlong Ma

, Wentong Tian

, Xiang Gao

, Hailong Sun

, and Li Li

(Beihang University, China)
API misuses can cause a range of issues in software development, including program crashes, bugs, and vulnerabilities. Different approaches have been developed to automatically detect API misuses by checking the program against usage rules extracted from extensive codebase or API documents. However, these mined rules may not be precise or complete, leading to high false positive/negative rates. In this paper, we propose a novel solution to this problem by representing the mined API usage rules as a probabilistic graphical model, where each rule's probability value represents its trustworthiness of being correct. Our approach automatically constructs probabilistic usage rules by mining codebase and documents, and aggregating knowledge from different sources. Here, the usage rules obtained from the codebase initialize the probabilistic model, while the knowledge from the documents serves as a supplement for adjusting and complementing the probabilities accordingly. We evaluate our approach on the MuBench benchmark. Experimental results show that our approach achieves 42.0% precision and 54.5% recall, significantly outperforming state-of-the-art approaches.

Ma11y: A Mutation Framework for Web Accessibility Testing
Mahan Tafreshipour

, Anmol Deshpande

, Forough Mehralian

, Iftekhar Ahmed

, and Sam Malek

(University of California at Irvine, Irvine, USA)
Despite the availability of numerous automatic accessibility testing solutions, web accessibility issues persist on many websites. Moreover, there is a lack of systematic evaluations of the efficacy of current accessibility testing tools. To address this gap, we present the first mutation analysis framework, called Ma11y, designed to assess web accessibility testing tools. Ma11y includes 25 mutation operators that intentionally violate various accessibility principles and an automated oracle to determine whether a mutant is detected by a testing tool. Evaluation on real-world websites demonstrates the practical applicability of the mutation operators and the framework’s capacity to assess tool performance. Our results demonstrate that the current tools cannot identify nearly 50% of the accessibility bugs injected by our framework, thus underscoring the need for the development of more effective accessibility testing tools. Finally, the framework’s accuracy and performance attest to its potential for seamless and automated application in practical settings.

Total Recall? How Good Are Static Call Graphs Really?
Dominik Helm

, Sven Keidel

, Anemone Kampkötter

, Johannes Düsing

, Tobias Roth

, Ben Hermann

, and Mira Mezini

(TU Darmstadt, Germany; National Research Center for Applied Cybersecurity ATHENE, Germany; TU Dortmund, Germany; hessian.AI, Germany)
Static call graphs are a fundamental building block of program analysis. However, differences in call-graph construction and the use of specific language features can yield unsoundness and imprecision. Call-graph analyses are evaluated using measures of precision and recall, but this is hard when a ground truth for real-world programs is generally unobtainable.
In this work, we propose to use carefully constructed dynamic baselines based on fixed entry points and input corpora. The creation of this dynamic baseline is posed as an approximation of the ground truth---an optimization problem. We use manual extension and coverage-guided fuzzing for creating suitable input corpora.
With these dynamic baselines, we study call-graph quality of multiple algorithms and implementations using four real-world Java programs. We find that our methodology provides valuable insights into call-graph quality and how to measure it. With this work, we provide a novel methodology to advance the field of static program analysis as we assess the computation of one of its core data structures---the call graph.

Artifacts Available

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios
Zhengran Zeng

, Yidong Wang

, Rui Xie

, Wei Ye

, and Shikun Zhang

(Peking University, China)
In the evolving landscape of large language models (LLMs) tailored for software engineering, the need for benchmarks that accurately reflect real-world development scenarios is paramount. Current benchmarks are either too simplistic or fail to capture the multi-tasking nature of software development. To address this, we introduce CoderUJB, a new benchmark designed to evaluate LLMs across diverse Java programming tasks that are executable and reflective of actual development scenarios, acknowledging Java's prevalence in real-world software production. CoderUJB comprises 2,239 programming questions derived from 17 real open-source Java projects and spans five practical programming tasks. Our empirical study on this benchmark investigates the coding abilities of various open-source and closed-source LLMs, examining the effects of continued pre-training in specific programming languages code and instruction fine-tuning on their performance. The findings indicate that while LLMs exhibit strong potential, challenges remain, particularly in non-functional code generation (e.g., test generation and defect detection). Importantly, our results advise caution in the specific programming languages continued pre-training and instruction fine-tuning, as these techniques could hinder model performance on certain tasks, suggesting the need for more nuanced strategies. CoderUJB thus marks a significant step towards more realistic evaluations of programming capabilities in LLMs, and our study provides valuable insights for the future development of these models in software engineering.

DAppFL: Just-in-Time Fault Localization for Decentralized Applications in Web3
Zhiying Wu

, Jiajing Wu

, Hui Zhang

, Ziwei Li

, Jiachi Chen

, Zibin Zheng

, Qing Xia

, Gang Fan

, and Yi Zhen

(Sun Yat-sen University, China; Institute of Software at Chinese Academy of Sciences, China; n.n., China)
Web3 describes an idea for the next evolution of the Internet, where blockchain technology enables the Internet of Value. As Web3 software, decentralized applications (DApps) have emerged in recent years. There exists a natural link between DApps and cryptocurrencies, where faults in DApps could directly lead to monetary losses associated with cryptocurrencies. Hence, efficient fault localization technology is of paramount importance for urgent DApp rescue operations and the mitigation of financial losses. However, fault localization methods applied in traditional applications are not well-suited for this specific field, due to their inability to identify DApp-specific fault features, e.g., a substantial amount of cryptocurrency is transferred from DApps to hackers. In order to explore the root cause of DApp faults, some researchers try to identify suspicious code snippets through mutation testing. Nonetheless, applying mutation testing for DApp fault localization is time-consuming and thus limited in practice. This paper conducts the first comprehensive study of DApp fault localization. We introduce DAppFL, a learning-based DApp fault localization tool that performs reverse engineering to gather executed source code and then trace cryptocurrency flow to assist in locating faulty functions. We also present the inaugural dataset for DApp fault localization, providing a new benchmark for this domain.Our experimental results demonstrate that DAppFL locates 63% of faults within the Top-5, 23% more than the state-of-the-art method. To facilitate further research, our code and dataset are freely available online: https://github.com/xplanet-sysu/awesome-works#dappfl.

CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection
Hao Wang

, Zeyu Gao

, Chao Zhang

, Mingyang Sun

, Yuchen Zhou

, Han Qiu

, and Xi Xiao

(Tsinghua University, China; University of Electronic Science and Technology of China, China; Beijing University of Technology, China)
Binary code similarity detection (BCSD) is a fundamental technique for various applications. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when the volume of target binaries to search is large. To address this issue, we propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches to significantly improve accuracy while minimizing overheads. Specifically, CEBin utilizes a refined embedding-based approach to extract features of target code, which efficiently narrows down the scope of candidate similar code and boosts performance. Then, it utilizes a comparison-based approach that performs a pairwise comparison on the candidates to capture more nuanced and complex relationships, which greatly improves the accuracy of similarity detection. By bridging the gap between embedding-based and comparison-based approaches, CEBin is able to provide an effective and efficient solution for detecting similar code (including vulnerable ones) in large-scale software ecosystems. Experimental results on three well-known datasets demonstrate the superiority of CEBin over existing state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD in real world, we construct a large-scale benchmark of vulnerability, offering the first precise evaluation scheme to assess BCSD methods for the 1-day vulnerability detection task. CEBin could identify the similar function from millions of candidate functions in just a few seconds and achieves an impressive recall rate of 85.46% on this more practical but challenging task, which are several order of magnitudes faster and 4.07× better than the best SOTA baseline.

Interprocedural Path Complexity Analysis
Mira Kaniyur

, Ana Cavalcante-Studart

, Yihan Yang

, Sangeon Park

, David Chen

, Duy Lam

, and Lucas Bang

(Harvey Mudd College, USA)
Software testing techniques like symbolic execution face significant challenges with path explosion. Asymptotic Path Complexity (APC) quantifies this path explosion complexity, but existing APC methods do not work for interprocedural functions in general. Our new algorithm, APC-IP, efficiently computes APC for a wider range of functions, including interprocedural ones, improving over previous methods in both speed and scope. We implement APC-IP atop the existing software Metrinome, and test it against a benchmark of C functions, comparing it to existing and baseline approaches as well as comparing it to the path explosion of the symbolic execution engine Klee. The results show that APC-IP not only aligns with previous APC values but also excels in performance, scalability, and handling complex source code. It also provides a complexity prediction of the number of paths explored by Klee, extending the APC metric's applicability and surpassing previous implementations.

Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models
Mingyi Zhou

, Xiang Gao

, Pei Liu

, John Grundy

, Chunyang Chen

, Xiao Chen

, and Li Li

(Monash University, Australia; Beihang University, China; CSIRO’s Data61, Australia; TU Munich, Germany; University of Newcastle, Australia)
Recent studies show that on-device deployed deep learning (DL) models, such as those of Tensor Flow Lite (TFLite), can be easily extracted from real-world applications and devices by attackers to generate many kinds of adversarial and other attacks. Although securing deployed on-device DL models has gained increasing attention, no existing methods can fully prevent these attacks. Traditional software protection techniques have been widely explored. If on-device models can be implemented using pure code, such as C++, it will open the possibility of reusing existing robust software protection techniques. However, due to the complexity of DL models, there is no automatic method that can translate DL models to pure code. To fill this gap, we propose a novel method, CustomDLCoder, to automatically extract on-device DL model information and synthesize a customized executable program for a wide range of DL models. CustomDLCoder first parses the DL model, extracts its backend computing codes, configures the extracted codes, and then generates a customized program to implement and deploy the DL model without explicit model representation. The synthesized program hides model information for DL deployment environments since it does not need to retain explicit model representation, preventing many attacks on the DL model. In addition, it improves ML performance because the customized code removes model parsing and preprocessing steps and only retains the data computing process. Our experimental results show that CustomDLCoder improves model security by disabling on-device model sniffing. Compared with the original on-device platform (i.e., TFLite), our method can accelerate model inference by 21.0% and 24.3% on x86-64 and ARM64 platforms, respectively. Most importantly, it can significantly reduce memory consumption by 68.8% and 36.0% on x86-64 and ARM64 platforms, respectively.

Artifacts Available

UPBEAT: Test Input Checks of Q# Quantum Libraries
Tianmin Hu

, Guixin Ye

, Zhanyong Tang

, Shin Hwei Tan

, Huanting Wang

, Meng Li

, and Zheng Wang

(Northwest University, China; Concordia University, Canada; University of Leeds, United Kingdom; Hefei University of Technology, China)
High-level programming models like Q# significantly simplify the complexity of programming for quantum computing. These models are supported by a set of foundation libraries for code development. However, errors can occur in the library implementation, and one common root cause is the lack of or incomplete checks on properties like values, length, and quantum states of inputs passed to user-facing subroutines. This paper presents Upbeat, a fuzzing tool to generate random test cases for bugs related to input checking in Q# libraries. Upbeat develops an automated process to extract constraints from the API documentation and the developer implemented input-checking statements. It leverages open-source Q# code samples to synthesize test programs. It frames the test case generation as a constraint satisfaction problem for classical computing and a quantum state model for quantum computing to produce carefully generated subroutine inputs to test if the input-checking mechanism is appropriately implemented. Under 100 hours of automated test runs, Upbeat has successfully identified 16 bugs in API implementations and 4 documentation errors. Of these, 14 have been confirmed, and 12 have been fixed by the library developers.

Enhancing Robustness of Code Authorship Attribution through Expert Feature Knowledge
Xiaowei Guo

, Cai Fu

, Juan Chen

, Hongle Liu

, Lansheng Han

, and Wenjin Li

(Huazhong University of Science and Technology, China; NSFOCUS Technologies Group, n.n.)
Code authorship attribution has been an interesting research problem for decades. Recent studies have revealed that existing methods for code authorship attribution suffer from weak robustness. Under the influence of small perturbations added by the attacker, the accuracy of the method will be greatly reduced. As of now, there is no code authorship attribution method capable of effectively handling such attacks. In this paper, we attribute the weak robustness of code authorship attribution methods to dataset bias and argue that this bias can be mitigated through adjustments to the feature learning strategy. We first propose a robust code authorship attribution feature combination framework, which is composed of only simple shallow neural network structures, and introduces controllability for the framework in the feature extraction by incorporating expert knowledge. Experiments show that the framework has significantly improved robustness over mainstream code authorship attribution methods, with an average drop of 23.4% (from 37.8% to 14.3%) in the success rate of targeted attacks and 25.9% (from 46.7% to 20.8%) in the success rate of untargeted attacks. At the same time, it can also achieve results comparable to mainstream code authorship attribution methods in terms of accuracy.

A Large-Scale Empirical Study on Improving the Fairness of Image Classification Models
Junjie Yang

, Jiajun Jiang

, Zeyu Sun

, and Junjie Chen

(Tianjin University, China; Institute of Software at Chinese Academy of Sciences, China)
Fairness has been a critical issue that affects the adoption of deep learning models in real practice. To improve model fairness, many existing methods have been proposed and evaluated to be effective in their own contexts. However, there is still no systematic evaluation among them for a comprehensive comparison under the same context, which makes it hard to understand the performance distinction among them, hindering the research progress and practical adoption of them. To fill this gap, this paper endeavours to conduct the first large-scale empirical study to comprehensively compare the performance of existing state-of-the-art fairness improving techniques. Specifically, we target the widely-used application scenario of image classification, and utilized three different datasets and five commonly-used performance metrics to assess in total 13 methods from diverse categories. Our findings reveal substantial variations in the performance of each method across different datasets and sensitive attributes, indicating over-fitting on specific datasets by many existing methods. Furthermore, different fairness evaluation metrics, due to their distinct focuses, yield significantly different assessment results. Overall, we observe that pre-processing methods and in-processing methods outperform post-processing methods, with pre-processing methods exhibiting the best performance. Our empirical study offers comprehensive recommendations for enhancing fairness in deep learning models. We approach the problem from multiple dimensions, aiming to provide a uniform evaluation platform and inspire researchers to explore more effective fairness solutions via a set of implications.

A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We?
Zhihan Jiang

, Jinyang Liu

, Junjie Huang

, Yichen Li

, Yintong Huo

, Jiazhen Gu

, Zhuangbin Chen

, Jieming Zhu

, and Michael R. Lyu

(Chinese University of Hong Kong, China; Sun Yat-sen University, China; Huawei Noah’s Ark Lab, China)
Log data have facilitated various tasks of software development and maintenance, such as testing, debugging and diagnosing. Due to the unstructured nature of logs, log parsing is typically required to transform log messages into structured data for automated log analysis. Given the abundance of log parsers that employ various techniques, evaluating these tools to comprehend their characteristics and performance becomes imperative. Loghub serves as a commonly used dataset for benchmarking log parsers, but it suffers from limited scale and representativeness, posing significant challenges for studies to comprehensively evaluate existing log parsers or develop new methods. This limitation is particularly pronounced when assessing these log parsers for production use. To address these limitations, we provide a new collection of annotated log datasets, denoted Loghub-2.0, which can better reflect the characteristics of log data in real-world software systems. Loghub-2.0 comprises 14 datasets with an average of 3.6 million log lines in each dataset. Based on Loghub-2.0, we conduct a thorough re-evaluation of 15 state-of-the-art log parsers in a more rigorous and practical setting. Particularly, we introduce a new evaluation metric to mitigate the sensitivity of existing metrics to imbalanced data distributions. We are also the first to investigate the granular performance of log parsers on logs that represent rare system events, offering in-depth details for software diagnosis. Accurately parsing such logs is essential, yet it remains a challenge. We believe this work could shed light on the evaluation and design of log parsers in practical settings, thereby facilitating their deployment in production systems.

SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection
Xin-Cheng Wen

, Cuiyun Gao

, Shuzheng Gao

, Yang Xiao

, and Michael R. Lyu

(Harbin Institute of Technology, China; Chinese University of Hong Kong, China; Chinese Academy of Sciences, China)
Recently, there has been a growing interest in automatic software vulnerability detection. Pre-trained model-based approaches have demonstrated superior performance than other Deep Learning (DL)-based approaches in detecting vulnerabilities. However, the existing pre-trained model-based approaches generally employ code sequences as input during prediction, and may ignore vulnerability-related structural information, as reflected in the following two aspects. First, they tend to fail to infer the semantics of the code statements with complex logic such as those containing multiple operators and pointers. Second, they are hard to comprehend various code execution sequences, which is essential for precise vulnerability detection.
To mitigate the challenges, we propose a Structured Natural Language Comment tree-based vulnerAbiLity dEtection framework based on the pre-trained models, named . The proposed Structured Natural Language Comment Tree (SCT) integrates the semantics of code statements with code execution sequences based on the Abstract Syntax Trees (ASTs).Specifically, comprises three main modules: (1) Comment Tree Construction, which aims at enhancing the model’s ability to infer the semantics of code statements by first incorporating Large Language Models (LLMs) for comment generation and then adding the comment node to ASTs. (2) Structured Natural Language Comment Tree Construction, which aims at explicitly involving code execution sequence by combining the code syntax templates with the comment tree. (3) SCT-Enhanced Representation, which finally incorporates the constructed SCTs for well capturing vulnerability patterns. Experimental results demonstrate that outperforms the best-performing baseline, including the pre-trained model and LLMs, with improvements of 2.96%, 13.47%, and 3.75% in terms of F1 score on the FFMPeg+Qemu, Reveal, and SVulD datasets, respectively. Furthermore, can be applied to different pre-trained models, such as CodeBERT and UniXcoder, yielding the F1 score performance enhancements ranging from 1.37% to 10.87%.

Distance-Aware Test Input Selection for Deep Neural Networks
Zhong Li

, Zhengfeng Xu

, Ruihua Ji

, Tian Zhang

, Linzhang Wang

, and Xuandong Li

(Nanjing University, China)
Deep Neural Network (DNN) testing is one of the common practices to guarantee the quality of DNNs. However, DNN testing in general requires a significant amount of test inputs with oracle information (labels), which can be challenging and resource-intensive to obtain. To relieve this problem, we propose DATIS, a distance-aware test input selection approach for DNNs. Specifically, DATIS adopts a two-step approach for selecting test inputs. In the first step, it selects test inputs based on improved uncertainty scores derived from the distances between the test inputs and their nearest neighbor training samples. In the second step, it further eliminates test inputs that may cover the same faults by examining the distances among the selected test inputs. To evaluate DATIS, we conduct extensive experiments on 8 diverse subjects, taking into account different domains of test inputs, varied DNN structures, and diverse types of test inputs. Evaluation results show that DATIS significantly outperforms 15 baseline approaches in both selecting test inputs with high fault-revealing power and guiding the selection of data for DNN enhancement.

LPR: Large Language Models-Aided Program Reduction
Mengxiao Zhang

, Yongqiang Tian

, Zhenyang Xu

, Yiwen Dong

, Shin Hwei Tan

, and Chengnian Sun

(University of Waterloo, Canada; Hong Kong University of Science and Technology, China; Concordia University, Canada)
Program reduction is a widely used technique to facilitate debugging compilers by automatically minimizing programs that trigger compiler bugs. Existing program reduction techniques are either generic to a wide range of languages (such as Perses and Vulcan) or specifically optimized for one certain language by exploiting language-specific knowledge (e.g., C-Reduce). However, synergistically combining both generality across languages and optimality to a specific language in program reduction is yet to be explored. This paper proposes LPR, the first LLMs-aided technique leveraging LLMs to perform language-specific program reduction for multiple languages. The key insight is to utilize both the language generality of program reducers such as Perses and the languagespecific semantics learned by LLMs. Concretely, language-generic program reducers can efficiently reduce programs into a small size that is suitable for LLMs to process; LLMs can effectively transform programs via the learned semantics to create new reduction opportunities for the language-generic program reducers to further reduce the programs. Our thorough evaluation on 50 benchmarks across three programming languages (i.e., C, Rust and JavaScript) has demonstrated LPR’s practicality and superiority over Vulcan, the state-of-the-art language-generic program reducer. For effectiveness, LPR surpasses Vulcan by producing 24.93%, 4.47%, and 11.71% smaller programs on benchmarks in C, Rust and JavaScript, separately. Moreover, LPR and Vulcan have the potential to complement each other. For the C language for which C-Reduce is optimized, by applying Vulcan to the output produced by LPR, we can attain program sizes that are on par with those achieved by C-Reduce. For efficiency perceived by users, LPR is more efficient when reducing large and complex programs, taking 10.77%, 34.88%, 36.96% less time than Vulcan to finish all the benchmarks in C, Rust and JavaScript, separately.

Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code
Yujia Chen

, Cuiyun Gao

, Zezhou Yang

, Hongyu Zhang

, and Qing Liao

(Harbin Institute of Technology, China; Chongqing University, China)
n the field of code intelligence, effectively modeling long-range code poses a significant challenge. Existing pre-trained language models (PLMs) such as UniXcoder have achieved remarkable success, but they still face difficulties with long code inputs. This is mainly due to their limited capacity to maintain contextual continuity and memorize the key information over long-range code. To alleviate the difficulties, we propose EXPO, a framework for EXtending Pre-trained language models for lOng-range code. EXPO incorporates two innovative memory mechanisms we propose in this paper: Bridge Memory and Hint Memory. Bridge Memory uses a tagging mechanism to connect disparate snippets of long-range code, helping the model maintain contextual coherence. Hint Memory focuses on crucial code elements throughout the global context, such as package imports, by integrating a 𝑘NN attention layer to adaptively select the relevant code elements. This dual-memory approach bridges the gap between understanding local code snippets and maintaining global code coherence, thereby enhancing the model’s overall comprehension of long code sequences. We validate the effectiveness of EXPO on five popular pre-trained language models such as UniXcoder and two code intelligence tasks including API recommendation and vulnerability detection. Experimental results demonstrate that EXPO significantly improves the pre-training language models.

Define-Use Guided Path Exploration for Better Forced Execution
Dongnan He

, Dongchen Xie

, Yujie Wang

, Wei You

, Bin Liang

, Jianjun Huang

, Wenchang Shi

, Zhuo Zhang

, and Xiangyu Zhang

(Renmin University of China, China; Purdue University, USA)
The evolution of recent malware, characterized by the escalating use of cloaking techniques, poses a significant challenge in the analysis of malware behaviors. Researchers proposed forced execution to penetrate malware’s self-protection mechanisms and expose hidden behaviors, by forcefully setting certain branch outcomes. Existing studies focus on enhancing the forced executor to provide light-weight crash-free execution models. However, insufficient attention has been directed toward the path exploration strategy, an aspect equally crucial to the effectiveness. Linear search employed in state-of-the-art forced execution tools exhibits inherent limitations that lead to unnecessary path exploration and incomplete behavior exposure. In this paper, we propose a novel and practical path exploration strategy that focuses on the coverage of defineuse relations in the subject binary. We develop a fuzzing approach for exploring these define-use relations in a progressive and self-supervised way. Our experimental results show that the proposed solution outperforms the existing forced execution tools in both memory dependence coverage and malware behavior exposure.

C2D2: Extracting Critical Changes for Real-World Bugs with Dependency-Sensitive Delta Debugging
Xuezhi Song

, Yijian Wu

, Shuning Liu

, Bihuan Chen

, Yun Lin

, and Xin Peng

(Fudan University, China; Shanghai Jiao Tong University, China)
Data-driven techniques are promising for automatically locating and fixing bugs, which can reduce enormous time and effort for developers. However, the effectiveness of these techniques heavily relies on the quality and scale of bug datasets. Despite that emerging approaches to automatic bug dataset construction partially provide a solution for scalability, data quality remains a concern. Specifically, it remains a barrier for humans to isolate the minimal set of bug-inducing or bug-fixing changes, known as critical changes. Although delta debugging (DD) techniques are capable of extracting critical changes on benchmark datasets in academia, the efficiency and accuracy are still limited when dealing with real-world bugs, where code change dependencies could be overly complicated. In this paper, we propose C2D2, a novel delta debugging approach for critical change extraction, which estimates the probabilities of dependencies between code change elements. C2D2 considers the probabilities of dependencies and introduces a matrix-based search mechanism to resolve compilation errors (CE) caused by missing dependencies. It also provides hybrid mechanisms for flexibly selecting code change elements during the DD process. Experiments on Defect4J and a real-world regression bug dataset reveal that C2D2 is significantly more efficient than the traditional DD algorithm ddmin with competitive effectiveness, and significantly more effective and more efficient than the state-of-the-art DD algorithm ProbDD. Furthermore, compared to human-isolated critical changes, C2D2 produces the same or better critical change results in 56% cases in Defects4J and 86% cases in the regression dataset, demonstrating its usefulness in automatically extracting critical changes and saving human efforts in constructing large-scale bug datasets with real-world bugs.

FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion
Qi Guo

, Xiaohong Li

, Shangqing Liu

, Ze Tang

, Ruitao Feng

, Junjie Wang

, Jidong Ge

, and Lei Bu

(Tianjin University, China; Singapore Management University, Singapore; Nanyang Technological University, Singapore; Nanjing University, Nanjing, China; Nanjing University, China)
The rise of code pre-trained models has significantly enhanced various coding tasks, such as code completion, and tools like GitHub Copilot. However, the substantial size of these models, especially large models, poses a significant challenge when it comes to fine-tuning them for specific downstream tasks. As an alternative approach, retrieval-based methods have emerged as a promising solution, augmenting model predictions without the need for fine-tuning. Despite their potential, a significant challenge is that the designs of these methods often rely on heuristics, leaving critical questions about what information should be stored or retrieved and how to interpolate such information for augmenting predictions.
To tackle this challenge, we first perform a theoretical analysis of the fine-tuning process, highlighting the importance of delta logits as a catalyst for improving model predictions. Building on this insight, we develop a novel retrieval-based method, FT2Ra, which aims to mimic genuine fine-tuning. While FT2Ra adopts a retrieval-based mechanism, it uniquely adopts a paradigm with a learning rate and multi-epoch retrievals, which is similar to fine-tuning.
We conducted a comprehensive evaluation of FT2Ra in both token-level and line-level code completions. Our findings demonstrate the remarkable effectiveness of FT2Ra when compared to state-of-the-art methods and its potential to genuine fine-tuning. In token-level completion, which represents a relatively easier task, FT2Ra achieves a 4.29% improvement in accuracy compared to the best baseline method on UniXcoder. In the more challenging line-level completion task, we observe a substantial more than twice increase in Exact Match (EM) performance, indicating the significant advantages of our theoretical analysis. Notably, even when operating without actual fine-tuning, FT2Ra exhibits competitive performance compared to the models with real fine-tuning.

MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination Indexing
Tianyi Yang

, Cheryl Lee

, Jiacheng Shen

, Yuxin Su

, Cong Feng

, Yongqiang Yang

, and Michael R. Lyu

(Chinese University of Hong Kong, China; Sun Yat-sen University, China; Huawei Cloud Computing Technology, n.n.)
Microservice resilience, the ability of microservices to recover from failures and continue providing reliable and responsive services, is crucial for cloud vendors. However, the current practice relies on manually configured rules specific to a certain microservice system, resulting in labor-intensity and flexibility issues, given the large scale and high dynamics of microservices. A more labor-efficient and versatile solution is desired. Our insight is that resilient deployment can effectively prevent the dissemination of degradation from system performance metrics to user-aware metrics, and the latter directly affects service quality. In other words, failures in a non-resilient deployment can impact both types of metrics, leading to user dissatisfaction. With this in mind, we propose MicroRes, the first versatile resilience profiling framework for microservices via degradation dissemination indexing. MicroRes first injects failures into microservices and collects available monitoring metrics. Then, it ranks the metrics according to their contributions to the overall service degradation. It produces a resilience index by how much the degradation is disseminated from system performance metrics to user-aware metrics. Higher degradation dissemination indicates lower resilience. We evaluate MicroRes on two open-source and one industrial microservice system. The experiments show MicroRes' efficient and effective resilience profiling of microservices. We also showcase MicroRes' practical usage in production.

Isolation-Based Debugging for Neural Networks
Jialuo Chen

, Jingyi Wang

, Youcheng Sun

, Peng Cheng

, and Jiming Chen

(Zhejiang University, China; University of Manchester, United Kingdom)
Neural networks (NNs) are known to have diverse defects such as adversarial examples, backdoor and discrimination, raising great concerns about their reliability. While NN testing can effectively expose these defects to a significant degree, understanding their root causes within the network requires further examination. In this work, inspired by the idea of debugging in traditional software for failure isolation, we propose a novel unified neuron-isolation-based framework for debugging neural networks, shortly IDNN. Given a buggy NN that exhibits certain undesired properties (e.g., discrimination), the goal of IDNN is to identify the most critical and minimal set of neurons that are responsible for exhibiting these properties. Notably, such isolation is conducted with the objective that by simply ‘freezing’ these neurons, the model’s undesired properties can be eliminated, resulting in a much more efficient model repair compared to computationally expensive retraining or weight optimization as in existing literature. We conduct extensive experiments to evaluate IDNN across a diverse set of NN structures on five benchmark datasets, for solving three debugging tasks, including backdoor, unfairness, and weak class. As a lightweight framework, IDNN outperforms state-of-the-art baselines by successfully identifying and isolating a very small set of responsible neurons, demonstrating superior generalization performance across all tasks. IDNN is publicly available at [1].

Atlas: Automating Cross-Language Fuzzing on Android Closed-Source Libraries
Hao Xiong

, Qinming Dai

, Rui Chang

, Mingran Qiu

, Renxiang Wang

, Wenbo Shen

, and Yajin Zhou

(Zhejiang University, China)
Fuzzing is an effective method for detecting security bugs in software, and there have been quite a few effective works on fuzzing Android. Researchers have developed methods for fuzzing open-source native APIs and Java interfaces on actual Android devices. However, the realm of automatically fuzzing Android closed-source native libraries, particularly on emulators, remains insufficiently explored. There are two key challenges: firstly, the multi-language programming model inherent to Android; and secondly, the absence of a Java runtime environment within the emulator.
To address these challenges, we propose Atlas, a practical automated fuzz framework for Android closed-source native libraries. Atlas consists of an automatic harness generator and a fuzzer containing the necessary runtime environment. The generator uses static analysis techniques to deduce the correct calling sequences and parameters of the native API according to the information from the "native world" and the "Java world". To maximize the practicality of the generated harness, Atlas heuristically optimizes the generated harness. The Fuzzer provides the essential Java runtime environment in the emulator, making it possible to fuzz the Android closed-source native libraries on a multi-core server. We have tested Atlas on 17 pre-installed apps from four Android vendors. Atlas generates 820 harnesses containing 767 native APIs, of which 78% is practical. Meanwhile, Atlas has discovered 74 new security bugs with 16 CVEs assigned. The experiments show that Atlas can efficiently generate high-quality harnesses and find security bugs.

Automating Zero-Shot Patch Porting for Hard Forks
Shengyi Pan

, You Wang

, Xing Hu

, Xin Xia

, and Shanping Li

(Zhejiang University, China; Huawei, China)
Forking is a typical way of code reuse, which provides a simple way for developers to create a variant software (denoted as hard fork) by copying and modifying an existing codebase. Despite of the benefits, forking also leads to duplicate efforts in software maintenance. Developers need to port patches across the hard forks to address similar bugs or implement similar features. Due to the divergence between the source project and the hard fork, patch porting is complicated, which requires an adaption regarding different implementations of the same functionality. In this work, we take the first step to automate patch porting for hard forks under a zero-shot setting. We first conduct an empirical study of the patches ported from Vim to Neovim over the last ten years to investigate the necessities of patch porting and the potential flaws in the current practice. We then propose a large language model (LLM) based approach (namely PPatHF) to automatically port patches for hard forks on a function-wise basis. Specifically, PPatHF is composed of a reduction module and a porting module. Given the pre- and post-patch versions of a function from the reference project and the corresponding function from the target project, the reduction module first slims the input functions by removing code snippets less relevant to the patch. Then, the porting module leverages a LLM to apply the patch to the function from the target project. To better elicit the power of the LLM on patch porting, we design a prompt template to enable efficient in-context learning. We further propose an instruction-tuning based training task to better guide the LLM to port the patch and inject task-specific knowledge. We evaluate PPatHF on 310 Neovim patches ported from Vim. The experimental results show that PPatHF outperforms the baselines significantly. Specifically, PPatHF can correctly port 131 (42.3%) patches and automate 57% of the manual edits required for the developer to port the patch.

DiaVio: LLM-Empowered Diagnosis of Safety Violations in ADS Simulation Testing
You Lu

, Yifan Tian

, Yuyang Bi

, Bihuan Chen

, and Xin Peng

(Fudan University, China)
Simulation testing has been widely adopted by leading companies to ensure the safety of autonomous driving systems (ADSs). Anumber of scenario-based testing approaches have been developed to generate diverse driving scenarios for simulation testing, and demonstrated to be capable of finding safety violations. However, there is no automated way to diagnose whether these violations are caused by the ADS under test and which category these violations belong to. As a result, great effort is required to manually diagnose violations.
To bridge this gap, we propose DiaVio to automatically diagnose safety violations in simulation testing by leveraging large language models (LLMs). It is built on top of a new domain specific language (DSL) of crash to align real-world accident reports described in natural language and violation scenarios in simulation testing. DiaVio fine-tunes a base LLM with real-world accident reports to learn diagnosis capability, and uses the fine-tuned LLM to diagnose violation scenarios in simulation testing. Our evaluation has demonstrated the effectiveness and efficiency of DiaVio in violation diagnosis.

Graph Neural Networks for Vulnerability Detection: A Counterfactual Explanation
Zhaoyang Chu

, Yao Wan

, Qian Li

, Yang Wu

, Hongyu Zhang

, Yulei Sui

, Guandong Xu

, and Hai Jin

(Huazhong University of Science and Technology, China; Curtin University, Perth, Australia; Chongqing University, China; UNSW, Sydney, Australia; University of Technology, Sydney, Australia)
Vulnerability detection is crucial for ensuring the security and reliability of software systems. Recently, Graph Neural Networks (GNNs) have emerged as a prominent code embedding approach for vulnerability detection, owing to their ability to capture the underlying semantic structure of source code. However, GNNs face significant challenges in explainability due to their inherently black-box nature. To this end, several factual reasoning-based explainers have been proposed. These explainers provide explanations for the predictions made by GNNs by analyzing the key features that contribute to the outcomes. We argue that these factual reasoning-based explanations cannot answer critical what-if questions: ``What would happen to the GNN's decision if we were to alter the code graph into alternative structures?'' Inspired by advancements of counterfactual reasoning in artificial intelligence, we propose CFExplainer, a novel counterfactual explainer for GNN-based vulnerability detection. Unlike factual reasoning-based explainers, CFExplainer seeks the minimal perturbation to the input code graph that leads to a change in the prediction, thereby addressing the what-if questions for vulnerability detection. We term this perturbation a counterfactual explanation, which can pinpoint the root causes of the detected vulnerability and furnish valuable insights for developers to undertake appropriate actions for fixing the vulnerability. Extensive experiments on four GNN-based vulnerability detection models demonstrate the effectiveness of CFExplainer over existing state-of-the-art factual reasoning-based explainers.

DeFort: Automatic Detection and Analysis of Price Manipulation Attacks in DeFi Applications
Maoyi Xie

, Ming Hu

, Ziqiao Kong

, Cen Zhang

, Yebo Feng

, Haijun Wang

, Yue Xue

, Hao Zhang

, Ye Liu

, and Yang Liu

(Nanyang Technological University, Singapore; Xi'an Jiaotong University, China; MetaTrust Labs, n.n.)
Although Decentralized Finance (DeFi) applications facilitate tamper-proof transactions among multiple anonymous users, since attackers can access the smart contract bytecode directly, vulnerabilities in the transaction mechanism, contract code, or third-party components can be easily exploited to manipulate token prices, leading to financial losses. Since price manipulation often relies on specific states and complex trading sequences, existing detection tools have limitations in addressing this problem. In addition, to swiftly identify the root cause of an attack and implement targeted defense and remediation measures, auditors typically prioritize understanding the methodology behind the attack, emphasizing `how' it occurred rather than simply confirming its existence. To address these problems, this paper presents a novel automatic price manipulation detection and analysis framework, named DeFort, which contains a price manipulation behavior model to guide on-chain detection, multiple price monitoring strategies to detect pools with abnormal token prices, and various profit calculation mechanisms to confirm attacks. Based on behavioral models, DeFort can automatically locate transactions and functions that cause abnormal price fluctuations and identify attackers and victims. Experimental results demonstrate that DeFort can outperform state-of-the-art price manipulation detection methods. Furthermore, after monitoring 441 real-world projects for two months, DeFort successfully detected five price manipulation attacks.

Traceback: A Fault Localization Technique for Molecular Programs
Michael C. Gerten

, James I. Lathrop

, and Myra B. Cohen

(Iowa State University, USA)
Fault localization is essential to software maintenance tasks such as testing and automated program repair. Many fault localization techniques have been developed, the most common of which are spectrum-based. Most techniques have been designed for traditional programming paradigms that map passing and failing test cases to lines or branches of code, hence specialized programming paradigms which utilize different code abstractions may fail to localize well. In this paper, we study fault localization in the context of a class of programs, molecular programs. Recent research has designed automated testing and repair frameworks for these pro- grams but has ignored the importance of fault localization. As we demonstrate, using existing spectrum-based approaches may not provide much information. Instead we propose a novel approach, Traceback, that leverages temporal trace data. In an empirical study on a set of 89 faulty program variants, we demonstrate that Trace- back provides between a 32-90% improvement in localization over reaction-based mapping, a direct translation of spectrum-based localization. We see little difference in parameter tuning of Trace- back when all tests, or only code-based (invariant) tests are used, however the best depth and weight parameters vary when using specification based tests, which can be either functional or meta- morphic. Overall, invariant-based tests provide the best localization results (either alone or in combination with others), followed by metamorphic and then functional tests.

Silent Taint-Style Vulnerability Fixes Identification
Zhongzhen Wen

, Jiayuan Zhou

, Shaohua Wang

, Xing Hu

, Tongtong Xu

, Tian Zhang

, and Xuandong Li

(Nanjing University, China; Huawei, Waterloo, Canada; Central University of Finance and Economics, China; Zhejiang University, China; Huawei, China)
Abstract: The coordinated vulnerability disclosure model, widely adopted in open-source software (OSS) organizations, recommends the silent resolution of vulnerabilities without revealing vulnerability information until their public disclosure. However, the inherently public nature of OSS development leads to security fixes becoming publicly available in repositories weeks before the official disclosure of vulnerabilities. This time gap poses a significant security risk to OSS users, as attackers could discover the fix and exploit vulnerabilities before disclosure. Thus, there is a critical need for OSS users to sense fixes as early as possible to address the vulnerability before any exploitation occurs. In response to this challenge, we introduce EarlyVulnFix, a novel approach designed to identify silent fixes for taint-style vulnerabilities—a persistent class of security weaknesses where attacker-controlled input reaches sensitive operations (sink) without proper sanitization. Leveraging data flow and dependency analysis, our tool distinguishes two types of connections between newly introduced code and sinks, tailored for two common fix scenarios. Our evaluation demonstrates that EarlyVulnFix surpasses state-of-the-art baselines by a substantial margin in terms of F1 score. Furthermore, when applied to the 700 latest commits across seven projects, EarlyVulnFix detected three security fixes before their respective security releases, highlighting its effectiveness in identifying unreported vulnerability fixes in the wild.

Benchmarking Automated Program Repair: An Extensive Study on Both Real-World and Artificial Bugs
Yicheng Ouyang

, Jun Yang

, and Lingming Zhang

(University of Illinois at Urbana-Champaign, USA)
As bugs are inevitable and prevalent in real-world programs, many Automated Program Repair (APR) techniques have been proposed to generate patches for them. However, due to the lack of a standard for evaluating APR techniques, prior works tend to use different settings and benchmarks in evaluation, threatening the trustworthiness of the evaluation results. Additionally, they typically only adopt plausibility and genuineness as evaluation metrics, which may potentially mask some underlying issues in APR techniques. To overcome these issues, in this paper, we conduct an extensive and multi-dimensional evaluation of nine learning-based and three traditional state-of-the-art APR techniques under the same environment and settings. We employ the widely studied Defects4J V2.0.0 benchmark and a newly constructed large-scale mutation-based benchmark named MuBench, derived from Defects4J and including 1,700 artificial bugs generated by various mutators, to uncover potential limitations in these APR techniques. We also apply multi-dimensional metrics, including compilability/plausibility/genuineness metrics, as well as SYE (SYntactic Equivalence) and TCE (Trivial Compiler Equivalence) metrics, to thoroughly analyze the 1,814,652 generated patches. This paper presents noteworthy findings from the extensive evaluation: Firstly, Large Language Model (LLM) based APR demonstrates less susceptibility to overfitting on the Defects4J V1.2.0 dataset and fixes the most number of bugs. Secondly, the study suggests a promising future for combining traditional and learning-based APR techniques, as they exhibit complementary advantages in fixing different types of bugs. Additionally, this work highlights the necessity for further enhancing patch compilability of learning-based APR techniques, despite the presence of various existing strategies attempting to improve it. The study also reveals other guidelines for enhancing APR techniques, including the need for handling unresolvable symbol compilability issues and reducing duplicate/no-op patch generation. Finally, our study uncovers seven implementation issues in the studied techniques, with five of them confirmed and fixed by the corresponding authors.

Artifacts Available

Multi-modal Learning for WebAssembly Reverse Engineering
Hanxian Huang

and Jishen Zhao

(University of California at San Diego, San Diego, USA)
The increasing adoption of WebAssembly (Wasm) for performance-critical and security-sensitive tasks drives the demand for WebAssembly program comprehension and reverse engineering. Recent studies have introduced machine learning (ML)-based WebAssembly reverse engineering tools. Yet, the generalization of task-specific ML solutions remains challenging, because their effectiveness hinges on the availability of an ample supply of high-quality task-specific labeled data. Moreover, previous works trained models only with features extracted from WebAssembly, overlooking the high-level semantics present in the corresponding source code and its documentation. Acknowledging the abundance of available source code with documentation, which can be compiled into WebAssembly, we propose to learn representations of them concurrently and harness their mutual relationships for effective WebAssembly reverse engineering.
In this paper, we present WasmRev, the first multi-modal pre-trained language model for WebAssembly reverse engineering. WasmRev is pre-trained using self-supervised learning on a large-scale multi-modal corpus encompassing source code, code documentation and the compiled WebAssembly, without requiring labeled data. WasmRev incorporates three tailored multi-modal pre-training tasks to capture various characteristics of WebAssembly and cross-modal relationships. WasmRev is only trained once to produce general-purpose representations that can broadly support WebAssembly reverse engineering tasks through few-shot fine-tuning with much less labeled data, improving data efficiency. We fine-tune WasmRev onto three important reverse engineering tasks: type recovery, function purpose identification and WebAssembly summarization. Our results show that WasmRev pre-trained on the corpus of multi-modal samples establishes a robust foundation for these tasks, achieving high task accuracy and outperforming the state-of-the-art ML methods for WebAssembly reverse engineering.

CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive Nature
Chenyan Liu

, Yufan Cai

, Yun Lin

, Yuhuan Huang

, Yunrui Pei

, Bo Jiang

, Ping Yang

, Jin Song Dong

, and Hong Mei

(Shanghai Jiao Tong University, China; National University of Singapore, Singapore; Bytedance Network Technology, Beijing, China)
Recent years have seen the development of LLM-based code generation. Compared to generating code in a software project, incremental code edits are empirically observed to be more frequent. The emerging code editing approaches usually formulate the problem as generating an edit based on known relevant prior edits and context. However, practical code edits can be more complicated. First, an editing session can include multiple (ir)relevant edits to the code under edit. Second, the inference of the subsequent edits is non-trivial as the scope of its ripple effect can be the whole project. In this work, we propose CoEdPilot, an LLM-driven solution to recommend code edits by discriminating the relevant edits, exploring their interactive natures, and estimating its ripple effect in the project. Specifically, CoEdPilot orchestrates multiple neural transformers to identify what and how to edit in the project regarding both edit location and edit content. When a user accomplishes an edit with an optional editing description, an Subsequent Edit Analysis first reports the most relevant files in the project with what types of edits (e.g., keep, insert, and replace) can happen for each line of their code. Next, an Edit-content Generator generates concrete edit options for the lines of code, regarding its relevant prior changes reported by an Edit-dependency Analyzer. Last, both the Subsequent Edit Analysis and the Edit-content Generator capture relevant prior edits as feedback to readjust their recommendations. We train our models by collecting over 180K commits from 471 open-source projects in 5 programming languages. Our extensive experiments show that (1) CoEdPilot can well predict the edits (i.e., predicting edit location with accuracy of 70.8%-85.3%, and the edit content with exact match rate of 41.8% and BLEU4 score of 60.7); (2) CoEdPilot can well boost existing edit generators such as GRACE and CCT5 on exact match rate by 8.57% points and BLEU4 score by 18.08. Last, our user study on 18 participants with 3 editing tasks (1) shows that CoEdPilot can be effective in assisting users to edit code in comparison with Copilot, and (2) sheds light on the future improvement of the tool design. The video demonstration of our tool is available at https://sites.google.com/view/coedpilot/home.

Automated Deep Learning Optimization via DSL-Based Source Code Transformation
Ruixin Wang

, Minghai Lu

, Cody Hao Yu

, Yi-Hsiang Lai

, and Tianyi Zhang

(Purdue University, USA; BosonAI, n.n.; Amazon Web Services, n.n.)
As deep learning models become increasingly bigger and more complex, it is critical to improve model training and inference efficiency. Though a variety of highly optimized libraries and packages (known as DL kernels) have been developed, it is tedious and time-consuming to figure out which kernel to use, where to use, and how to use them correctly. To address this challenge, we propose an Automated Deep learning OPTimization approach called Adopter. We design a Domain-Specific Language (DSL) to represent DL model architectures and leverage this DSL to specify model transformation rules required to integrate a DL kernel into a model. Given the source code of a DL model and the transformation rules for a set of kernels, Adopter first performs inter-procedural analysis to identify and express the model architecture in our DSL. Then, Adopter performs scope analysis and sub-sequence matching to identify locations in the model architecture where the transformation rules can be applied. Finally, Adopter proposes a synthesis-based code transformation method to apply the transformation rule. We curated a benchmark with 199 models from Hugging Face and a diverse set of DL kernels. We found that, compared to a state-of-the-art automated code transformation technique, Adopter helps improve the precision and recall by 3% and 56%, respectively. An in-depth analysis of 9 models revealed that on average, Adopter improved the training speed by 22.7% while decreasing the GPU memory usage by 10.5%.

Artifacts Available

Evaluating the Effectiveness of Decompilers
Ying Cao

, Runze Zhang

, Ruigang Liang

, and Kai Chen

(Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China)
In software security tasks like malware analysis and vulnerability mining, reverse engineering is pivotal, with C decompilers playing a crucial role in understanding program semantics. However, reverse engineers still predominantly rely on assembly code rather than decompiled code when analyzing complex binaries. This practice underlines the limitations of current decompiled code, which hinders its effectiveness in reverse engineering. Identifying and analyzing the problems of existing decompilers and making targeted improvements can effectively enhance the efficiency of software analysis.
In this study, we systematically evaluate current mainstream decompilers’ semantic consistency and readability. Semantic evaluation results show that the state-of-the-art decompiler Hex-Rays has about 55% accuracy at almost all optimization, which contradicts the common belief among many reverse engineers that decompilers are usually accurate. Readability evaluation indicates that despite years of efforts to improve the readability of the decompiled code, decompilers’ template-based approach still predominantly yields code akin to binary structures rather than human coding patterns. Additionally, our human study indicates that to enhance decompilers’ accuracy and readability, introducing human or compiler-aware strategies like a speculate-verify-correct approach to obtain recompilable decompiled code and iteratively refine it to more closely resemble the original binary, potentially offers a more effective optimization method than relying on static analysis and rule expansion.

CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision
Hao Wang

, Zeyu Gao

, Chao Zhang

, Zihan Sha

, Mingyang Sun

, Yuchen Zhou

, Wenyu Zhu

, Wenju Sun

, Han Qiu

, and Xi Xiao

(Tsinghua University, China; Information Engineering University, China; University of Electronic Science and Technology of China, China; Beijing University of Technology, China)
Binary code representation learning has shown significant performance in binary analysis tasks. But existing solutions often have poor transferability, particularly in few-shot and zero-shot scenarios where few or no training samples are available for the tasks. To address this problem, we present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code (i.e., assembly code) and get better transferability. At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations (in natural language), resulting a model able to generate better embeddings for binary code. To enable this alignment training, we then propose an efficient dataset engine that could automatically generate a large and diverse dataset comprising of binary code and corresponding natural language explanations. We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP. The evaluations of CLAP across various downstream tasks in binary analysis all demonstrate exceptional performance. Notably, without any task-specific training, CLAP is often competitive with a fully supervised baseline, showing excellent transferability.

FunRedisp: Reordering Function Dispatch in Smart Contract to Reduce Invocation Gas Fees
Yunqi Liu

and Wei Song

(Nanjing University of Science and Technology, China)
Smart contracts mostly written in Solidity are Turing-complete programs executed on the blockchain platforms such as Ethereum. To prevent resource abuse, a gas fee is required when users deploy or invoke smart contracts. Although saving gas consumption has received much attention, no work investigates the effect of function dispatch on the invocation gas consumption. In this paper, after demystifying how the function dispatch affects the invocation gas consumption, we present FunRedisp, a bytecode refactoring method and an open-source tool, to reduce the overall invocation gas consumption of smart contracts. At the source code level, FunRedisp initially identifies hot functions in a smart contract that have a big chance to be invoked, and then move them to the front of the function dispatch at the bytecode level. We implement FunRedisp and evaluate it on 50 real-world smart contracts randomly selected from Ethereum. The experimental results demonstrate that FunRedisp can save approximately 125.17 units of gas per transaction with the compilation overhead increased by only 0.37 seconds.

Artifacts Available

proc time: 1.45