Powered by
Proceedings of the ACM on Software Engineering, Volume 2, Number FSE,
June 23–27, 2025,
Trondheim, Norway
Frontmatter
Papers
Gleipner: A Benchmark for Gadget Chain Detection in Java Deserialization Vulnerabilities
Bruno Kreyssig and
Alexandre Bartel
(Umeå University, Sweden)
While multiple recent publications on detecting Java Deserialization Vulnerabilities highlight an increasing relevance of the topic, until now no proper benchmark has been established to evaluate the individual approaches. Hence, it has become increasingly difficult to show improvements over previous tools and trade-offs that were made. In this work, we synthesize the main challenges in gadget chain detection. More specifically, this unveils the constraints program analysis faces in the context of gadget chain detection. From there, we develop Gleipner: the first synthetic, large-scale and systematic benchmark to validate the effectiveness of algorithms for detecting gadget chains in the Java programming language. We then benchmark seven previous publications in the field using Gleipner. As a result, it shows, that (1) our benchmark provides a transparent, qualitative, and sound measurement for the maturity of gadget chain detecting tools, (2) Gleipner alleviates severe benchmarking flaws which were previously common in the field and (3) state-of-the-art tools still struggle with most challenges in gadget chain detection.
Article Search
Detecting Smart Contract State-Inconsistency Bugs via Flow Divergence and Multiplex Symbolic Execution
Yinxi Liu,
Wei Meng, and
Yinqian Zhang
(Rochester Institute of Technology, USA; Chinese University of Hong Kong, China; Southern University of Science and Technology, China)
Ethereum smart contracts determine state transition results not only by
the previous states, but also by a mutable global state consisting of
storage variables. This has resulted in state-inconsistency bugs, which
grant an attacker the ability to modify contract states either through
recursive function calls to a contract (reentrancy), or by exploiting
transaction order dependence (TOD). Current studies have determined
that identifying data races on global storage variables can capture all
state-inconsistency bugs. Nevertheless, eliminating false positives
poses a significant challenge, given the extensive number of execution
paths that could potentially cause a data race.
For simplicity, existing research considers a data race to be vulnerable
as long as the variable involved could have inconsistent values under
different execution orders. However, such a data race could be benign
when the inconsistent value does not affect any critical computation or
decision-making process in the program. Besides, the data race could
also be infeasible when there is no valid state in the contract that
allows the execution of both orders.
In this paper, we aim to appreciably reduce these false positives
without introducing false negatives. We present DivertScan, a precise
framework to detect exploitable state-inconsistency bugs in smart
contracts. We first introduce the use of flow divergence to check where
the involved variable may flow to. This allows DivertScan to precisely
infer the potential effects of a data race and determine whether it can
be exploited for inducing unexpected program behaviors. We also propose
multiplex symbolic execution to examine different execution orders in
one time of solving. This helps DivertScan to determine whether a
common starting state could potentially exist. To address the
scalability issue in symbolic execution, DivertScan utilizes an
overapproximated pre-checking and a selective exploration strategy. As
a result, it only needs to explore a limited state space.
DivertScan significantly outperformed state-of-the-art tools by
improving the precision rate by 20.72% to 74.93% while introducing no
false negatives. It also identified five exploitable real-world
vulnerabilities that other tools missed. The detected vulnerabilities
could potentially lead to a loss of up to $68.2M, based on trading
records and rate limits.
Article Search
MendelFuzz: The Return of the Deterministic Stage
Han Zheng,
Flavio Toffalini,
Marcel Böhme, and
Mathias Payer
(EPFL, Switzerland; Ruhr Universität Bochum, Germany; MPI-SP, Germany)
Can a fuzzer cover more code with minimal corruption of the initial seed? Before a seed is fuzzed, the early greybox fuzzers first systematically enumerated slightly corrupted inputs by applying every mutation operator to every part of the seed, once per generated input. The hope of this so-called “deterministic” stage was that simple changes to the seed would be less likely to break the complex file format; the resulting inputs would find bugs in the program logic well beyond the program’s parser. However, when experiments showed that disabling the deterministic stage achieves more coverage, applying multiple mutation operators at the same time to a single input, most fuzzers disabled the deterministic stage by default. Instead of ignoring the deterministic stage, we analyze its potential and substantially improve deterministic exploration. Our deterministic stage is now the default in AFL++, reverting the earlier decision of dropping deterministic exploration. We start by investigating the overhead and the contribution of the deterministic stage to the discovery of coverage-increasing inputs. While the sheer number of generated inputs explains the overhead, we find that only a few critical seeds (20%), and only a few critical bytes in a seed (0.5%) are responsible for the vast majority of the coverage-increasing inputs (83% and 84%, respectively). Hence, we develop an efficient technique, called , to identify these critical seeds / bytes so as to prune a large number of unnecessary inputs. retains the benefits of the classic deterministic stage by only enumerating a tiny part of the total deterministic state space. We evaluate implementation on two benchmarking frameworks, FuzzBench and Magma. Our evaluation shows that outperforms state-of-the-art fuzzers with and without the (old) deterministic stage enabled, both in terms of coverage and bug finding. also discovered 8 new CVEs on exhaustively fuzzed security-critical applications. Finally, has been independently evaluated and integrated into AFL++ as default option.
Article Search
SmartShot: Hunt Hidden Vulnerabilities in Smart Contracts using Mutable Snapshots
Ruichao Liang,
Jing Chen,
Ruochen Cao,
Kun He,
Ruiying Du,
Shuhua Li,
Zheng Lin, and
Cong Wu
(Wuhan University, China; University of Hong Kong, Hong Kong)
Smart contracts, as Turing-complete programs managing billions of assets in decentralized finance, are prime targets for attackers. While fuzz testing seems effective for detecting vulnerabilities in these programs, we identify several significant challenges when targeting smart contracts: (i) the stateful nature of these contracts requires stateful exploration, but current fuzzers rely on transaction sequences to manipulate contract states, making the process inefficient; (ii) contract execution is influenced by the continuously changing blockchain environment, yet current fuzzers are limited to local deployments, failing to test contracts in real-world scenarios. These challenges hinder current fuzzers from uncovering hidden vulnerabilities, i.e., those concealed in deep contract states and specific blockchain environments. In this paper, we present SmartShot, a mutable snapshot-based fuzzer to hunt hidden vulnerabilities within smart contracts. We innovatively formulate contract states and blockchain environments as directly fuzzable elements and design mutable snapshots to quickly restore and mutate these elements. SmartShot features a symbolic taint analysis-based mutation strategy along with double validation to soundly guide the state mutation. SmartShot mutates blockchain environments using contract’s historical on-chain states, providing real-world execution contexts. We propose a snapshot checkpoint mechanism to integrate mutable snapshots into SmartShot’s fuzzing loops. These innovations enable SmartShot to effectively fuzz contract states, test contracts across varied and realistic blockchain environments, and support on-chain fuzzing. Experimental results show that SmartShot is effective to detect hidden vulnerabilities with the highest code coverage and lowest false positive rate. SmartShot is 4.8× to 20.2× faster than state-of-the-art tools, identifying 2,150 vulnerable contracts out of 42,738 real-world contracts which is 2.1× to 13.7× more than other tools. SmartShot has demonstrated its real-world impact by detecting vulnerabilities that are only discoverable on-chain and uncovering 24 0-day vulnerabilities in the latest 10,000 deployed contracts.
Article Search
On-Demand Scenario Generation for Testing Automated Driving Systems
Songyang Yan,
Xiaodong Zhang,
Kunkun Hao,
Haojie Xin,
Yonggang Luo,
Jucheng Yang,
Ming Fan,
Chao Yang,
Jun Sun, and
Zijiang Yang
(Xi'an Jiaotong University, China; Xidian University, China; Synkrotron, China; Chongqing Changan Automobile, China; Singapore Management University, Singapore; University of Science and Technology of China, China)
The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS's reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG's ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels.
Article Search
Info
Element-Based Automated DNN Repair with Fine-Tuned Masked Language Model
Xu Wang,
Mingming Zhang,
Xiangxin Meng,
Jian Zhang,
Yang Liu, and
Chunming Hu
(Beihang University, China; Zhongguancun Laboratory, China; Ministry of Education, China; Nanyang Technological University, Singapore)
Deep Neural Networks (DNNs) are prevalent across a wide range of applications. Despite their success, the complexity and opaque nature of DNNs pose significant challenges in debugging and repairing DNN models, limiting their reliability and broader adoption. In this paper, we propose MLM4DNN, an element-based automated DNN repair method. Unlike previous techniques that focus on post-training adjustments or rely heavily on predefined bug patterns, MLM4DNN repairs DNNs by leveraging a fine-tuned Masked Language Model (MLM) to predict correct fixes for nine predefined key elements in DNNs. We construct a large-scale dataset by masking nine key elements from the correct DNN source code and then force the MLM to restore the correct elements to learn the deep semantics that ensure the normal functionalities of DNNs. Afterwards, a light-weight static analysis tool is designed to filter out low-quality patches to enhance the repair efficiency. We introduce a patch validation method specifically for DNN repair tasks, which consists of three evaluation metrics from different aspects to model the effectiveness of generated patches. We construct a benchmark, BenchmarkAPR4DNN, including 51 buggy DNN models and an evaluation tool that outputs the three metrics. We evaluate MLM4DNN against six baselines on BenchmarkAPR4DNN, and results show that MLM4DNN outperforms all state-of-the-art baselines, including two dynamic-based and four zero-shot learning-based methods. After applying the fine-tuned MLM design to several prevalent Large Language Models (LLMs), we consistently observe improved performance in DNN repair tasks compared to the original LLMs, which demonstrates the effectiveness of the method proposed in this paper.
Article Search
Mystique: Automated Vulnerability Patch Porting with Semantic and Syntactic-Enhanced LLM
Susheng Wu,
Ruisi Wang,
Yiheng Cao,
Bihuan Chen,
Zhuotong Zhou,
Yiheng Huang,
JunPeng Zhao, and
Xin Peng
(Fudan University, China)
Branching repositories facilitates efficient software development but can also inadvertently propagate vulnerabilities. When an original branch is patched, other unfixed branches remain vulnerable unless the patch is successfully ported. However, due to inherent discrepancies between branches, many patches cannot be directly applied and require manual intervention, which is time-consuming and leads to delays in patch porting, increasing vulnerability risks. Existing automated patch porting approaches are prone to errors, as they often overlook essential semantic and syntactic context of vulnerability and fail to detect or refine faulty patches.
We propose Mystique, a novel LLM-based approach to address these limitations. Mystique first slices the semantic-related statements linked to the vulnerability while ensuring syntactic correctness, allowing it to extract the signatures for both the original patched function and the target vulnerable function. Mystique then utilizes a fine-tuned LLM to generate a fixed function, which is further iteratively checked and refined to ensure successful porting. Our evaluation shows that Mystique achieved a success rate of 0.954 at function level and of 0.924 at CVE level, outperforming state-of-the-art approaches by at least 13.2% at function level and 12.3% at CVE level. Our evaluation also demonstrates Mystique’s superior generality across various projects, bugs, and programming languages. Mystique successfully ported patches for 34 real-world vulnerable branches.
Article Search
Artifacts Available
Smart Contract Fuzzing Towards Profitable Vulnerabilities
Ziqiao Kong,
Cen Zhang,
Maoyi Xie,
Ming Hu,
Yue Xue,
Ye Liu,
Haijun Wang, and
Yang Liu
(Nanyang Technological University, Singapore; Singapore Management University, Singapore; MetaTrust Labs, Singapore; Xi'an Jiaotong University, China)
Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: 1. a lack of profit-centric techniques for expediting detection and, 2. insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits. VERITE has three key features: 1. DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2. potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3. a gradient descent-based profit maximization strategy for these identified candidates. VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ItyFuzz in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards.
Preprint
Info
Artifacts Available
CKTyper: Enhancing Type Inference for Java Code Snippets by Leveraging Crowdsourcing Knowledge in Stack Overflow
Anji Li,
Neng Zhang,
Ying Zou,
Zhixiang Chen,
Jian Wang, and
Zibin Zheng
(Sun Yat-sen University, China; Central China Normal University, China; Queen's University, Canada; Wuhan University, China)
Code snippets are widely used in technical forums to demonstrate solutions to programming problems. They can be leveraged by developers to accelerate problem-solving. However, code snippets often lack concrete types of the APIs used in them, which impedes their understanding and resue. To enhance the description of a code snippet, a number of approaches are proposed to infer the types of APIs. Although existing approaches can achieve good performance, their performance is limited by ignoring other information outside the input code snippet (e.g., the descriptions of similar code snippets) that could potentially improve the performance.
In this paper, we propose a novel type inference approach, named CKTyper, by leveraging crowdsourcing knowledge in technical posts. The key idea is to generate a relevant context for a target code snippet from the posts containing similar code snippets and then employ the context to promote the type inference with large language models (e.g., ChatGPT). More specifically, we build a crowdsourcing knowledge base (CKB) by extracting code snippets from a large set of posts and index the CKB using Lucene. An API type dictionary is also built from a set of API libraries. Given a code snippet to be inferred, we first retrieve a list of similar code snippets from the indexed CKB. Then, we generate a crowdsourcing knowledge context (CKC) by extracting and summarizing useful content (e.g., API-related sentences) in the posts that contain the similar code snippets. The CKC is subsequently used to improve the type inference of ChatGPT on the input code snippet. The hallucination of ChatGPT is eliminated by employing the API type dictionary. Evaluation results on two open-source datasets demonstrate the effectiveness and efficiency of CKTyper. CKTyper achieves the optimal precision/recall of 97.80% and 95.54% on both datasets, respectively, significantly outperforming three state-of-the-art baselines and ChatGPT.
Article Search
Ransomware Detection through Temporal Correlation between Encryption and I/O Behavior
Lihua Guo,
Yiwei Hou,
Chijin Zhou,
Quan Zhang, and
Yu Jiang
(Tsinghua University, China)
In recent years, the increase in ransomware attacks has significantly impacted individuals and organizations. Many strategies have been proposed to detect ransomware’s file disruption operation. However, they rely on certain assumptions that gradually fail in the face of evolving ransomware attacks, which use more stealthy encryption methods or benign-imitation-based I/O orchestration. To mitigate this, we propose an approach to detect ransomware attacks through temporal correlation between encryption and I/O behaviors. Its key idea is that there is a strong temporal correlation inherent in ransomware’s encryption and I/O behaviors. To disrupt files, ransomware must first read the file data from the disk into memory, encrypt it, and then write the encrypted data back to the disk. This process creates a pronounced temporal correlation between the computation load of the encryption operations and the size of the files being encrypted. Based on this invariant, we implement a prototype called RansomRadar and evaluate its effectiveness against 411 latest ransomware samples from 89 families. Experimental results show that it achieves a detection rate of 100.00% with 2.68% false alarms. Its F1-score is 96.03, higher than the existing detectors for 31.82 on average.
Article Search
DeclarUI: Bridging Design and Development with Automated Declarative UI Code Generation
Ting Zhou,
Yanjie Zhao,
Xinyi Hou,
Xiaoyu Sun,
Kai Chen, and
Haoyu Wang
(Huazhong University of Science and Technology, China; Australian National University, Australia)
Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance. Despite these advantages, the process of translating UI designs into functional code remains challenging and time-consuming. Recent advancements in multimodal large language models (MLLMs) have shown promise in directly generating mobile app code from user interface (UI) designs. However, the direct application of MLLMs to this task is limited by challenges in accurately recognizing UI components and comprehensively capturing interaction logic. To address these challenges, we propose DeclarUI, an automated approach that synergizes computer vision (CV), MLLMs, and iterative compiler-driven optimization to generate and refine declarative UI code from designs. DeclarUI enhances visual fidelity, functional completeness, and code quality through precise component segmentation, Page Transition Graphs (PTGs) for modeling complex inter-page relationships, and iterative optimization. In our evaluation, DeclarUI outperforms baselines on React Native, a widely adopted declarative UI framework, achieving a 96.8% PTG coverage rate and a 98% compilation success rate. Notably, DeclarUI demonstrates significant improvements over state-of-the-art MLLMs, with a 123% increase in PTG coverage rate, up to 55% enhancement in visual similarity scores, and a 29% boost in compilation success rate. We further demonstrate DeclarUI’s generalizability through successful applications to Flutter and ArkUI frameworks. User studies with professional developers confirm that DeclarUI’s generated code meets industrial-grade standards in code availability, modification time, readability, and maintainability. By streamlining app development, improving efficiency, and fostering designer-developer collaboration, DeclarUI offers a practical solution to the persistent challenges in mobile UI development.
Article Search
COFFE: A Code Efficiency Benchmark for Code Generation
Yun Peng,
Jun Wan,
Yichen Li, and
Xiaoxue Ren
(Chinese University of Hong Kong, China; Zhejiang University, China)
Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation.
To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation.
Article Search
Pinning Is Futile: You Need More Than Local Dependency Versioning to Defend against Supply Chain Attacks
Hao He,
Bogdan Vasilescu, and
Christian Kästner
(Carnegie Mellon University, USA)
Recent high-profile incidents in open-source software have greatly raised practitioner attention on software supply chain attacks. To guard against potential malicious package updates, security practitioners advocate pinning dependency to specific versions rather than floating in version ranges. However, it remains controversial whether pinning carries a meaningful security benefit that outweighs the cost of maintaining outdated and possibly vulnerable dependencies. In this paper, we quantify, through counterfactual analysis and simulations, the security and maintenance impact of version constraints in the npm ecosystem. By simulating dependency resolutions over historical time points, we find that pinning direct dependencies not only (as expected) increases the cost of maintaining vulnerable and outdated dependencies, but also (surprisingly) even increases the risk of exposure to malicious package updates in larger dependency graphs due to the specifics of npm’s dependency resolution mechanism. Finally, we explore collective pinning strategies to secure the ecosystem against supply chain attacks, suggesting specific changes to npm to enable such interventions. Our study provides guidance for practitioners and tool designers to manage their supply chains more securely.
Article Search
An Empirical Study of Suppressed Static Analysis Warnings
Huimin Hu,
Yingying Wang,
Julia Rubin, and
Michael Pradel
(University of Stuttgart, Germany; University of British Columbia, Canada)
Scalable static analyzers are popular tools for finding incorrect, inefficient, insecure, and hard-to-maintain code early during the development process. Because not all warnings reported by a static analyzer are immediately useful to developers, many static analyzers provide a way to suppress warnings, e.g., in the form of special comments added into the code. Such suppressions are an important mechanism at the interface between static analyzers and software developers, but little is currently known about them. This paper presents the first in-depth empirical study of suppressions of static analysis warnings, addressing questions about the prevalence of suppressions, their evolution over time, the relationship between suppressions and warnings, and the reasons for using suppressions. We answer these questions by studying projects written in three popular languages and suppressions for warnings by four popular static analyzers. Our findings show that (i) suppressions are relatively common, e.g., with a total of 7,357 suppressions in 46 Python projects, (ii) the number of suppressions in a project tends to continuously increase over time, (iii) surprisingly, 50.8% of all suppressions do not affect any warning and hence are practically useless, (iv) some suppressions, including useless ones, may unintentionally hide future warnings, and (v) common reasons for introducing suppressions include false positives, suboptimal configurations of the static analyzer, and misleading warning messages. These results have actionable implications, e.g., that developers should be made aware of useless suppressions and the potential risk of unintentional suppressing, that static analyzers should provide better warning messages, and that static analyzers should separately categorize warnings from third-party libraries.
Article Search
Towards Diverse Program Transformations for Program Simplification
Haibo Wang,
Zezhong Xing,
Chengnian Sun,
Zheng Wang, and
Shin Hwei Tan
(Concordia University, Canada; Southern University of Science and Technology, China; University of Waterloo, Canada; University of Leeds, UK)
By reducing the number of lines of code, program simplification reduces code complexity, improving software maintainability and code comprehension. While several existing techniques can be used for automatic program simplification, there is no consensus on the effectiveness of these approaches. We present the first study on how real-world developers simplify programs in open-source software projects. By analyzing 382 pull requests from 296 projects, we summarize the types of program transformations used, the motivations behind simplifications, and the set of program transformations that have not been covered by existing refactoring types. As a result of our study, we submitted eight bug reports to a widely used refactoring detection tool, RefactoringMiner, where seven were fixed. Our study also identifies gaps in applying existing approaches for automating program simplification and outlines the criteria for designing automatic program simplification techniques. In light of these observations, we propose SimpT5, a tool to automatically produce simplified programs that are semantically equivalent programs with reduced lines of code. SimpT5 is trained on our collected dataset of 92,485 simplified programs with two heuristics: (1) modified line localization that encodes lines changed in simplified programs, and (2) checkers that measure the quality of generated programs. Experimental results show that SimpT5 outperforms prior approaches in automating developer-induced program simplification.
Preprint
Info
Today’s Cat Is Tomorrow’s Dog: Accounting for Time-Based Changes in the Labels of ML Vulnerability Detection Approaches
Ranindya Paramitha,
Yuan Feng, and
Fabio Massacci
(University of Trento, Italy; Vrije Universiteit Amsterdam, Netherlands)
Vulnerability datasets used for ML testing implicitly contain retrospective information. When tested on the field, one can only use the labels available at the time of training and testing (e.g. seen and assumed negatives). As vulnerabilities are discovered across calendar time, labels change and past performance is not necessarily aligned with future performance. Past works only considered the slices of the whole history (e.g. DiverseVUl) or individual differences between releases (e.g. Jimenez et al. ESEC/FSE 2019).
Such approaches are either too optimistic in training (e.g. the whole history) or too conservative (e.g. consecutive releases). We propose a method to restructure a dataset into a series of datasets in which both training and testing labels change to account for the knowledge available at the time. If the model is actually learning, it should improve its performance over time as more data becomes available and data becomes more stable, an effect that can be checked with the Mann-Kendall test.
We validate our methodology for vulnerability detection with 4 time-based datasets (3 projects from BigVul dataset + Vuldeepecker’s NVD) and 5 ML models (Code2Vec, CodeBERT, LineVul, ReGVD, and Vuldeepecker). In contrast to the intuitive expectation (more retrospective information, better performance), the trend results show that performance changes inconsistently across the years, showing that most models are not learning.
Article Search
A New Approach to Evaluating Nullability Inference Tools
Nima Karimipour,
Erfan Arvan,
Martin Kellogg, and
Manu Sridharan
(University of California at Riverside, USA; New Jersey Institute of Technology, USA)
Null-pointer exceptions are serious problem for Java, and researchers have developed type-based nullness checking tools to prevent them. These tools, however, have a downside: they require developers to write nullability annotations, which is time-consuming and hinders adoption. Researchers have therefore proposed nullability annotation inference tools, whose goal is to (partially) automate the task of annotating a program for nullability. However, prior works rely on differing theories of what makes a set of nullability annotations good, making comparing their effectiveness challenging. In this work, we identify a systematic bias in some prior experimental evaluation of these tools: the use of “type reconstruction” experiments to see if a tool can recover erased developer-written annotations. We show that developers make semantic code changes while adding annotations to facilitate typechecking, leading such experiments to overestimate the effectiveness of inference tools on never-annotated code. We propose a new definition of the “best” inferred annotations for a program that avoids this bias, based on a systematic exploration of the design space. With this new definition, we perform the first head-to-head comparison of three extant nullability inference tools. Our evaluation showed the complementary strengths of the tools and remaining weaknesses that could be addressed in future work.
Article Search
A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems
Yuntianyi Chen,
Yuqi Huai,
Yirui He,
Shilong Li,
Changnam Hong,
Qi Alfred Chen, and
Joshua Garcia
(University of California at Irvine, USA)
As autonomous driving systems (ADSes) become increasingly complex and integral to daily life, the importance of understanding the nature and mitigation of software bugs in these systems has grown correspondingly. Addressing the challenges of software maintenance in autonomous driving systems (e.g., handling real-time system decisions and ensuring safety-critical reliability) is crucial due to the unique combination of real-time decision-making requirements and the high stakes of operational failures in ADSes. The potential of automated tools in this domain is promising, yet there remains a gap in our comprehension of the challenges faced and the strategies employed during manual debugging and repair of such systems. In this paper, we present an empirical study that investigates bug-fix patterns in ADSes, with the aim of improving reliability and safety. We have analyzed the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes with the study of bug symptoms, root causes, and bug-fix patterns. Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management. Additionally, we find that the frequency distribution of bug-fix patterns varies significantly depending on their nature and types and that certain categories of bugs are recurrent and more challenging to exterminate. Based on our findings, we propose a hierarchy of ADS bugs and two taxonomies of 15 syntactic bug-fix patterns and 27 semantic bug-fix patterns that offer guidance for bug identification and resolution. We also contribute a benchmark of 1,331 ADS bug-fix instances.
Preprint
Artifacts Available
An Empirical Study on Release-Wise Refactoring Patterns
Shayan Noei,
Heng Li, and
Ying Zou
(Queen's University, Canada; Polytechnique Montréal, Canada)
Refactoring is a technical approach to increase the internal quality of software without altering its external functionalities. Developers often invest significant effort in refactoring. With the increased adoption of continuous integration and deployment (CI/CD), refactoring activities may vary within and across different releases and be influenced by various release goals. For example, developers may consistently allocate refactoring activities throughout a release, or prioritize new features early on in a release and only pick up refactoring late in a release. Different approaches to allocating refactoring tasks may have different implications for code quality. However, there is a lack of existing research on how practitioners allocate their refactoring activities within a release and their impact on code quality. Therefore, we first empirically study the frequent release-wise refactoring patterns in 207 open-source Java projects and their characteristics. Then, we analyze how these patterns and their transitions affect code quality. We identify four major release-wise refactoring patterns: early active, late active, steady active, and steady inactive. We find that adopting the late active pattern—characterized by gradually increasing refactoring activities as the release approaches—leads to the best code quality. We observe that as projects mature, refactoring becomes more active, reflected in the increasing use of the steady active release-wise refactoring pattern and the decreasing utilization of the steady inactive release-wise refactoring pattern. While the steady active pattern shows improvement in quality-related code metrics (e.g., cohesion), it can also lead to more architectural problems. Additionally, we observe that developers tend to adhere to a single refactoring pattern rather than switching between different patterns. The late active pattern, in particular, can be a safe release-wise refactoring pattern that is used repeatedly. Our results can help practitioners understand existing release-wise refactoring patterns and their effects on code quality, enabling them to utilize the most effective pattern to enhance release quality.
Article Search
Hallucination Detection in Large Language Models with Metamorphic Relations
Borui Yang,
Md Afif Al Mamun,
Jie M. Zhang, and
Gias Uddin
(King's College London, UK; University of Calgary, Canada; York University, Canada)
Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses. These hallucinations present challenges for LLM-based applications that demand high factual accuracy. Existing hallucination detection methods primarily depend on external resources, which can suffer from issues such as low availability, incomplete coverage, privacy concerns, high latency, low reliability, and poor scalability. There are also methods depending on output probabilities, which are often inaccessible for closed-source LLMs like GPT models. This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic relation and prompt mutation. Unlike existing methods, MetaQA operates without any external resources and is compatible with both open-source and closed-source LLMs.
MetaQA is based on the hypothesis that if an LLM’s response is a hallucination, the designed metamorphic relations will be violated. We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets, and on two open-source and two closed-source LLMs. Our results reveal that MetaQA outperforms SelfCheckGPT in terms of precision, recall, and f1 score. For the four LLMs we study, MetaQA outperforms SelfCheckGPT with a superiority margin ranging from 0.041 - 0.113 (for precision), 0.143 - 0.430 (for recall), and 0.154 - 0.368 (for F1-score). For instance, with Mistral-7B, MetaQA achieves an average F1-score of 0.435, compared to SelfCheckGPT’s F1-score of 0.205, representing an improvement rate of 112.2%. MetaQA also demonstrates superiority across all different categories of questions.
Article Search
One-for-All Does Not Work! Enhancing Vulnerability Detection by Mixture-of-Experts (MoE)
Xu Yang,
Shaowei Wang,
Jiayuan Zhou, and
Wenhan Zhu
(University of Manitoba, Canada; Huawei, Canada)
Deep Learning-based Vulnerability Detection (DLVD) techniques have garnered significant interest due to their ability to automatically learn vulnerability patterns from previously compromised code. Despite the notable accuracy demonstrated by pioneering tools, the broader application of DLVD methods in real-world scenarios is hindered by significant challenges. A primary issue is the “one-for-all” design, where a single model is trained to handle all types of vulnerabilities. This approach fails to capture the patterns of different vulnerability types, resulting in suboptimal performance, particularly for less common vulnerabilities that are often underrepresented in training datasets. To address these challenges, we propose MoEVD, which adopts the Mixture-of-Experts (MoE) framework for vulnerability detection. MoEVD decomposes vulnerability detection into two tasks, CWE type classification and CWE-specific vulnerability detection. By splitting the task, in vulnerability detection, MoEVD allows specific experts to handle distinct types of vulnerabilities instead of handling all vulnerabilities within one model. Our results show that MoEVD achieves an F1-score of 0.44, significantly outperforming all studied state-of-the-art (SOTA) baselines by at least 12.8%. MoEVD excels across almost all CWE types, improving recall over the best SOTA baseline by 9% to 77.8%. Notably, MoEVD does not sacrifice performance on long-tailed CWE types; instead, its MoE design enhances performance (F1-score) on these by at least 7.3%, addressing long-tailed issues effectively.
Article Search
LlamaRestTest: Effective REST API Testing with Small Language Models
Myeongsoo Kim,
Saurabh Sinha, and
Alessandro Orso
(Georgia Institute of Technology, USA; IBM Research, USA)
Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on these specifications. While Large Language Models (LLMs) have shown promising results in various testing domains, their application to REST API testing remains unexplored. We present LlamaRestTest, a novel approach that employs two custom LLMs, created by fine-tuning and quantizing the Llama3-8B model using mined datasets of REST API example values and inter-parameter dependencies—to generate realistic test inputs and uncover parameter dependencies during the testing process by incorporating server responses. We evaluated LlamaRestTest on 12 real-world services (including popular services such as FDIC and Spotify), comparing it against RESTGPT, a GPT-powered specification-enhancement tool, as well as several state-of-the-art REST API testing tools, including RESTler, MoRest, EvoMaster, and ARAT-RL. Our results demonstrate that fine-tuning enables smaller LLMs to outperform much larger models in detecting actionable rules and generating inputs for REST API testing. We also evaluated different tool configurations, ranging from the base Llama3-8B model to fine-tuned versions, and explored multiple quantization techniques, including 2-bit, 4-bit, and 8-bit integer formats. Our study shows that small language models can perform as well as, or better than, large language models in REST API testing, striking a balance between effectiveness and efficiency. Furthermore, LlamaRestTest outperforms state-of-the-art REST API testing tools in code coverage achieved and internal server errors identified, even when those tools utilize specifications enhanced by RESTGPT. Finally, through an ablation study, we show that each component of LlamaRestTest contributes to its overall performance.
Article Search
Code Change Intention, Development Artifact, and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM
Xu Yang,
Wenhan Zhu,
Michael Pacheco,
Jiayuan Zhou,
Shaowei Wang,
Xing Hu, and
Kui Liu
(University of Manitoba, Canada; Huawei, Canada; Zhejiang University, China; Huawei Software Engineering Application Technology Lab, China)
Detecting vulnerability fix commits in open-source software is crucial for maintaining software security. To help OSS identify vulnerability fix commits, several automated approaches are developed. However, existing approaches like VulFixMiner and CoLeFunDa, focus solely on code changes, neglecting essential context from development artifacts. Tools like Vulcurator, which integrates issue reports, fail to leverage semantic associations between different development artifacts (e.g., pull requests and history vulnerability fixes). Moreover, they miss vulnerability fixes in tangled commits and lack explanations, limiting practical use. Hence to address those limitations, we propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning to improve the accuracy of vulnerability fix detection. LLM4VFD comprises three components: (1) Code Change Intention, which analyzes commit summaries, purposes, and implications using Chain-of-Thought reasoning; (2) Development Artifact, which incorporates context from related issue reports and pull requests; (3) Historical Vulnerability, which retrieves similar past vulnerability fixes to enrich context. More importantly, on top of the prediction, LLM4VFD also provides a detailed analysis and explanation to help security experts understand the rationale behind the decision. We evaluated LLM4VFD against state-of-the-art techniques, including Pre-trained Language Model-based approaches and vanilla LLMs, using a newly collected dataset, BigVulFixes. Experimental results demonstrate that LLM4VFD significantly outperforms the best-performed existing approach by 68.1%--145.4%. Furthermore, We conducted a user study with security experts, showing that the analysis generated by LLM4VFD improves the efficiency of vulnerability fix identification.
Article Search
QSF: Multi-objective Optimization Based Efficient Solving for Floating-Point Constraints
Xu Yang,
Zhenbang Chen,
Wei Dong, and
Ji Wang
(National University of Defense Technology, China)
Floating-point constraint solving is challenging due to the complex representation and non-linear computations. Search-based constraint solving provides an effective method for solving floating-point constraints. In this paper, we propose QSF to improve the efficiency of search-based solving for floating-point constraints. The key idea of QSF is to model the floating-point constraint solving problem as a multi-objective optimization problem. Specifically, QSF considers both the number of unsatisfied constraints and the sum of the violation degrees of unsatisfied constraints as the objectives for search-based optimization. Besides, we propose a new evolutionary algorithm in which the mutation operators are specially designed for floating-point numbers, aiming to solve the multi-objective problem more efficiently. We have implemented QSF and conducted extensive experiments on both the SMT-COMP benchmark and the benchmark from real-world floating-point programs. The results demonstrate that compared to SOTA floating-point solvers, QSF achieved an average speedup of 15.72X under a 60-second timeout and an impressive 87.48X under a 600-second timeout on the first benchmark. Similarly, on the second benchmark, QSF delivered an average speedup of 22.44X and 29.23X, respectively, under the two timeout configurations. Furthermore, QSF has also enhanced the performance of symbolic execution for floating-point programs.
Article Search
Adaptive Random Testing with Q-grams: The Illusion Comes True
Matteo Biagiola,
Robert Feldt, and
Paolo Tonella
(USI Lugano, Switzerland; Chalmers University of Technology, Sweden)
Adaptive Random Testing (ART) has faced criticism, particularly for its computational inefficiency, as highlighted by Arcuri and Briand. Their analysis clarified how ART requires a quadratic number of distance computations as the number of test executions increases, which limits its scalability in scenarios requiring extensive testing to uncover faults. Simulation results support this, showing that the computational overhead of these distance calculations often outweighs ART’s benefits. While various ART variants have attempted to reduce these costs, they frequently do so at the expense of fault detection, lack complexity guarantees, or are restricted to specific input types, such as numerical or discrete data.
In this paper, we introduce a novel framework for adaptive random testing that replaces pairwise distance computations with a compact aggregation of past executions, such as counting the q-grams observed in previous runs. Test case selection then leverages this aggregated data to measure diversity (e.g., entropy of q-grams), allowing us to reduce the computational complexity from quadratic to linear. Experiments with a benchmark of six web applications, show that ART with q-grams covers, on average, 4× more unique targets than random testing, and 3.5×more than ART using traditional distance-based methods.
Article Search
Understanding and Characterizing Mock Assertions in Unit Tests
Hengcheng Zhu,
Valerio Terragni,
Lili Wei,
Shing-Chi Cheung,
Jiarong Wu, and
Yepang Liu
(Hong Kong University of Science and Technology, China; University of Auckland, New Zealand; McGill University, Canada; Southern University of Science and Technology, China)
Mock assertions provide developers with a powerful means to validate program behaviors that are unobservable to test assertions. Despite their significance, they are rarely considered by automated test generation techniques. Effective generation of mock assertions requires understanding how they are used in practice. Although previous studies highlighted the importance of mock assertions, none provide insight into their usages. To bridge this gap, we conducted the first empirical study on mock assertions, examining their adoption, the characteristics of the verified method invocations, and their effectiveness in fault detection. Our analysis of 4,652 test cases from 11 popular Java projects reveals that mock assertions are mostly applied to validating specific kinds of method calls, such as those interacting with external resources and those reflecting whether a certain code path was traversed in systems under test. Additionally, we find that mock assertions complement traditional test assertions by ensuring the desired side effects have been produced, validating control flow logic, and checking internal computation results. Our findings contribute to a better understanding of mock assertion usages and provide a foundation for future related research such as automated test generation that support mock assertions.
Article Search
Artifacts Available
Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning
Yuqing Wang,
Mika V. Mäntylä,
Serge Demeyer,
Mutlu Beyazıt,
Joanna Kisaakye, and
Jesse Nyyssölä
(University of Helsinki, Finland; University of Oulu, Finland; University of Antwerp, Belgium)
Microservice-based systems (MSS) may fail with various fault types, due to their complex and dynamic nature. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts from practitioners are still required for further root cause analysis to diagnose specific fault types and analyze failure reasons for detected abnormal traces, particularly when abnormal traces do not stem directly from specific services. In this paper, we propose a novel AIOps framework, TraFaultDia, to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia is trained on several abnormal trace classification tasks with a few labeled instances from a MSS using a meta-learning approach. After training, TraFaultDia can quickly adapt to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia’s use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two representative MSS, TrainTicket and OnlineBoutique, with open datasets. In these datasets, each fault category is tied to the faulty system component(s) (service/pod) with a root cause. Our TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. Our results show that, within the MSS it is trained on, TraFaultDia achieves an average accuracy of 93.26% and 85.20% across 50 new, unseen abnormal trace classification tasks for TrainTicket and OnlineBoutique respectively, when provided with 10 labeled instances for each fault category per task in each system. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system.
Article Search
Detecting and Handling WoT Violations by Learning Physical Interactions from Device Logs
Bingkun Sun,
Shiqi Sun,
Jialin Ren,
Mingming Hu,
Kun Hu,
Liwei Shen, and
Xin Peng
(Fudan University, China; Northwestern Polytechnique University, China)
The Web of Things (WoT) system standardizes the integration of ubiquitous IoT devices in physical environments, enabling various software applications to automatically sense and regulate the physical environment. While providing convenience, the complex interactions among software applications and physical environment make the WoT system vulnerable to violations caused by improper actuator operations, which may cause undesirable or even harmful results, posing serious risks to user safety and security. In response to this critical concern, many previous efforts have be made. However, existing works primarily focus on analyzing software application behaviors, with insufficient consideration of the physical interactions, multi-source violations, and environmental dynamics in such ubiquitous software systems. As a result, they fail to comprehensively detect the impact of actuator operations on the dynamic environment, thus limiting their effectiveness.
To address these limitations, we propose SysGuard, a violation detecting and handling approach. SysGuard employs the dynamic probabilistic graphical model (DPGM) to model the physical interactions as the physical interaction graph (PIG). In the offline phase, SysGuard takes device description models and history device logs as input to capture physical interactions by learning the PIG. In this process, a large language model (LLM) based causal analysis method is further introduced to filter out the device dependencies unrelated to physical interaction by analyzing the device interaction scenarios recorded in device logs. In the online phase, SysGuard processes user-customized violation rules, and monitors runtime device logs to predict violation states and generates handling policies by inferring the PIG. Evaluation on two real-world WoT systems shows that SysGuard significantly outperforms existing state-of-the-art works, achieving high performance in both violation detection and handling. It also confirms the runtime efficiency and scalability of SysGuard. Ablation experiment on our constructed dataset demonstrates that the LLM-based causal analysis significantly improves the performance of SysGuard, with the accuracy increasing in both violation detecting and handling.
Article Search
Info
The Landscape of Toxicity: An Empirical Investigation of Toxicity on GitHub
Jaydeb Sarker,
Asif Kamal Turzo, and
Amiangshu Bosu
(University of Nebraska at Omaha, USA; Wayne State University, USA)
Toxicity on GitHub can severely impact Open Source Software (OSS) development communities. To mitigate such behavior, a better understanding of its nature and how various measurable characteristics of project contexts and participants are associated with its prevalence is necessary. To achieve this goal, we conducted a large-scale mixed-method empirical study of 2,828 GitHub-based OSS projects randomly selected based on a stratified sampling strategy. Using ToxiCR, an SE domain-specific toxicity detector, we automatically classified each comment as toxic or non-toxic. Additionally, we manually analyzed a random sample of 600 comments to validate ToxiCR's performance and gain insights into the nature of toxicity within our dataset. The results of our study suggest that profanity is the most frequent toxicity on GitHub, followed by trolling and insults. While a project's popularity is positively associated with the prevalence of toxicity, its issue resolution rate has the opposite association. Corporate-sponsored projects are less toxic, but gaming projects are seven times more toxic than non-gaming ones. OSS contributors who have authored toxic comments in the past are significantly more likely to repeat such behavior. Moreover, such individuals are more likely to become targets of toxic texts.
Article Search
Artifacts Available
DuoReduce: Bug Isolation for Multi-layer Extensible Compilation
Jiyuan Wang,
Yuxin Qiu,
Ben Limpanukorn,
Hong Jin Kang,
Qian Zhang, and
Miryung Kim
(University of California at Los Angeles, USA; University of California at Riverside, USA)
In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton, CIRCT, and ONNX-MLIR. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual-dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators.
Article Search
ROSCallBaX: Statically Detecting Inconsistencies in Callback Function Setup of Robotic Systems
Sayali Kate,
Yifei Gao,
Shiwei Feng, and
Xiangyu Zhang
(Purdue University, USA)
Increasingly popular Robot Operating System (ROS) framework allows building robotic systems by integrating newly developed and/or reused modules, where the modules can use different versions of the framework (e.g., ROS1 or ROS2) and programming language (e.g. C++ or Python). The majority of such robotic systems' work happens in callbacks. The framework provides various elements for initializing callbacks and for setting up the execution of callbacks. It is the responsibility of developers to compose callbacks and their execution setup elements, and hence can lead to inconsistencies related to the setup of callback execution due to developer's incomplete knowledge of the semantics of elements in various versions of the framework. Some of these inconsistencies do not throw errors at runtime, making their detection difficult for developers. We propose a static approach to detecting such inconsistencies by extracting a static view of the composition of robotic system's callbacks and their execution setup, and then checking it against the composition conventions based on the elements' semantics. We evaluate our ROSCallBaX prototype on the dataset created from the posts on developer forums and ROS projects that are publicly available. The evaluation results show that our approach can detect real inconsistencies.
Article Search
Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models
Yanlin Wang,
Tianyue Jiang,
Mingwei Liu,
Jiachi Chen,
Mingzhi Mao,
Xilin Liu,
Yuchi Ma, and
Zibin Zheng
(Sun Yat-sen University, China; Huawei Cloud Computing Technologies, China)
Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers exhibit differences in coding style. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem.
Article Search
Automated Unit Test Refactoring
Yi Gao,
Xing Hu,
Xiaohu Yang, and
Xin Xia
(Zhejiang University, China)
Test smells arise from poor design practices and insufficient domain knowledge, which can lower the quality of test code and make it harder to maintain and update. Manually refactoring of test smells is time-consuming and error-prone, highlighting the necessity for automated approaches. Current rule-based refactoring methods often struggle in scenarios not covered by predefined rules and lack the flexibility needed to handle diverse cases effectively. In this paper, we propose a novel approach called UTRefactor, a context-enhanced, LLM-based framework for automatic test refactoring in Java projects. UTRefactor extracts relevant context from test code and leverages an external knowledge base that includes test smell definitions, descriptions, and DSL-based refactoring rules. By simulating the manual refactoring process through a chain-of-thought approach, UTRefactor guides the LLM to eliminate test smells in a step-by-step process, ensuring both accuracy and consistency throughout the refactoring. Additionally, we implement a checkpoint mechanism to facilitate comprehensive refactoring, particularly when multiple smells are present. We evaluate UTRefactor on 879 tests from six open-source Java projects, reducing the number of test smells from 2,375 to 265, achieving an 89% reduction. UTRefactor outperforms direct LLM-based refactoring methods by 61.82% in smell elimination and significantly surpasses the performance of a rule-based test smell refactoring tool. Our results demonstrate the effectiveness of UTRefactor in enhancing test code quality while minimizing manual involvement.
Article Search
HornBro: Homotopy-Like Method for Automated Quantum Program Repair
Siwei Tan,
Liqiang Lu,
Debin Xiang,
Tianyao Chu,
Congliang Lang,
Jintao Chen,
Xing Hu, and
Jianwei Yin
(Zhejiang University, China)
Quantum programs provide exponential speedups compared to classical programs in certain areas, but they also inevitably encounter logical faults. Automatically repairing quantum programs is much more challenging than repairing classical programs due to the non-replicability of data, the vast search space of program inputs, and the new programming paradigm. Existing works based on semantic-based or learning-based program repair techniques are fundamentally limited in repairing efficiency and effectiveness. In this work, we propose HornBro, an efficient framework for automated quantum program repair. The key insight of HornBro lies in the homotopy-like method, which iteratively switches between the classical and quantum parts. This approach allows the repair tasks to be efficiently offloaded to the most suitable platforms, enabling a progressive convergence toward the correct program. We start by designing an implication assertion pragma to enable rigorous specifications of quantum program behavior, which helps to generate a quantum test suite automatically. This suite leverages the orthonormal bases of quantum programs to accommodate different encoding schemes. Given a fixed number of test cases, it allows the maximum input coverage of potential counter-example candidates. Then, we develop a Clifford approximation method with an SMT-based search to transform the patch localization program into a symbolic reasoning problem. Finally, we offload the computationally intensive repair of gate parameters to quantum hardware by leveraging the differentiability of quantum gates. Experiments suggest that HornBro increases the repair success rate by more than 62.5% compared to the existing repair techniques, supporting more types of quantum bugs. It also achieves 35.7× speedup in the repair and 99.9% gate reduction of the patch.
Article Search
Artifacts Available
Automated Soap Opera Testing Directed by LLMs and Scenario Knowledge: Feasibility, Challenges, and Road Ahead
Yanqi Su,
Zhenchang Xing,
Chong Wang,
Chunyang Chen,
Sherry (Xiwei) Xu,
Qinghua Lu, and
Liming Zhu
(Australian National University, Australia; CSIRO's Data61, Australia; Nanyang Technological University, Singapore; TU Munich, Germany; CSIRO, Australia; UNSW, Australia)
Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end-user's perspective. Although ET has proven effective in system-level testing of interactive systems, the need for manual execution has hindered large-scale adoption. In this work, we explore the feasibility, challenges and road ahead of automated scenario-based ET (a.k.a soap opera testing). We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process. We then develop a multi-agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. The system consists of three multi-modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs. Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under-explored scenario boundaries and incorrectly identified bugs. Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human-AI co-learning, and the integration of soap opera testing with broader software engineering practices. These insights aim to guide and inspire the future research.
Article Search
LLM-Based Method Name Suggestion with Automatically Generated Context-Rich Prompts
Waseem Akram,
Yanjie Jiang,
Yuxia Zhang,
Haris Ali Khan, and
Hui Liu
(Beijing Institute of Technology, China; Peking University, China)
Accurate method naming is crucial for code readability and maintainability. However, manually creating concise and meaningful names remains a significant challenge. To this end, in this paper, we propose an approach based on Large Language Model (LLMs) to suggest method names according to function descriptions. The key of the approach is ContextCraft, an automated algorithm for generating context-rich prompts for LLM that suggests the expected method names according to the prompts. For a given query (functional description), it retrieves a few best examples whose functional descriptions have the greatest similarity with the query. From the examples, it identifies tokens that are likely to appear in the final method name as well as their likely positions, picks up pivot words that are semantically related to tokens in the according method names, and specifies the evaluation results of the LLM on the selected examples. All such outputs (tokens with probabilities and position information, pivot words accompanied by associated name tokens and similarity scores, and evaluation results) together with the query and the selected examples are then filled in a predefined prompt template, resulting in a context-rich prompt. This context-rich prompt reduces the randomness of LLMs by focusing the LLM’s attention on relevant contexts, constraining the solution space, and anchoring results to meaningful semantic relationships. Consequently, the LLM leverages this prompt to generate the expected method name, producing a more accurate and relevant suggestion. We evaluated the proposed approach with 43k real-world Java and Python methods accompanied by functional descriptions. Our evaluation results suggested that it significantly outperforms the state-of-the-art approach RNN-att-Copy, improving the chance of exact match by 52% and decreasing the edit distance between generated and expected method names by 32%. Our evaluation results also suggested that the proposed approach worked well for various LLMs, including ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Gemini-1.5, and Llama-3.
Article Search
Demystifying LLM-Based Software Engineering Agents
Chunqiu Steven Xia,
Yinlin Deng,
Soren Dunn, and
Lingming Zhang
(University of Illinois at Urbana-Champaign, USA)
Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically resolve software development issues. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents at the time of paper submission! Agentless also achieves more than 50% solve rate when using Claude 3.5 Sonnet on the new SWE-bench Verified benchmark. In fact, Agentless has already been adopted by OpenAI as the go-to approach to showcase the real-world coding performance of both GPT-4o and the new o1 models; more recently, Agentless has also been used by DeepSeek to evaluate their newest DeepSeek V3 and R1 models. Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patches or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-𝑆 by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the currently overlooked potential of a simplistic, cost-effective technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction. We have open-sourced Agentless at: https://github.com/OpenAutoCoder/Agentless
Article Search
Standing on the Shoulders of Giants: Bug-Aware Automated GUI Testing via Retrieval Augmentation
Mengzhuo Chen,
Zhe Liu,
Chunyang Chen,
Junjie Wang,
Boyu Wu,
Jun Hu, and
Qing Wang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; TU Munich, Germany)
In software development, similar apps often encounter similar bugs due to shared functionalities and implementation methods. However, current automated GUI testing methods mainly focus on generating test scripts to cover more pages by analyzing the internal structure of the app, without targeted exploration of paths that may trigger bugs, resulting in low efficiency in bug discovery. Considering that a large number of bug reports on open source platforms can provide external knowledge for testing, this paper proposes BugHunter, a novel bug-aware automated GUI testing approach that generates exploration paths guided by bug reports from similar apps, utilizing a combination of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Instead of focusing solely on coverage, BugHunter dynamically adapts the testing process to target bug paths, thereby increasing bug detection efficiency. BugHunter first builds a high-quality bug knowledge base from historical bug reports. Then it retrieves relevant reports from this large bug knowledge base using a two-stage retrieval process, and generates test paths based on similar apps’ bug reports. BugHunter also introduces a local and global path-planning mechanism to handle differences in functionality and UI design across apps, and the ambiguous behavior or missing steps in the online bug reports. We evaluate BugHunter on 121 bugs across 71 apps and compare its performance against 16 state-of-the-art baselines. BugHunter achieves 60% improvement in bug detection over the best baseline, with comparable or higher coverage against the baselines. Furthermore, BugHunter successfully detects 49 new crash bugs in real-world apps from Google Play, with 33 bugs fixed, 9 confirmed, and 7 pending feedback.
Article Search
PDCAT: Preference-Driven Compiler Auto-tuning
Mingxuan Zhu,
Zeyu Sun, and
Dan Hao
(Peking University, China; Institute of Software at Chinese Academy of Sciences, China)
Compilers are crucial software tools that usually convert programs in high-level languages into machine code. A compiler provides hundreds of optimizations to improve the performance of the compiled code, which are controlled by enabled or disabled optimization flags. However, the vast number of combinations of these flags makes it extremely challenging to select the desired settings for compiler optimization flags (i.e., an optimization sequence) for a given target program. In the literature, many auto-tuning techniques have been proposed to select a desired optimization sequence via different strategies across the entire optimization space. However, due to the huge optimization space, these techniques commonly suffer from the widely recognized efficiency problem. To address this problem, in this paper, we propose a preference-driven selection approach PDCAT, which reduces the search space of optimization sequences through three components. In particular, PDCAT first identifies combined optimizations based on compiler documentation to exclude optimization sequences violating the combined constraints, and then categorizes the optimizations into a common optimization set (whose optimization flags are fixed) and an exploration set containing the remaining optimizations. Finally, within the search process, PDCAT assigns distinct enable probabilities to the explored optimization flags and finally selects a desired optimization sequence. The former two components reduce the search space by removing invalid optimization sequences and fixing some optimization flags, whereas the latter performs a biased search in the search space. To evaluate the performance of the proposed approach PDCAT, we conducted an extensive experimental study on the latest version of the GCC compiler with two widely used benchmarks, cBench and PolyBench. The results show that PDCAT significantly outperforms the four compared techniques, including the state-of-art technique SRTuner. Moreover, each component of PDCAT not only contributes to its performance, but also improves the acceleration performance of the compared techniques.
Article Search
Towards Understanding Docker Build Faults in Practice: Symptoms, Root Causes, and Fix Patterns
Yiwen Wu,
Yang Zhang,
Tao Wang,
Bo Ding, and
Huaimin Wang
(National University of Defense Technology, China; State Key Laboratory of Complex and Critical Software Environment, China)
Docker building is a critical component of containerization in modern software development, automating the process of packaging and converting sources into container images. It is not uncommon to find that Docker build faults (DBFs) occur frequently across Docker-based projects, inducing non-negligible costs in development activities. DBF resolution is a challenging problem and previous studies have demonstrated that developers spend non-trivial time in resolving encountered build faults. However, the characteristics of DBFs is still under-investigated, hindering practical solutions to build management in the Docker community. In this paper, to bridge this gap, we present a comprehensive study for understanding the real-world DBFs in practice. We collect and analyze a DBF dataset of 255 issues and 219 posts from GitHub, Stack Overflow, and Docker Forum. We investigate and construct characteristic taxonomies for the DBFs, including 15 symptoms, 23 root causes, and 35 fix patterns. Moreover, we study the fault distributions of symptoms and root causes, in terms of the different build types, i.e., Dockerfile builds and Docker-compose builds. Based on the results, we provide actionable implications and develop a knowledge-based application, which can potentially facilitate research and assist developers in improving the Docker build management.
Article Search
Large Language Models for In-File Vulnerability Localization Can Be “Lost in the End”
Francesco Sovrano,
Adam Bauer, and
Alberto Bacchelli
(ETH Zurich, Switzerland; University of Zurich, Switzerland)
Traditionally, software vulnerability detection research has focused on individual small functions due to earlier language processing technologies’ limitations in handling larger inputs. However, this function-level approach may miss bugs that span multiple functions and code blocks. Recent advancements in artificial intelligence have enabled processing of larger inputs, leading everyday software developers to increasingly rely on chat-based large language models (LLMs) like GPT-3.5 and GPT-4 to detect vulnerabilities across entire files, not just within functions. This new development practice requires researchers to urgently investigate whether commonly used LLMs can effectively analyze large file-sized inputs, in order to provide timely insights for software developers and engineers about the pros and cons of this emerging technological trend. Hence, the goal of this paper is to evaluate the effectiveness of several state-of-the-art chat-based LLMs, including the GPT models, in detecting in-file vulnerabilities. We conducted a costly investigation into how the performance of LLMs varies based on vulnerability type, input size, and vulnerability location within the file. To give enough statistical power (β ≥ .8) to our study, we could only focus on the three most common (as well as dangerous) vulnerabilities: XSS, SQL injection, and path traversal. Our findings indicate that the effectiveness of LLMs in detecting these vulnerabilities is strongly influenced by both the location of the vulnerability and the overall size of the input. Specifically, regardless of the vulnerability type, LLMs tend to significantly (p < .05) underperform when detecting vulnerabilities located toward the end of larger files—a pattern we call the lost-in-the-end effect. Finally, to further support software developers and practitioners, we also explored the optimal input size for these LLMs and presented a simple strategy for identifying it, which can be applied to other models and vulnerability types. Eventually, we show how adjusting the input size can lead to significant improvements in LLM-based vulnerability detection, with an average recall increase of over 37% across all models.
Article Search
Artifacts Available
Empirically Evaluating the Impact of Object-Centric Breakpoints on the Debugging of Object-Oriented Programs
Valentin Bourcier,
Pooja Rani,
Maximilian Ignacio Willembrinck Santander,
Alberto Bacchelli, and
Steven Costiou
(University of Lille - Inria - CNRS - Centrale Lille - UMR 9189 CRIStAL, France; University of Zurich, Switzerland)
Debugging consists in understanding the behavior of a program to identify and correct its defects. Breakpoints are the most commonly used debugging tool and aim to facilitate the debugging process by allowing developers to interrupt a program’s execution at a source code location of their choice and inspect the state of the program. Researchers suggest that in systems developed using object-oriented programming (OOP), traditional breakpoints may be a not effective method for debugging. In OOP, developers create code in classes, which at runtime are instantiated as object—entities with their own state and behavior that can interact with one another. Traditional breakpoints are set within the class code, halting execution for every object that shares that class’s code. This leads to unnecessary interruptions for developers who are focused on monitoring the behavior of a specific object. As an answer to this challenge, researchers proposed object-centric debugging, an approach based on debugging tools that focus on objects rather than classes. In particular, using object-centric breakpoints, developers can select specific objects (rather than classes) for which the execution must be interrupted. Even though it seems reasonable that this approach may ease the debugging process by reducing the time and actions needed for debugging objects, no research has yet verified its actual impact. To investigate the impact of object-centric breakpoints on the debugging process, we devised and conducted a controlled experiment with 81 developers who spent an average of 1 hour and 30 minutes each on the study. The experiment required participants to complete two debugging tasks using debugging tools with vs. without object-centric breakpoints. We found no significant effect from the use of object-centric breakpoints on the number of actions required to debug or the effectiveness in understanding or fixing the bug. However, for one of the two tasks, we measured a statistically significant reduction in debugging time for participants who used object-centric breakpoints, while for the other task, there was a statistically significant increase. Our analysis suggests that the impact of object-centric breakpoints varies depending on the context and the specific nature of the bug being addressed. In particular, our analysis indicates that object-centric breakpoints can speed up the process of locating the root cause of a bug when the bug can be replicated without needing to restart the program. We discuss the implications of these findings for debugging practices and future research.
Article Search
Artifacts Available
ChangeGuard: Validating Code Changes via Pairwise Learning-Guided Execution
Lars Gröninger,
Beatriz Souza, and
Michael Pradel
(University of Stuttgart, Germany)
Code changes are an integral part of the software development process. Many code changes are meant to improve the code without changing its functional behavior, e.g., refactorings and performance improvements. Unfortunately, validating whether a code change preserves the behavior is non-trivial, particularly when the code change is performed deep inside a complex project. This paper presents ChangeGuard, an approach that uses learning-guided execution to compare the runtime behavior of a modified function. The approach is enabled by the novel concept of pairwise learning-guided execution and by a set of techniques that improve the robustness and coverage of the state-of-the-art learning-guided execution technique. Our evaluation applies ChangeGuard to a dataset of 224 manually annotated code changes from popular Python open-source projects and to three datasets of code changes obtained by applying automated code transformations. Our results show that the approach identifies semantics-changing code changes with a precision of 77.1% and a recall of 69.5%, and that it detects unexpected behavioral changes introduced by automatic code refactoring tools. In contrast, the existing regression tests of the analyzed projects miss the vast majority of semantics-changing code changes, with a recall of only 7.6%. We envision our approach being useful for detecting unintended behavioral changes early in the development process and for improving the quality of automated code transformations.
Article Search
Integrating Large Language Models and Reinforcement Learning for Non-linear Reasoning
Yoav Alon and
Cristina David
(University of Bristol, UK)
Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM’s space exploration: (1) the Agent has access to domain-specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM’s training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task and compare it against Chain of Thought (CoT) and Tree of Thoughts (ToT). We assess both the downstream task, denoting the binary classification, and the intermediate reasoning steps. Our approach compares positively against CoT and ToT
Article Search
How Do Programming Students Use Generative AI?
Christian Rahe and
Walid Maalej
(University of Hamburg, Germany)
Programming students have a widespread access to powerful Generative AI tools like ChatGPT.
While this can help understand the learning material and assist with exercises, educators are voicing more and more concerns about an overreliance on generated outputs and lack of critical thinking skills.
It is thus important to understand how students actually use generative AI and what impact this could have on their learning behavior.
To this end, we conducted a study including an exploratory experiment with 37 programming students, giving them monitored access to ChatGPT while solving a code authoring exercise.
The task was not directly solvable by ChatGPT and required code comprehension and reasoning.
While only 23 of the students actually opted to use the chatbot, the majority of those eventually prompted it to simply generate a full solution.
We observed two prevalent usage strategies: to seek knowledge about general concepts and to directly generate solutions.
Instead of using the bot to comprehend the code and their own mistakes, students often got trapped in a vicious cycle of submitting wrong generated code and then asking the bot for a fix.
Those who self-reported using generative AI regularly were more likely to prompt the bot to generate a solution.
Our findings indicate that concerns about potential decrease in programmers' agency and productivity with Generative AI are justified.
We discuss how researchers and educators can respond to the potential risk of students uncritically over-relying on Generative AI.
We also discuss potential modifications to our study design for large-scale replications.
Article Search
LLMDroid: Enhancing Automated Mobile App GUI Testing Coverage with Large Language Model Guidance
Chenxu Wang,
Tianming Liu,
Yanjie Zhao,
Minghui Yang, and
Haoyu Wang
(Huazhong University of Science and Technology, China; Monash University, Australia; OPPO, China)
With the rapid development of Large Language Models (LLMs), their integration into automated mobile GUI testing has emerged as a promising research direction. However, existing LLM-based testing approaches face significant challenges, including time inefficiency and high costs due to constant LLM querying. To address these issues, this paper introduces LLMDroid, a novel testing framework designed to enhance existing automated mobile GUI testing tools by leveraging LLMs more efficiently. The workflow of LLMDroid comprises two main stages: Autonomous Exploration and LLM Guidance. During Autonomous Exploration, LLMDroid utilizes existing testing tools while leveraging LLMs to summarize explored pages. When code coverage growth slows, it transitions to LLM Guidance to strategically direct testing towards unexplored functionalities. This approach minimizes LLM interactions while maximizing their impact on test coverage. We applied LLMDroid to three popular open-source Android testing tools and evaluated it on 14 top-listed apps from Google Play. Results demonstrate an average increase of 26.16% in code coverage and 29.31% in activity coverage. Furthermore, our evaluation under different LLMs reveals that LLMDroid outperform existing step-wise approaches with significant cost efficiency, achieving optimal performance at $0.49 per hour using GPT-4o among tested models, with a cost-effective alternative achieving 94% of this performance at just $0.03 per hour. These findings highlight LLMDroid’s effectiveness in enhancing automated mobile app testing and its potential for widespread adoption.
Article Search
The Struggles of LLMs in Cross-Lingual Code Clone Detection
Micheline Bénédicte Moumoula,
Abdoul Kader Kaboré,
Jacques Klein, and
Tegawendé F. Bissyandé
(University of Luxembourg, Luxembourg; Interdisciplinary Centre of Excellence in AI for Development (CITADEL), Burkina Faso)
With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and, eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.99, for straightforward programming examples. However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of “code clones” in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ∼1 and ∼20 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection.
Article Search
Has My Code Been Stolen for Model Training? A Naturalness Based Approach to Code Contamination Detection
Haris Ali Khan,
Yanjie Jiang,
Qasim Umer,
Yuxia Zhang,
Waseem Akram, and
Hui Liu
(Beijing Institute of Technology, China; Peking University, China; King Fahd University of Petroleum and Minerals, Saudi Arabia)
It is often valuable to know whether a given piece of source code has or hasn’t been used to train a given deep learning model. On one side, it helps avoid data contamination problems that may exaggerate the performance of evaluated models. Conversely, it facilitates copyright protection by identifying private or protected code leveraged for model training without permission. To this end, automated approaches have been proposed for the detection, known as data contamination detection. Such approaches often heavily rely on the confidence of the involved models, assuming that the models should be more confident in handling contaminated data than cleaned data. However, such approaches do not consider the nature of the given data item, i.e., how difficult it is to predict the given item. Consequently, difficult-to-predict contaminated data and easy-to-predict cleaned data are often misclassified. As an initial attempt to solve this problem, this paper presents a naturalness-based approach, called Natural-DaCoDe, for code-completion models to distinguish contaminated source code from cleaned ones. Natural-DaCoDe leverages code naturalness to quantitatively measure the difficulty of a given source code for code-completion models. It then trains a classifier to distinguish contaminated source code according to both code naturalness and the performance of the code-completion models on the given source code. We evaluate Natural-DaCoDe with two pre-trained large language models (e.g., ChatGPT and Claude) and two code-completion models that we trained from scratch for detecting contamination data. Our evaluation results suggest that Natural-DaCoDe substantially outperformed the state-of-the-art approaches in detecting contaminated data, improving the average accuracy by 61.78%. We also evaluate Natural-DaCoDe with method name suggestion task, and it remains more accurate than the state-of-the-art approaches, improving the accuracy by 54.39%. Furthermore, Natural-DaCoDe was tested on a natural language text benchmark, significantly outperforming the state-of-the-art approaches by 22% . It may suggest that Natural-DaCoDe could be applied to various source code related tasks besides code complete.
Article Search
Automated and Accurate Token Transfer Identification and Its Applications in Cryptocurrency Security
Shuwei Song,
Ting Chen,
Ao Qiao,
Xiapu Luo,
Leqing Wang,
Zheyuan He,
Ting Wang,
Xiaodong Lin,
Peng He,
Wensheng Zhang, and
Xiaosong Zhang
(University of Electronic Science and Technology of China, China; Hong Kong Polytechnic University, China; Stony Brook University, USA; University of Guelph, Canada)
Cryptocurrency tokens, implemented by smart contracts, are prime targets for attackers due to their substantial monetary value. To illicitly gain profit, attackers often embed malicious code or exploit vulnerabilities within token contracts. Token transfer identification is crucial for detecting malicious and vulnerable token contracts. However, existing methods suffer from high false positives or false negatives due to invalid assumptions or reliance on limited patterns. This paper introduces a novel approach that captures the essential principles of token contracts, which are independent of programming languages and token standards, and presents a new tool, CRYPTO-SCOUT. CRYPTO-SCOUT automatically and accurately identifies token transfers, enabling the detection of various malicious and vulnerable token contracts. CRYPTO-SCOUT's core innovation is its capability to automatically identify complex container-type variables used by token contracts for storing holder information. It processes the bytecode of smart contracts written in the two most widely-used languages, Solidity and Vyper, and supports the three most popular token standards, ERC20, ERC721, and ERC1155. Furthermore, CRYPTO-SCOUT detects four types of malicious and vulnerable token contracts and is designed to be extensible. Extensive experiments show that CRYPTO-SCOUT outperforms existing approaches and uncovers over 21,000 malicious/vulnerable token contracts and more than 12,000 transactions triggering them.
Article Search
Revolutionizing Newcomers’ Onboarding Process in OSS Communities: The Future AI Mentor
Xin Tan,
Xiao Long,
Yinghao Zhu,
Lin Shi,
Xiaoli Lian, and
Li Zhang
(Beihang University, China)
Onboarding newcomers is vital for the sustainability of open-source software (OSS) projects. To lower barriers and increase engagement, OSS projects have dedicated experts who provide guidance for newcomers. However, timely responses are often hindered by experts' busy schedules. The recent rapid advancements of AI in software engineering have brought opportunities to leverage AI as a substitute for expert mentoring. However, the potential role of AI as a comprehensive mentor throughout the entire onboarding process remains unexplored. To identify design strategies of this ``AI mentor'', we applied Design Fiction as a participatory method with 19 OSS newcomers. We investigated their current onboarding experience and elicited 32 design strategies for future AI mentor. Participants envisioned AI mentor being integrated into OSS platforms like GitHub, where it could offer assistance to newcomers, such as ``recommending projects based on personalized requirements'' and ``assessing and categorizing project issues by difficulty''. We also collected participants' perceptions of a prototype, named ``OSSerCopilot'', that implemented the envisioned strategies. They found the interface useful and user-friendly, showing a willingness to use it in the future, which suggests the design strategies are effective. Finally, in order to identify the gaps between our design strategies and current research, we conducted a comprehensive literature review, evaluating the extent of existing research support for this concept. We find that research is relatively scarce in certain areas where newcomers highly anticipate AI mentor assistance, such as ``discovering an interested project''. Our study has the potential to revolutionize the current newcomer-expert mentorship and provides valuable insights for researchers and tool designers aiming to develop and enhance AI mentor systems.
Article Search
Impact of Request Formats on Effort Estimation: Are LLMs Different Than Humans?
Gül Çalıklı and
Mohammed Alhamed
(University of Glasgow, UK; Applied Behaviour Systems, UK)
Software development Effort Estimation (SEE) comprises predicting the most realistic amount of effort (e.g., in work hours) required to develop or maintain software based on incomplete, uncertain, and noisy input. Expert judgment is the dominant SEE strategy used in the industry. Yet, expert-based judgment can provide inaccurate effort estimates, leading to projects’ poor budget planning and cost and time overruns, negatively impacting the world economy. Large Language Models (LLMs) are good candidates to assist software professionals in effort estimation. However, their effective leveraging for SEE requires thoroughly investigating their limitations and to what extent they overlap with those of (human) software practitioners. One primary limitation of LLMs is the sensitivity of their responses to prompt changes. Similarly, empirical studies showed that changes in the request format (e.g., rephrasing) could impact (human) software professionals’ effort estimates. This paper reports the first study that replicates a series of SEE experiments, which were initially carried out with software professionals (humans) in the literature. Our study aims to investigate how LLMs’ effort estimates change due to the transition from the traditional request format (i.e., "How much effort is required to complete X?”) to the alternative request format (i.e., "How much can be completed in Y work hours?”). Our experiments involved three different LLMs (GPT-4, Gemini 1.5 Pro, Llama 3.1) and 88 software project specifications (per treatment in each experiment), resulting in 880 prompts, in total that we prepared using 704 user stories from three large-scale open-source software projects (Hyperledger Fabric, Mulesoft Mule, Spring XD). Our findings align with the original experiments conducted with software professionals: The first four experiments showed that LLMs provide lower effort estimates due to transitioning from the traditional to the alternative request format. The findings of the fifth and first experiments detected that LLMs display patterns analogous to anchoring bias, a human cognitive bias defined as the tendency to stick to the anchor (i.e., the "Y work-hours” in the alternative request format). Our findings provide crucial insights into facilitating future human-AI collaboration and prompt designs for improved effort estimation accuracy.
Article Search
Mitigating Emergent Malware Label Noise in DNN-Based Android Malware Detection
Haodong Li,
Xiao Cheng,
Guohan Zhang,
Guosheng Xu,
Guoai Xu, and
Haoyu Wang
(Beijing University of Posts and Telecommunications, China; UNSW, Australia; Harbin Institute of Technology, Shenzhen, China; Huazhong University of Science and Technology, China)
Learning-based Android malware detection has earned significant recognition across industry and academia, yet its effectiveness hinges on the accuracy of labeled training data. Manual labeling, being prohibitively expensive, has prompted the use of automated methods, such as leveraging anti-virus engines like VirusTotal, which unfortunately introduces mislabeling, aka "label noise". The state-of-the-art label noise reduction approach, MalWhiteout, can effectively reduce random label noise but underperforms in mitigating real-world emergent malware (EM) label noise stemming from newly emerging Android malware variants overlooked by VirusTotal. To tackle this, we conceptualize EM label noise detection as an anomaly detection problem and introduce a novel tool, MalCleanse, that surpasses MalWhiteout's ability to address EM label noise. MalCleanse combines uncertainty estimation with unsupervised anomaly detection, identifying samples with high uncertainty as mislabeled, thereby enhancing its capability to remove EM label noise. Our experimental results demonstrate a significant reduction in EM label noise by approximately 25.25%, achieving an F1 Score of 80.32% for label noise detection at a noise ratio of 40.61%. Notably, MalCleanse outperforms MalWhiteout with an increase of 40.9% in overall F1 score for mitigating EM label noise. This paper pioneers the integration of deep neural network model uncertainty to refine label accuracy, thereby enhancing the reliability of malware detection systems. Our approach represents a significant step forward in addressing the challenges posed by emergent malware in automated labeling systems.
Article Search
Detecting Metadata-Related Bugs in Enterprise Applications
Md Mahir Asef Kabir,
Xiaoyin Wang, and
Na Meng
(Virginia Tech, USA; University of Texas at San Antonio, USA)
When building enterprise applications (EAs) on Java frameworks (e.g., Spring), developers often configure application components via metadata (i.e., Java annotations and XML files). It is challenging for developers to correctly use metadata, because the usage rules can be complex and existing tools provide limited assistance. When developers misuse metadata, EAs become misconfigured, which defects can trigger erroneous runtime behaviors or introduce security vulnerabilities. To help developers correctly use metadata, this paper presents (1) RSL---a domain-specific language that domain experts can adopt to prescribe metadata checking rules, and (2) MeCheck---a tool that takes in RSL rules and EAs to check for rule violations.
With RSL, domain experts (e.g., developers of a Java framework) can specify metadata checking rules by defining content consistency among XML files, annotations, and Java code. Given such RSL rules and a program to scan, MeCheck interprets rules as cross-file static analyzers, which analyzers scan Java and/or XML files to gather information and look for consistency violations. For evaluation, we studied the Spring and JUnit documentation to manually define 15 rules, and created 2 datasets with 115 open-source EAs. The first dataset includes 45 EAs, and the ground truth of 45 manually injected bugs. The second dataset includes multiple versions of 70 EAs. We observed that MeCheck identified bugs in the first dataset with 100% precision, 96% recall, and 98% F-score. It reported 156 bugs in the second dataset, 53 of which bugs were already fixed by developers. Our evaluation shows that MeCheck helps ensure the correct usage of metadata.
Article Search
An Adaptive Language-Agnostic Pruning Method for Greener Language Models for Code
Mootez Saad,
José Antonio Hernández López,
Boqi Chen,
Dániel Varró, and
Tushar Sharma
(Dalhousie University, Canada; University of Murcia, Spain; McGill University, Canada; Linköping University, Sweden)
Language models of code have demonstrated remarkable performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges.
This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce the computational overhead of these models.
The proposed method offers a pluggable layer that can be integrated with all Transformer-based models.
With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size that is up to x3 less their initial size, resulting in significantly reduced computational load.
Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 emissions by up to 44.85%. Importantly, it achieves a reduction in computation resources while maintaining up to 98.1% of the original predictive performance.
These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance,
contributing to the overall sustainability of their adoption in software development.
Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE.
Article Search
10 Years Later: Revisiting How Developers Search for Code
Kathryn T. Stolee,
Tobias Welp,
Caitlin Sadowski, and
Sebastian Elbaum
(North Carolina State University, USA; Google, Germany; Unaffiliated, USA; University of Virginia, USA)
Code search is an integral part of a developer’s workflow. In 2015, researchers published a paper reflecting on the code search practices at Google of 27 developers who used the internal Code Search tool. That paper had first-hand accounts for why those developers were using code search and highlighted how often and in what situations developers were searching for code. In the past decade, much has changed in the landscape of developer support. New languages have emerged, artificial intelligence (AI) for code generation has gained traction, auto-complete in the IDE has gotten better, Q&A forums have increased in popularity, and code repositories are larger than ever. It is worth considering whether those observations from almost a decade ago have stood the test of time. In this work, inspired by the prior survey about the Code Search tool, we run a series of three surveys with 1,945 total responses and report overall Code Search usage statistics for over 100,000 users. Unlike the prior work, in our surveys, we include explicit success criteria to understand when code search is meeting their needs, and when it is not. We dive further into two common sub-categories of code search effort: when its users are looking for examples and when they are using code search alongside code review. We find that Code Search users continue to use the tool frequently and the frequency has not changed despite the introduction of AI-enhanced development support. Users continue to turn to Code Search to find examples, but the frequency of example-seeking behavior has decreased. More often than before, users access the tool to learn about and explore code. This has implications for future Code Search support in software development.
Article Search
IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models
Sayem Mohammad Imtiaz,
Astha Singh,
Fraol Batole, and
Hridesh Rajan
(Iowa State University, USA; Tulane University, USA)
Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity, harmful responses, and factual inaccuracies. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, drawing inspiration from fault localization via program slicing, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model’s most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model’s overall versatility by altering a smaller portion of the model. Furthermore, dynamic selection allows for a more nuanced and precise model repair compared to a fixed selection strategy. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair.
Preprint
Clone Detection for Smart Contracts: How Far Are We?
Zuobin Wang,
Zhiyuan Wan,
Yujing Chen,
Yun Zhang,
David Lo,
Difan Xie, and
Xiaohu Yang
(Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang), China; Hangzhou City University, China; Singapore Management University, Singapore)
In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research.
Article Search
Dissecting Real-World Cross-Language Bugs
Haoran Yang and
Haipeng Cai
(Washington State University, USA; SUNY Buffalo, USA)
Multilingual systems are prevalent and broadly impactful, but also complex due to the intricate interactions between the heterogeneous programming languages the systems are developed in. This complexity is further aggravated by the diversity of cross-language interoperability across different language combinations, resulting in additional, often stealthy cross-language bugs. Yet despite the growing number of tools aimed to discover cross-language bugs, a systematic understanding of such bugs is still lacking. To fill this gap, we conduct the first comprehensive study of cross-language bugs, characterizing them in 5 aspects including their symptoms, locations, manifestation, root causes, and fixes, as well as their relationships. Through careful identification and detailed analysis of 400 cross-language bugs in real-world multilingual projects classified from 54,356 relevant code commits in their GitHub repositories, we revealed not only bug characteristics of those five aspects but also how they compare between two top language combinations in the multilingual world (Python-C and Java-C). In addition to findings of the study as well as its enabling tools and datasets, we also provide practical recommendations regarding the prevention, detection, and patching of cross-language bugs.
Preprint
Info
Less Is More: On the Importance of Data Quality for Unit Test Generation
Junwei Zhang,
Xing Hu,
Shan Gao,
Xin Xia,
David Lo, and
Shanping Li
(Zhejiang University, China; Huawei, China; Singapore Management University, Singapore)
Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models. Fine-tuning the four LLMs with the filtered Methods2Test dataset, on average, improves its performance by 67% in branch coverage, using the Defects4J benchmark. For the Atlas dataset, the four LLMs improve branch coverage by 39%. Additionally, filtering noise improves bug detection performance, resulting in a 21.42% increase in bugs detected by the generated tests.
Article Search
Protecting Privacy in Software Logs: What Should Be Anonymized?
Roozbeh Aghili,
Heng Li, and
Foutse Khomh
(Polytechnique Montréal, Canada)
Software logs, generated during the runtime of software systems, are essential for various development and analysis activities, such as anomaly detection and failure diagnosis. However, the presence of sensitive information in these logs poses significant privacy concerns, particularly regarding Personally Identifiable Information (PII) and quasi-identifiers that could lead to re-identification risks. While general data privacy has been extensively studied, the specific domain of privacy in software logs remains underexplored, with inconsistent definitions of sensitivity and a lack of standardized guidelines for anonymization. To mitigate this gap, this study offers a comprehensive analysis of privacy in software logs from multiple perspectives. We start by performing an analysis of 25 publicly available log datasets to identify potentially sensitive attributes. Based on the result of this step, we focus on three perspectives: privacy regulations, research literature, and industry practices. We first analyze key data privacy regulations, such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), to understand the legal requirements concerning sensitive information in logs. Second, we conduct a systematic literature review to identify common privacy attributes and practices in log anonymization, revealing gaps in existing approaches. Finally, we survey 45 industry professionals to capture practical insights on log anonymization practices. Our findings shed light on various perspectives of log privacy and reveal industry challenges, such as technical and efficiency issues while highlighting the need for standardized guidelines. By combining insights from regulatory, academic, and industry perspectives, our study aims to provide a clearer framework for identifying and protecting sensitive information in software logs.
Article Search
MiSum: Multi-modality Heterogeneous Code Graph Learning for Multi-intent Binary Code Summarization
Kangchen Zhu,
Zhiliang Tian,
Shangwen Wang,
Weiguo Chen,
Zixuan Dong,
Mingyue Leng, and
Xiaoguang Mao
(National University of Defense Technology, China)
The current landscape of binary code summarization predominantly focuses on generating a single summary, which limits its utility and understanding for reverse engineers. Existing approaches often fail to meet the diverse needs of users, such as providing detailed insights into usage patterns, implementation nuances, and design rationale, as observed in the field of source code summarization. This highlights the need for multi-intent binary code summarization to enhance the effectiveness of reverse engineering processes. To address this gap, we propose MiSum, a novel method that leverages multi-modality heterogeneous code graph alignment and learning to integrate both assembly code and pseudo-code. MiSum introduces a unified multi-modality heterogeneous code graph (MM-HCG) that aligns assembly code graphs with pseudo-code graphs, capturing both low-level execution details and high-level structural information. We further propose multi-modality heterogeneous graph learning with heterogeneous mutual attention and message passing, which highlights important code blocks and discovers inter-dependencies across different code forms. Additionally, an intent-aware summary generator with an intent-aware attention mechanism is introduced to produce customized summaries tailored to multiple intents. Extensive experiments, including evaluations across various architectures and optimization levels, demonstrate that MiSum outperforms state-of-the-art baselines in BLEU, METEOR, and ROUGE-L metrics. Human evaluations validate its capability to effectively support reverse engineers in understanding diverse binary code intents, marking a significant advancement in binary code analysis.
Article Search
Info
Artifacts Available
SemBIC: Semantic-Aware Identification of Bug-Inducing Commits
Xiao Chen,
Hengcheng Zhu,
Jialun Cao,
Ming Wen, and
Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; Huazhong University of Science and Technology, China)
Debugging can be much facilitated if one can identify the evolution commit that introduced the bug leading to a detected failure (aka. bug-inducing commit, BIC). Although one may, in theory, locate BICs by executing the detected failing test on various historical commit versions, it is impractical when the test cannot be executed on some of those versions. On the other hand, existing static techniques often assume the availability of additional information such as patches and bug reports, or the applicability of predefined heuristics like commit chronology. However, these approaches are ineffective when such assumptions do not hold, which are often the case in practice. To address these limitations, we propose SEMBIC to identify the BIC of a bug by statically tracking the semantic changes in the execution path prescribed by the failing test across successive historical commit versions. Our insight is that the greater the semantic changes a commit introduces concerning the failing execution path of a target bug, the more likely it is to be the BIC. To distill semantic changes relevant to the failure, we focus on three fine-grained semantic properties. We evaluate the performance of SEMBIC on a benchmark containing 199 real-world bugs from 12 open-source projects. We found that SEMBIC can identify BICs with high accuracy – it ranks the BIC as top 1 for 88 out of 199 bugs, and achieves an MRR of 0.520, outperforming the state-of-the-art technique by 29.4% and 13.6%, respectively.
Article Search
Artifacts Available
Eliminating Backdoors in Neural Code Models for Secure Code Understanding
Weisong Sun,
Yuchen Chen,
Chunrong Fang,
Yebo Feng,
Yuan Xiao,
An Guo,
Quanjun Zhang,
Zhenyu Chen,
Baowen Xu, and
Yang Liu
(Nanyang Technological University, Singapore; Nanjing University, China)
Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could jeopardize life safety. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced backdoor attack from 99.76% to 2.64%, significantly outperforming the three baselines. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline).
Article Search
Understanding Debugging as Episodes: A Case Study on Performance Bugs in Configurable Software Systems
Max Weber,
Alina Mailach,
Sven Apel,
Janet Siegmund,
Raimund Dachselt, and
Norbert Siegmund
(Leipzig University, Germany; ScaDS.AI Dresden-Leipzig, Germany; Saarland Informatics Campus, Germany; Saarland University, Germany; Chemnitz University of Technology, Germany; Dresden University of Technology, Germany)
Debugging performance bugs in configurable software systems is a complex and time-consuming task that requires not only fixing a bug, but also understanding its root cause. While there is a vast body of literature on debugging strategies, there is no consensus on general debugging. This makes it difficult to provide concrete guidance for developers, especially for configuration-dependent performance bugs. The goal of our work is to alleviate this situation by providing an framework to describe debugging strategies in a more general, unifying way. We conducted a user study with 12 professional developers who debugged a performance bug in a real-world configurable system. To observe developers in an unobstructive way, we provided an immersive virtual reality tool, SoftVR, giving them a large degree of freedom to choose the preferred debugging strategy. The results show that the existing documentation of strategies is too coarse-grained and intermixed to identify successful approaches. In a subsequent qualitative analysis, we devised a coding framework to reason about debugging approaches. With this framework, we identified five goal-oriented episodes that developers employ, which they also confirmed in subsequent interviews. Our work provides a unified description of debugging strategies, allowing researchers a common foundation to study debugging and practitioners and teachers guidance on successful debugging strategies.
Article Search
Detecting and Reducing the Factual Hallucinations of Large Language Models with Metamorphic Testing
Weibin Wu,
Yuhang Cao,
Ning Yi,
Rongyi Ou, and
Zibin Zheng
(Sun Yat-sen University, China)
Question answering (QA) is a fundamental task of large language models (LLMs), which requires LLMs to automatically answer human-posed questions in natural language. However, LLMs are known to distort facts and make non-factual statements (i.e., hallucinations) when dealing with QA tasks, which may affect the deployment of LLMs in real-life situations. In this work, we propose DrHall, a framework for detecting and reducing the factual hallucinations of black-box LLMs with metamorphic testing (MT). We believe that hallucinated answers are unstable. Therefore, when LLMs hallucinate, they are more likely to produce different answers if we use metamorphic testing to make LLMs re-execute the same task with different execution paths, which motivates the design of DrHall. The effectiveness of DrHall is evaluated empirically on three datasets, including a self-built dataset of natural language questions: FactHalluQA, as well as two datasets of programming questions: Refactory and LeetCode. The evaluation results confirm that DrHall can consistently outperform the state-of-the-art baselines, obtaining an average F1 score of over 0.856 for hallucination detection. For hallucination correction, DrHall can also outperform the state-of-the-art baselines, with an average hallucination correction rate of over 53%. We hope that our work can enhance the reliability of LLMs and provide new insights for the research of LLM hallucination mitigation.
Article Search
On the Unnecessary Complexity of Names in X.509 and Their Impact on Implementations
Yuteng Sun,
Joyanta Debnath,
Wenzheng Hong,
Omar Chowdhury, and
Sze Yiu Chau
(Chinese University of Hong Kong, Hong Kong; Stony Brook University, USA; Independent, China)
TheX.509 Public Key Infrastructure (PKI) provides a cryptographically verifiable mechanism for authenticating a binding of an entity’s public-key with its identity, presented in a tamper-proof digital certificate. This often serves as a foundational building block for achieving different security guarantees in many critical applications and protocols (e.g., SSL/TLS). Identities in the context of X.509 PKI are often captured as names, which are encoded in certificates as composite records with different optional fields that can store various types of string values (e.g., ASCII, UTF8). Although such flexibility allows for the support of diverse name types (e.g., IP addresses, DNS names) and application scenarios, it imposes on library developers obligations to enforce unnecessarily convoluted requirements. Bugs in enforcing these requirements can lead to potential interoperability and performance issues, and might open doors to impersonation attacks. This paper focuses on analyzing how open-source libraries enforce the constraints regarding the formatting, encoding, and processing of complex name structures on X.509 certificate chains, for the purpose of establishing identities. Our analysis reveals that a portfolio of simplistic testing approaches can expose blatant violations of the requirements in widely used open-source libraries. Although there is a substantial amount of prior work that focused on testing the overall certificate chain validation process of X.509 libraries, the identified violations have escaped their scrutiny. To make matters worse, we observe that almost all the analyzed libraries completely ignore certain pre-processing steps prescribed by the standard. This begs the question of whether it is beneficial to have a standard so flexible but complex that none of the implementations can faithfully adhere to it. With our test results, we argue in the negative, and explain how simpler alternatives (e.g., other forms of identifiers such as Authority and Subject Key Identifiers) can be used to enforce similar constraints with no loss of security.
Article Search
RePurr: Automated Repair of Block-Based Learners’ Programs
Sebastian Schweikl and
Gordon Fraser
(University of Passau, Germany)
Programming is increasingly taught using dedicated block-based
programming environments such as Scratch. While the use of blocks
instead of text prevents syntax errors, learners can still make
semantic mistakes implying a need for feedback and help. Since
teachers may be overwhelmed by help requests in a classroom, may not
have the required programming education themselves, and may simply
not be available in independent learning scenarios, automated hint
generation is desirable. Automated program repair can provide the
foundation for automated hints, but relies on multiple assumptions:
(1) Program repair usually aims to produce localized patches for
fixing single bugs, but learners may fundamentally misunderstand
programming concepts and tasks or request help for substantially
incomplete programs.
(2) Software tests are required to guide the search and to localize
broken statements, but test suites for block-based programs are
different to those considered in past research on fault localization
and repair: They consist of system tests, where very few tests are
sufficient to fully cover the code. At the same time, these tests
have vastly longer execution times caused by the use of animations
and interactions on Scratch programs, thus inhibiting the
applicability of metaheuristic search.
(3) The plastic surgery hypothesis assumes that the code necessary
for repairs already exists in the codebase. Block-based programs
tend to be small and may lack this necessary redundancy.
In order to study whether automated program repair of block-based
programs is nevertheless feasible, in this paper we introduce, to
the best of our knowledge, the first automated program repair
approach for Scratch programs based on evolutionary search.
Our RePurr prototype includes novel refinements of fault
localization to improve the lack of guidance of the test suites,
recovers the plastic surgery hypothesis by exploiting that a
learning scenario provides model and student solutions as
alternatives, and uses parallelization and accelerated executions to
reduce the costs of fitness evaluations.
Empirical evaluation of RePurr on a set of real learners' programs
confirms the anticipated challenges, but also demonstrates that the
repair can nonetheless effectively improve and fix learners'
programs, thus enabling automated generation of hints and feedback
for learners.
Article Search
Why the Proof Fails in Different Versions of Theorem Provers: An Empirical Study of Compatibility Issues in Isabelle
Xiaokun Luan,
David Sanan,
Zhe Hou,
Qiyuan Xu,
Chengwei Liu,
Yufan Cai,
Yang Liu, and
Meng Sun
(Peking University, China; Singapore Institute of Technology, Singapore; Griffith University, Australia; Nanyang Technological University, Singapore; China-Singapore International Joint Research Institute, China; National University of Singapore, Singapore)
Proof assistants are software tools for formal modeling and verification of software, hardware, design, and mathematical proofs. Due to the growing complexity and scale of formal proofs, compatibility issues frequently arise when using different versions of proof assistants. These issues result in broken proofs, disrupting the maintenance of formalized theories and hindering the broader dissemination of results within the community. Although existing works have proposed techniques to address specific types of compatibility issues, the overall characteristics of these issues remain largely unexplored. To address this gap, we conduct the first extensive empirical study to characterize compatibility issues, using Isabelle as a case study. We develop a regression testing framework to automatically collect compatibility issues from the Archive of Formal Proofs, the largest repository of formal proofs in Isabelle. By analyzing 12,079 collected issues, we identify their types and symptoms and further investigate their root causes. We also extract updated proofs that address these issues to understand the applied resolution strategies. Our study provides an in-depth understanding of compatibility issues in proof assistants, offering insights that support the development of effective techniques to mitigate these issues.
Article Search
Artifacts Available
Improving Graph Learning-Based Fault Localization with Tailored Semi-supervised Learning
Chun Li,
Hui Li,
Zhong Li,
Minxue Pan, and
Xuandong Li
(Nanjing University, China; Samsung Electronics, China)
Due to advancements in graph neural networks, graph learning-based fault localization (GBFL) methods have achieved promising results. However, as these methods are supervised learning paradigms and deep learning is typically data-hungry, they can only be trained on fully labeled large-scale datasets. This is impractical because labeling failed tests is similar to manual fault localization, which is time-consuming and labor-intensive, leading to only a small portion of failed tests that can be labeled within limited budgets. These data labeling limitations would lead to the sub-optimal effectiveness of supervised GBFL techniques. Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance and address data labeling limitations. However, as these methods are not specifically designed for fault localization, directly utilizing them might lead to sub-optimal effectiveness. In response, we propose a novel semi-supervised GBFL framework, LEGATO. LEGATO first leverages the attention mechanism to identify and augment likely fault-unrelated sub-graphs in unlabeled graphs and then quantifies the suspiciousness distribution of unlabeled graphs to estimate pseudo-labels. Through training the model on augmented unlabeled graphs and pseudo-labels, LEGATO can utilize the unlabeled data to improve the effectiveness of fault localization and address the restrictions in data labeling. By extensive evaluations against 3 baselines SSL methods, LEGATO demonstrates superior performance by outperforming all the methods in comparison.
Article Search
A Mixed-Methods Study of Model-Based GUI Testing in Real-World Industrial Settings
Shaoheng Cao,
Renyi Chen,
Wenhua Yang,
Minxue Pan, and
Xuandong Li
(Nanjing University, China; Samsung Electronics, China; Nanjing University of Aeronautics and Astronautics, China)
Model-based testing (MBT) has been an important methodology in software engineering, attracting extensive research attention for over four decades. However, despite its academic acclaim, studies examining the impact of MBT in industrial environments—particularly regarding its extended effects—remain limited and yield unclear results. This gap may contribute to the challenges in establishing a study environment for implementing and applying MBT in production settings to evaluate its impact over time. To bridge this gap, we collaborated with an industrial partner to undertake a comprehensive, longitudinal empirical study employing mixed methods. Over two months, we implemented our MBT tool within the corporation, assessing the immediate and extended effectiveness and efficiency of MBT compared to script-writing-based testing. Through a mix of quantitative and qualitative methods—spanning controlled experiments, questionnaire surveys, and interviews—our study uncovers several insightful findings. These include differences in effectiveness and maintainability between immediate and extended MBT application, the evolving perceptions and expectations of engineers regarding MBT, and more. Leveraging these insights, we propose actionable implications for both the academic and industrial communities, aimed at bolstering confidence in MBT adoption and investment for software testing purposes.
Article Search
proc time: 2.05