ISSTA 2025
Proceedings of the ACM on Software Engineering, Volume 2, Number ISSTA
Powered by
Conference Publishing Consulting

Proceedings of the ACM on Software Engineering, Volume 2, Number ISSTA, June 25–28, 2025, Trondheim, Norway

ISSTA – Journal Issue

Contents - Abstracts - Authors

Frontmatter

Title Page


Editorial Message


Sponsors


Papers

Doctor: Optimizing Container Rebuild Efficiency by Instruction Re-orchestration
Zhiling Zhu, Tieming Chen, Liu Chengwei, Han Liu, Qijie Song, Zhengzi Xu, and Yang Liu
(Zhejiang University of Technology, China; Nanyang Technological University, Singapore; Hong Kong University of Science and Technology, Hong Kong)


Article Search
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution
Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, and Zibin Zheng
(Sun Yat-sen University, China; Independent Researcher, China; Huawei Cloud Computing Technologies, China; Chongqing University, China)


Article Search
Automated Attack Synthesis for Constant Product Market Makers
Sujin Han, Jinseo Kim, Sung-Ju Lee, and Insu Yun
(KAIST, Republic of Korea)
Decentralized Finance (DeFi) enables many novel applications that were impossible in traditional finances. However, it also introduces new types of vulnerabilities. An example of such vulnerabilities is a composability bug between token contracts and Decentralized Exchange (DEX) that follows the Constant Product Market Maker (CPMM) model. This type of bug, which we refer to as CPMM composability bug, originates from issues in token contracts that make them incompatible with CPMMs, thereby endangering other tokens within the CPMM ecosystem. Since 2022, 23 exploits of such kind have resulted in a total loss of 2.2M USD. BlockSec, a smart contract auditing company, reported that 138 exploits of such kind occurred just in February 2023.
In this paper, we propose CPMMX , a tool that automatically detects CPMM composability bugs across entire blockchains. To achieve such scalability, we first formalized CPMM composability bugs and found that these bugs can be induced by breaking two safety invariants. Based on this finding, we designed CPMMX equipped with a two-step approach, called shallow-then-deep search. In more detail, it first uses shallow search to find transactions that break the invariants. Then, it uses deep search to refine these transactions, making them profitable for the attacker. We evaluated CPMMX against five baselines on two public datasets and one synthetic dataset. In our evaluation, CPMMX detected 2.5x to 1.5x more vulnerabilities compared to baseline methods. It also analyzed contracts significantly faster, achieving higher F1 scores than the baselines. Additionally, we applied CPMMX to all contracts on the latest blocks of the Ethereum and Binance networks and discovered 26 new exploits that can result in 15.7K USD profit in total.

Preprint Artifacts Available
xFUZZ: A Flexible Framework for Fine-Grained, Runtime-Adaptive Fuzzing Strategy Composition
Dongsong Yu, Yiyi Wang, Chao Zhang, Yang Lan, Zhiyuan Jiang, Shuitao Gan, Zheyu Ma, and Wende Tan
(Zhongguancun Laboratory, China; Tsinghua University, China; Huazhong University of Science and Technology, China; National University of Defense Technology, China; Labortory for Advanced Computing and Intelligence Engineering, China)


Article Search
DataHook: An Efficient and Lightweight System Call Hooking Technique without Instruction Modification
Quan Hong, Jiaqi Li, Wen Zhang, and Lidong Zhai
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; China Unicom Online Information Technology, China)
System calls serve as the primary interface for interaction between user-space programs and the operating system (OS) kernel. By hooking system calls, it is possible to analyze and modify the behavior of user-space programs. This paper proposes DataHook, an efficient and lightweight system call hooking technique for 32-bit programs. Compared to existing system call hooking techniques, DataHook achieves hooking with extremely low hook overhead by modifying only a few data elements without altering any program instructions. This unique characteristic not only avoids the multithreading conflicts associated with binary rewriting but also provides support for programs to apply more efficient user-space OS subsystems. However, existing system call hooking techniques struggle to meet these goals simultaneously. While techniques like syscall user dispatch (SUD) and ptrace do not require rewriting process instructions, they introduce significant hook overhead. On the other hand, low-overhead techniques typically involve binary rewriting of multiple bytes or instructions, which introduces its own set of challenges. DataHook cleverly addresses these issues by leveraging the specific behavior of 32-bit programs during system calls. In short, unlike 64-bit programs, 32-bit programs use an indirect call instruction to jump to the function executing the syscall/sysenter when making a system call. This paper achieves system call hooking by manipulating the data dependencies involved in the indirect call process. This characteristic is present in 32-bit programs on glibc-based Linux systems, whether running on x86 or x86-64 architectures. Therefore, DataHook can be deployed on these systems. Experimental results demonstrate that DataHook reduces hook overhead by 5.4 to 1,429.0 times compared to existing techniques. When DataHook was applied to a server program to make it use the user-space network stack, the server performance was improved by approximately 4.3 times. Additionally, when applied to Redis, DataHook resulted in only a 4.0% performance loss, compared to 8.0% to 94.7% with other techniques.

Article Search
Beyond Static Pattern Matching? Rethinking Automatic Cryptographic API Misuse Detection in the Era of LLMs
Yifan Xia, Zichen Xie, Peiyu Liu, Kangjie Lu, Yan Liu, Wenhai Wang, and Shouling Ji
(Zhejiang University, China; University of Minnesota, USA; Ant Group, China)


Article Search
MoDitector: Module-Directed Testing for Autonomous Driving Systems
Renzhi Wang, Mingfei Cheng, Xie Xiaofei, Yuan Zhou, and Lei Ma
(University of Alberta, Canada; Singapore Management University, Singapore; Zhejiang Sci-Tech University, China; University of Tokyo, Japan)


Article Search
FreeWavm: Enhanced WebAssembly Runtime Fuzzing Guided by Parse Tree Mutation and Snapshot
Peng Qian, Xinlei Ying, Jiashui Wang, Long Liu, Lun Zhang, Jianhai Chen, and Qinming He
(Zhejiang University, China; Ant Group, China; GoPlus Security, China)
WebAssembly, recognized as a low-level and portable language, has been widely embraced in areas as diverse as browsers and blockchains, emerging as a revolutionary force for Internet evolution. Unfortunately, defects and flaws in WebAssembly runtimes bring about unexpected results when running WebAssembly applications. A family of solutions has been proposed to detect vulnerabilities in WebAssembly runtimes, with fuzzing surging as the most promising and persuasive approach. Despite its potential, fuzzing faces significant challenges due to the grammatical complexity of WebAssembly runtimes, which lacks an in-depth understanding of the unique Module-based code structure, and thus generates test inputs that struggle to tap into the deep logic within a WebAssembly runtime, limiting its effectiveness in unveiling vulnerabilities.
To bridge this gap, we introduce FreeWavm, a novel framework for fuzzing WebAssembly runtimes by aggressively mutating the structure of WebAssembly code. Technically, we transform the WebAssembly bytecode into a parse tree format that captures complex characteristics of code structure. To generate meaningful test inputs for WebAssembly runtime fuzzing, we design a structure-aware mutation module that engages in a customized node prioritization strategy to screen out interesting nodes in the parse tree, and then applies specific structure mutations. To ensure the validity of the mutated test inputs, FreeWavm is equipped with an automated repair mechanism to patch the mutated parse tree. Furthermore, we take advantage of parse tree snapshots to facilitate input evolution and the overall fuzzing process. Extensive experiments are conducted to evaluate FreeWavm on multiple WebAssembly runtimes. Empirical results show that FreeWavm effectively triggers structure-specific crashes in WebAssembly runtimes, outperforming other counterparts. FreeWavm has identified 69 previously unknown bugs, 24 of which are assigned CVEs thus far.

Article Search
Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection
Lei Yu, Zhirong Huang, Hang Yuan, Shiqi Cheng, Li Yang, Fengjun Zhang, Chenjie Shen, Jiajia Ma, Jingyuan Zhang, Junyi Lu, and Chun Zuo
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Sinosoft, China)
Smart contract vulnerability detection is a critical challenge in the rapidly evolving blockchain landscape. Existing vulnerability detection methods face two main issues: (1) Existing datasets lack comprehensiveness and sufficient quality, with limited vulnerability type coverage and insufficient distinction between high-quality and low-quality explanations for preference learning. (2) Large language models (LLMs) often struggle with accurately interpreting specific concepts in smart contract security. Through our empirical analysis, we found that even after continual pre-training and supervised fine-tuning, LLMs still exhibit limitations in precisely understanding the execution order of state changes in smart contracts, which can lead to incorrect vulnerability explanations despite making correct detection decisions. These limitations result in poor detection performance, leading to potentially severe financial losses. To address these challenges, we propose Smart-LLaMA-DPO, an advanced detection method based on the LLaMA-3.1-8B. First, we construct a comprehensive dataset covering four vulnerability types and machine-unauditable vulnerabilities, containing labels, detailed explanations, and precise vulnerability locations for Supervised Fine-Tuning (SFT), as well as paired high-quality and low-quality outputs for Direct Preference Optimization (DPO). Second, we perform continual pre-training using large-scale smart contract code to enhance the LLM's understanding of specific security practices in smart contracts. Futhermore, we conduct supervised fine-tuning with our comprehensive dataset. Finally, we apply DPO, which leverages human feedback to improve the quality of generated explanations. Smart-LLaMA-DPO utilizes a specially designed loss function that encourages the LLM to increase the probability of preferred outputs while decreasing the probability of non-preferred outputs, thereby enhancing the LLM's ability to generate high-quality explanations. We evaluate Smart-LLaMA-DPO on four major vulnerability types: reentrancy, timestamp dependence, integer overflow/underflow, and delegatecall, as well as machine-unauditable vulnerabilities. Our method significantly outperforms state-of-the-art baselines, with average improvements of 10.43% in F1 score and 7.87% in accuracy. Moreover, both LLM evaluation and human evaluation demonstrate the superior quality of explanations generated by Smart-LLaMA-DPO in terms of correctness, thoroughness, and clarity.

Article Search
Are Autonomous Web Agents Good Testers?
Antoine Chevrot, Alexandre Vernotte, Jean-Rémy Falleri, Xavier Blanc, Bruno Legeard, and Aymeric Cretin
(Smartesting, France; University of Bordeaux - LaBRI - UMR 5800, France; University of Bordeaux, France)
Despite advances in automated testing, manual testing remains prevalent due to the high maintenance demands associated with test script fragility—scripts often break with minor changes in application structure. Recent developments in Large Language Models (LLMs) offer a potential alternative by powering Autonomous Web Agents (AWAs) that can autonomously interact with applications. These agents may serve as Autonomous Test Agents (ATAs), potentially reducing the need for maintenance-heavy automated scripts by utilising natural language instructions similar to those used by human testers. This paper investigates the feasibility of adapting AWAs for natural language test case execution and how to evaluate them.
We contribute with (1) a benchmark of three offline web applications, and a suite of 113 manual test cases, split between passing and failing cases, to evaluate and compare ATAs performance, (2) SeeAct-ATA and pinATA, two open-source ATA implementations capable of executing test steps, verifying assertions and giving verdicts, and (3) comparative experiments using our benchmark that quantifies our ATAs effectiveness. Finally we also proceed to a qualitative evaluation to identify the limitations of PinATA, our best performing implementation.
Our findings reveal that our simple implementation, SeeAct-ATA, does not perform well compared to our more advanced PinATA implementation when executing test cases (50% performance improvement). However, while PinATA obtains around 60% of correct verdict and up to a promising 94% specificity, we identify several limitations that need to be addressed to develop more resilient and reliable ATAs, paving the way for robust, low maintenance test automation.

Preprint
Preventing Disruption of System Backup against Ransomware Attacks
Yiwei Hou, Lihua Guo, Chijin Zhou, Quan Zhang, Wenhuan Liu, Chengnian Sun, and Yu Jiang
(Tsinghua University, China; Union Tech, China; University of Waterloo, Canada)
The ransomware threat to the software ecosystem has grown rapidly in recent years. Despite being well-studied, new ransomware variants continually emerge, designed to evade existing encryption-based detection mechanisms. This paper introduces Remembrall, a new perspective to defend against ransomware by monitoring and preventing system backup disruptions. Focusing on deletion actions of volume shadow copies (VSC) in Windows, Remembrall captures related malicious events and identifies all ransomware traces as a real-time defense tool. To ensure no ransomware is missing, we conduct a comprehensive investigation to classify all potential attack actions that can be used to delete VSCs throughout the application layer, OS layer, and hardware layer. Based on the analysis, Remembrall is designed to retrieve system event information and accurately identify ransomware without false negatives. We evaluate Remembrall on recent ransomware samples. Remembrall achieves 4.31%-87.55% increase in F1-score compared to other state-of-the-art anti-ransomware tools across 60 ransomware families. Remembrall has also detected eight zero-day ransomware samples in the experiment.

Article Search
Fairness Mediator: Neutralize Stereotype Associations to Mitigate Bias in Large Language Models
Yisong Xiao, Aishan Liu, Siyuan Liang, Xianglong Liu, and Dacheng Tao
(Beihang University, China; National University of Singapore, Singapore; Nanyang Technological University, Singapore)
Large Language Models (LLMs) have demonstrated remarkable performance across diverse applications, yet they inadvertently absorb spurious correlations from training data, leading to stereotype associations between biased concepts and specific social groups. These associations perpetuate and even amplify harmful social biases, raising significant concerns about fairness, which is a crucial issue in software engineering. To mitigate such biases, prior studies have attempted to project model embeddings into unbiased spaces during inference. However, these approaches have shown limited effectiveness due to their weak alignment with downstream social biases. Inspired by the observation that concept cognition in LLMs is primarily represented through a linear associative memory mechanism, where key-value mapping occurs in the MLP layers, we posited that biased concepts and social groups are similarly encoded as entity (key) and information (value) pairs, which can be manipulated to promote fairer associations. To this end, we propose Fairness Mediator (FairMed), an effective and efficient bias mitigation framework that neutralizes stereotype associations. Our framework comprises two main components: a stereotype association prober and an adversarial debiasing neutralizer. The prober captures stereotype associations encoded within MLP layer activations by employing prompts centered around biased concepts (keys) to detect the emission probabilities for social groups (values). Subsequently, the adversarial debiasing neutralizer intervenes in MLP activations during inference to equalize the association probabilities among different social groups. Extensive experiments across nine protected attributes show that our FairMed significantly outperforms state-of-the-art methods in effectiveness, achieving average bias reductions of up to 84.42

Article Search
KRAKEN: Program-Adaptive Parallel Fuzzing
Anshunkang Zhou, Heqing Huang, and Charles Zhang
(Hong Kong University of Science and Technology, China; City University of Hong Kong, China)
Parallel fuzzing, which utilizes multicore computers to accelerate the fuzzing process, has been widely used in industrial-scale software defect detection. However, specifying efficient parallel fuzzing strategies for programs with different characteristics is challenging due to the difficulty of reasoning about fuzzing runtime statically. Existing efforts still use pre-defined tactics for various programs, resulting in suboptimal performance.
In this paper, we propose KraKen, a new program-adaptive parallel fuzzer that improves fuzzing efficiency through dynamic strategy optimization. The key insight is that the inefficiency in parallel fuzzing can be observed during runtime through various feedbacks, such as code coverage changes, which allows us to adjust the adopted strategy to avoid inefficient path searching, thus gradually approximating the optimal policy. Based on the above insight, our key idea is to view the task of finding the optimal strategy as an optimization problem and gradually approach the best program-specific strategy on the fly by maximizing certain objective functions. We have implemented Kraken in C/C++ and evaluated it on 19 real-world programs against 6 state-of-the-art parallel fuzzers. Experimental results show that Kraken can achieve 54.7% more code coverage and find 70.2% more bugs in the given time. Moreover, Kraken has found 192 bugs in 37 popular open-source projects, and 119 of them are assigned with CVE IDs.

Article Search
Model Checking Guided Incremental Testing for Distributed Systems
Yu Gao, Dong Wang, Wensheng Dou, Wenhan Feng, Yu Liang, Yuan Feng, and Jun Wei
(Institute of Software at Chinese Academy of Sciences, China; Wuhan Dameng Database, China)


Article Search
Understanding Model Weaknesses: A Path to Strengthening DNN-Based Android Malware Detection
Haodong Li, Xiao Cheng, Yanjie Zhao, Guosheng Xu, Guoai Xu, and Haoyu Wang
(Beijing University of Posts and Telecommunications, China; UNSW, Australia; Huazhong University of Science and Technology, China; Harbin Institute of Technology, China)
Android malware detection remains a critical challenge in cybersecurity research. Recent advancements leverage AI techniques, particularly deep neural networks (DNNs), to train a detection model, but their effectiveness is often compromised by the pronounced imbalance among malware families in commonly used training datasets. This imbalance leads to overfitting in dominant categories and poor performance in underrepresented ones, increasing predictive uncertainty for less common malware families. To address the suboptimal performance of many DNN models, we introduce MalTutor, a novel framework that enhances model robustness through an optimized training process. Our primary insight lies in transforming uncertainties from “liabilities” into “assets” by strategically incorporating them into DNN training methodologies. Specifically, we begin by evaluating the predictive uncertainty of DNN models throughout various training epochs, which guides our sample categorization. Incorporating Curriculum Learning strategies, we commence training with easy-to-learn samples with lower uncertainty, progressively incorporating difficult-to-learn ones with higher uncertainty. Our experimental results demonstrate that MalTutor significantly improves the performance of models trained on imbalanced datasets, increasing accuracy by 31.0%, elevating the F1 score by 138.8%, and specifically boosting the average accuracy in detecting various types of malicious apps by 133.9%. Our findings provide valuable insights into the potential benefits of incorporating uncertainty to improve the robustness of DNN models for prediction-oriented software engineering tasks.

Article Search
LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models
Lingxiao Tang, Jiakun Liu, Zhongxin Liu, Xiaohu Yang, and Lingfeng Bao
(Zhejiang University, China; Singapore Management University, Singapore)
The SZZ algorithm is the dominant technique for identifying bug-inducing commits and serves as a foundation for many software engineering studies, such as bug prediction and static code analysis, thereby enhancing software quality and facilitating better maintenance practices. Researchers have proposed many variants to enhance the SZZalgorithm’s performance since its introduction. The majority of them rely on static techniques or heuristic assumptions, making them easy to implement, but their performance improvements are often limited. Recently, a deep learning-based SZZ algorithm has been introduced to enhance the original SZZ algorithm. However, it requires complex preprocessing and is restricted to a single programming language. Additionally, while it enhances precision, it sacrifices recall. Furthermore, most of variants overlook crucial information, such as commit messages and patch context, and are limited to bug-fixing commits involving deleted lines.
The emergence of large language models (LLMs) offers an opportunity to address these drawbacks. In this study, we investigate the strengths and limitations of LLMs and propose LLM4SZZ, which employs two approaches (i.e., rank-based identification and context-enhanced identification) to handle different types of bug-fixing commits. We determine which approach to adopt based on the LLM’s ability to comprehend the bug and identify whether the bug is present in a commit. The context-enhanced identification provides the LLM with more context and requires it to find the bug-inducing commit among a set of candidate commits. In rank-based identification, we ask the LLM to select buggy statements from the bug-fixing commit and rank them based on their relevance to the root cause. Experimental results show that LLM4SZZ outperforms all baselines across three datasets, improving F1-score by 6.9% to 16.0% without significantly sacrificing recall. Additionally, LLM4SZZ can identify many bug-inducing commits that the baselines fail to detect, accounting for 7.8%, 7.4% and 2.5% of the total bug-inducing commits across three datasets, respectively.

Article Search
ALMOND: Learning an Assembly Language Model for 0-Shot Code Obfuscation Detection
Xuezixiang Li, Sheng Yu, and Heng Yin
(University of California at Riverside, USA; Deepbits Technology, USA)


Article Search
Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection
Niklas Risse, Jing Liu, and Marcel Böhme
(MPI-SP, Germany)


Article Search
Tracezip: Efficient Distributed Tracing via Trace Compression
Zhuangbin Chen, Junsong Pu, and Zibin Zheng
(Sun Yat-sen University, China; Beijing University of Posts and Telecommunications, China)


Article Search
Walls Have Ears: Demystifying Notification Listener Usage in Android Apps
Jiapeng Deng, Tianming Liu, Yanjie Zhao, Chao Wang, Lin Zhang, and Haoyu Wang
(Huazhong University of Science and Technology, China; Monash University, Australia; CNCERT-CC, China)


Article Search
Safe4U: Identifying Unsound Safe Encapsulations of Unsafe Calls in Rust using LLMs
Huan Li, Bei Wang, Xing Hu, and Xin Xia
(Zhejiang University, China)
Rust is an emerging programming language that ensures safety through strict compile-time checks. A Rust function marked as unsafe indicates it has additional safety requirements (e.g., initialized, not null), known as contracts in the community. These unsafe functions can only be called within explicit unsafe blocks and the contracts must be guaranteed by the caller. To reuse and reduce unsafe code, the community recommends using safe encapsulation of unsafe calls (EUC) in practice. However, an EUC is unsound if any contract is not guaranteed and could lead to undefined behaviors in safe Rust, thus breaking Rust's safety promise. It is challenging to identify unsound EUCs with conventional techniques due to the limitation in cross-lingual comprehension of code and natural language. Large language models (LLMs) have demonstrated impressive capabilities, but their performance is unsatisfactory owing to the complexity of contracts and the lack of domain knowledge. To this end, we propose a novel framework, Safe4U, which incorporates LLMs, static analysis tools, and domain knowledge to identify unsound EUCs. Safe4U first utilizes static analysis tools to retrieve relevant context. Then, it decomposes the primitive contract description into several fine-grained classified contracts. Ultimately, Safe4U introduces domain knowledge and invokes the reasoning capability of LLMs to verify every fine-grained contract. The evaluation results show that Safe4U brings a general performance improvement and the fine-grained results are constructive for locating specific unsound sources. In real-world scenarios, Safe4U can identify 9 out of 11 unsound EUCs from CVE. Furthermore, Safe4U detected 22 new unsound EUCs in the most downloaded crates, 16 of which have been confirmed.

Article Search
LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation
Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng
(Sun Yat-sen University, China; Nanyang Technological University, Singapore; Xi'an Jiaotong University, China; Huawei Cloud Computing Technologies, China)


Article Search
Bridge the Islands: Pointer Analysis for Microservice Systems
Teng Zhang, Yufei Liang, Ganlin Li, Tian Tan, Chang Xu, and Yue Li
(Nanjing University, China)


Article Search
Program Feature-Based Benchmarking for Fuzz Testing
Miao Miao, Sriteja Kummita, Eric Bodden, and Shiyi Wei
(University of Texas at Dallas, USA; Fraunhofer IEM, Germany; Heinz Nixdorf Institute at Paderborn University, Germany)


Article Search
SoK: A Taxonomic Analysis of DeFi Rug Pulls: Types, Dataset, and Tool Assessment
Dianxiang Sun, Wei Ma, Liming Nie, and Yang Liu
(Nanyang Technological University, Singapore; Singapore Management University, Singapore; Shenzhen Technology University, China)


Article Search
Recurring Vulnerability Detection: How Far Are We?
Yiheng Cao, Susheng Wu, Ruisi Wang, Bihuan Chen, Yiheng Huang, Chenhao Lu, Zhuotong Zhou, and Xin Peng
(Fudan University, China)
With the rapid development of open-source software, code reuse has become a common practice to accelerate development. However, it leads to inheritance from the original vulnerability, which recurs at the reusing projects, known as recurring vulnerabilities (RVs). Traditional general-purpose vulnerability detection approaches struggle with scalability and adaptability, while learning-based approaches are often constrained by limited training datasets and are less effective against unseen vulnerabilities. Though specific recurring vulnerability detection (RVD) approaches have been proposed, their effectiveness across various RV characteristics remains unclear.
In this paper, we conduct a large-scale empirical study using a newly constructed RV dataset containing 4,569 RVs, achieving a 953% expansion over prior RV datasets. Our study analyzes the characteristics of RVs, evaluates the effectiveness of the state-of-the-art RVD approaches, and investigates the root causes of false positives and false negatives, yielding key insights. Inspired by these insights, we design , a novel RVD approach that identifies both explicit and implicit call relations with modified functions, then employs inter-procedural taint analysis and intra-procedural dependency slicing within those functions to generate comprehensive signatures, and finally incorporates a flexible matching to detect RVs. Our evaluation has shown the effectiveness, generality and practical usefulness in RVD. has detected 4,593 RVs, with 307 confirmed by developers, and identified 73 new 0-day vulnerabilities across 15 projects, receiving 5 CVE identifiers.

Article Search
ConTested: Consistency-Aided Tested Code Generation with LLM
Jinhao Dong, Jun Sun, Wenjie Zhang, Jin Song Dong, and Dan Hao
(Peking University, China; Singapore Management University, Singapore; National University of Singapore, Singapore)


Article Search
Robust Vulnerability Detection across Compilations: LLVM-IR vs. Assembly with Transformer Model
Rony Shir, Priyanka Surve, Yuval Elovici, and Asaf Shabtai
(Ben-Gurion University of the Negev, Israel)


Article Search
RouthSearch: Inferring PID Parameter Specification for Flight Control Program by Coordinate Search
Siao Wang, Zhen Dong, Hui Li, Liwei Shen, Xin Peng, and Dongdong She
(Fudan University, China; Hong Kong University of Science and Technology, China)


Article Search
Program Analysis Combining Generalized Bit-Level and Word-Level Abstractions
Guangsheng Fan, Liqian Chen, Banghu Yin, Wenyu Zhang, Peisen Yao, and Ji Wang
(National University of Defense Technology, China; Zhejiang University, China)


Article Search
Bridging the Gaps between Graph Neural Networks and Data-Flow Analysis: The Closer, the Better
Qingchen Yu, Xin Liu, Qingguo Zhou, and Chunming Wu
(Zhejiang University, China; Lanzhou University, China)


Article Search
AudioTest: Prioritizing Audio Test Cases
Yinghua Li, Xueqi Dang, Wendkûuni C. Ouédraogo, Jacques Klein, and Tegawendé F. Bissyandé
(University of Luxembourg, Luxembourg)
Audio classification systems, powered by deep neural networks (DNNs), are integral to various applications that impact daily lives, like voice-activated assistants. Ensuring the accuracy of these systems is crucial since inaccuracies can lead to significant security issues and user mistrust. However, testing audio classifiers presents a significant challenge: the high manual labeling cost for annotating audio test inputs. Test input prioritization has emerged as a promising approach to mitigate this labeling cost issue. It prioritizes potentially misclassified tests, allowing for the early labeling of such critical inputs and making debugging more efficient. However, when applying existing test prioritization methods to audio-type test inputs, there are some limitations: 1) Coverage-based methods are less effective and efficient than confidence-based methods. 2) Confidence-based methods rely only on prediction probability vectors, ignoring the unique characteristics of audio-type data. 3) Mutation-based methods lack designed mutation operations for audio data, making them unsuitable for audio-type test inputs. To overcome these challenges, we propose AudioTest, a novel test prioritization approach specifically designed for audio-type test inputs. The core premise is that tests closer to misclassified samples are more likely to be misclassified. Based on the special characteristics of audio-type data, AudioTest generates four types of features: time-domain features, frequency-domain features, perceptual features, and output features. For each test, AudioTest concatenates its four types of features into a feature vector and applies a carefully designed feature transformation strategy to bring misclassified tests closer in space. AudioTest leverages a trained model to predict the probability of misclassification of each test based on its transformed vectors and ranks all the tests accordingly. We evaluate the performance of AudioTest utilizing 96 subjects, encompassing natural and noisy datasets. We employed two classical metrics, Percentage of Fault Detection (PFD) and Average Percentage of Fault Detected (APFD), for our evaluation. The results demonstrate that AudioTest outperforms all the compared test prioritization approaches in terms of both PFD and APFD. The average improvement of AudioTest compared to the baseline test prioritization methods ranges from 12.63% to 54.58% on natural datasets and from 12.71% to 40.48% on noisy datasets.

Article Search
QTRAN: Extending Metamorphic-Oracle Based Logical Bug Detection Techniques for Multiple-DBMS Dialect Support
Li Lin, Qinglin Zhu, Hongqiao Chen, Zhuangda Wang, Rongxin Wu, and Xiaoheng Xie
(Xiamen University, China; Ant Group, China)
Metamorphic testing is a widely used method to detect logical bugs in Database Management Systems (DBMSs), referred to herein as MOLT (Metamorphic-Oracle based Logical Bug Detection Technique). This technique involves constructing SQL statement pairs, including original and mutated queries, and assessing whether the execution results conform to predefined metamorphic relations to detect logical bugs. However, current MOLTs rely heavily on specific DBMS grammar to generate valid SQL statement pairs, which makes it challenging to adapt these techniques to various DBMSs with different grammatical structures. As a result, only a few popular DBMSs, such as PostgreSQL, MySQL, and MariaDB, are supported by existing MOLTs, with extensive manual effort required to expand to other DBMSs. Given that many DBMSs remain inadequately tested, there is a pressing need for a method that enables effortless extension of MOLTs across diverse DBMSs.
In this paper, we propose QTRAN, a novel LLM-powered approach that automatically extends existing MOLTs to various DBMSs. Our key insight is to translate SQL statement pairs to target DBMSs for metamorphic testing from existing MOLTs using LLMs. To address the challenges of LLMs’ limited understanding of dialect differences and metamorphic mechanisms, we propose a two-phase approach comprising the transfer and mutation phases. QTRAN tackles these challenges by drawing inspiration from the developer’s process of creating a MOLT, which includes understanding the grammar of the target DBMS to generate original queries and employing a mutator for customized mutations. The transfer phase is designed to identify potential dialects and leverage information from SQL documents to enhance query retrieval, enabling LLMs to translate original queries across different DBMSs accurately. During the mutation phase, we gather SQL statement pairs from existing MOLTs to fine-tune the pretrained model, tailoring it specifically for mutation tasks. Then we employ the customized LLM to mutate the translated original queries, preserving the defined relationships necessary for metamorphic testing.
We implement our approach as a tool and apply it to extend four state-of-the-art MOLTs for eight DBMSs: MySQL, MariaDB, TiDB, PostgreSQL, SQLite, MonetDB, DuckDB, and ClickHouse. The evaluation results show that over 99% of the SQL statement pairs transferred by QTRAN satisfy the metamorphic relations required for testing. Furthermore, we have detected 24 logical bugs among these DBMSs, with 16 confirmed as unique and previously unknown bugs. We believe that the generality of QTRAN can significantly enhance the reliability of DBMSs.

Article Search
GUIPilot: A Consistency-Based Mobile GUI Testing Approach for Detecting Application-Specific Bugs
Ruofan Liu, Xiwen Teoh, Yun Lin, Guanjie Chen, Ruofei Ren, Denys Poshyvanyk, and Jin Song Dong
(Shanghai Jiao Tong University, China; National University of Singapore, Singapore; College of William and Mary, USA)
GUI testing is crucial for ensuring the reliability of mobile applications. State-of-the-art GUI testing approaches are successful in exploring more application scenarios and discovering general bugs such as application crashes. However, industrial GUI testing also needs to investigate application-specific bugs such as deviations in screen layout, widget position, or GUI transition from the GUI design mock-ups created by the application designers. These mock-ups specify the expected screens, widgets, and their respective behaviors. Validating the consistency between the GUI design and the implementation is labor-intensive and time-consuming, yet, this validation step plays an important role in industrial GUI testing.
In this work, we propose , an approach for detecting inconsistencies between the mobile design and their implementations. The mobile design usually consists of design mock-ups that specify (1) the expected screen appearances (e.g., widget layouts, colors, and shapes) and (2) the expected screen behaviors, regarding how one screen can transition into another (e.g., labeled widgets with textual description). Given a design mock-up and the implementation of its application, reports both their screen inconsistencies as well as process inconsistencies. On the one hand, detects the screen inconsistencies by abstracting every screen into a widget container where each widget is represented by its position, width, height, and type. By defining the partial order of widgets and the costs of replacing, inserting, and deleting widgets in a screen, we convert the screen-matching problem into an optimizable widget alignment problem. On the other hand, we translate the specified GUI transition into stepwise actions on the mobile screen (e.g., click, long-press, input text on some widgets). To this end, we propose a visual prompt for the vision-language model to infer widget-specific actions on the screen. By this means, we can validate the presence or absence of expected transitions in the implementation. Our extensive experiments on 80 mobile applications and 160 design mock-ups show that (1) can achieve 99.8% precision and 98.6% recall in detecting screen inconsistencies, outperforming the state-of-the-art approach, such as GVT, by 66.2% and 56.6% respectively, and (2) reports zero errors in detecting process inconsistencies. Furthermore, our industrial case study on applying on a trading mobile application shows that has detected nine application bugs, and all the bugs were confirmed by the original application experts. Our code is available at https://github.com/code-philia/GUIPilot.

Article Search Video Info
Testing the Fault-Tolerance of Multi-sensor Fusion Perception in Autonomous Driving Systems
Haoxiang Tian, Wenqiang Ding, Xingshuo Han, Guoquan Wu, An Guo, Junqi Zhang, Wei Chen, Jun Wei, and Tianwei Zhang
(Institute of Software at Chinese Academy of Sciences, China; Nanjing Institute of Software, China; Nanyang Technological University, Singapore; University of Science and Technology of China, China)
High-level Autonomous Driving Systems (ADSs), such as Google Waymo and Baidu Apollo, typically rely on multi-sensor fusion (MSF) based approaches to perceive their surroundings. This strategy increases perception robustness by combining the respective strengths of the camera and LiDAR and directly affects the safety-critical driving decisions of autonomous vehicles (AVs). However, in real-world autonomous driving scenarios, cameras and LiDAR are subject to various faults, which can probably significantly impact the decision-making and behaviors of ADSs. Existing MSF testing approaches only discovered corner cases that the MSF-based perception cannot accurately detected by MSF-based perception, while lacking research on how sensor faults affect the system-level behaviors of ADSs.
To address this gap, we conduct the first exploration of the fault tolerance of MSF perception-based ADS for sensor faults. In this paper, we systematically and comprehensively build fault models for cameras and LiDAR in AVs and inject them into the MSF perception-based ADS to test its behaviors in test scenarios. To effectively and efficiently explore the parameter spaces of sensor fault models, we design a feedback-guided differential fuzzer to discover the safety violations of MSF perception-based ADS caused by the injected sensor faults. We evaluate FADE on the representative and practical industrial ADS, Baidu Apollo. Our evaluation results demonstrate the effectiveness and efficiency of FADE, and we conclude some useful findings from the experimental results. To validate the findings in the physical world, we use a real Baidu Apollo 6.0 EDU autonomous vehicle to conduct the physical experiments, and the results show the practical significance of our findings.

Article Search
KEENHash: Hashing Programs into Function-Aware Embeddings for Large-Scale Binary Code Similarity Analysis
Zhijie Liu, Qiyi Tang, Sen Nie, Shi Wu, Liang Feng Zhang, and Yutian Tang
(ShanghaiTech University, China; Tencent Security Keen Lab, China; University of Glasgow, UK)


Article Search
Enhancing Vulnerability Detection via Inter-procedural Semantic Completion
Bozhi Wu, Chengjie Liu, Zhiming Li, Yushi Cao, Jun Sun, and Shang-Wei Lin
(Singapore Management University, Singapore; Peking University, China; Nanyang Technological University, Singapore)


Article Search
Unlocking Low Frequency Syscalls in Kernel Fuzzing with Dependency-Based RAG
Zhiyu Zhang, Longxing Li, Ruigang Liang, and Kai Chen
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)


Article Search
Structure-Aware, Diagnosis-Guided ECU Firmware Fuzzing
Qicai Chen, Kun Hu, Sichen Gong, Bihuan Chen, Zikui Kong, Haowen Jiang, Bingkun Sun, You Lu, and Xin Peng
(Fudan University, China; ANHUI GuarDrive Safety Technology, China)


Article Search
FANDANGO: Evolving Language-Based Testing
José Antonio Zamudio Amaya, Marius Smytzek, and Andreas Zeller
(CISPA Helmholtz Center for Information Security, Germany)
Language-based fuzzers leverage formal input specifications (languages) to generate arbitrarily large and diverse sets of valid inputs for a program under test. Modern language-based test generators combine grammars and constraints to satisfy syntactic and semantic input constraints. ISLA, the leading input generator in that space, uses symbolic constraint solving to solve input constraints. Using solvers places ISLA among the most precise fuzzers but also makes it slow.
In this paper, we explore search-based testing as an alternative to symbolic constraint solving. We employ a genetic algorithm that iteratively generates candidate inputs from an input specification, evaluates them against defined constraints, evolving a population of inputs through syntactically valid mutations and retaining those with superior fitness until the semantic input constraints are met. This evolutionary procedure, analogous to natural genetic evolution, leads to progressively improved inputs that cover both semantics and syntax. This change boosts the efficiency of language-based testing: In our experiments, compared to ISLA, our search-based FANDANGO prototype is faster by one to three orders of magnitude without sacrificing precision.
The search-based approach no longer restricts constraints to constraint solvers' (miniature) languages. In FANDANGO, constraints can use the whole Python language and library. This expressiveness gives testers unprecedented flexibility in shaping test inputs. It allows them to state arbitrary goals for test generation: ''Please produce 1,000 valid test inputs where the voltage field follows a Gaussian distribution but never exceeds 20 mV.''

Preprint Info
ZTaint-Havoc: From Havoc Mode to Zero-Execution Fuzzing-Driven Taint Inference
Yuchong Xie, Wenhui Zhang, and Dongdong She
(Hong Kong University of Science and Technology, China; Hunan University, Changsha, China)


Article Search
A Low-Cost Feature Interaction Fault Localization Approach for Software Product Lines
Haining Wang, Yi Xiang, Han Huang, Jie Cao, Kaichen Chen, and Xiaowei Yang
(South China University of Technology, China; Puhua Basic Software, China)
In Software Product Lines (SPLs), localizing buggy feature interactions helps developers identify the root cause of test failures, thereby reducing their workload. This task is challenging because the number of potential interactions grows exponentially with the number of features, resulting in a vast search space, especially for large SPLs. Previous approaches have partially addressed this issue by constructing and examining potential feature interactions based on suspicious feature selections (e.g., those present in failed configurations but not in passed ones). However, these approaches often overlook the causal relationship between buggy feature interaction and test failures, resulting in an excessive search space and high-cost fault localization. To address this, we propose a low-cost Counterfactual Reasoning-Based Fault Localization (CRFL) approach for SPLs, which enhances fault localization efficiency by reducing both the search space and redundant computations. Specifically, CRFL employs counterfactual reasoning to infer suspicious feature selections and utilizes symmetric uncertainty to filter out irrelevant feature interactions. Additionally, CRFL incorporates two findings to prevent the repeated generation and examination of the same feature interactions. We evaluate the performance of our approach using eight publicly available SPL systems. To enable comparisons on larger real-world SPLs, we generate multiple buggy mutants for both BerkeleyDB and TankWar. Experimental results show that our approach reduces the search space by 51%∼73% for small SPLs (with 6∼9 features) and by 71%∼88% for larger SPLs (with 13∼99 features). The average runtime of our approach is approximately 15.6 times faster than that of a state-of-the-art method. Furthermore, when combined with statement-level localization techniques, CRFL can efficiently localize buggy statements, demonstrating its ability to accurately identify buggy feature interactions.

Article Search
WildSync: Automated Fuzzing Harness Synthesis via Wild API Usage Recovery
Wei-Cheng Wu, Stefan Nagy, and Christophe Hauser
(Dartmouth College, USA; University of Utah, USA)
Fuzzing stands as one of the most practical techniques for testing software efficiently. When applying fuzzing to software library APIs, high-quality fuzzing harnesses are essential, enabling fuzzers to execute the APIs with precise sequences and function parameters. Although software developers commonly rely on manual efforts to create fuzzing harnesses, there has been a growing interest in automating this process. Existing works are often constrained in scalability and effectiveness due to their reliance on compiler-based analysis or runtime execution traces, which require manual setup and configuration. Our investigation of multiple actively fuzzed libraries reveals that a large number of exported API functions externally used by various open-source projects remain untested by existing harnesses or unit-test files. The lack of testing for these API functions increase the risk of vulnerabilities going undetected, potentially leading to security issues. In order to address the lack of coverage affecting existing fuzzing methods, we propose a novel approach to automatically generate fuzzing harnesses by extracting usage patterns of untested functions from real-world scenarios, using techniques based on lightweight Abstract Syntax Tree parsing to extract API usage from external source code. Then, we integrate the usage patterns into existing harnesses to construct new ones covering these untested functions. We have implemented a prototype of this concept named WildSync, enabling the automatic synthesis of fuzzing harnesses for C/C++ libraries on OSS-Fuzz. In our experiments, WildSync successfully produced 469 new harnesses for 24 actively fuzzed libraries on OSS-Fuzz, and also 3 widely used libraries that can be later integrated into OSS-Fuzz. This results in a significant increase in test coverage spanning over 1.3k functions and 16k lines of code, while also identifying 7 previously undetected bugs.

Article Search
Automated Scene Generation for Testing COLREGS-Compliance of Autonomous Surface Vehicles
Dominik Frey, Ulf Kargén, and Dániel Varró
(Linköping University, Sweden; McGill University, Canada)


Article Search
Clause2Inv: A Generate-Combine-Check Framework for Loop Invariant Inference
Weining Cao, Guangyuan Wu, Tangzhi Xu, Yuan Yao, Hengfeng Wei, Taolue Chen, and Xiaoxing Ma
(Nanjing University, China; Birkbeck University of London, UK)
Loop invariant inference is a fundamental, yet challenging, problem in program verification. Recent work adopts the guess-and-check framework, where candidate loop invariants are iteratively generated in the guess step and verified in the check step. A major challenge of this general framework is to produce high-quality candidate invariants in each iteration so that the inference procedure can converge quickly. Empirically, we observe that existing approaches may struggle with guessing the complete invariant due to the complexity of logical connectives, but usually, all the clauses of the correct loop invariant have already appeared in the previous guesses. This motivates us to refine the guess-and-check framework, resulting in a generate-combine-check framework, where the loop invariant inference task is divided into clause generation and clause combination. Specifically, we propose a novel loop invariant inference approach under the new framework, which consists of an LLM-based clause generator and a counterexample-driven clause combinator. As the clause generator, leverages LLMs to generate a multitude of clauses; as the clause combinator, leverages counterexamples from the previous rounds to convert generated clauses into invariants. Our experiments show that significantly outperforms existing loop invariant inference approaches. For example, solved 312 (out of 316) linear invariant inference tasks and 44 (out of 50) nonlinear invariant inference tasks, which is at least 93 and 16 more than the existing baselines, respectively. By design, the generate-combine-check framework is flexible to accommodate various existing approaches which are currently under the guess-and-check framework by splitting the guessed candidate invariants into clauses. The evaluation shows that our approach can, with minor adaptation, improve existing loop invariant inference approaches in both effectiveness and efficiency. For example, Code2Inv which solved 210 linear problems with an average solving time of 137.6 seconds can be improved to solve 252 problems with an average solving time of 17.8 seconds.

Article Search
Copy-and-Paste? Identifying EVM-Inequivalent Code Smells in Multi-chain Reuse Contracts
Zexu Wang, Jiachi Chen, Tao Zhang, Yu Zhang, Weizhe Zhang, Yuming Feng, and Zibin Zheng
(Sun Yat-sen University, China; Peng Cheng Laboratory, China; Guangdong Engineering Technology Research Center of Blockchain, China; Macau University of Science and Technology, China; Harbin Institute of Technology, China)
As the development of Solidity contracts on Ethereum, more developers are reusing them on other compatible blockchains. However, developers may overlook the differences between the designs of the blockchain system, such as the Gas Mechanism and Consensus Protocol, leading to the same contracts on different blockchains not being able to achieve consistent execution as on Ethereum. This inconsistency reveals design flaws in reused contracts, exposing code smells that hinder code reusability, and we define this inconsistency as EVM-Inequivalent Code Smells.
In this paper, we conducted the first empirical study to reveal the causes and characteristics of EVM-Inequivalent Code Smells. To ensure the identified smells reflect real developer concerns, we collected and analyzed 1,379 security audit reports and 326 Stack Overflow posts related to reused contracts on EVM-compatible blockchains, such as Binance Smart Chain (BSC) and Polygon. Using the open card sorting method, we defined six types of EVM-Inequivalent Code Smells. For automated detection, we developed a tool named EquivGuard. It employs static taint analysis to identify key paths from different patterns and uses symbolic execution to verify path reachability. Our analysis of 905,948 contracts across six major blockchains shows that EVM-Inequivalent Code Smells are widespread, with an average prevalence of 17.70%. While contracts with code smells do not necessarily lead to financial loss and attacks, their high prevalence and significant asset management underscore the potential threats of reusing these smelly Ethereum contracts. Thus, developers are advised to abandon Copy-and-Paste programming practices and detect EVM-Inequivalent Code Smells before reusing Ethereum contracts.

Article Search
You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects
Islem Bouzenia and Michael Pradel
(University of Stuttgart, Germany)


Article Search
Finding 709 Defects in 258 Projects: An Experience Report on Applying CodeQL to Open-Source Embedded Software (Experience Paper)
Mingjie Shen, Akul Abhilash Pillai, Brian A. Yuan, James C. Davis, and Aravind Machiry
(Purdue University, USA)


Article Search
Enhancing Smart Contract Security Analysis with Execution Property Graphs
Kaihua Qin, Zhe Ye, Zhun Wang, Weilin Li, Liyi Zhou, Chao Zhang, Dawn Song, and Arthur Gervais
(Yale University, USA; University of California at Berkeley, USA; University College London, UK; University of Sydney, Australia; Tsinghua University, China)
Smart contract vulnerabilities have led to significant financial losses, with their increasing complexity rendering outright prevention of hacks increasingly challenging. This trend highlights the crucial need for advanced forensic analysis and real-time intrusion detection, where dynamic analysis plays a key role in dissecting smart contract executions. Therefore, there is a pressing need for a unified and generic representation of smart contract executions, complemented by an efficient methodology that enables the modeling and identification of a broad spectrum of emerging attacks.
We introduce Clue, a dynamic analysis framework specifically designed for the Ethereum virtual machine. Central to Clue is its ability to capture critical runtime information during contract executions, employing a novel graph-based representation, the Execution Property Graph. A key feature of Clue is its innovative graph traversal technique, which is adept at detecting complex attacks, including (read-only) reentrancy and price manipulation. Evaluation results reveal Clue's superior performance with high true positive rates and low false positive rates, outperforming state-of-the-art tools. Furthermore, Clue's efficiency positions it as a valuable tool for both forensic analysis and real-time intrusion detection.

Article Search
MLLM-Based UI2Code Automation Guided by UI Layout Information
Fan Wu, Cuiyun Gao, Shuqing Li, Xin-Cheng Wen, and Qing Liao
(Harbin Institute of Technology, China; Chinese University of Hong Kong, China)


Article Search
Quantum Concolic Testing
Shangzhou Xia, Jianjun Zhao, Fuyuan Zhang, and Xiaoyu Guo
(Kyushu University, Japan; University of Tokyo, Japan)


Article Search
BinQuery: A Novel Framework for Natural Language-Based Binary Code Retrieval
Bolun Zhang, Zeyu Gao, Hao Wang, Yuxin Cui, Siliang Qin, Chao Zhang, Kai Chen, and Beibei Zhao
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Tsinghua University, China)


Article Search
BinDSA: Efficient, Precise Binary-Level Pointer Analysis with Context-Sensitive Heap Reconstruction
Lian Gao and Heng Yin
(University of California at Riverside, USA)


Article Search
Adding Spatial Memory Safety to EDK II through Checked C (Experience Paper)
Sourag Cherupattamoolayil, Arunkumar Bhattar, Connor Everett Glosner, and Aravind Machiry
(Purdue University, USA)


Article Search
REACCEPT: Automated Co-evolution of Production and Test Code Based on Dynamic Validation and Large Language Models
Jianlei Chi, Xiaotian Wang, Yuhan Huang, Lechen Yu, Di Cui, Jianguo Sun, and Jun Sun
(Xidian University, China; Harbin Engineering University, China; Microsoft, USA; Singapore Management University, Singapore)


Article Search
Understanding Practitioners’ Expectations on Clear Code Review Comments
Junkai Chen, Zhenhao Li, Qiheng Mao, Xing Hu, Kui Liu, and Xin Xia
(Zhejiang University, China; York University, Canada)


Article Search
Pepper: Preference-Aware Active Trapping for Ransomware
Huan Zhang, Zhengkai Qin, Lixin Zhao, Aimin Yu, Lijun Cai, and Dan Meng
(Institute of Information Engineering at Chinese Academy of Sciences, China)


Article Search
Extended Reality Cybersickness Assessment via User Review Analysis
Shuqing Li, Qisheng Zheng, Cuiyun Gao, Jia Feng, and Michael R. Lyu
(Chinese University of Hong Kong, China; Harbin Institute of Technology, China)


Article Search
Wemby’s Web: Hunting for Memory Corruption in WebAssembly
Oussama Draissi, Tobias Cloosters, David Klein, Michael Rodler, Marius Musch, Martin Johns, and Lucas Davi
(University of Duisburg-Essen, Germany; TU Braunschweig, Germany; Amazon Web Services, Germany)


Article Search
The Incredible Shrinking Context... in a Decompiler Near You
Sifis Lagouvardos, Yannis Bollanos, Neville Grech, and Yannis Smaragdakis
(University of Athens, Greece; Dedaub, Greece; Dedaub, Malta)
Decompilation of binary code has arisen as a highly-important application in the space of Ethereum VM (EVM) smart contracts. Major new decompilers appear nearly every year and attain popularity, for a multitude of reverse-engineering or tool-building purposes. Technically, the problem is fundamental: it consists of recovering high-level control flow from a highly-optimized continuation-passing-style (CPS) representation. Architecturally, decompilers can be built using either static analysis or symbolic execution techniques.
We present Shrnkr, a static-analysis-based decompiler succeeding the state-of-the-art Elipmoc decompiler. Shrnkr manages to achieve drastic improvements relative to the state of the art, in all significant dimensions: scalability, completeness, precision. Chief among the techniques employed is a new variant of static analysis context: shrinking context sensitivity. Shrinking context sensitivity performs deep cuts in the static analysis context, eagerly “forgetting” control-flow history, in order to leave room for further precise reasoning.
We compare Shrnkr to state-of-the-art decompilers, both static-analysis- and symbolic-execution-based. In a standard benchmark set, Shrnkr scales to over 99.5% of contracts (compared to ∼95% for Elipmoc), covers (i.e., reaches and manages to decompile) 67% more code than Heimdall-rs, and reduces key imprecision metrics by over 65%, compared again to Elipmoc.

Preprint
Causality-Aided Evaluation and Explanation of Large Language Model-Based Code Generation
Zhenlan Ji, Pingchuan Ma, Zongjie Li, Zhaoyu Wang, and Shuai Wang
(Hong Kong University of Science and Technology, China)


Article Search
AdverIntent-Agent: Adversarial Reasoning for Repair Based on Inferred Program Intent
He Ye, Aidan Z.H. Yang, Chang Hu, Yanlin Wang, Tao Zhang, and Claire Le Goues
(Carnegie Mellon University, USA; Macau University of Science and Technology, China; Sun Yat-sen University, China)


Article Search
ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation
Pengyu Xue, Linhao Wu, Zhen Yang, Chengyi Wang, Xiang Li, Yuxiang Zhang, Jia Li, Ruikai Jin, Yifei Pei, Zhaoyan Shen, Xiran Lyu, and Jacky Wai Keung
(Shandong University, China; Tsinghua University, China; City University of Hong Kong, Hong Kong)
In recent years, Large Language Models (LLMs) have dramatically advanced the performance of automated code translation, making their computational accuracy score reach up to over 80% on many previous benchmarks. However, most code samples in these benchmarks are short, standalone, statement/method-level, and algorithmic, which is not aligned with practical coding tasks. Therefore, it is still unknown the actual capability of LLMs in translating code samples written for daily development.
To achieve this, we construct a class-level code translation benchmark, ClassEval-T, and make the first attempt to extensively assess recent LLMs' performance on class-level code translation. ClassEval-T is extended from ClassEval, a well-known class-level Python code generation benchmark consisting of multiple practical coding topics, such as database operation and game design, and diverse contextual dependencies (e.g., fields, methods, and libraries). It cost us 360 person-hours to accomplish the manual migration to Java and C++ with complete code samples and associated test suites. Subsequently, we design three translation strategies (i.e., holistic, min-dependency, and standalone) for class-level code translations and evaluate eight recent LLMs of commercial, general, and code kinds in diverse families and sizes on ClassEval-T. Experimental results demonstrate a remarkable performance drop compared with the most widely studied method-level code translation benchmark, and obvious discrepancies among LLMs appear, showing the effectiveness of ClassEval-T in measuring recent LLMs. Afterwards, we further discuss the usage scenarios for diverse translation strategies and LLMs' ability to dependency awareness when translating class samples. Finally, 1,243 failure cases made by the best-performing LLM under test are thoroughly analyzed and categorized in this paper for practical guidance and future enlightenment.

Article Search
Porting Software Libraries to OpenHarmony: Transitioning from TypeScript or JavaScript to ArkTS
Bo Zhou, Jiaqi Shi, Ying Wang, Li Li, Tsz On Li, Hai Yu, and Zhiliang Zhu
(Northeastern University, China; Beihang University, China; Hong Kong University of Science and Technology, China)


Article Search
Reinforcement Learning-Based Fuzz Testing for the Gazebo Robotic Simulator
Zhilei Ren, Yitao Li, Xiaochen Li, Guanxiao Qi, Jifeng Xuan, and He Jiang
(Dalian University of Technology, China; Wuhan University, China)


Article Search
Why Does My Transaction Fail? A First Look at Failed Transactions on the Solana Blockchain
Xiaoye Zheng, Zhiyuan Wan, David Lo, Difan Xie, and Xiaohu Yang
(Zhejiang University, China; Singapore Management University, Singapore; Hangzhou High-Tech Zone, China)


Article Search
PatchScope: LLM-Enhanced Fine-Grained Stable Patch Classification for Linux Kernel
Rongkai Liu, Heyuan Shi, Shuning Liu, Chao Hu, Sisheng Li, Yuheng Shen, Runzhe Wang, Xiaohai Shi, and Yu Jiang
(Central South University, China; Tsinghua University, China; Alibaba Cloud Computing, China)
Stable patch classification plays a crucial role in vulnerability management for the Linux kernel, significantly contributing to the stability and security of Long-term support(LTS) versions. Although existing tools have effectively assisted in assessing whether patches should be merged into stable versions, they cannot determine which stable patches should be merged into which LTS versions. This process still requires the maintainers of the distribution community to manually screen based on the requirements of their respective versions.To address this issue, we propose PatchScope, which is designed to predict the specific merge status of patches.Patchscope consists of two components: patch analysis and patch classification.Patch analysis leverages Large Language Models(LLMs) to generate detailed patch descriptions from the commit message and code changes, thereby deepening the model's semantic understanding of patches. Patch classification utilizes a pre-trained language model to extract semantic features of the patches and employs a two-stage classifier to predict the merge status of the patches.The model is optimized using the dynamic weighted loss function to handle data imbalance and improve overall performance.Given that the primary focus is maintaining Linux kernel versions 5.10 and 6.6, we have conducted comparative experiments based on these two versions. Experimental results demonstrate that Patchscope can effectively predict the merge status of patches.

Article Search
Identifying Multi-parameter Constraint Errors in Python Data Science Library API Documentations
Xiufeng Xu, Fuman Xie, Chenguang Zhu, Guangdong Bai, Sarfraz Khurshid, and Yi Li
(Nanyang Technological University, Singapore; University of Queensland, Australia; University of Texas at Austin, USA)


Article Search
OpDiffer: LLM-Assisted Opcode-Level Differential Testing of Ethereum Virtual Machine
Jie Ma, Ningyu He, Jinwen Xi, Mingzhe Xing, Haoyu Wang, Ying Gao, and Yinliang Yue
(Beihang University, China; Zhongguancun Laboratory, China; Hong Kong Polytechnic University, China; Huazhong University of Science and Technology, China)
As Ethereum continues to thrive, the Ethereum Virtual Machine (EVM) has become the cornerstone powering tens of millions of active smart contracts. Intuitively, security issues in EVMs could lead to inconsistent behaviors among smart contracts or even denial-of-service of the entire blockchain network. However, to the best of our knowledge, only a limited number of studies focus on the security of EVMs. Moreover, they suffer from 1) insufficient test input diversity and invalid semantics; and 2) the inability to automatically identify bugs and locate root causes. To bridge this gap, we propose OpDiffer, a differential testing framework for EVM, which takes advantage of LLMs and static analysis methods to address the above two limitations. We conducted the largest-scale evaluation, covering nine EVMs and uncovering 26 previously unknown bugs, 22 of which have been confirmed by developers and three have been assigned CNVD IDs. Compared to state-of-the-art baselines, OpDiffer can improve code coverage by at most 71.06%, 148.40% and 655.56%, respectively. Through an analysis of real-world deployed Ethereum contracts, we estimate that 7.21% of the contracts could trigger our identified EVM bugs under certain environmental settings, potentially resulting in severe negative impact on the Ethereum ecosystem.

Preprint Artifacts Available
The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-Based Code Generation
Yingjie Fu, Bozhou Li, Linyi Li, Wentao Zhang, and Tao Xie
(Peking University, China; Simon Fraser University, Canada)


Article Search
No Bias Left Behind: Fairness Testing for Deep Recommender Systems Targeting General Disadvantaged Groups
Zhuo Wu, Zan Wang, Chuan Luo, Xiaoning Du, and Junjie Chen
(Tianjin University, China; Beihang University, China; Monash University, Australia)


Article Search
Enhanced Prompting Framework for Code Summarization with Large Language Models
Minying Fang, Xing Yuan, Yuying Li, Haojie Li, Chunrong Fang, and Junwei Du
(Qingdao University of Science and Technology, China; Nanjing University, China)


Article Search
An Investigation on Numerical Bugs in GPU Programs Towards Automated Bug Detection
Ravishka Rathnasuriya, Nidhi Majoju, Zihe Song, and Wei Yang
(University of Texas at Dallas, USA)


Article Search
A Large-Scale Empirical Study on Fine-Tuning Large Language Models for Unit Testing
Ye Shang, Quanjun Zhang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen
(Nanjing University, China; Huawei Cloud Computing Technologies, China)


Article Search
DeCoMa: Detecting and Purifying Code Dataset Watermarks through Dual Channel Code Abstraction
Yuan Xiao, Yuchen Chen, Shiqing Ma, Haocheng Huang, Chunrong Fang, Yanwei Chen, Weisong Sun, Yunfeng Zhu, Xiaofang Zhang, and Zhenyu Chen
(Nanjing University, China; University of Massachusetts at Amherst, USA; Soochow University, China; Nanyang Technological University, Singapore)
Watermarking is a technique to help identify the source of data points, which can be used to help prevent the misuse of protected datasets. Existing methods on code watermarking, leveraging the idea from the backdoor research, embed stealthy triggers as watermarks. Despite their high resilience against dilution attacks and backdoor detections, the robustness has not been fully evaluated. To fill this gap, we propose DeCoMa, a dual-channel approach to Detect and purify Code dataset waterMarks. To overcome the high barrier created by the stealthy and hidden nature of code watermarks, DeCoMa leverages dual-channel constraints on code to generalize and map code samples into standardized templates. Subsequently, DeCoMa extracts hidden watermarks by identifying outlier associations between paired elements within the standardized templates. Finally, DeCoMa purifies the watermarked dataset by removing all samples containing the detected watermark, enabling the silent appropriation of protected code. We conduct extensive experiments to evaluate the effectiveness and efficiency of DeCoMa, covering 14 types of code watermarks and 3 representative intelligent code tasks (a total of 14 scenarios). Experimental results demonstrate that DeCoMa achieves a stable recall of 100% in 14 code watermark detection scenarios, significantly outperforming the baselines. Additionally, DeCoMa effectively attacks code watermarks with embedding rates as low as 0.1%, while maintaining comparable model performance after training on the purified dataset. Furthermore, as DeCoMa requires no model training for detection, it achieves substantially higher efficiency than all baselines, with a speedup ranging from 31.5 to 130.9×. The results call for more advanced watermarking techniques for code models, while DeCoMa can serve as a baseline for future evaluation.

Article Search
Detecting Isolation Anomalies in Relational DBMSs
Rui Yang, Ziyu Cui, Wensheng Dou, Yu Gao, Jiansen Song, Xudong Xie, and Jun Wei
(Institute of Software at Chinese Academy of Sciences, China)


Article Search
Rethinking Performance Analysis for Configurable Software Systems: A Case Study from a Fitness Landscape Perspective
Mingyu Huang, Peili Mao, and Ke Li
(University of Electronic Science and Technology of China, China; University of Exeter, UK)


Article Search
Validating Network Protocol Parsers with Traceable RFC Document Interpretation
Mingwei Zheng, Danning Xie, Qingkai Shi, Chengpeng Wang, and Xiangyu Zhang
(Purdue University, USA; Nanjing University, China)


Article Search
NADA: Neural Acceptance-Driven Approximate Specification Mining
Weilin Luo, Tingchen Han, Qiu Junming, Hai Wan, Jianfeng Du, Bo Peng, Guohui Xiao, and Yanan Liu
(Sun Yat-sen University, China; Guangdong University of Foreign Studies, China; Southeast University, China)


Article Search
Productively Deploying Emerging Models on Emerging Platforms: A Top-Down Approach for Testing and Debugging
Siyuan Feng, Jiawei Liu, Ruihang Lai, Charlie F. Ruan, Yong Yu, Lingming Zhang, and Tianqi Chen
(Shanghai Jiao Tong University, China; University of Illinois at Urbana-Champaign, USA; Carnegie Mellon University, USA)


Article Search
DecLLM: LLM-Augmented Recompilable Decompilation for Enabling Programmatic Use of Decompiled Code
Wai Kin Wong, Daoyuan Wu, Huaijin Wang, Zongjie Li, Zhibo Liu, Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu
(Hong Kong University of Science and Technology, China; Ohio State University, USA; Tencent Security Keen Lab, China)


Article Search
Dynamically Fusing Python HPC Kernels
Nader Al Awar, Muhammad Hannan Naeem, James Almgren-Bell, George Biros, and Milos Gligoric
(University of Texas at Austin, USA)


Article Search
Tratto: A Neuro-Symbolic Approach to Deriving Axiomatic Test Oracles
Davide Molinelli, Alberto Martin-Lopez, Elliott Zackrone, Beyza Eken, Michael D. Ernst, and Mauro Pezzè
(USI Lugano, Switzerland; University of Washington, USA; Sakarya University, Türkiye; Schaffhausen Institute of Technology, Switzerland)


Article Search
VerLog: Enhancing Release Note Generation for Android Apps using Large Language Models
Jiawei Guo, Haoran Yang, and Haipeng Cai
(SUNY Buffalo, USA; Washington State University, USA)
Release notes are essential documents that communicate the details of software updates to users and developers, yet their generation remains a time-consuming and error-prone process. In this paper, we present VerLog, a novel technique that enhances the generation of software release notes using Large Language Models (LLMs). VerLog leverages few-shot in-context learning with adaptive prompting to facilitate the graph reasoning capabilities of LLMs, enabling them to accurately interpret and document the semantic information of code changes. Additionally, VerLog incorporates multi-granularity information, including fine-grained code modifications and high-level non-code artifacts, to guide the generation process and ensure comprehensive, accurate, and readable release notes. We applied VerLog to the 42 releases of 248 unique Android applications and conducted extensive evaluations. Our results demonstrate that VerLog significantly (up to 18%–21% higher precision, recall, and F1) outperforms state-of-the-art baselines in terms of completeness, accuracy, readability, and overall quality of the generated release notes, in both controlled experiments with high-quality reference release notes and in-the-wild evaluations.

Article Search
Uncovering API-Scope Misalignment in the App-in-App Ecosystem
Jiarui Che, Chenkai Guo, Naipeng Dong, Jiaqi Pei, Lingling Fan, Xun Mi, Xueshuo Xie, Xiangyang Luo, Zheli Liu, and Renhong Cheng
(Nankai University, China; Haihe Lab of ITAI, China; University of Queensland, Australia; State Key Laboratory of Mathematical Engineering and Advanced Computing, China)


Article Search
Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia
(Harbin Institute of Technology, China; Monash University Malaysia, Malaysia; Zhejiang University, China)


Article Search
Effective REST APIs Testing with Error Message Analysis
Lixin Xu, Huayao Wu, Zhenyu Pan, Tongtong Xu, Shaohua Wang, Xintao Niu, and Changhai Nie
(Nanjing University, China; Huawei, China; Central University of Finance and Economics, China)


Article Search
DepState: Detecting Synchronization Failure Bugs in Distributed Database Management Systems
Cundi Fang, Jie Liang, Zhiyong Wu, Jingzhou Fu, Zhouyang Jia, Chun Huang, Yu Jiang, and Shanshan Li
(National University of Defense Technology, China; Tsinghua University, China)


Article Search
What Happened in This Pipeline? Diffing Build Logs with CiDiff
Nicolas Hubner, Jean-Rémy Falleri, Raluca Uricaru, Thomas Degueule, and Thomas Durieux
(University of Bordeaux - LaBRI - UMR 5800, France; University of Bordeaux - Bordeaux INP - CNRS - LaBRI - UMR5800, France; Delft University of Technology, Netherlands)
Continuous integration (CI) is widely used by developers to ensure the quality and reliability of their software projects. However, diagnosing a CI regression is a tedious process that involves the manual analysis of lengthy build logs. In this paper, we explore how textual differencing can support the debugging of CI regressions. As off-the-shelf diff algorithms produce suboptimal results, in this work we introduce a new diff algorithm specifically tailored to build logs called CiDiff. We evaluate CiDiff against several baselines on a novel dataset of 17 906 CI regressions, performing an accuracy study, a quantitative study and a user-study. Notably, our algorithm reduces the number of lines to inspect by about 60 % in the median case, with reasonable overhead compared to the state-of-practice LCS-diff. Finally, our algorithm is preferred by the majority of participants in 70 % of the regression cases, whereas LCS-diff is preferred in only 5 % of the cases

Article Search
Freesia: Verifying Correctness of TEE Communication with Concurrent Separation Logic
Fanlang Zeng, Rui Chang, and Hongjian Liu
(Zhejiang University, China)
The Trusted Execution Environment (TEE), a security extension in modern processors, provides a secure runtime environment for sensitive code and data. Although TEEs are designed to protect applications and their private data, their large code bases often harbor vulnerabilities that could compromise data security. Even though some formal verification efforts have been directed toward the functionality and security of TEE standards and implementations, the verification of TEE correctness in concurrent scenarios remains insufficient. This paper introduces an enhancement for ensuring concurrency safety in TEEs, named Freesia, which is formally verified using concurrent separation logic. Through a thorough analysis of the GlobalPlatform TEE standards, Freesia addresses data race issues in the TEE communication interfaces and ensures consistency protection for shared memory between the client and the TEE. A prototype of Freesia is implemented in the open-source TEE platform, OP-TEE. Additionally, the concurrency correctness of Freesia is modeled and verified using the Iris concurrent separation logic framework. The effectiveness and efficiency of Freesia are further demonstrated through real-world case study and performance evaluations.

Article Search Artifacts Available
Static Program Reduction via Type-Directed Slicing
Loi Ngo Duc Nguyen, Tahiatul Islam, Theron Wang, Sam Lenz, and Martin Kellogg
(University of California at Riverside, USA; New Jersey Institute of Technology, USA; Academy for Mathematics, Science, and Engineering, USA)


Article Search
LogBase: A Large-Scale Benchmark for Semantic Log Parsing
Chenbo Zhang, Wenying Xu, Jinbu Liu, Lu Zhang, Guiyang Liu, Jihong Guan, Qi Zhou, and Shuigeng Zhou
(Fudan University, China; Alibaba Group, China; Tongji University, China)
Logs generated by large-scale software systems contain a huge amount of useful information. As the first step of automated log analysis, log parsing has been extensively studied. General log parsing techniques focus on identifying static templates from raw logs, but overlook the more important semantics implied in dynamic log parameters. With the popularity of Artificial Intelligence for IT Operations (AIOps), traditional log parsing methods no longer meet the requirements of various downstream tasks. Researchers are now exploring the next generation of log parsing techniques, i.e., semantic log parsing, to identify both log templates and semantics in log parameters. However, the absence of semantic annotations in existing datasets hinders the training and evaluation of semantic log parsers, thereby stalling the progress of semantic log parsing.
To fill this gap and advance the field of semantic log parsing, we construct LogBase, the first semantic log parsing benchmark dataset. LogBase consists of logs from 130 popular open-source projects, containing 85,300 semantically annotated log templates, surpassing existing datasets in both log source diversity and template richness. To build Logbase, we develop the framework GenLog for constructing semantic log parsing datasets. GenLog mines log template-parameter-context triplets from popular open-source repositories on GitHub, and uses chain-of-thought (CoT) techniques with large language models (LLMs) to generate high-quality logs. Meanwhile, GenLog employs human feedback to improve the quality of the generated data and ensure its reliability. GenLog is highly automated and cost-effective, enabling researchers to easily and efficiently construct semantic log parsing datasets. Furthermore, we also design a set of comprehensive evaluation metrics for LogBase, including general log parser metrics and the metrics specifically for semantic log parsers and LLM-based parsers.
With LogBase, we extensively evaluate 15 existing log parsers, revealing their true performance in complex scenarios. We believe that this work provides researchers with valuable data, reliable tools, and insightful findings to support and guide the future research of semantic log parsing.

Article Search
STRUT: Structured Seed Case Guided Unit Test Generation for C Programs using LLMs
Jinwei Liu, Chao Li, Rui Chen, Shaofeng Li, Bin Gu, and Mengfei Yang
(Xidian University, China; Beijing Institute of Control Engineering, China; Beijing Sunwise Information Technology, China; China Academy of Space Technology, China)


Article Search
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, Kui Ren, and Jingyi Wang
(Zhejiang University, China; Alibaba Group, China)
Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risk efficiently.
To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM Mt and a novel safety critique LLM Mc. The expert testing LLM Mt is responsible for automatically generating test cases in accordance with the proposed risk management (including 8 risk dimensions and a total of 102 subdivided risks). The safety critique LLM Mc can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval differs in significant ways: (i) efficient – we construct a multi-dimensional and open-ended benchmark comprising 220,000 test cases across 102 risks utilizing Mt and conduct safety evaluations for 21 influential LLMs via Mc on our benchmark. The entire process is fully automated and requires no human involvement. (ii) effective – extensive validations show S-Eval facilitates a more thorough assessment and better perception of potential LLM risks, and Mc not only accurately quantifies the risks of LLMs but also provides explainable and in-depth insight into their safety, surpassing comparable models such as LLaMA-Guard-2. (iii) adaptive – S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. We further study the impact of hyper-parameters and language environments on model safety, which may lead to promising directions for future research. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios.

Preprint Info
Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing
Yanzhou Mu, Juan Zhai, Chunrong Fang, Xiang Chen, Zhixiang Cao, Peiran Yang, Kexin Zhao, An Guo, and Zhenyu Chen
(Nanjing University, China; University of Massachusetts at Amherst, USA; Nantong University, China)


Article Search
Hulk: Exploring Data-Sensitive Performance Anomalies in DBMSs via Data-Driven Analysis
Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, and Yu Jiang
(Tsinghua University, China)


Article Search
Type-Alias Analysis: Enabling LLVM IR with Accurate Types
Jinmeng Zhou, Ziyue Pan, Wenbo Shen, Xingkai Wang, Kangjie Lu, and Zhiyun Qian
(Zhejiang University, China; University of Minnesota, USA; University of California at Riverside, USA)


Article Search
Automated Test Transfer across Android Apps using Large Language Models
Benyamin Beyzaei, Saghar Talebipour, Ghazal Rafiei, Nenad Medvidović, and Sam Malek
(University of California at Irvine, USA; University of Southern California, USA)
The pervasiveness of mobile apps in everyday life necessitates robust testing strategies to ensure quality and efficiency, especially through end-to-end usage-based tests for mobile apps' user interfaces (UIs). However, manually creating and maintaining such tests can be costly for developers. Since many apps share similar functionalities beneath diverse UIs, previous works have shown the possibility of transferring UI tests across different apps within the same domain, thereby eliminating the need for writing the tests manually. However, these methods have struggled to accommodate real-world variations, often facing limitations in scenarios where source and target apps are not very similar or fail to accurately transfer test oracles. This paper introduces an innovative technique, LLMigrate, which leverages Large Language Models (LLMs) to efficiently transfer usage-based UI tests across mobile apps. Our experimental evaluation shows LLMigrate can achieve a 97.5% success rate in automated test transfer, reducing the manual effort required to write tests from scratch by 91.1%. This represents an improvement of 9.1% in success rate and 38.2% in effort reduction compared to the best-performing prior technique, setting a new benchmark for automated test transfer.

Article Search
Incremental Verification of Concurrent Programs through Refinement Constraint Adaptation
Liangze Yin, Yiwei Li, Kun Chen, Wei Dong, and Ji Wang
(National University of Defense Technology, China)


Article Search
Fixing Outside the Box: Uncovering Tactics for Open-Source Security Issue Management
Lyuye Zhang, Jiahui Wu, Liu Chengwei, Kaixuan Li, Xiaoyu Sun, Lida Zhao, Chong Wang, and Yang Liu
(Nanyang Technological University, Singapore; East China Normal University, China; Australian National University, Australia)
In the rapidly evolving landscape of software development, addressing security vulnerabilities in open-source software (OSS) has become critically important. However, existing research and tools from both academia and industry mainly relied on limited solutions, such as vulnerable version adjustment and adopting patches, to handle identified vulnerabilities. However, far more flexible and diverse countermeasures have been actively adopted in the open-source communities. A holistic empirical study is needed to explore the prevalence, distribution, preferences, and effectiveness of these diverse strategies. To this end, in this paper, we conduct a comprehensive study on the taxonomy of vulnerability remediation tactics (RT) in OSS projects and investigate their pros and cons. This study addresses this oversight by conducting a comprehensive empirical analysis of 21,187 issues from GitHub, aiming to understand the range and efficacy of remediation tactics within the OSS community. We developed a hierarchical taxonomy of 44 distinct RT and evaluated their effectiveness and costs. Our findings highlight a significant reliance on community-driven strategies, like using alternative libraries and bypassing vulnerabilities, 44% of which are currently unsupported by cutting-edge tools. Additionally, this research exposes the community’s preferences for certain fixing approaches by analyzing their acceptance and the reasons for rejection. It also underscores a critical gap in modern vulnerability databases, where 54% of CVEs lack fixing suggestions—a gap that can be significantly mitigated by leveraging the 93% of actionable solutions provided through GitHub issues.

Preprint Info
Intention-Based GUI Test Migration for Mobile Apps using Large Language Models
Shaoheng Cao, Minxue Pan, Yuanhong Lan, and Xuandong Li
(Nanjing University, China)


Article Search
GoPV: Detecting Blocking Concurrency Bugs Related to Shared-Memory Synchronization in Go
Wei Song, Xiaofan Xu, and Jeff Huang
(Nanjing University of Science and Technology, China; Texas A&M University, USA)
Go is a popular concurrent programming language that employs both message-passing and shared-memory synchronization primitives for interaction between different threads known as goroutines. However, the misuse of the synchronization primitives can easily lead to blocking concurrency bugs, including deadlocks, goroutine leaks. While blocking concurrency bugs related to message passing have received increasing attention, little work focuses on the blocking concurrency bugs caused by the misuse of shared-memory synchronization primitives. In this paper, we present GoPV, a static analyzer and an open-source tool, which performs concurrency analysis and (post-)dominator analysis to determine blocking concurrency bugs by ascertaining whether the synchronization primitives are misused. We evaluate GoPV on eight benchmark programs and 21 large real-world Go projects. The experimental results demonstrate that GoPV not only successfully detects all blocking concurrency bugs related to shared-memory synchronization in the eight benchmark programs, but also discovers 17 such bugs in the 21 large Go applications within 2.78 hours.

Article Search Artifacts Available
More Effective JavaScript Breaking Change Detection via Dynamic Object Relation Graph
Dezhen Kong, Jiakun Liu, Chao Ni, David Lo, and Lingfeng Bao
(Zhejiang University, China; Singapore Management University, Singapore)
JavaScript libraries are characterized by their widespread use, frequent code changes, and a high tolerance for backward incompatible changes. Awareness of such breaking changes can help developers adapt to version updates and avoid negative impacts. Several tools have been targeted to or can be used to detect breaking change detection in the JavaScript community. However, these tools detect breaking changes using different ways, and there are currently no systematic reviews of these approaches. From a preliminary study on popular JavaScript libraries, we find that existing approaches, including simple regression testing, model-based testing and type differencing cannot detect many breaking changes but can produce plenty of false positives. We discuss the reasons for missing breaking changes and producing false positives.
Based on the insights from our findings, we propose a new approach named Diagnose that iteratively constructs an object relation graph based on API exploration and forced execution-based type analysis. Diagnose then refine the graphs and reconstruct the graphs in the newer versions of the libraries to detect breaking changes. By evaluating approach on the same set of libraries used in the empirical study, we find that Diagnose can detect much more breaking changes (60.2%) and produce fewer false positives. Therefore, Diagnose is suitable for practical use.

Article Search
SWE-GPT: A Process-Centric Language Model for Automated Software Improvement
Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, and Yongbin Li
(Alibaba Group, China)


Article Search
ICEPRE: ICS Protocol Reverse Engineering via Data-Driven Concolic Execution
Yibo Qu, Dongliang Fang, Zhen Wang, Jiaxing Cheng, Shuaizong Si, Yongle Chen, and Limin Sun
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Taiyuan University of Technology, China)
With the advancement of digital transformation, Industrial Control Systems (ICS) are becoming increasingly open and intelligent. However, inherent vulnerabilities in ICS protocols pose significant security threats to devices and systems. The proprietary nature of ICS protocols complicates the security analysis and deployment of protective mechanisms for ICS. Protocol reverse engineering aims to infer the syntax, semantics, and state machines of protocols in the absence of official specifications. Traditional protocol reverse engineering tools face considerable limitations due to the lack of executable environments, incomplete inference strategies, and low-quality network traffic. In this paper, we present ICEPRE, a novel data-driven protocol reverse engineering method based on concolic execution, which uniquely integrates network trace with static analysis. Unlike conventional methods that rely on executable environments, ICEPRE statically tracks the program's parsing process for specific input messages. Furthermore, we employ an innovative field boundary inference strategy to infer the protocol's syntax by analyzing how the protocol parser handles different fields. Our evaluation demonstrates that ICEPRE significantly outperforms previous protocol reverse engineering tools in field boundary inference, achieving an F1 score of 0.76 and a perfection score of 0.67, while DynPRE, BinaryInferno, Nemeys, and Netzob yield (0.65, 0.35), (0.42, 0.14), (0.39, 0.09), and (0.27, 0.10), respectively. These results underscore the superior overall performance of our method. Additionally, ICEPRE exhibits exceptional performance with proprietary protocols in real-world scenarios, highlighting its practical applicability in downstream applications.

Article Search
Gamifying Testing in IntelliJ: A Replicability Study
Philipp Straubinger, Tommaso Fulcini, Giacomo Garaccione, Luca Ardito, and Gordon Fraser
(University of Passau, Germany; Politecnico di Torino, Italy)


Article Search
CrossProbe: LLM-Empowered Cross-Project Bug Detection for Deep Learning Frameworks
Hao Guan, Guangdong Bai, and Yepang Liu
(University of Queensland, Australia; Southern University of Science and Technology, China)


Article Search

proc time: 1.22