ISSTA – Journal Issue

Artifacts Available Article: issta25main-p34-p doi:10.1145/3728872

xFUZZ: A Flexible Framework for Fine-Grained, Runtime-Adaptive Fuzzing Strategy Composition
Dongsong Yu, Yiyi Wang, Chao Zhang, Yang Lan, Zhiyuan Jiang, Shuitao Gan, Zheyu Ma, and Wende Tan
(Zhongguancun Laboratory, China; Tsinghua University, China; Huazhong University of Science and Technology, China; National University of Defense Technology, China; Labortory for Advanced Computing and Intelligence Engineering, China)

Publisher's Version Article: issta25main-p36-p doi:10.1145/3728873

DataHook: An Efficient and Lightweight System Call Hooking Technique without Instruction Modification
Quan Hong, Jiaqi Li, Wen Zhang, and Lidong Zhai
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; China Unicom Online Information Technology, China)
System calls serve as the primary interface for interaction between user-space programs and the operating system (OS) kernel. By hooking system calls, it is possible to analyze and modify the behavior of user-space programs. This paper proposes DataHook, an efficient and lightweight system call hooking technique for 32-bit programs. Compared to existing system call hooking techniques, DataHook achieves hooking with extremely low hook overhead by modifying only a few data elements without altering any program instructions. This unique characteristic not only avoids the multithreading conflicts associated with binary rewriting but also provides support for programs to apply more efficient user-space OS subsystems. However, existing system call hooking techniques struggle to meet these goals simultaneously. While techniques like syscall user dispatch (SUD) and ptrace do not require rewriting process instructions, they introduce significant hook overhead. On the other hand, low-overhead techniques typically involve binary rewriting of multiple bytes or instructions, which introduces its own set of challenges. DataHook cleverly addresses these issues by leveraging the specific behavior of 32-bit programs during system calls. In short, unlike 64-bit programs, 32-bit programs use an indirect call instruction to jump to the function executing the syscall/sysenter when making a system call. This paper achieves system call hooking by manipulating the data dependencies involved in the indirect call process. This characteristic is present in 32-bit programs on glibc-based Linux systems, whether running on x86 or x86-64 architectures. Therefore, DataHook can be deployed on these systems. Experimental results demonstrate that DataHook reduces hook overhead by 5.4 to 1,429.0 times compared to existing techniques. When DataHook was applied to a server program to make it use the user-space network stack, the server performance was improved by approximately 4.3 times. Additionally, when applied to Redis, DataHook resulted in only a 4.0% performance loss, compared to 8.0% to 94.7% with other techniques.

Publisher's Version Article: issta25main-p44-p doi:10.1145/3728874

Beyond Static Pattern Matching? Rethinking Automatic Cryptographic API Misuse Detection in the Era of LLMs
Yifan Xia, Zichen Xie, Peiyu Liu, Kangjie Lu, Yan Liu, Wenhai Wang, and Shouling Ji
(Zhejiang University, China; University of Minnesota, USA; Ant Group, China)

Publisher's Version Article: issta25main-p48-p doi:10.1145/3728875

MoDitector: Module-Directed Testing for Autonomous Driving Systems
Renzhi Wang, Mingfei Cheng, Xiaofei Xie, Yuan Zhou, and Lei Ma
(University of Alberta, Canada; Singapore Management University, Singapore; Zhejiang Sci-Tech University, China; University of Tokyo, Japan)
Testing Autonomous Driving Systems (ADSs) is crucial for ensuring their safety, reliability, and performance. Despite numerous testing methods available that can generate diverse and challenging scenarios to uncover potential vulnerabilities, these methods often treat ADS as a black-box, primarily focusing on identifying system-level failures like collisions or near-misses without pinpointing the specific modules responsible for these failures. This lack of root causes understanding for the failures hinders effective debugging and subsequent system repair. Furthermore, current approaches often fall short in generating violations that adequately test the individual modules of an ADS from a system-level perspective, such as perception, prediction, planning, and control. To bridge this gap, we introduce MoDitector, a root-cause-aware testing method for ADS that generates safety-critical scenarios specifically designed to expose weaknesses in targeted ADS modules. Unlike existing approaches, MoDitector not only produces scenarios that lead to violations but also pinpoints the specific module responsible for each failure. Specifically, our approach introduces Module-Specific Oracles to automatically detect module-level errors and identify the root-cause module responsible for system-level violations. To effectively generate module-specific failures, we propose a module-directed testing strategy that integrates Module-Specific Feedback and Adaptive Scenario Generation to guide the testing process. We evaluated MoDitector across four critical ADS modules and four representative testing scenarios. The results demonstrate that MoDitector can effectively and efficiently generate scenarios in which failures can be attributed to specific targeted modules. In total, MoDitector generated 216.7 expected scenarios, significantly outperforming the best baseline, which identified only 79.0 scenarios. Our approach represents a significant innovation in ADS testing by focusing on the identification and rectification of module-specific errors within the system, moving beyond conventional black-box failure detection.

Artifacts Available Article: issta25main-p64-p doi:10.1145/3728876

FreeWavm: Enhanced WebAssembly Runtime Fuzzing Guided by Parse Tree Mutation and Snapshot
Peng Qian, Xinlei Ying, Jiashui Wang, Long Liu, Lun Zhang, Jianhai Chen, and Qinming He
(Zhejiang University, China; Ant Group, China; GoPlus Security, China)
WebAssembly, recognized as a low-level and portable language, has been widely embraced in areas as diverse as browsers and blockchains, emerging as a revolutionary force for Internet evolution. Unfortunately, defects and flaws in WebAssembly runtimes bring about unexpected results when running WebAssembly applications. A family of solutions has been proposed to detect vulnerabilities in WebAssembly runtimes, with fuzzing surging as the most promising and persuasive approach. Despite its potential, fuzzing faces significant challenges due to the grammatical complexity of WebAssembly runtimes, which lacks an in-depth understanding of the unique Module-based code structure, and thus generates test inputs that struggle to tap into the deep logic within a WebAssembly runtime, limiting its effectiveness in unveiling vulnerabilities.
To bridge this gap, we introduce FreeWavm, a novel framework for fuzzing WebAssembly runtimes by aggressively mutating the structure of WebAssembly code. Technically, we transform the WebAssembly bytecode into a parse tree format that captures complex characteristics of code structure. To generate meaningful test inputs for WebAssembly runtime fuzzing, we design a structure-aware mutation module that engages in a customized node prioritization strategy to screen out interesting nodes in the parse tree, and then applies specific structure mutations. To ensure the validity of the mutated test inputs, FreeWavm is equipped with an automated repair mechanism to patch the mutated parse tree. Furthermore, we take advantage of parse tree snapshots to facilitate input evolution and the overall fuzzing process. Extensive experiments are conducted to evaluate FreeWavm on multiple WebAssembly runtimes. Empirical results show that FreeWavm effectively triggers structure-specific crashes in WebAssembly runtimes, outperforming other counterparts. FreeWavm has identified 69 previously unknown bugs, 24 of which are assigned CVEs thus far.

Publisher's Version Article: issta25main-p73-p doi:10.1145/3728877

Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection
Lei Yu, Zhirong Huang, Hang Yuan, Shiqi Cheng, Li Yang, Fengjun Zhang, Chenjie Shen, Jiajia Ma, Jingyuan Zhang, Junyi Lu, and Chun Zuo
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Sinosoft, China)
Smart contract vulnerability detection is a critical challenge in the rapidly evolving blockchain landscape. Existing vulnerability detection methods face two main issues: (1) Existing datasets lack comprehensiveness and sufficient quality, with limited vulnerability type coverage and insufficient distinction between high-quality and low-quality explanations for preference learning. (2) Large language models (LLMs) often struggle with accurately interpreting specific concepts in smart contract security. Through our empirical analysis, we found that even after continual pre-training and supervised fine-tuning, LLMs still exhibit limitations in precisely understanding the execution order of state changes in smart contracts, which can lead to incorrect vulnerability explanations despite making correct detection decisions. These limitations result in poor detection performance, leading to potentially severe financial losses. To address these challenges, we propose Smart-LLaMA-DPO, an advanced detection method based on the LLaMA-3.1-8B. First, we construct a comprehensive dataset covering four vulnerability types and machine-unauditable vulnerabilities, containing labels, detailed explanations, and precise vulnerability locations for Supervised Fine-Tuning (SFT), as well as paired high-quality and low-quality outputs for Direct Preference Optimization (DPO). Second, we perform continual pre-training using large-scale smart contract code to enhance the LLM's understanding of specific security practices in smart contracts. Futhermore, we conduct supervised fine-tuning with our comprehensive dataset. Finally, we apply DPO, which leverages human feedback to improve the quality of generated explanations. Smart-LLaMA-DPO utilizes a specially designed loss function that encourages the LLM to increase the probability of preferred outputs while decreasing the probability of non-preferred outputs, thereby enhancing the LLM's ability to generate high-quality explanations. We evaluate Smart-LLaMA-DPO on four major vulnerability types: reentrancy, timestamp dependence, integer overflow/underflow, and delegatecall, as well as machine-unauditable vulnerabilities. Our method significantly outperforms state-of-the-art baselines, with average improvements of 10.43% in F1 score and 7.87% in accuracy. Moreover, both LLM evaluation and human evaluation demonstrate the superior quality of explanations generated by Smart-LLaMA-DPO in terms of correctness, thoroughness, and clarity.

Artifacts Functional Article: issta25main-p76-p doi:10.1145/3728878

Are Autonomous Web Agents Good Testers?
Antoine Chevrot, Alexandre Vernotte, Jean-Rémy Falleri, Xavier Blanc, Bruno Legeard, and Aymeric Cretin
(Smartesting, France; University of Bordeaux - LaBRI - UMR 5800, France)

Artifacts Available Article: issta25main-p82-p doi:10.1145/3728879

Preventing Disruption of System Backup against Ransomware Attacks
Yiwei Hou, Lihua Guo, Chijin Zhou, Quan Zhang, Wenhuan Liu, Chengnian Sun, and Yu Jiang
(Tsinghua University, China; Union Tech, China; University of Waterloo, Canada)

Publisher's Version Article: issta25main-p88-p doi:10.1145/3728880

Fairness Mediator: Neutralize Stereotype Associations to Mitigate Bias in Large Language Models
Yisong Xiao, Aishan Liu, Siyuan Liang, Xianglong Liu, and Dacheng Tao
(Beihang University, China; National University of Singapore, Singapore; Zhongguancun Laboratory, China; Nanyang Technological University, Singapore)

Publisher's Version Article: issta25main-p107-p doi:10.1145/3728881

KRAKEN: Program-Adaptive Parallel Fuzzing
Anshunkang Zhou, Heqing Huang, and Charles Zhang
(Hong Kong University of Science and Technology, China; City University of Hong Kong, China)

Publisher's Version Article: issta25main-p120-p doi:10.1145/3728882

Model Checking Guided Incremental Testing for Distributed Systems
Yu Gao, Dong Wang, Wensheng Dou, Wenhan Feng, Yu Liang, Yuan Feng, and Jun Wei
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Wuhan Dameng Database, China)

Publisher's Version Article: issta25main-p122-p doi:10.1145/3728883

Understanding Model Weaknesses: A Path to Strengthening DNN-Based Android Malware Detection
Haodong Li, Xiao Cheng, Yanjie Zhao, Guosheng Xu, Guoai Xu, and Haoyu Wang
(Beijing University of Posts and Telecommunications, China; UNSW, Australia; Huazhong University of Science and Technology, China; Harbin Institute of Technology, China)

Publisher's Version Article: issta25main-p123-p doi:10.1145/3728884

LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models
Lingxiao Tang, Jiakun Liu, Zhongxin Liu, Xiaohu Yang, and Lingfeng Bao
(Zhejiang University, China; Singapore Management University, Singapore)
The SZZ algorithm is the dominant technique for identifying bug-inducing commits and serves as a foundation for many software engineering studies, such as bug prediction and static code analysis, thereby enhancing software quality and facilitating better maintenance practices. Researchers have proposed many variants to enhance the SZZalgorithm’s performance since its introduction. The majority of them rely on static techniques or heuristic assumptions, making them easy to implement, but their performance improvements are often limited. Recently, a deep learning-based SZZ algorithm has been introduced to enhance the original SZZ algorithm. However, it requires complex preprocessing and is restricted to a single programming language. Additionally, while it enhances precision, it sacrifices recall. Furthermore, most of variants overlook crucial information, such as commit messages and patch context, and are limited to bug-fixing commits involving deleted lines.
The emergence of large language models (LLMs) offers an opportunity to address these drawbacks. In this study, we investigate the strengths and limitations of LLMs and propose LLM4SZZ, which employs two approaches (i.e., rank-based identification and context-enhanced identification) to handle different types of bug-fixing commits. We determine which approach to adopt based on the LLM’s ability to comprehend the bug and identify whether the bug is present in a commit. The context-enhanced identification provides the LLM with more context and requires it to find the bug-inducing commit among a set of candidate commits. In rank-based identification, we ask the LLM to select buggy statements from the bug-fixing commit and rank them based on their relevance to the root cause. Experimental results show that LLM4SZZ outperforms all baselines across three datasets, improving F1-score by 6.9% to 16.0% without significantly sacrificing recall. Additionally, LLM4SZZ can identify many bug-inducing commits that the baselines fail to detect, accounting for 7.8%, 7.4% and 2.5% of the total bug-inducing commits across three datasets, respectively.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: issta25main-p137-p doi:10.1145/3728885

ALMOND: Learning an Assembly Language Model for 0-Shot Code Obfuscation Detection
Xuezixiang Li, Sheng Yu, and Heng Yin
(University of California at Riverside, USA; Deepbits Technology, USA)

Publisher's Version Article: issta25main-p152-p doi:10.1145/3728886

Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection
Niklas Risse, Jing Liu, and Marcel Böhme
(MPI-SP, Germany)

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: issta25main-p155-p doi:10.1145/3728887

Tracezip: Efficient Distributed Tracing via Trace Compression
Zhuangbin Chen, Junsong Pu, and Zibin Zheng
(Sun Yat-sen University, China; Beijing University of Posts and Telecommunications, China)

Publisher's Version Article: issta25main-p165-p doi:10.1145/3728888

Walls Have Ears: Demystifying Notification Listener Usage in Android Apps
Jiapeng Deng, Tianming Liu, Yanjie Zhao, Chao Wang, Lin Zhang, and Haoyu Wang
(Huazhong University of Science and Technology, China; Monash University, Australia; CNCERT-CC, China)

Publisher's Version Article: issta25main-p176-p doi:10.1145/3728898

Safe4U: Identifying Unsound Safe Encapsulations of Unsafe Calls in Rust using LLMs
Huan Li, Bei Wang, Xing Hu, and Xin Xia
(Zhejiang University, China)

Publisher's Version Article: issta25main-p180-p doi:10.1145/3728890

LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation
Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng
(Sun Yat-sen University, China; Nanyang Technological University, Singapore; Huawei Cloud Computing Technologies, China)

Publisher's Version Article: issta25main-p194-p doi:10.1145/3728894

Bridge the Islands: Pointer Analysis for Microservice Systems
Teng Zhang, Yufei Liang, Ganlin Li, Tian Tan, Chang Xu, and Yue Li
(Nanjing University, China)

Artifacts Reusable Article: issta25main-p212-p doi:10.1145/3728896

Program Feature-Based Benchmarking for Fuzz Testing
Miao Miao, Sriteja Kummita, Eric Bodden, and Shiyi Wei
(University of Texas at Dallas, USA; Fraunhofer IEM, Germany; Heinz Nixdorf Institute at Paderborn University, Germany)

Artifacts Reusable Article: issta25main-p215-p doi:10.1145/3728899

SoK: A Taxonomic Analysis of DeFi Rug Pulls: Types, Dataset, and Tool Assessment
Dianxiang Sun, Wei Ma, Liming Nie, and Yang Liu
(Nanyang Technological University, Singapore; Singapore Management University, Singapore; Shenzhen Technology University, China)

Publisher's Version Article: issta25main-p217-p doi:10.1145/3728900

Recurring Vulnerability Detection: How Far Are We?
Yiheng Cao, Susheng Wu, Ruisi Wang, Bihuan Chen, Yiheng Huang, Chenhao Lu, Zhuotong Zhou, and Xin Peng
(Fudan University, China)

Publisher's Version Article: issta25main-p224-p doi:10.1145/3728901

ConTested: Consistency-Aided Tested Code Generation with LLM
Jinhao Dong, Jun Sun, Wenjie Zhang, Jin Song Dong, and Dan Hao
(Peking University, China; Singapore Management University, Singapore; National University of Singapore, Singapore)

Publisher's Version Article: issta25main-p235-p doi:10.1145/3728902

Robust Vulnerability Detection across Compilations: LLVM-IR vs. Assembly with Transformer Model
Rony Shir, Priyanka Surve, Yuval Elovici, and Asaf Shabtai
(Ben-Gurion University of the Negev, Israel)

Publisher's Version Article: issta25main-p240-p doi:10.1145/3728903

RouthSearch: Inferring PID Parameter Specification for Flight Control Program by Coordinate Search
Siao Wang, Zhen Dong, Hui Li, Liwei Shen, Xin Peng, and Dongdong She
(Fudan University, China; Hong Kong University of Science and Technology, China)
Flight control programs are widely used in unmanned aerial vehicles (UAVs) to manage and maintain UAVs’ flying behaviors dynamically. These flight control programs include a PID control module that takes three user-configurable PID parameters: Proportional (P), Integral (I), and Derivative (D). Users can also adjust these PID parameters during flight to suit the needs of various flight tasks. However, flight control programs do not have sufficient safety checks on the user-provided PID parameters, leading to a severe vulnerability of UAV—input validation bug. It happens when the user misconfigures PID parameters and causes the UAV to enter a dangerous state, such as deviation from the expected path, loss of control, or even crash.
Prior works use random testing approaches like fuzzing to identify invalid PID parameters from user input. However, they are not effective in the three-dimensional search space of PID parameters. Meanwhile, each dynamic execution of the UAV test is very expensive, further affecting the performance of random testing.
In this work, we address the problem of PID parameter misconfiguration by combining the Routh-Hurwitz stability criterion with coordinate search, introducing a method called RouthSearch. Instead of identifying misconfigured PID parameters in an ad-hoc fashion, RouthSearch principledly determines valid ranges for three-dimensional PID parameters. We first leverage the Routh-Hurwitz Criterion to identify a theoretical PID parameter boundary. We then refine the boundary using an efficient coordinate search. The valid range of three-dimensional PID parameters determined by RouthSearch can filter out misconfigured PID parameters from users during flight and further help to discover logical bugs in popular flight control programs.
We evaluated RouthSearch across eight flight modes in two popular flight control programs, PX4 and ArduPilot. The results show that RouthSearch can determine the valid ranges of the three-dimensional PID parameters with an accuracy of 92. 0% when compared to the ground truth. In terms of the total number of misconfigured PID parameters, RouthSearch discovers 3,853 sets of PID misconfigurations within 48 hours, while the STOA work PGFuzz only discovers 449 sets of PID misconfigurations, significantly outperforming prior works by 8.58 times. Additionally, our method helps to detect three bugs in ArduPilot and PX4.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: issta25main-p247-p doi:10.1145/3728904

Program Analysis Combining Generalized Bit-Level and Word-Level Abstractions
Guangsheng Fan, Liqian Chen, Banghu Yin, Wenyu Zhang, Peisen Yao, and Ji Wang
(National University of Defense Technology, China; Zhejiang University, China)

Publisher's Version Article: issta25main-p258-p doi:10.1145/3728905

Bridging the Gaps between Graph Neural Networks and Data-Flow Analysis: The Closer, the Better
Qingchen Yu, Xin Liu, Qingguo Zhou, and Chunming Wu
(Zhejiang University, China; Lanzhou University, China)

Info

ACM SIGSOFT Distinguished Paper Award Article: issta25main-p287-p doi:10.1145/3728906

AudioTest: Prioritizing Audio Test Cases
Yinghua Li, Xueqi Dang, Wendkûuni C. Ouédraogo, Jacques Klein, and Tegawendé F. Bissyandé
(University of Luxembourg, Luxembourg)
Audio classification systems, powered by deep neural networks (DNNs), are integral to various applications that impact daily lives, like voice-activated assistants. Ensuring the accuracy of these systems is crucial since inaccuracies can lead to significant security issues and user mistrust. However, testing audio classifiers presents a significant challenge: the high manual labeling cost for annotating audio test inputs. Test input prioritization has emerged as a promising approach to mitigate this labeling cost issue. It prioritizes potentially misclassified tests, allowing for the early labeling of such critical inputs and making debugging more efficient. However, when applying existing test prioritization methods to audio-type test inputs, there are some limitations: 1) Coverage-based methods are less effective and efficient than confidence-based methods. 2) Confidence-based methods rely only on prediction probability vectors, ignoring the unique characteristics of audio-type data. 3) Mutation-based methods lack designed mutation operations for audio data, making them unsuitable for audio-type test inputs. To overcome these challenges, we propose AudioTest, a novel test prioritization approach specifically designed for audio-type test inputs. The core premise is that tests closer to misclassified samples are more likely to be misclassified. Based on the special characteristics of audio-type data, AudioTest generates four types of features: time-domain features, frequency-domain features, perceptual features, and output features. For each test, AudioTest concatenates its four types of features into a feature vector and applies a carefully designed feature transformation strategy to bring misclassified tests closer in space. AudioTest leverages a trained model to predict the probability of misclassification of each test based on its transformed vectors and ranks all the tests accordingly. We evaluate the performance of AudioTest utilizing 96 subjects, encompassing natural and noisy datasets. We employed two classical metrics, Percentage of Fault Detection (PFD) and Average Percentage of Fault Detected (APFD), for our evaluation. The results demonstrate that AudioTest outperforms all the compared test prioritization approaches in terms of both PFD and APFD. The average improvement of AudioTest compared to the baseline test prioritization methods ranges from 12.63% to 54.58% on natural datasets and from 12.71% to 40.48% on noisy datasets.

Publisher's Version Article: issta25main-p293-p doi:10.1145/3728907

QTRAN: Extending Metamorphic-Oracle Based Logical Bug Detection Techniques for Multiple-DBMS Dialect Support
Li Lin, Qinglin Zhu, Hongqiao Chen, Zhuangda Wang, Rongxin Wu, and Xiaoheng Xie
(Xiamen University, China; Ant Group, China)
Metamorphic testing is a widely used method to detect logical bugs in Database Management Systems (DBMSs), referred to herein as MOLT (Metamorphic-Oracle based Logical Bug Detection Technique). This technique involves constructing SQL statement pairs, including original and mutated queries, and assessing whether the execution results conform to predefined metamorphic relations to detect logical bugs. However, current MOLTs rely heavily on specific DBMS grammar to generate valid SQL statement pairs, which makes it challenging to adapt these techniques to various DBMSs with different grammatical structures. As a result, only a few popular DBMSs, such as PostgreSQL, MySQL, and MariaDB, are supported by existing MOLTs, with extensive manual effort required to expand to other DBMSs. Given that many DBMSs remain inadequately tested, there is a pressing need for a method that enables effortless extension of MOLTs across diverse DBMSs.
In this paper, we propose QTRAN, a novel LLM-powered approach that automatically extends existing MOLTs to various DBMSs. Our key insight is to translate SQL statement pairs to target DBMSs for metamorphic testing from existing MOLTs using LLMs. To address the challenges of LLMs’ limited understanding of dialect differences and metamorphic mechanisms, we propose a two-phase approach comprising the transfer and mutation phases. QTRAN tackles these challenges by drawing inspiration from the developer’s process of creating a MOLT, which includes understanding the grammar of the target DBMS to generate original queries and employing a mutator for customized mutations. The transfer phase is designed to identify potential dialects and leverage information from SQL documents to enhance query retrieval, enabling LLMs to translate original queries across different DBMSs accurately. During the mutation phase, we gather SQL statement pairs from existing MOLTs to fine-tune the pretrained model, tailoring it specifically for mutation tasks. Then we employ the customized LLM to mutate the translated original queries, preserving the defined relationships necessary for metamorphic testing.
We implement our approach as a tool and apply it to extend four state-of-the-art MOLTs for eight DBMSs: MySQL, MariaDB, TiDB, PostgreSQL, SQLite, MonetDB, DuckDB, and ClickHouse. The evaluation results show that over 99% of the SQL statement pairs transferred by QTRAN satisfy the metamorphic relations required for testing. Furthermore, we have detected 24 logical bugs among these DBMSs, with 16 confirmed as unique and previously unknown bugs. We believe that the generality of QTRAN can significantly enhance the reliability of DBMSs.

Publisher's Version Article: issta25main-p298-p doi:10.1145/3728908

GUIPilot: A Consistency-Based Mobile GUI Testing Approach for Detecting Application-Specific Bugs
Ruofan Liu, Xiwen Teoh, Yun Lin, Guanjie Chen, Ruofei Ren, Denys Poshyvanyk, and Jin Song Dong
(Shanghai Jiao Tong University, China; National University of Singapore, Singapore; College of William and Mary, USA)
GUI testing is crucial for ensuring the reliability of mobile applications. State-of-the-art GUI testing approaches are successful in exploring more application scenarios and discovering general bugs such as application crashes. However, industrial GUI testing also needs to investigate application-specific bugs such as deviations in screen layout, widget position, or GUI transition from the GUI design mock-ups created by the application designers. These mock-ups specify the expected screens, widgets, and their respective behaviors. Validating the consistency between the GUI design and the implementation is labor-intensive and time-consuming, yet, this validation step plays an important role in industrial GUI testing.
In this work, we propose , an approach for detecting inconsistencies between the mobile design and their implementations. The mobile design usually consists of design mock-ups that specify (1) the expected screen appearances (e.g., widget layouts, colors, and shapes) and (2) the expected screen behaviors, regarding how one screen can transition into another (e.g., labeled widgets with textual description). Given a design mock-up and the implementation of its application, reports both their screen inconsistencies as well as process inconsistencies. On the one hand, detects the screen inconsistencies by abstracting every screen into a widget container where each widget is represented by its position, width, height, and type. By defining the partial order of widgets and the costs of replacing, inserting, and deleting widgets in a screen, we convert the screen-matching problem into an optimizable widget alignment problem. On the other hand, we translate the specified GUI transition into stepwise actions on the mobile screen (e.g., click, long-press, input text on some widgets). To this end, we propose a visual prompt for the vision-language model to infer widget-specific actions on the screen. By this means, we can validate the presence or absence of expected transitions in the implementation. Our extensive experiments on 80 mobile applications and 160 design mock-ups show that (1) can achieve 99.8% precision and 98.6% recall in detecting screen inconsistencies, outperforming the state-of-the-art approach, such as GVT, by 66.2% and 56.6% respectively, and (2) reports zero errors in detecting process inconsistencies. Furthermore, our industrial case study on applying on a trading mobile application shows that has detected nine application bugs, and all the bugs were confirmed by the original application experts. Our code is available at https://github.com/code-philia/GUIPilot.

Video

Info Article: issta25main-p307-p doi:10.1145/3728909

Testing the Fault-Tolerance of Multi-sensor Fusion Perception in Autonomous Driving Systems
Haoxiang Tian, Wenqiang Ding, Xingshuo Han, Guoquan Wu, An Guo, Junqi Zhang, Wei Chen, Jun Wei, and Tianwei Zhang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Nanyang Technological University, Singapore; Nanjing Institute of Software, China; Continental-NTU Corporate Lab, Singapore; University of Chinese Academy of Sciences Nanjing, China; University of Science and Technology of China, China)

Artifacts Available Article: issta25main-p341-p doi:10.1145/3728910

KEENHash: Hashing Programs into Function-Aware Embeddings for Large-Scale Binary Code Similarity Analysis
Zhijie Liu, Qiyi Tang, Sen Nie, Shi Wu, Liang Feng Zhang, and Yutian Tang
(ShanghaiTech University, China; Tencent Security Keen Lab, China; University of Glasgow, UK)

Info Article: issta25main-p354-p doi:10.1145/3728911

Enhancing Vulnerability Detection via Inter-procedural Semantic Completion
Bozhi Wu, Chengjie Liu, Zhiming Li, Yushi Cao, Jun Sun, and Shang-Wei Lin
(Nanyang Technological University, Singapore; Peking University, China; Singapore Management University, Singapore)

Publisher's Version Article: issta25main-p356-p doi:10.1145/3728912

Unlocking Low Frequency Syscalls in Kernel Fuzzing with Dependency-Based RAG
Zhiyu Zhang, Longxing Li, Ruigang Liang, and Kai Chen
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)

Artifacts Available Article: issta25main-p362-p doi:10.1145/3728913

Structure-Aware, Diagnosis-Guided ECU Firmware Fuzzing
Qicai Chen, Kun Hu, Sichen Gong, Bihuan Chen, Zikui Kong, Haowen Jiang, Bingkun Sun, You Lu, and Xin Peng
(Fudan University, China; ANHUI GuarDrive Safety Technology, China)

Publisher's Version Article: issta25main-p397-p doi:10.1145/3728914

FANDANGO: Evolving Language-Based Testing
José Antonio Zamudio Amaya, Marius Smytzek, and Andreas Zeller
(CISPA Helmholtz Center for Information Security, Germany)

Info Article: issta25main-p410-p doi:10.1145/3728915

ZTaint-Havoc: From Havoc Mode to Zero-Execution Fuzzing-Driven Taint Inference
Yuchong Xie, Wenhui Zhang, and Dongdong She
(Hong Kong University of Science and Technology, China; Hunan University, Changsha, China)
Fuzzing is a popular software testing technique for discovering vulnerabilities. A central problem in fuzzing is identifying hot bytes that can influence program behavior. Taint analysis can track the data flow of hot bytes in a white-box fashion, but it often suffers from stability issues and cannot run on large real-world programs. Fuzzing-Driven Taint Inference (FTI) is a simple black-box technique to track hot bytes for fuzzing. It monitors the dynamic program behaviors of program execution instances and further infers hot bytes in a black-box fashion. However, this method requires additional O(N) program executions and incurs a large runtime overhead.
We observe that a widely used mutation scheme in fuzzing--havoc mode can be adapted into a lightweight FTI with zero additional program execution. In this work, we first present a computational model of the havoc mode that formally describes its mutation process. Based on this model, we show that the havoc mode can simultaneously launch FTI while generating and executing new testcases. Further, we propose a novel FTI called ZTaint-Havoc that doesn't need any additional program execution. ZTaint-Havoc incurs minimal instrumentation overhead of 3.84% on UniBench and 12.58% on FuzzBench, respectively. In the end, we give an effective mutation algorithm using the hot bytes identified by ZTaint-Havoc.
We conduct a comprehensive evaluation to investigate the computational model of havoc mode. Our evaluation result justifies that it is feasible to adapt the havoc mode to an efficient FTI without any additional program execution. We further implement our approach as a prototype ZTaint-Havoc based on the havoc mode of AFL++. We evaluate ZTaint-Havoc on two fuzzing datasets FuzzBench and UniBench. Our extensive evaluation results show that ZTaint-Havoc improves edge coverage by up to 33.71% on FuzzBench and 51.12% on UniBench over vanilla AFL++, with average improvements of 2.97% and 6.12% respectively, in 24-hour campaigns.

Publisher's Version Article: issta25main-p413-p doi:10.1145/3728916

A Low-Cost Feature Interaction Fault Localization Approach for Software Product Lines
Haining Wang, Yi Xiang, Han Huang, Jie Cao, Kaichen Chen, and Xiaowei Yang
(South China University of Technology, China; iSOFT Infrastructure Software, China)
In Software Product Lines (SPLs), localizing buggy feature interactions helps developers identify the root cause of test failures, thereby reducing their workload. This task is challenging because the number of potential interactions grows exponentially with the number of features, resulting in a vast search space, especially for large SPLs. Previous approaches have partially addressed this issue by constructing and examining potential feature interactions based on suspicious feature selections (e.g., those present in failed configurations but not in passed ones). However, these approaches often overlook the causal relationship between buggy feature interaction and test failures, resulting in an excessive search space and high-cost fault localization. To address this, we propose a low-cost Counterfactual Reasoning-Based Fault Localization (CRFL) approach for SPLs, which enhances fault localization efficiency by reducing both the search space and redundant computations. Specifically, CRFL employs counterfactual reasoning to infer suspicious feature selections and utilizes symmetric uncertainty to filter out irrelevant feature interactions. Additionally, CRFL incorporates two findings to prevent the repeated generation and examination of the same feature interactions. We evaluate the performance of our approach using eight publicly available SPL systems. To enable comparisons on larger real-world SPLs, we generate multiple buggy mutants for both BerkeleyDB and TankWar. Experimental results show that our approach reduces the search space by 51%∼73% for small SPLs (with 6∼9 features) and by 71%∼88% for larger SPLs (with 13∼99 features). The average runtime of our approach is approximately 15.6 times faster than that of a state-of-the-art method. Furthermore, when combined with statement-level localization techniques, CRFL can efficiently localize buggy statements, demonstrating its ability to accurately identify buggy feature interactions.

Publisher's Version Article: issta25main-p435-p doi:10.1145/3728917

WildSync: Automated Fuzzing Harness Synthesis via Wild API Usage Recovery
Wei-Cheng Wu, Stefan Nagy, and Christophe Hauser
(Dartmouth College, USA; University of Utah, USA)

Publisher's Version Article: issta25main-p447-p doi:10.1145/3728918

Assessing Scene Generation Techniques for Testing COLREGS-Compliance of Autonomous Surface Vehicles
Dominik Frey, Ulf Kargén, and Dániel Varró
(Linköping University, Sweden; McGill University, Canada)

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: issta25main-p457-p doi:10.1145/3728919

Clause2Inv: A Generate-Combine-Check Framework for Loop Invariant Inference
Weining Cao, Guangyuan Wu, Tangzhi Xu, Yuan Yao, Hengfeng Wei, Taolue Chen, and Xiaoxing Ma
(Nanjing University, China; Birkbeck University of London, UK)
Loop invariant inference is a fundamental, yet challenging, problem in program verification. Recent work adopts the guess-and-check framework, where candidate loop invariants are iteratively generated in the guess step and verified in the check step. A major challenge of this general framework is to produce high-quality candidate invariants in each iteration so that the inference procedure can converge quickly. Empirically, we observe that existing approaches may struggle with guessing the complete invariant due to the complexity of logical connectives, but usually, all the clauses of the correct loop invariant have already appeared in the previous guesses. This motivates us to refine the guess-and-check framework, resulting in a generate-combine-check framework, where the loop invariant inference task is divided into clause generation and clause combination. Specifically, we propose a novel loop invariant inference approach under the new framework, which consists of an LLM-based clause generator and a counterexample-driven clause combinator. As the clause generator, leverages LLMs to generate a multitude of clauses; as the clause combinator, leverages counterexamples from the previous rounds to convert generated clauses into invariants. Our experiments show that significantly outperforms existing loop invariant inference approaches. For example, solved 312 (out of 316) linear invariant inference tasks and 44 (out of 50) nonlinear invariant inference tasks, which is at least 93 and 16 more than the existing baselines, respectively. By design, the generate-combine-check framework is flexible to accommodate various existing approaches which are currently under the guess-and-check framework by splitting the guessed candidate invariants into clauses. The evaluation shows that our approach can, with minor adaptation, improve existing loop invariant inference approaches in both effectiveness and efficiency. For example, Code2Inv which solved 210 linear problems with an average solving time of 137.6 seconds can be improved to solve 252 problems with an average solving time of 17.8 seconds.

Publisher's Version Article: issta25main-p460-p doi:10.1145/3728920

Copy-and-Paste? Identifying EVM-Inequivalent Code Smells in Multi-chain Reuse Contracts
Zexu Wang, Jiachi Chen, Tao Zhang, Yu Zhang, Weizhe Zhang, Yuming Feng, and Zibin Zheng
(Sun Yat-sen University, China; Peng Cheng Laboratory, China; Guangdong Engineering Technology Research Center of Blockchain, China; Macau University of Science and Technology, China; Harbin Institute of Technology, China)

Publisher's Version Article: issta25main-p465-p doi:10.1145/3728921

You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects
Islem Bouzenia and Michael Pradel
(University of Stuttgart, Germany)

Artifacts Functional Article: issta25main-p470-p doi:10.1145/3728922

Finding 709 Defects in 258 Projects: An Experience Report on Applying CodeQL to Open-Source Embedded Software (Experience Paper)
Mingjie Shen, Akul Abhilash Pillai, Brian A. Yuan, James C. Davis, and Aravind Machiry
(Purdue University, USA)
Embedded software is deployed in billions of devices worldwide, including in safety-sensitive systems like medical devices and autonomous vehicles. Defects in embedded software can have severe consequences. Many embedded software products incorporate Open-Source Embedded Software (EMBOSS), so it is important for EMBOSS engineers to use appropriate mechanisms to avoid defects. One of the common security practices is to use Static Application Security Testing (SAST) tools, which help identify commonly occurring vulnerabilities. Existing research related to SAST tools focuses mainly on regular (or non-embedded) software. There is a lack of knowledge about the use of SAST tools in embedded software. Furthermore, embedded software greatly differs from regular software in terms of semantics, software organization, coding practices, and build setup. All of these factors influence SAST tools and could potentially affect their usage.
In this experience paper, we report on a large-scale empirical study of SAST in EMBOSS repositories. We collected a corpus of 258 of the most popular EMBOSS projects, and then measured their use of SAST tools via program analysis and a survey (N=25) of their developers. Advanced SAST tools are rarely used -- only 3% of projects go beyond trivial compiler analyses. Developers cited the perception of ineffectiveness and false positives as reasons for limited adoption. Motivated by this deficit, we applied the state-of-the-art (SOTA) CodeQL SAST tool and measured its ease of use and actual effectiveness. Across the 258 projects, CodeQL reported 709 true defects with a false positive rate of 34%. There were 535 (75%) likely security vulnerabilities, including in major projects maintained by Microsoft, Amazon, and the Apache Foundation. EMBOSS engineers have confirmed 376 (53%) of these defects, mainly by accepting our pull requests. Two CVEs were issued. Based on these results, we proposed pull requests to include our workflows as part of EMBOSS Continuous Integration (CI) pipelines, 37 (71% of active repositories) of these are already merged. In summary, we urge EMBOSS engineers to adopt the current generation of SAST tools, which offer low false positive rates and are effective at finding security-relevant defects.

Artifacts Functional Article: issta25main-p481-p doi:10.1145/3728923

Enhancing Smart Contract Security Analysis with Execution Property Graphs
Kaihua Qin, Zhe Ye, Zhun Wang, Weilin Li, Liyi Zhou, Chao Zhang, Dawn Song, and Arthur Gervais
(Yale University, USA; University of California at Berkeley, USA; University College London, UK; University of Sydney, Australia; Tsinghua University, China)

Publisher's Version Article: issta25main-p483-p doi:10.1145/3728924

MLLM-Based UI2Code Automation Guided by UI Layout Information
Fan Wu, Cuiyun Gao, Shuqing Li, Xin-Cheng Wen, and Qing Liao
(Harbin Institute of Technology, China; Chinese University of Hong Kong, China)

Publisher's Version Article: issta25main-p488-p doi:10.1145/3728925

Quantum Concolic Testing
Shangzhou Xia, Jianjun Zhao, Fuyuan Zhang, and Xiaoyu Guo
(Kyushu University, Japan; University of Tokyo, Japan)

Publisher's Version Article: issta25main-p512-p doi:10.1145/3728926

BinQuery: A Novel Framework for Natural Language-Based Binary Code Retrieval
Bolun Zhang, Zeyu Gao, Hao Wang, Yuxin Cui, Siliang Qin, Chao Zhang, Kai Chen, and Beibei Zhao
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Tsinghua University, China)

Publisher's Version Article: issta25main-p523-p doi:10.1145/3728927

BinDSA: Efficient, Precise Binary-Level Pointer Analysis with Context-Sensitive Heap Reconstruction
Lian Gao and Heng Yin
(University of California at Riverside, USA)

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: issta25main-p541-p doi:10.1145/3728928

Adding Spatial Memory Safety to EDK II through Checked C (Experience Paper)
Sourag Cherupattamoolayil, Arunkumar Bhattar, Connor Everett Glosner, and Aravind Machiry
(Purdue University, USA)

Artifacts Available Article: issta25main-p558-p doi:10.1145/3728929

REACCEPT: Automated Co-evolution of Production and Test Code Based on Dynamic Validation and Large Language Models
Jianlei Chi, Xiaotian Wang, Yuhan Huang, Lechen Yu, Di Cui, Jianguo Sun, and Jun Sun
(Xidian University, China; Harbin Engineering University, China; Microsoft, USA; Singapore Management University, Singapore)

Publisher's Version Article: issta25main-p565-p doi:10.1145/3728930

Understanding Practitioners’ Expectations on Clear Code Review Comments
Junkai Chen, Zhenhao Li, Qiheng Mao, Xing Hu, Kui Liu, and Xin Xia
(Zhejiang University, China; York University, Canada)

Publisher's Version Article: issta25main-p566-p doi:10.1145/3728931

Pepper: Preference-Aware Active Trapping for Ransomware
Huan Zhang, Zhengkai Qin, Lixin Zhao, Aimin Yu, Lijun Cai, and Dan Meng
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)

Publisher's Version Article: issta25main-p587-p doi:10.1145/3728932

Extended Reality Cybersickness Assessment via User Review Analysis
Shuqing Li, Qisheng Zheng, Cuiyun Gao, Jia Feng, and Michael R. Lyu
(Chinese University of Hong Kong, China; Harbin Institute of Technology, China)
In recent years, the extended reality (XR) software ecosystem has emerged as the next ubiquitous computing platform, as they provide users with immersive interactive experiences. However, XR ecosystem suffers from cybersickness problems, which would greatly affect user comfort and safety, leading to symptoms like headaches, disorientation, etc. That makes effective cybersickness assessment a timely and important question. The state-of-the-art methods of assessing the cybersickness of XR software typically monitor the biological metrics of users, during XR usage, which rely heavily on manual playtesting,sufferring from limited scalability issue. User reviews on XR app stores are informative for developers to learn the cybersickness ratings of their apps and the reasons behind. Nevertheless, the large number of user reviews can hardly be analyzed by developers manually, and most current automatic user review analysis methods can only provide coarse-grained or abstract results, such as extracting several high-level key topic groups discussed by reviews. Recent advancements in LLMs may bring new opportunities. However, directly leveraging LLMs for evaluating XR cybersickness is challenging because LLMs perform poorly on a large number of short texts, and their context window is limited. To address these challenges, we introduce XRCare, a comprehensive framework designed to automate the assessment of cybersickness and root cause reasoning for XR apps by resorting to fine-grained user review analysis. XRCare mainly includes three phases: (1) Insight pool construction, when XRCare collects the cybersickness analyzing chains and corresponding analyzing results from domain experts; (2) Reasoning graph construction, when XRCare dynamically extracts, categorizes, and maintains the reasons from user reviews that make users feel cybersickness on a self-evolving hierarchical graph; and (3) Multi-agent deductive cybersickness reasoning, which utilizes a multi-agent system to simulate diverse user demographics for analyzing and rating the intensity of cybersickness, as well as the causes of cybersickness. This structured approach allows XRCare to systematically identify, categorize, and address instances of cybersickness. For experiments, we construct a large-scale dataset consisting of 685,111 user reviews from 9,667 XR apps. Our evaluation shows that XRCare enhances the F1-score by 20.63% over the best-performing baseline and by 32.27% on average across all baselines, while also offering more accurate and detailed interpretability insights.

Publisher's Version Article: issta25main-p608-p doi:10.1145/3728933

Wemby’s Web: Hunting for Memory Corruption in WebAssembly
Oussama Draissi, Tobias Cloosters, David Klein, Michael Rodler, Marius Musch, Martin Johns, and Lucas Davi
(University of Duisburg-Essen, Germany; TU Braunschweig, Germany; Amazon Web Services, Germany)

Publisher's Version Article: issta25main-p612-p doi:10.1145/3728937

The Incredible Shrinking Context... in a Decompiler Near You
Sifis Lagouvardos, Yannis Bollanos, Neville Grech, and Yannis Smaragdakis
(University of Athens, Greece; Dedaub, Greece; Dedaub, Malta)

Artifacts Functional Article: issta25main-p615-p doi:10.1145/3728935

Causality-Aided Evaluation and Explanation of Large Language Model-Based Code Generation
Zhenlan Ji, Pingchuan Ma, Zongjie Li, Zhaoyu Wang, and Shuai Wang
(Hong Kong University of Science and Technology, China)

Publisher's Version Article: issta25main-p617-p doi:10.1145/3728938

AdverIntent-Agent: Adversarial Reasoning for Repair Based on Inferred Program Intent
He Ye, Aidan Z.H. Yang, Chang Hu, Yanlin Wang, Tao Zhang, and Claire Le Goues
(University College London, UK; Carnegie Mellon University, USA; Macau University of Science and Technology, China; Sun Yat-sen University, China)

Publisher's Version Article: issta25main-p652-p doi:10.1145/3728939

ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation
Pengyu Xue, Linhao Wu, Zhen Yang, Chengyi Wang, Xiang Li, Yuxiang Zhang, Jia Li, Ruikai Jin, Yifei Pei, Zhaoyan Shen, Xiran Lyu, and Jacky Wai Keung
(Shandong University, China; Tsinghua University, China; City University of Hong Kong, Hong Kong)

Publisher's Version Article: issta25main-p671-p doi:10.1145/3728940

Porting Software Libraries to OpenHarmony: Transitioning from TypeScript or JavaScript to ArkTS
Bo Zhou, Jiaqi Shi, Ying Wang, Li Li, Tsz On Li, Hai Yu, and Zhiliang Zhu
(Northeastern University, China; Beihang University, China; Hong Kong University of Science and Technology, China)
OpenHarmony emerges as a potent force in the mobile app domain, poised to stand alongside established industry giants. ArkTS is its main language, enhancing TypeScript (TS) and JavaScript (JS) with strict typing for improved performance. Developers are encouraged to port popular TS/JS libraries to OpenHarmony, supported by detailed guidelines. However, this requires a deep understanding of ArkTS syntax, following porting specifications, and making manual changes. An automated solution is crucial to streamline this process and foster a robust software ecosystem.
As a new programming language, ArkTS currently lacks essential analysis tools for automated analysis and porting of software libraries. However, the rise of Large Language Models (LLMs) shows promise for effectively addressing automated porting tasks. There are two challenges in using LLMs to automate the porting of TS/JS libraries to OpenHarmony: (1) LLMs have limited exposure to ArkTS code, making it difficult for them to grasp the syntactical differences between ArkTS and JS/TS, as well as the various adaptation scenarios. (2) Project-level code adaptation often involves correcting numerous syntax mismatches, which complicates matters for LLMs as they must handle the interactions between different mismatches and interdependent code. In response, we introduce ArkAdapter, a project-level automatic code adaptation approach. ArkAdapter addresses Challenge 1 by establishing an adaptation knowledge repository for ArkTS syntax comprehension. It expands a collection of real code adaptation examples based on expert experience across various scenarios, improving the adaptation capabilities of LLMs through few-shot learning. ArkAdapter overcomes Challenge 2 based on an adaptation priority strategy by considering both the dependency structure and the granularity of syntax-mismatching code. This strategy helps prevent interference among various syntax mismatches and their interdependent code. Evaluation shows ArkAdapter achieves high precision (86.84

Publisher's Version Article: issta25main-p709-p doi:10.1145/3728941

Reinforcement Learning-Based Fuzz Testing for the Gazebo Robotic Simulator
Zhilei Ren, Yitao Li, Xiaochen Li, Guanxiao Qi, Jifeng Xuan, and He Jiang
(Dalian University of Technology, China; Wuhan University, China)

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: issta25main-p762-p doi:10.1145/3728942

Why Does My Transaction Fail? A First Look at Failed Transactions on the Solana Blockchain
Xiaoye Zheng, Zhiyuan Wan, David Lo, Difan Xie, and Xiaohu Yang
(Zhejiang University, China; Singapore Management University, Singapore; Hangzhou High-Tech Zone, China)
Solana is an emerging blockchain platform, recognized for its high throughput and low transaction costs, positioning it as a preferred infrastructure for Decentralized Finance (DeFi), Non-Fungible Tokens (NFTs), and other Web 3.0 applications. In the Solana ecosystem, transaction initiators submit various instructions to interact with a diverse range of Solana smart contracts, among which are decentralized exchanges (DEXs) that utilize automated market makers (AMMs), allowing users to trade cryptocurrencies directly on the blockchain without the need for intermediaries. Despite the high throughput and low transaction costs of Solana, the advantages have exposed Solana to bot spamming for financial exploitation, resulting in the prevalence of failed transactions and network congestion.
Prior work on Solana has mainly focused on the evaluation of the performance of the Solana blockchain, particularly scalability and transaction throughput, as well as on the improvement of smart contract security, leaving a gap in understanding the characteristics and implications of failed transactions on Solana. To address this gap, we conducted a large-scale empirical study of failed transactions on Solana, using a curated dataset of over 1.5 billion failed transactions across more than 72 million blocks. Specifically, we first characterized the failed transactions in terms of their initiators, failure-triggering programs, and temporal patterns, and compared their block positions and transaction costs with those of successful transactions. We then categorized the failed transactions by the error messages in their error logs, and investigated how specific programs and transaction initiators are associated with these errors.
We find that transaction failure rates on Solana exhibit recurring daily patterns, and demonstrate a strong positive correlation with the volume of failed transactions, with bots on Solana experiencing a high transaction failure rate of 58.43%. We identify ten distinct error types in the error logs of failed transactions, with price or profit not met and invalid status errors accounting for 67.18% of all failed transactions. AMMs primarily experience invalid status errors among failed transactions, while DEX aggregators are more commonly affected by price or profit not met errors. Among transaction initiators, bots encounter a broader range of errors due to their high-frequency trading and complex interactions with smart contracts. In contrast, human users experience a more limited range of errors. Based on our findings, we provide recommendations to mitigate transaction failures on Solana and outline future research directions.

Info Article: issta25main-p778-p doi:10.1145/3728943

PatchScope: LLM-Enhanced Fine-Grained Stable Patch Classification for Linux Kernel
Rongkai Liu, Heyuan Shi, Shuning Liu, Chao Hu, Sisheng Li, Yuheng Shen, Runzhe Wang, Xiaohai Shi, and Yu Jiang
(Central South University, China; Tsinghua University, China; Alibaba Cloud Computing, China)

Publisher's Version Article: issta25main-p808-p doi:10.1145/3728944

Identifying Multi-parameter Constraint Errors in Python Data Science Library API Documentation
Xiufeng Xu, Fuman Xie, Chenguang Zhu, Guangdong Bai, Sarfraz Khurshid, and Yi Li
(Nanyang Technological University, Singapore; University of Queensland, Australia; University of Texas at Austin, USA)

Artifacts Functional Article: issta25main-p812-p doi:10.1145/3728945

OpDiffer: LLM-Assisted Opcode-Level Differential Testing of Ethereum Virtual Machine
Jie Ma, Ningyu He, Jinwen Xi, Mingzhe Xing, Haoyu Wang, Ying Gao, and Yinliang Yue
(Beihang University, China; Zhongguancun Laboratory, China; Hong Kong Polytechnic University, China; Huazhong University of Science and Technology, China)

Artifacts Reusable Article: issta25main-p826-p doi:10.1145/3728946

The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-Based Code Generation
Yingjie Fu, Bozhou Li, Linyi Li, Wentao Zhang, and Tao Xie
(Peking University, China; Simon Fraser University, Canada)
The capabilities of Large Language Models (LLMs) in code generation have been extensively studied, particularly for implementing target functionalities from natural-language descriptions. As an alternative to natural language, input-output (I/O) examples provide an accessible, unambiguous, and flexible way to describe functionalities. However, their inherent diversity, opaqueness, and incompleteness impose greater challenges for understanding and implementing the target requirements. Therefore, generating code from I/O examples (i.e., example-based code generation) provides a new perspective, allowing us to additionally evaluate LLMs’ capability to infer target functionalities from limited information and to process new-form requirements. However, related research about LLMs in example-based code generation remains largely unexplored. To fill this gap, this paper presents the first comprehensive study on example-based code generation using LLMs. To address the incorrectness caused by the incompleteness of I/O examples, we adopt an iterative evaluation framework and formalize the objective of example-based code generation as two sequential sub-objectives: generating code conforming to the given examples and generating code that successfully implements the target functionalities from (iteratively) given examples. We assess six state-of-the-art LLMs using a new benchmark of 172 diverse target functionalities (derived from HumanEval and CodeHunt). The results demonstrate that when requirements are described using iterative I/O examples rather than natural language, the LLMs’ score decreases by over 60%, indicating that example-based code generation remains challenging for the evaluated LLMs. Notably, the vast majority (even over 95%) of successfully implemented functionalities are achieved in the first round of the iterations, suggesting that the LLMs struggle to effectively utilize the iteratively supplemented requirements. Furthermore, we find that combining I/O examples with even imprecise and fragmental natural language descriptions greatly improves LLM performance, and the selection of initial I/O examples can also influence the score, suggesting opportunities for prompt optimization. These findings highlight the importance of early prompts during interactions and offer critical insights and implications for enhancing LLM-based code generation.

Publisher's Version Article: issta25main-p832-p doi:10.1145/3728947

No Bias Left Behind: Fairness Testing for Deep Recommender Systems Targeting General Disadvantaged Groups
Zhuo Wu, Zan Wang, Chuan Luo, Xiaoning Du, and Junjie Chen
(Tianjin University, China; Beihang University, China; Monash University, Australia)

Publisher's Version Article: issta25main-p847-p doi:10.1145/3728948

Enhanced Prompting Framework for Code Summarization with Large Language Models
Minying Fang, Xing Yuan, Yuying Li, Haojie Li, Chunrong Fang, and Junwei Du
(Qingdao University of Science and Technology, China; Nanjing University, China)
Code summarization is essential for enhancing the efficiency of software development, enabling developers to swiftly comprehend and maintain software projects. Recent efforts utilizing large language models for generating precise code summaries have shown promising performance, primarily due to their advanced generative capabilities. LLMs that employ continuous prompting techniques can explore a broader problem space, potentially unlocking greater capabilities. However, they also present specific challenges, particularly in aligning with task-specific situations—a strength of discrete prompts. Additionally, the inherent differences between programming languages and natural languages can complicate comprehension for LLMs, impacting the accuracy and relevance of the summaries in complex programming scenarios. These challenges may result in outputs that do not align with actual task needs, underscoring the necessity for further research to enhance the effectiveness of LLMs in code summarization.
To overcome these limitations, we combine the strengths of the two approaches described above and introduce EP4CS—an Enhanced Prompting framework for Code Summarization with Large Language Models. Firstly, we design Mapper, which undergoes pre-training on <Code, Knowledge> pairs and facilitates the optimization and updating of prompt vectors based on the outputs of LLMs. Additionally, we develop a Struct-Agent that enables LLMs to more accurately interpret the complex code by in-depth analysis of the programming language’s syntax and structure. Experimental results indicate that, compared to existing baseline methods, our enhanced prompting learning framework significantly improves performance while maintaining the same parameter scale. Specifically, when evaluated on Java using StarCoderBase_1B, EP4CS achieved score improvements of 6.59% on BLEU, 7.06% on METEOR, and 4.43% on ROUGE-L, while also demonstrating strong robustness. And it’s closer to real-world scenarios in terms of semantic metrics SentenceBERT. The results from the human evaluation and case studies show that EP4CS surpasses the baseline methods, producing higher-quality and more relevant summaries.

Publisher's Version Article: issta25main-p861-p doi:10.1145/3728949

An Investigation on Numerical Bugs in GPU Programs Towards Automated Bug Detection
Ravishka Rathnasuriya, Nidhi Majoju, Zihe Song, and Wei Yang
(University of Texas at Dallas, USA)

Info Article: issta25main-p870-p doi:10.1145/3728950

A Large-Scale Empirical Study on Fine-Tuning Large Language Models for Unit Testing
Ye Shang, Quanjun Zhang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen
(Nanjing University, China; Nanjing University of Science and Technology, China; Huawei Cloud Computing Technologies, China)
Unit testing plays a pivotal role in software development, improving software quality and reliability. However, generating effective test cases manually is time-consuming, prompting interest in unit testing research. Recently, Large Language Models (LLMs) have shown potential in various unit testing tasks, including test generation, assertion generation, and test evolution, but existing studies are limited in scope and lack a systematic evaluation of the effectiveness of LLMs.
To bridge this gap, we present a large-scale empirical study on fine-tuning LLMs for unit testing. Our study involves three unit testing tasks, five benchmarks, eight evaluation metrics, and 37 popular LLMs across various architectures and sizes, consuming over 3,000 NVIDIA A100 GPU hours. We focus on three key research questions: (1) the performance of LLMs compared to state-of-the-art methods, (2) the impact of different factors on LLM performance, and (3) the effectiveness of fine-tuning versus prompt engineering. Our findings reveal that LLMs outperform existing state-of-the-art approaches on all three unit testing tasks across nearly all metrics, highlighting the potential of fine-tuning LLMs in unit testing tasks. Furthermore, large-scale, decoder-only models achieve the best results across tasks, while encoder-decoder models perform better under the same parameter scale. Additionally, the comparison of the performance between fine-tuning and prompt engineering approaches reveals the considerable potential capability of the prompt engineering approach in unit testing tasks. We then discuss the concerned issues on the test generation task, including data leakage issues, bug detection capabilities, and metrics comparisons. Finally, we further pinpoint carious practical guidelines for LLM-based approaches to unit testing tasks in the near future. Overall, our work demonstrates the promising future of fine-tuning LLMs on unit testing tasks and reduces the manual efforts of unit testing experts in practical scenarios.

Publisher's Version Article: issta25main-p965-p doi:10.1145/3728951

DeCoMa: Detecting and Purifying Code Dataset Watermarks through Dual Channel Code Abstraction
Yuan Xiao, Yuchen Chen, Shiqing Ma, Haocheng Huang, Chunrong Fang, Yanwei Chen, Weisong Sun, Yunfeng Zhu, Xiaofang Zhang, and Zhenyu Chen
(Nanjing University, China; University of Massachusetts at Amherst, USA; Soochow University, China; Nanyang Technological University, Singapore)

Publisher's Version Article: issta25main-p966-p doi:10.1145/3728952

Detecting Isolation Anomalies in Relational DBMSs
Rui Yang, Ziyu Cui, Wensheng Dou, Yu Gao, Jiansen Song, Xudong Xie, and Jun Wei
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)

Publisher's Version Article: issta25main-p972-p doi:10.1145/3728953

Rethinking Performance Analysis for Configurable Software Systems: A Case Study from a Fitness Landscape Perspective
Mingyu Huang, Peili Mao, and Ke Li
(University of Electronic Science and Technology of China, China; University of Exeter, UK)

Publisher's Version Article: issta25main-p987-p doi:10.1145/3728954

Validating Network Protocol Parsers with Traceable RFC Document Interpretation
Mingwei Zheng, Danning Xie, Qingkai Shi, Chengpeng Wang, and Xiangyu Zhang
(Purdue University, USA; Nanjing University, China)

Publisher's Version Article: issta25main-p990-p doi:10.1145/3728955

NADA: Neural Acceptance-Driven Approximate Specification Mining
Weilin Luo, Tingchen Han, Junming Qiu, Hai Wan, Jianfeng Du, Bo Peng, Guohui Xiao, and Yanan Liu
(Sun Yat-sen University, China; Guangdong University of Foreign Studies, China; Southeast University, China)

Info Article: issta25main-p1014-p doi:10.1145/3728956

Productively Deploying Emerging Models on Emerging Platforms: A Top-Down Approach for Testing and Debugging
Siyuan Feng, Jiawei Liu, Ruihang Lai, Charlie Ruan, Yong Yu, Lingming Zhang, and Tianqi Chen
(Shanghai Jiao Tong University, China; University of Illinois at Urbana-Champaign, USA; Carnegie Mellon University, USA)

Publisher's Version Article: issta25main-p1030-p doi:10.1145/3728957

DecLLM: LLM-Augmented Recompilable Decompilation for Enabling Programmatic Use of Decompiled Code
Wai Kin Wong, Daoyuan Wu, Huaijin Wang, Zongjie Li, Zhibo Liu, Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu
(Hong Kong University of Science and Technology, Hong Kong; Ohio State University, USA; Tencent Security Keen Lab, China)

Publisher's Version Article: issta25main-p1032-p doi:10.1145/3728958

Dynamically Fusing Python HPC Kernels
Nader Al Awar, Muhammad Hannan Naeem, James Almgren-Bell, George Biros, and Milos Gligoric
(University of Texas at Austin, USA)

Publisher's Version Article: issta25main-p1052-p doi:10.1145/3728959

Tratto: A Neuro-Symbolic Approach to Deriving Axiomatic Test Oracles
Davide Molinelli, Alberto Martin-Lopez, Elliott Zackrone, Beyza Eken, Michael D. Ernst, and Mauro Pezzè
(Constructor Institute, Switzerland; USI Lugano, Switzerland; University of Washington, USA)

Publisher's Version Article: issta25main-p1067-p doi:10.1145/3728960

VerLog: Enhancing Release Note Generation for Android Apps using Large Language Models
Jiawei Guo, Haoran Yang, and Haipeng Cai
(SUNY Buffalo, USA; Washington State University, USA)

Artifacts Reusable Article: issta25main-p1108-p doi:10.1145/3728961

Uncovering API-Scope Misalignment in the App-in-App Ecosystem
Jiarui Che, Chenkai Guo, Naipeng Dong, Jiaqi Pei, Lingling Fan, Xun Mi, Xueshuo Xie, Xiangyang Luo, Zheli Liu, and Renhong Cheng
(Nankai University, China; Haihe Lab of ITAI, China; University of Queensland, Australia; State Key Laboratory of Mathematical Engineering and Advanced Computing, China)
The "app-in-app" paradigm is an emerging trend in mobile systems, where super applications (short for superApps) such as WeChat, Baidu, TikTok, enable external vendors to develop mini-programs (short for miniApps) on their platforms by providing privileged APIs. To facilitate management, superApps have devised their specific permission configuration (called scope) to grant the APIs access to specific capabilities and resources. Adhering to these scopes during API implementation is crucial for maintaining security; otherwise, the permission management of superApps can be bypassed—a vulnerability we refer to as API-scope misalignment. In this work, we conduct the first systematic study on the API-scope misalignment issues in the app-in-app ecosystems, uncovering root causes and security risks. More importantly, we developed an automatic tool called ScopeChecker to detect the API-scope misalignment in both superApps and miniApps. ScopeChecker extracts the standard API-scope mappings by integrating the Android permission mechanism into the functionalities of superApps. Then, LLM-based code generation is used to create executable API snippets as test cases. The execution results reflect the actual mappings of APIs to their scopes, which are compared with the standard API-scope mappings to identify misalignment. After that, ScopeChecker verifies the identified misalignment in miniApps by matching the misaligned APIs with a tailored method-oriented abstract syntax tree (MAST) of the target miniApp. ScopeChecker identified 38 misaligned APIs in top superApps with manual confirmation, outperforming the state-of-the-art miniApp-focused test methods. As a highlight, we received 11 positive responses from the superApp developers and CNVD, encompassing 9 vulnerability confirmations with rewards: 1 high-risk, 7 medium-risk, and 1 low-risk. To assess prevalence, ScopeChecker evaluated 42𝑘+ miniApps, and found 51% had API-scope misalignment, averaging 1.4 misaligned APIs each. At last, we illustrated 4 types of security threats raised by the API-scope misalignment by analyzing real-world exploitation cases.

Publisher's Version Article: issta25main-p1124-p doi:10.1145/3728962

Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia
(Harbin Institute of Technology, China; Monash University Malaysia, Malaysia; Zhejiang University, China)
Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored.
In this paper, we empirically explore LLM-as-a-judge methods for evaluating SE tasks, focusing on their alignment with human judgments. We select seven LLM-as-a-judge methods that utilize general-purpose LLMs, alongside two LLMs specifically fine-tuned for evaluation. After generating and manually scoring LLM responses on three recent SE datasets of code translation, code generation, and code summarization, we then prompt these methods to evaluate each response. Finally, we compare the scores generated by these methods with human evaluation. The results indicate that output-based methods reach the highest Pearson correlation of 81.32 and 68.51 with human scores in code translation and generation, achieving near-human evaluation, noticeably outperforming ChrF++, one of the best conventional metrics, at 34.23 and 64.92. Such output-based methods prompt LLMs to output judgments directly, and exhibit more balanced score distributions that resemble human score patterns. Finally, we provide insights and implications, concluding that current state-of-the-art LLM-as-a-judge methods can potentially replace human evaluations in certain SE tasks.

Publisher's Version Article: issta25main-p1130-p doi:10.1145/3728963

Effective REST APIs Testing with Error Message Analysis
Lixin Xu, Huayao Wu, Zhenyu Pan, Tongtong Xu, Shaohua Wang, Xintao Niu, and Changhai Nie
(Nanjing University, China; Huawei, China; Central University of Finance and Economics, China)

Artifacts Available Article: issta25main-p1133-p doi:10.1145/3728964

DepState: Detecting Synchronization Failure Bugs in Distributed Database Management Systems
Cundi Fang, Jie Liang, Zhiyong Wu, Jingzhou Fu, Zhouyang Jia, Chun Huang, Yu Jiang, and Shanshan Li
(National University of Defense Technology, China; Beihang University, China; Tsinghua University, China)

Publisher's Version Article: issta25main-p1165-p doi:10.1145/3728965

What Happened in This Pipeline? Diffing Build Logs with CiDiff
Nicolas Hubner, Jean-Rémy Falleri, Raluca Uricaru, Thomas Degueule, and Thomas Durieux
(University of Bordeaux - CNRS - Bordeaux INP - LaBRI - UMR 5800, France; Delft University of Technology, Netherlands)

Artifacts Available Article: issta25main-p1193-p doi:10.1145/3728966

Freesia: Verifying Correctness of TEE Communication with Concurrent Separation Logic
Fanlang Zeng, Rui Chang, and Hongjian Liu
(Zhejiang University, China)

Artifacts Functional Article: issta25main-p1202-p doi:10.1145/3728967

Static Program Reduction via Type-Directed Slicing
Loi Ngo Duc Nguyen, Tahiatul Islam, Theron Wang, Sam Lenz, and Martin Kellogg
(University of California at Riverside, USA; New Jersey Institute of Technology, USA; Academy for Mathematics, Science, and Engineering, USA)

Artifacts Available Article: issta25main-p1224-p doi:10.1145/3728968

LogBase: A Large-Scale Benchmark for Semantic Log Parsing
Chenbo Zhang, Wenying Xu, Jinbu Liu, Lu Zhang, Guiyang Liu, Jihong Guan, Qi Zhou, and Shuigeng Zhou
(Fudan University, China; Alibaba Group, China; Tongji University, China)
Logs generated by large-scale software systems contain a huge amount of useful information. As the first step of automated log analysis, log parsing has been extensively studied. General log parsing techniques focus on identifying static templates from raw logs, but overlook the more important semantics implied in dynamic log parameters. With the popularity of Artificial Intelligence for IT Operations (AIOps), traditional log parsing methods no longer meet the requirements of various downstream tasks. Researchers are now exploring the next generation of log parsing techniques, i.e., semantic log parsing, to identify both log templates and semantics in log parameters. However, the absence of semantic annotations in existing datasets hinders the training and evaluation of semantic log parsers, thereby stalling the progress of semantic log parsing.
To fill this gap and advance the field of semantic log parsing, we construct LogBase, the first semantic log parsing benchmark dataset. LogBase consists of logs from 130 popular open-source projects, containing 85,300 semantically annotated log templates, surpassing existing datasets in both log source diversity and template richness. To build Logbase, we develop the framework GenLog for constructing semantic log parsing datasets. GenLog mines log template-parameter-context triplets from popular open-source repositories on GitHub, and uses chain-of-thought (CoT) techniques with large language models (LLMs) to generate high-quality logs. Meanwhile, GenLog employs human feedback to improve the quality of the generated data and ensure its reliability. GenLog is highly automated and cost-effective, enabling researchers to easily and efficiently construct semantic log parsing datasets. Furthermore, we also design a set of comprehensive evaluation metrics for LogBase, including general log parser metrics and the metrics specifically for semantic log parsers and LLM-based parsers.
With LogBase, we extensively evaluate 15 existing log parsers, revealing their true performance in complex scenarios. We believe that this work provides researchers with valuable data, reliable tools, and insightful findings to support and guide the future research of semantic log parsing.

Publisher's Version Article: issta25main-p1284-p doi:10.1145/3728969

STRUT: Structured Seed Case Guided Unit Test Generation for C Programs using LLMs
Jinwei Liu, Chao Li, Rui Chen, Shaofeng Li, Bin Gu, and Mengfei Yang
(Xidian University, China; Beijing Institute of Control Engineering, China; Beijing Sunwise Information Technology, China; China Academy of Space Technology, China)

Publisher's Version Article: issta25main-p1285-p doi:10.1145/3728970

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, Kui Ren, and Jingyi Wang
(Zhejiang University, China; Alibaba Group, China)
Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risks efficiently.
To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM M_t and a novel safety critique LLM M_c. The expert testing LLM M_t is responsible for automatically generating test cases in accordance with the proposed risk management (including 8 risk dimensions and a total of 102 subdivided risks). The safety critique LLM M_c can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval differs in significant ways: (i) efficient – we construct a multi-dimensional and open-ended benchmark comprising 220,000 test cases across 102 risks utilizing M_t and conduct safety evaluations for 21 influential LLMs via M_c on our benchmark. The entire process is fully automated and requires no human involvement. (ii) effective – extensive validations show S-Eval facilitates a more thorough assessment and better perception of potential LLM risks, and M_c not only accurately quantifies the risks of LLMs but also provides explainable and in-depth insights into their safety, surpassing comparable models such as LLaMA-Guard-2. (iii) adaptive – S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. We further study the impact of hyper-parameters and language environments on model safety, which may lead to promising directions for future research. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios.

Info Article: issta25main-p1328-p doi:10.1145/3728971

Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing
Yanzhou Mu, Juan Zhai, Chunrong Fang, Xiang Chen, Zhixiang Cao, Peiran Yang, Kexin Zhao, An Guo, and Zhenyu Chen
(Nanjing University, China; Shenzhen Research Institute of Nanjing University, China; University of Massachusetts at Amherst, USA; Nantong University, China)
Deep learning (DL) frameworks are essential to DL-based software systems, and framework bugs may lead to substantial disasters, thus requiring effective testing. Researchers adopt DL models or single interfaces as test inputs and analyze their execution results to detect bugs. However, floating-point errors, inherent randomness, and the complexity of test inputs make it challenging to analyze execution results effectively, leading to existing methods suffering from a lack of suitable test oracles. Some researchers utilize metamorphic testing to tackle this challenge. They design Metamorphic Relations (MRs) based on input data and parameter settings of a single framework interface to generate equivalent test inputs, ensuring consistent execution results between original and generated test inputs. Despite their promising effectiveness, they still face certain limitations. (1) Existing MRs overlook structural complexity, limiting test input diversity. (2) Existing MRs focus on limited interfaces, which limits generalization and necessitates additional adaptations. (3) Their detected bugs are related to the result consistency of single interfaces and far from those exposed in multi-interface combinations and runtime metrics (e.g., resource usage). To address these limitations, we propose ModelMeta, a model-level metamorphic testing method for DL frameworks with four MRs focused on the structure characteristics of DL models. ModelMeta augments seed models with diverse interface combinations to generate test inputs with consistent outputs, guided by the QR-DQN strategy. It then detects bugs through fine-grained analysis of training loss/gradients, memory/GPU usage, and execution time. We evaluate the effectiveness of ModelMeta on three popular DL frameworks (i.e., MindSpore, PyTorch, and ONNX) with 17 DL models from ten real-world tasks ranging from image classification to object detection. Results demonstrate that ModelMeta outperforms state-of-the-art baselines from the perspective of test coverage and diversity of generated test inputs. Regarding bug detection, ModelMeta has identified 31 new bugs, of which 27 have been confirmed, and 11 have been fixed. Among them, seven bugs existing methods cannot detect, i.e., five wrong resource usage bugs and two low-efficiency bugs. These results demonstrate the practicality of our method.

Publisher's Version Article: issta25main-p1331-p doi:10.1145/3728972

Hulk: Exploring Data-Sensitive Performance Anomalies in DBMSs via Data-Driven Analysis
Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, and Yu Jiang
(Tsinghua University, China; Beihang University, China)

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: issta25main-p1337-p doi:10.1145/3728973

Type-Alias Analysis: Enabling LLVM IR with Accurate Types
Jinmeng Zhou, Ziyue Pan, Wenbo Shen, Xingkai Wang, Kangjie Lu, and Zhiyun Qian
(Zhejiang University, China; University of Minnesota, USA; University of California at Riverside, USA)

Artifacts Functional Article: issta25main-p1365-p doi:10.1145/3728974

Automated Test Transfer across Android Apps using Large Language Models
Benyamin Beyzaei, Saghar Talebipour, Ghazal Rafiei, Nenad Medvidović, and Sam Malek
(University of California at Irvine, USA; University of Southern California, USA)

Info Article: issta25main-p1386-p doi:10.1145/3728975

Incremental Verification of Concurrent Programs through Refinement Constraint Adaptation
Liangze Yin, Yiwei Li, Kun Chen, Wei Dong, and Ji Wang
(National University of Defense Technology, China)

Publisher's Version Article: issta25main-p1396-p doi:10.1145/3728976

Fixing Outside the Box: Uncovering Tactics for Open-Source Security Issue Management
Lyuye Zhang, Jiahui Wu, Chengwei Liu, Kaixuan Li, Xiaoyu Sun, Lida Zhao, Chong Wang, and Yang Liu
(Nanyang Technological University, Singapore; East China Normal University, China; Australian National University, Australia)

Publisher's Version Article: issta25main-p1441-p doi:10.1145/3728977

Intention-Based GUI Test Migration for Mobile Apps using Large Language Models
Shaoheng Cao, Minxue Pan, Yuanhong Lan, and Xuandong Li
(Nanjing University, China)

Publisher's Version Article: issta25main-p1471-p doi:10.1145/3728978

GoPV: Detecting Blocking Concurrency Bugs Related to Shared-Memory Synchronization in Go
Wei Song, Xiaofan Xu, and Jeff Huang
(Nanjing University of Science and Technology, China; Texas A&M University, USA)

Artifacts Reusable Article: issta25main-p1496-p doi:10.1145/3728979

More Effective JavaScript Breaking Change Detection via Dynamic Object Relation Graph
Dezhen Kong, Jiakun Liu, Chao Ni, David Lo, and Lingfeng Bao
(Zhejiang University, China; Singapore Management University, Singapore)

Publisher's Version Article: issta25main-p1525-p doi:10.1145/3728980

SWE-GPT: A Process-Centric Language Model for Automated Software Improvement
Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, and Yongbin Li
(Alibaba Group, China)
Large language models (LLMs) have demonstrated remarkable performance in code generation, significantly enhancing the coding efficiency of developers. Recent advancements in LLM-based agents have led to significant progress in end-to-end automatic software engineering (ASE), particularly in software maintenance (e.g., fixing software issues) and evolution (e.g., adding new features). Despite these encouraging advances, current research faces two major challenges. First, state-of-the-art performance primarily depends on closed-source models like GPT-4, which significantly limits the technology’s accessibility, and potential for customization in diverse software engineering tasks. This dependence also raises concerns about data privacy, particularly when handling sensitive codebases. Second, these models are predominantly trained on static code data, lacking a deep understanding of the dynamic interactions, iterative problem-solving processes, and evolutionary characteristics inherent in software development. Consequently, they may face challenges in navigating complex project structures and generating contextually relevant solutions, which can affect their practical utility in real-world scenarios.
To address these challenges, our study adopts a software engineering perspective. We recognize that real-world software maintenance and evolution processes encompass not only static code data but also developers’ thought processes, utilization of external tools, and the interaction between different functional personnel. Our objective is to develop an open-source large language model specifically optimized for software improvement, aiming to match the performance of closed-source alternatives while offering greater accessibility and customization potential. Consequently, we introduce the Lingma SWE-GPT series, comprising Lingma SWE-GPT 7B and Lingma SWE-GPT 72B. By learning from and simulating real-world code submission activities, Lingma SWE-GPT systematically incorporates the dynamic interactions and iterative problem-solving inherent in software development process—such as repository understanding, fault localization, and patch generation—thereby achieving a more comprehensive understanding of software improvement processes. We conducted experimental evaluations using SWE-bench-Verified benchmark (comprising 500 real GitHub issues), recently proposed by OpenAI. The results demonstrate that Lingma SWE-GPT 72B successfully resolves 30.20% of the GitHub issues, marking a significant improvement in automatic issue resolution (22.76% relative improvement compared to Llama 3.1 405B), approaching the performance of closed-source models (31.80% issues of GPT-4o resolved). Notably, Lingma SWE-GPT 7B resolves 18.20% of the issues, surpassing the 17.20% resolution rate of Llama 3.1 70B, highlighting the potential for applying smaller models to ASE tasks.

Publisher's Version

ACM SIGSOFT Distinguished Paper Award Article: issta25main-p1599-p doi:10.1145/3728981

ICEPRE: ICS Protocol Reverse Engineering via Data-Driven Concolic Execution
Yibo Qu, Dongliang Fang, Zhen Wang, Jiaxing Cheng, Shuaizong Si, Yongle Chen, and Limin Sun
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Taiyuan University of Technology, China)

Publisher's Version Article: issta25main-p1717-p doi:10.1145/3728982

Gamifying Testing in IntelliJ: A Replicability Study
Philipp Straubinger, Tommaso Fulcini, Giacomo Garaccione, Luca Ardito, and Gordon Fraser
(University of Passau, Germany; Politecnico di Torino, Italy)

Publisher's Version Article: issta25main-p1796-p doi:10.1145/3728983

CrossProbe: LLM-Empowered Cross-Project Bug Detection for Deep Learning Frameworks
Hao Guan, Guangdong Bai, and Yepang Liu
(University of Queensland, Australia; Southern University of Science and Technology, China)
Deep Learning (DL) models may introduce reliability challenges in the underlying DL frameworks. These frameworks may be prone to bugs that can lead to crash or wrong results, particularly when involving complex model architectures and substantial computational demands. Such framework bugs can disrupt DL applications, impacting customer experience and potentially causing financial losses. Traditional approaches to testing DL frameworks face limitations in adapting to the vast search space of model structures, diverse APIs, and the complexity of hybrid programming and hardware environments. Recent advancements using Large Language Models (LLMs) have improved DL framework fuzzing, but their efficacy depends heavily on the quality and diversity of input prompts, which are often constructed using single-framework data.
In this paper, we propose an innovative approach for enhancing test generation for DL frameworks by leveraging “mirroring issues”—analogous bugs identified across different frameworks with common functionalities. Our approach is inspired by the fact that DL frameworks, such as PyTorch and TensorFlow, often share common bugs due to dependencies, developer errors, or edge-case inputs. We develop CrossProbe that utilizes LLMs to effectively learn from existing issues of one framework and transfer the acquired knowledge to generate test cases for finding mirroring issues in another framework, thus enabling cross-framework bug detection. To overcome the challenges of test case generation arising from the incompatible functionalities and different implementations between frameworks, we introduce three processes: alignment, screening, and distinction. These processes help mitigate transfer errors by establishing API pair databases, filtering unsuitable cases, and highlighting cross-framework distinctions. Experiments demonstrate that CrossProbe is efficient by saving 36.3% iterations of generation, and achieves a 25.0% higher success rate in issue transferring compared to existing state-of-the-art LLM-based testing techniques. CrossProbe detects 24 unique bugs using its transferred knowledge. Out of them, 19 are previously unknown and each requires cross-framework knowledge in deep learning for identification.