Powered by
Proceedings of the ACM on Software Engineering, Volume 3, Number FSE
Frontmatter
Papers A
Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-Breaking
Yifan Huang,
Xiaojun Jia,
Wenbo Guo,
Yuqiang Sun,
Yihao Huang,
Chong Wang, and
Yang Liu
(Nanyang Technological University, Singapore; National University of Singapore, Singapore)
Large language models (LLMs) have revolutionized software development through AI-assisted coding tools, enabling developers with limited programming expertise to create sophisticated applications. This democratization of software development has significantly lowered the barriers to entry for complex programming tasks. However, the same accessibility extends to malicious actors who may exploit these powerful tools to generate harmful software, including malware, ransomware, and other security threats. Existing jailbreaking research primarily focuses on general attack scenarios against LLMs, with limited exploration of malicious code generation as a jailbreak target and insufficient technical expertise to evaluate whether generated outputs align with specified malicious objectives. To address this gap, we propose SPELL, a comprehensive testing framework for LLM developers and the Secure Team, specifically designed to evaluate the weakness of security alignment in malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns with exploitation of successful techniques. Extensive evaluation across three advanced code models (GPT-4.1, Claude-3.5, and Qwen2.5-Coder) demonstrates SPELL’s effectiveness, achieving attack success rates of 83.75%, 19.38%, and 68.12% respectively across eight malicious code categories. The generated prompts successfully produce malicious code in real-world AI development tools such as Cursor, with outputs confirmed as malicious by state-of-the-art detection systems at rates exceeding 73%, including four instances flagged as extremely dangerous by all detection tools. These findings reveal significant security gaps in current LLM implementations and provide valuable insights for improving AI safety alignment in code generation applications.
Article Search
Article: fse26maina-p10-p
Still Manual? Automated Linter Configuration via DSL-Based LLM Compilation of Coding Standards
Zejun Zhang,
Yixin Gan,
Zhenchang Xing,
Tian Zhang,
Yi Li,
Qinghua Lu,
Sherry (Xiwei) Xu, and
Liming Zhu
(Nanjing University, China; CSIRO's Data61, Australia; Nanyang Technological University, Singapore)
Coding standards are essential for maintaining consistent and high-quality code across teams and projects. Linters help developers enforce these standards by detecting code violations. However, manual linter configuration is complex and expertise-intensive, and the diversity and evolution of programming languages, coding standards, and linters lead to repetitive and maintenance-intensive configuration work. To reduce manual effort, we propose LintCFG, a domain-specific language (DSL)-driven, LLM-based compilation approach to automate linter configuration generation for coding standards, independent of programming languages, coding standards, and linters. Inspired by compiler design, we first design a DSL to express coding rules in a tool-agnostic, structured, readable, and precise manner. Then, we build linter configurations into DSL configuration instructions. For a given natural language coding standard, the compilation process parses it into DSL coding standards, matches them with the DSL configuration instructions to set configuration names, option names and values, verifies consistency between the standards and configurations, and finally generates linter-specific configurations. Experiments with Checkstyle for Java coding standard show that our approach achieves over 90% precision and recall in DSL representation, with accuracy, precision, recall, and F1-scores close to 70% (with some exceeding 70%) in fine-grained linter configuration generation. Notably, our approach outperforms baselines by over 100% in precision. An ablation study confirms the effectiveness of the main components of our approach. A user study further shows that our approach improves developers’efficiency in configuring linters for coding standards. Finally, we demonstrate the generality of the approach by generating ESLint configurations for JavaScript coding standards, showcasing its broad applicability across other programming languages, coding standards, and linters. We developed a lightweight, general-purpose AI skill, which is publicly available on GitHub
Article Search
Article: fse26maina-p27-p
GadgetHunter: Region-Based Neuro-symbolic Detection of Java Deserialization Vulnerabilities
Kaixuan Li,
Jian Zhang,
Chong Wang,
Sen Chen,
Zong Cao,
Min Zhang, and
Yang Liu
(Nanyang Technological University, Singapore; Beihang University, China; Nankai University, China; Imperial Global Singapore of Imperial College London, Singapore; East China Normal University, China)
Java deserialization vulnerabilities (JDVs) enable attackers to execute arbitrary code by crafting malicious serialized objects that trigger sequences of method calls (gadget chains) leading to dangerous operations. Existing detection approaches face a fundamental trade-off: static analysis achieves scalability but suffers from high false positives due to infeasible paths and imprecision with dynamic features like reflection; dynamic validation reduces false positives but incurs prohibitive costs and fails to explore deep exploitation chains.
We present GadgetHunter, a neuro-symbolic JDV detector that combines scalable static analysis with
targeted LLM reasoning and JDV exploitation-oriented constraint solving. Our approach partitions gadget chains into regions based on analyzability: statically resolvable segments are processed via interprocedural taint analysis, while dynamic boundaries are delegated to LLMs for semantic validation. We then extract critical constraints from each gadget and compose them into SMT formulas to determine chain feasibility through satisfiability solving. Evaluation on the ysoserial benchmark demonstrates that GadgetHunter reduces false negatives by up to 32% and false positives by 12-85% compared to state-of-the-art tools, while discovering 197 previously unknown gadget chains and rediscovering 4 recent CVEs. Our results show that combining symbolic reasoning with semantic understanding achieves both precision and practical impact in vulnerability detection.
Article Search
Article: fse26maina-p42-p
SnakeCharmer: Automatic Fuzzing Harness Generation for Pure and Hybrid Python Libraries
Gabriel Sherman and
Stefan Nagy
(University of Utah, USA)
With Python’s rising popularity, ensuring the correctness of its ever-growing ecosystem of software libraries is more critical than ever. Recently, fuzzing has become a de facto technique for vetting software libraries, enabled via the use of harnesses: small wrapper programs that inject fuzzer-generated test cases into the library under test. While harness creation has shed its reliance on human expertise and is now fully automated for languages such as C and C++, Python remains uniquely challenging—both for pure Python libraries as well as hybrid ones combining Python with native C/C++ extensions—due to (1) limited visibility across language boundaries, (2) the absence of reliable bug oracles, and (3) incomplete type information. Consequently, attempts at automating harnessing for Python fail to both uphold critical runtime behaviors and produce the structured call and data flows needed for effective fuzzing, leaving much of today’s Python ecosystem largely unvetted.
To overcome these challenges and broaden fuzzing’s reach across Python libraries, this paper introduces SnakeCharmer: the first automated harness generation approach for both pure and hybrid Python libraries. At its core, SnakeCharmer leverages static analysis to first capture key interface information from both Python and native code components, subsequently enriching it with runtime-captured type information and exception behaviors. During fuzzing, SnakeCharmer further distinguishes between expected exceptions and true library bugs, filtering out benign exceptions that would otherwise derail testing progress. Together, these techniques significantly enhance the scope and effectiveness of fuzzing across the Python library ecosystem, enabling the automated discovery of bugs in code previously inaccessible to existing Python fuzzing efforts.
We evaluate SnakeCharmer alongside today’s leading Python auto-harnessing approach, PyRTFuzz; the actively fuzzed expert-written harnesses from both OSS-Fuzz and PolyFuzz; and the harnesses generated by Google’s own state-of-the-art LLM-driven automatic harnessing approach, OSS-Fuzz-Gen. Across 21 diverse Python libraries, SnakeCharmer attains type-recovery precision and exception-filtering accuracy of 95% and 97%, respectively, further attaining 1.48×, 1.87×, 1.78×, and 1.40× the code coverage of the fuzzing harnesses from PyRTFuzz, OSS-Fuzz, PolyFuzz, and OSS-Fuzz-Gen, respectively. Further, SnakeCharmer finds 16, 24, and 24 more Python library bugs than all expert- and LLM-created harnesses as well as PyRTFuzz, respectively—uncovering a total of 20 new bugs, with 18 since confirmed or fixed by developers.
Article Search
Article: fse26maina-p67-p
UNICS: Multilingual Code Search via Unified Pseudocode and Contrastive Transfer Learning
Ye Fan,
Jidong Ge,
Chuanyi Li,
LiGuo Huang, and
Bin Luo
(Nanjing University, China; Southern Methodist University, USA)
While pre-trained models have achieved remarkable success in code search, their multilingual capabilities remain a major hurdle, plagued by data imbalance, cross-lingual semantic interference, and the loss of critical information from existing unified representations like Abstract Syntax Trees (ASTs) or Intermediate Representations (IRs). Furthermore, conventional contrastive learning strategies often rely on simplistic hard negative sampling while overlooking the potential of mining hard positives to learn code's intrinsic semantic invariance. To address these challenges, we introduce UNICS, a framework for multilingual code search built on a two-stage training strategy. In the first stage, UNICS is pre-trained on a novel dataset we constructed, which uses pseudo-code as a unified representation to learn a cross-lingual, algorithm-level logic that preserves full semantic fidelity. The second stage employs a multi-task transfer learning strategy that adapts this general knowledge to specific languages by decomposing code into semantic slices (e.g., API calls, function bodies) and incorporating tasks for hard positive mining and cross-lingual dynamic hard negative sampling. Experimental results demonstrate that UNICS achieves state-of-the-art performance across multiple multilingual and cross-lingual benchmarks, showcasing superior generalization and performance balance, especially in zero-shot transfer tasks to low-resource languages.
Article Search
Article: fse26maina-p88-p
Towards Automated Smart Contract Generation: Evaluation, Benchmarking, and Retrieval-Augmented Repair
Zaoyu Chen,
Haoran Qin,
Nuo Chen,
Xiangyu Zhao,
Lei Xue,
Xiapu Luo, and
Xiao-Ming Wu
(Hong Kong Polytechnic University, Hong Kong; Sun Yat-sen University, China)
Smart contracts, predominantly written in Solidity and executed on blockchains like Ethereum, are immutable, making functional correctness paramount: once deployed, bugs and vulnerabilities become permanent. Despite rapid progress in transformer-based code LLMs, existing evaluations of Solidity code completion rely heavily on surface-form metrics (e.g., BLEU, CrystalBLEU) or hand-grading, which poorly correlate with functional correctness. Unlike Python, Solidity lacks large-scale and execution-based benchmarks, hindering systematic assessment and optimization of LLMs for smart contract development.
To bridge this research gap, we introduce SolBench, a comprehensive benchmark and automated testing pipeline for Solidity, designed to emphasize functional correctness via differential fuzzing. SolBench contains 28,825 functions from 7,604 contracts collected from Etherscan (genesis to 2024), spanning 10 popular domains. We benchmark 14 diverse LLMs (open/closed, 1.3B to 671B parameters, general/code-specific, with/without reasoning). The dominant failure mode is missing crucial details (e.g., type definitions, state variables) in intra-contract context. Providing full-contract context mitigates this and improves code completion accuracy.
However, full-context inference can be prohibitively expensive in practice. Generating outputs with large context windows using state-of-the-art models often incurs significant costs, rendering naive context scaling economically impractical. Crucially, most of a contract is irrelevant to implementing a given function; only a small subset of details is needed. To exploit this, we propose Retrieval-Augmented Repair (RAR), which integrates retrieval into code repair: it uses the executor's error messages to extract only the most relevant snippets from the full contract. RAR sharply reduces input length for function completion, improving accuracy while significantly cutting computational cost. We further analyze retrieval and code repair strategies within RAR, showing substantial improvements in accuracy and efficiency. SolBench and our RAR framework enable principled evaluation and cost-effective improvement of Solidity code generation. Dataset and code are available at https://github.com/ZaoyuChen/SolBench.
Article Search
Article: fse26maina-p90-p
MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge
Shuai Liang,
Pengfei Chen,
Bozhe Tian,
Gou Tan,
Maohong Xu,
Youjun Qu,
Yahui Zhao,
Yiduo Shang, and
Chongkang Tan
(Sun Yat-sen University, China; China Unicom Software Research Institute, China; Individual Researcher, China)
The dynamics and complexity of cloud-native systems present significant challenges for Root Cause Analysis (RCA). While causality-based RCA methods have shown significant progress in recent years, their practical adoption is fundamentally limited by three intertwined challenges: poor scalability against system complexity, brittle generalization across different system topologies, and inadequate integration of domain knowledge. These limitations create a vicious cycle, hindering the development of robust and efficient RCA solutions. This paper introduces MetaRCA, a generalizable RCA framework for cloud-native systems. MetaRCA first constructs a Meta Causal Graph (MCG) offline, a reusable knowledge base defined at the metadata level. To build the MCG, we propose an evidence-driven algorithm that systematically fuses knowledge from Large Language Models (LLMs), historical fault reports, and observability data. When a fault occurs, MetaRCA performs a lightweight online inference by dynamically instantiating the MCG into a localized graph based on the current context, and then leverages real-time data to weight and prune causal links for precise root cause localization. Evaluated on 252 public and 59 production failures, MetaRCA demonstrates state-of-the-art performance. It surpasses the strongest baseline by 29 percentage points in service-level and 48 percentage points in metric-level accuracy. This performance advantage widens as system complexity increases, with its overhead scaling near-linearly. Crucially, MetaRCA shows robust cross-system generalization, maintaining over 80% accuracy across diverse systems.
Preprint
Article: fse26maina-p149-p
Mitigating the Risk of Defects and Improving Knowledge Distribution with Code Reviewer Recommenders
Mohammadali Sefidi Esfahani and
Peter C. Rigby
(Concordia University, Canada)
Defects are inevitable in software projects, leading to increased maintenance costs, user dissatisfaction, and a diminished software reputation. Code review is one of the most critical software quality assurance activities that reduces software defects and improves software quality. Prior works have quantified the impact of reviewer recommenders on the risk of inducing new defects based on the highest level of expertise among the developers in the reviewer team. However, our analysis shows that prior work overestimates the safety of a change and ignores the defect-finding effectiveness of the diverse knowledge of reviewers. In this study, we incorporate the knowledge of the entire reviewer team into the author’s level of expertise and introduce the novel Contribution-aware Changeset Safety Ratio (CCSR) outcome to assess the impact of code reviewer recommenders on the risk of inducing defects more accurately.
When a pull request is risky, a natural mitigation is to add an expert reviewer.We are unaware of any works that have quantified the impact of adding a reviewer to risky PRs. We propose the novel AddExpertRec(𝐷𝑡) strategy that recommends an additional expert reviewer for defect-prone pull requests to reduce the likelihood of introducing new defects when the risk is above the threshold 𝐷𝑡. The simulation results show that AddExpertRec(𝐷𝑡) can enhance the defect finding effectiveness of existing recommenders while still balancing reviewer workload and spreading knowledge to reduce the impact of turnover. Ultimately, our results give managers the ability to select a recommender strategy that best suits their project needs based on their resource constraints. The scripts and data are available in our replication package [14].
Article Search
Article: fse26maina-p175-p
WalleTruth: Visual-Oriented Software Testing for Web3 Wallet Browser Extensions
Xiaohui Hu,
Ningyu He, and
Haoyu Wang
(Huazhong University of Science and Technology, China; Hong Kong Polytechnic University, Hong Kong)
Serving as the first touch point for users to the cryptocurrency world, cryptocurrency wallets allow users to manage, receive, and transmit digital assets on blockchains and interact with emerging decentralized finance (DeFi) applications. Unfortunately, cryptocurrency wallets have always been the prime targets for attackers, and incidents of wallet breaches have been reported from time to time. Although some recent studies have characterized the vulnerabilities and scams related to wallets, they have mostly been studied at a coarse granularity, overlooking potential risks inherent in detailed designs of cryptocurrency wallets, especially from perspectives including user interaction and advanced features. To fill the void, in this paper, we present a fine-grained security analysis of browser-based cryptocurrency wallets. To pinpoint security issues in wallet components, we design WalleTruth, a visual-oriented testing framework specifically for browser-based wallet extensions. We have identified 12 attack vectors that can be abused by attackers to exploit cryptocurrency wallets and exposed 21 concrete attack strategies. By applying WalleTruth on 39 widely-adopted browser-based wallet extensions, we find that all of them can be abused to steal crypto assets from innocent users. Identified potential attack vectors were reported to developers in a timely manner and 26 issues have been patched already. This calls for urgent action from the community to mitigate threats related to cryptocurrency wallets.
Article Search
Article: fse26maina-p176-p
Precondition Synthesis for Deep Neural Networks with Statistical Guarantees
Zengyu Liu,
Bai Xue,
Pengfei Yang, and
Ji Wang
(National University of Defense Technology, China; Institute of Software at Chinese Academy of Sciences, China; Institute of AI for Industries at Chinese Academy of Sciences, China)
Deep neural networks (DNNs) are increasingly being deployed in safety-critical systems. However, existing formal verification methods provide limited quantitative guarantees for their reliable specification, and emerging precondition synthesis techniques are hindered by the scalability and architectural limitations of DNNs. In this paper, we propose a select-and-solve framework, StatPre, to automatically synthesize preconditions with statistical guarantees. StatPre aims to maximally weaken the synthesized preconditions while keeping them as accurate as possible to the real preconditions through a Box-based abstraction. The framework operates in two phases: the center selection phase, which identifies potential center points using a cluster-based heuristic with potential assessment, and the expansion solution phase, which solves the problem of optimizing maximal preconditions by employing statistical model approximation, equivalent constraint transformation, and automatic iterative execution. We evaluated StatPre on 15 models with 27 properties from 6 benchmarks and compared it with 4 existing deterministic and statistical schemes. The results demonstrate that StatPre effectively synthesizes preconditions with broader coverage while accurately approximating the real preconditions in practice. Additionally, StatPre exhibits competitive performance in handling high-dimensional, non-ReLU, complex-structured neural networks.
Article Search
Artifacts Available
Article: fse26maina-p191-p
Coding in a Bubble? Evaluating LLMs in Resolving Context Adaptation Bugs during Code Adaptation
Tanghaoran Zhang,
Xinjun Mao,
Shangwen Wang,
Yuxin Zhao,
Yao Lu,
Zezhou Tang,
Wenyu Xu,
Longfei Sun,
Changrong Xie,
Kang Yang, and
Yue Yu
(National University of Defense Technology, China; Peng Cheng Laboratory, China)
Code adaptation is a fundamental but challenging task in software development, requiring developers to modify existing code for new contexts. A key challenge is to resolve Context Adaptation Bugs (CtxBugs), which occurs when code correct in its original context violates constraints in the target environment. Unlike isolated bugs, CtxBugs cannot be resolved through local fixes and require cross-context reasoning to identify semantic mismatches. Overlooking them may lead to critical failures in adaptation. Although Large Language Models (LLMs) show great potential in automating code-related tasks, their ability to resolve CtxBugs remains a significant and unexplored obstacle to their practical use in code adaptation.
To bridge this gap, we propose CtxBugGen, a novel framework for generating CtxBugs to evaluate LLMs. Its core idea is to leverage LLMs’ tendency to generate plausible but context-free code when contextual constraints are absent. The framework generates CtxBugs through a four-step process to ensure their relevance and validity: (1) Selection of four established context-aware adaptation tasks from the literature, (2) Perturbation via task-specific rules to induce CtxBugs from LLMs while ensuring their plausibility, (3) Generation of candidate variants by prompting LLMs without any context constraints and (4) Identification of valid CtxBugs through syntactic differencing and test execution in the target context. Based on the benchmark constructed by CtxBugGen, we conduct an empirical study with four state-of-the-art LLMs. Our results reveal their unsatisfactory performance in CtxBug resolution. The best performing LLM, Kimi-K2, achieves 55.93% on Pass@1 and resolves just 52.47% of CtxBugs. The presence of CtxBugs degrades LLMs’ adaptation performance by up to 30%. Failure analysis indicates that LLMs often overlook CtxBugs and replicate them in their outputs. This suggests that LLMs overly focus on the local code correctness of the reused code while ignoring its compatibility in the target context. Our study highlights a critical weakness in LLMs’ cross-context reasoning and emphasize the need for new methods to enhance their context awareness for reliable code adaptation. The replication package for this paper is at https://github.com/ztwater/CtxBugGen.
Article Search
Article: fse26maina-p224-p
TSGuard: Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud
Yitao Yang,
Yangtao Deng,
Yifan Xiong,
Baochun Li,
Hong Xu, and
Peng Cheng
(Chinese University of Hong Kong, Hong Kong; Microsoft Research, Canada; University of Toronto, Canada; Microsoft Research, USA)
AI workloads incur frequent failures and incidents from the underlying infrastructure.
The current incident management workflow follows a provider-centric paradigm, where users report incidents to the infrastructure provider who then conducts troubleshooting.
Due to the large number of incidents and the manual nature of the troubleshooting process, the provider often takes several days to resolve an incident, resulting in operational delays and productivity loss.
To address these challenges, we present TSGuard, a user-centric multi-agent system that delivers immediate incident diagnosis to users who deploy the workloads.
The core innovation of TSGuard is twofold: (1) constructing domain-specific knowledge bases by mining historical on-call experiences in the offline phase, and (2) mimicking human expert diagnosis via structured reasoning and iterative trial-and-error in the online phase.
Evaluation using production incident records from Microsoft Azure demonstrates that TSGuard significantly outperforms state-of-the-art baselines, improving diagnostic accuracy by 19.8%. Furthermore, TSGuard reduces the average verification time by 63.4% compared to the sequential execution baseline.
Article Search
Article: fse26maina-p227-p
TestTailor: Generating High-Coverage Tests via Path-Proximal Tests with LLMs
Xiaoxuan Zhou,
Yiling Lou,
Jinhao Dong, and
Dan Hao
(Peking University, China; Northeastern University, China; Fudan University, China; Peking University Shenzhen Graduate School, China)
Automated unit testing is essential for ensuring software quality. Achieving high code coverage through automated unit test generation remains challenging, especially for hard-to-cover branches guarded by complex or deeply nested conditions. Traditional search-based approaches often stagnate at fitness plateaus, while recent LLM-based techniques provide mostly coarse-grained prompts, leaving models to guess how to reach uncovered targets. To address these limitations, we present TestTailor, a neuro-symbolic framework that exploits fine-grained, path-oriented guidance to guide LLM-based test generation. The key idea is to exploit path-proximal tests (i.e., existing test cases whose execution paths closely resemble the target uncovered path) and to analyze their divergence points. By combining this analysis with symbolic constraints (i.e., constraints collected from the target uncovered path using symbolic execution), TestTailor derives actionable path guidance and encodes them into concise prompts that tell the LLM not only what remains uncovered, but also how to reach it. We evaluate TestTailor on the widely used CODAMOSA benchmark comprising 486 Python modules. Results show that TestTailor consistently outperforms state-of-the-art baselines, improving statement coverage by 5.01% and branch coverage by 4.17% on average compared to the best baseline CoverUp, while incurring only about 40% of CoverUp's API cost. Against the hybrid LLM-search-based technique CODAMOSA, TestTailor achieves even larger gains of 12.78% and 13.09% in statement and branch coverage, respectively. Moreover, TestTailor attains the highest coverage accuracy (85.2% vs. 75.3% for CoverUp and 63.8% for TELPA), and demonstrates robustness across different LLM backbones. These results highlight that TestTailor transforms vague coverage goals into precise path-level instructions, enabling LLMs to generate high-coverage test suites more efficiently and accurately.
Article Search
Article: fse26maina-p261-p
GraphQLify: Automated and Type Safety-Preserving GraphQL API Adoption
Saleh Amareen,
Arif Rahman,
Sazzadur Rahaman, and
Amiangshu Bosu
(Wayne State University, USA; University of Arizona, USA)
GraphQL provides a schema-based, strongly typed query language that enables highly efficient client-server communication. This paper introduces GraphQLify, an automated framework designed to migrate existing REST APIs to GraphQL. Unlike prior approaches that rely on relational databases, resource description frameworks (RDF), or machine-parsable specifications, GraphQLify leverages static source code analysis for precise type inference. This novel technique generates GraphQL schemas that guarantee end-to-end type safety, preserving a core advantage of adopting GraphQL. Furthermore, existing migration tools typically generate separate adapter servers, which introduce performance overhead via dynamic request binding and network latency. GraphQLify eliminates this by generating an embedded server that directly invokes the underlying API code, significantly improving performance. We evaluated GraphQLify on 834 APIs across nine popular open-source projects, where it successfully converted 100% of the APIs with zero type mismatches. In contrast, the current state-of-the-art tool, OASGraph, exhibited a 3.5% failure rate and a 42% type mismatch rate on the same dataset. Finally, our performance evaluation demonstrates that for workflows requiring five sequential API calls, clients using GraphQLify reduce data fetching time by a factor of 2 to 4 compared to their REST counterparts.
Article Search
Article: fse26maina-p262-p
Deployability-Centric Infrastructure-as-Code Generation: Fail, Learn, Refine, and Succeed through LLM-Empowered DevOps Simulation
Tianyi Zhang,
Shidong Pan,
Zejun Zhang,
Zhenchang Xing, and
Xiaoyu Sun
(Australian National University, Australia; New York University, USA; Columbia University, USA; Nanyang Technological University, Singapore; CSIRO's Data61, Australia)
Infrastructure-as-Code (IaC) generation holds significant promise for automating the provisioning of cloud infrastructure. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions. However, current evaluation focuses on syntactic correctness while ignoring deployability, the critical measure of the utility of IaC configuration files. Six state-of-the-art LLMs performed poorly on deployability, achieving only 20.8∼30.2% deployment success rate on the first attempt. In this paper, we construct DPIaC-Eval, the first deployability-centric IaC template benchmark consisting of 153 real-world scenarios across 58 unique services. Also, we propose an LLM-based deployability-centric framework, dubbed IaCGen, that uses an iterative feedback mechanism encompassing format verification, syntax checking, and live deployment stages, thereby closely mirroring the real DevOps workflows. Results show that IaCGen can make 54.6∼91.6% generated IaC templates from all evaluated models deployable in the first 10 iterations. Additionally, human-in-the-loop feedback that provide direct guidance for the deployability errors, can further boost the performance to over 90% passItr@25 on all evaluated LLMs. Furthermore, we explore the trustworthiness of the generated IaC templates on user intent alignment and security compliance. The poor performance (25.2% user requirement coverage and 8.4% security compliance rate) indicates a critical need for continued research in this domain.
Article Search
Article: fse26maina-p296-p
Understanding Performance Problems in CUDA Programs
Yuyang Bi,
Junming Cao,
You Lu,
Bihuan Chen,
Tianwei Gan,
Dingji Wang, and
Xin Peng
(Fudan University, China)
With the wide adoption of GPUs, CUDA programming has become essential for leveraging GPU parallelism. However, its complex programming model poses challenges in performance optimization. Consequently, CUDA programs often suffer from performance problems. In that sense, it is crucial to understand the performance problems specific to CUDA programming. Unfortunately, no systematic study has been conducted in literature.
To bridge this gap, we conduct the first systematic study to 1) characterize the symptoms and root causes of 216 performance problems collected from 55 StackOverflow posts and 122 NVIDIA forum posts, and 2) measure the speedup of fixing performance problems, and assess the capability of existing performance analysis methods in identifying performance problems, using a dataset of 69 reproduced performance problems. Our findings provide practical guidance for developers, and opportunities for researchers to advance performance analysis.
Article Search
Article: fse26maina-p299-p
Do Not Treat Code as Natural Language: Implications for Repository-Level Code Generation and Beyond
Minh Le-Anh,
Huyen Nguyen,
An Khanh Tran,
Nam Le Hai,
Linh Ngo Van,
Nghi D.Q. Bui, and
Bach Le
(FPT Software AI Center, Vietnam; Hanoi University of Science and Technology, Vietnam; University of Melbourne, Australia)
Large language models for code (CodeLLMs) have demonstrated remarkable success in standalone code completion and generation, sometimes even surpassing human performance, yet their effectiveness diminishes in repository-level settings where cross-file dependencies and structural context are essential. Existing Retrieval-Augmented Generation (RAG) approaches often borrow strategies from NLP, relying on chunking-based indexing and similarity-based retrieval. Chunking results in the loss of coherence between code units and overlooks structural relationships, while similarity-driven methods frequently miss functionally relevant dependencies such as helper functions, classes, or global variables. To address these limitations, we present Hydra, a repository-level code generation framework that treats code as structured code rather than natural language. Our approach introduces (i) a structure-aware indexing strategy that represents repositories as hierarchical trees of functions, classes, and variables, preserving code structure and dependencies, (ii) a lightweight dependency-aware retriever (DAR) that explicitly identifies and retrieves the true dependencies required by a target function, and (iii) a hybrid retrieval mechanism that combines DAR with similarity-based retrieval to provide both essential building blocks and practical usage examples. Extensive experiments on the challenging DevEval and RepoExec benchmarks, both requiring function implementation from real-world repositories with complex large repository context, show that Hydra achieves state-of-the-art performance across open- and closed-source CodeLLMs. Notably, our method establishes a new state of the art in repository-level code generation, surpassing strongest baseline by over 5% in Pass@1 and even enabling smaller models to match or exceed the performance of much larger ones that rely on existing retrievers.
Article Search
Article: fse26maina-p306-p
Small Is Beautiful: A Practical and Efficient Log Parsing Framework
Minxing Wang and
Yintong Huo
(Singapore Management University, Singapore)
Log parsing serves as the fundamental step in log analysis, splitting logs into constant templates and dynamic variables. While recent semantic-based parsers leveraging LLM have shown superior generalizability over prior syntax-based methods, their effectiveness is critically dependent on the scale of the underlying model. This dependency results in a significant performance collapse when using smaller, more practical LLMs, thereby creating a major barrier to real-world adoption where data privacy and computational constraints necessitate the use of succinct and resource-efficient models.
In a typical semantic parsing pipeline, the parsing cache is a critical component that stores the set of observed templates to quickly route incoming logs. The design of this cache is therefore paramount to the parser’s overall effectiveness. Motivated by such, we improve parsing accuracy from two insights: 1) designing a more flexible cache updating strategy that can rectify prior errors, and 2) including an explicit validation process to proofread templates before they are added to the cache, preventing error injection. In particular, we propose EFParser, an unsupervised LLM-based log parser, including template extraction, template correction, and validated templates. To mitigate the impact of degraded capabilities in smaller LLMs, we designed a dual cache with an adaptive updating mechanism. When the LLM generates a new template, this module determines if it is a novel pattern or a variation of an existing one. If it’s a variation, it merges the templates, thereby maintaining consistency and correcting the cache. Furthermore, we integrate a correction module that acts as a gatekeeper, validating and refining every LLM-generated template to ensure only high-quality, accurate patterns are cached. Evaluation on public large-scale datasets demonstrates that EFParser outperforms all baseline methods by an average of 12.5% across all evaluation metrics when running on smaller LLMs, with performance that even exceeds some baseline methods using large-scale LLMs, highlighting the advantages of systematic architectural design. Moreover, despite the additional processing procedures, the average processing time remains shorter than most semantic-based baselines. The superior performance on smaller LLMs combined with computational efficiency demonstrates that EFParser has significant potential for real-world deployment.
Article Search
Artifacts Available
Article: fse26maina-p310-p
Validating LLM-Generated SQL Queries through Metamorphic Prompting
Li Lin,
Qinglin Zhu,
Jintai Hong,
Chong Wang,
Yang Liu, and
Rongxin Wu
(Xiamen University, China; Nanyang Technological University, Singapore)
Large Language Models (LLMs) can translate natural language (NL) into SQL, enabling non-experts to query databases via conversational interfaces. However, the generated SQL often contains intent-violating hallucinations—queries that are syntactically valid and executable, yet semantically misaligned with the user’s question. These failures are especially risky in real-world settings where users cannot verify the correctness. In this paper, we propose MRSQLGen, a framework for detecting intent-violating hallucinations, built on the metamorphic prompting paradigm. MRSQLGen rewrites the input prompt using task-specific transformation rules derived from a hallucination taxonomy, and validates the generated SQL by checking behavioral consistency across multiple executions. Each transformation is associated with a metamorphic relationship (MR) that defines the expected relation between results; discrepancies are aggregated through a majority-vote strategy to robustly flag hallucinations without ground-truth SQL. We evaluate MRSQLGen on two benchmarks (Spider and Bird) using five representative LLMs, including GPT-4o. Experimental results demonstrate that MRSQLGen consistently outperforms state-of-the-art hallucination detection techniques, achieving higher precision and recall in detecting hallucinated SQL queries.
Article Search
Article: fse26maina-p328-p
On the Road to Personalized Code Intelligence: Portraiting and Assisting Developers Based on Their In-IDE Behaviors
Yuhong Liu,
Yunhe Su,
Zhipeng Peng,
Zhiwen Luo,
Lin Shi,
Zhi Jin, and
Li Zhang
(Beihang University, China; Wuhan University, China)
With the advent of powerful large language models (LLMs), research in automated software engineering has increasingly focused on leveraging these models to achieve a deeper semantic understanding of code or to engineer sophisticated agent-based processes. The predominant goal of these efforts is to enhance developer productivity through automated assistance. However, this research trajectory has largely overlooked a critical factor: the developers themselves. Programming is a deeply human and individualized activity; developers exhibit significant variation in their coding styles, tool-chain preferences, domain-specific expertise, and problem-solving strategies. Consequently, the current paradigm of one-size-fits-all code intelligence systems struggles to accommodate the unique characteristics and needs of individual developers. To address this gap, we introduce VirtualME, a novel IDE-embedded data infrastructure designed to model the developer by continuously capturing and interpreting their dynamic programming behaviors and preferences.
VirtualME contains three components. (1) Log-level Behavior Extraction: it captures and extracts developers' log-level behaviors (edits, navigations, etc.) from IDE. (2) Task-level Behavior Recognition: it aggregates log-level behaviors into task-level behaviors (“skimming API docs”, “iterative debugging”, etc.) via a multi-agent pipeline. (3) Developer-persona Measurement: it builds a rule engine to distill a four-dimensional developer persona: Core Technical Foundation, Practical Development Efficiency, Personal Development Norms, and Technical Adaptability.
On top of VirtualME, we propose a solution for personalized repository-level knowledge Q&A by integrating the developer persona into a Chain-of-Thought (CoT) guided agent. We evaluated VirtualME by building a multi-repository benchmark with real-world developer trajectories, balancing correctness and personalization. Experimental results show that VirtualME-enhanced answers outperform generic baselines on five dimensions: correctness, cognitive-level fit, technology-stack relevance, behavioral-pattern alignment, and stylistic preference, yielding an average 33.80% improvement. Our results demonstrate that abundant, continuous developer-behavior data can unlock Personalized Code Intelligence. By integrating this personalized understanding into the code intelligence loop, our approach paves the new way for adaptive and personalized code intelligence.
Article Search
Article: fse26maina-p336-p
Beyond Language Boundaries: Uncovering Programming Language Families for Code Language Models
Shangbo Yun,
Xiaodong Gu,
Jianghong Huang, and
Beijun Shen
(Shanghai Jiao Tong University, China)
The rapid proliferation of diverse programming languages presents both opportunities and challenges for developing multilingual code LLMs. While existing techniques often train code LLMs by simply aggregating multilingual code data, few explore the deeper relationships between programming languages and how such relationships can be utilized to optimize both training and inference. In this work, we investigate two fundamental questions: (1) What are the deep linguistic relationships among programming languages? and (2) How can these relationships be leveraged to improve multilingual code LLMs? We propose an embedding-based framework to uncover the latent families of programming languages. Our approach begins by defining 21 primary linguistic features of programming languages, such as variable definition, control structures, and method declarations, and then employs LLMs to generate feature-aligned code samples across multiple languages. By embedding these semantically parallel code snippets from 19 languages, we construct a similarity matrix and perform hierarchical clustering to uncover inherent language relationships. Our analysis reveals clear hierarchical structures among programming languages. Closely related languages form well-defined clusters (e.g., C, C++, Java, and Swift group together), while Go exhibits as a “lingua franca” with the highest cross-language similarity. Building on the uncovered language families, we propose three strategies to enhance multilingual LLM training: transfer learning across linguistically related languages, linguistic proximity-guided curriculum learning, and centroid-based intermediary code translation. Experiments on four code intelligence tasks demonstrate that our methods significantly improve multilingual LLM performance. This work offers a universal perspective on programming languages and advances more effective strategies for multilingual code LLM training.
Preprint
Article: fse26maina-p343-p
ProofFusion: Improving Neural Theorem Proving via Adaptive Retrieval-Augmented Reasoning
Manqing Zhang,
Yunwei Dong,
Lingru Zhou,
Bingxu Xiao, and
Yepang Liu
(Northwestern Polytechnical University, China; Southern University of Science and Technology, China)
Interactive theorem proving (ITP) is a powerful approach to ensuring the correctness of complex software systems. However, it often requires substantial manual effort, which makes it costly to use in practice. Recently, neural network based approaches have shown promise in automatically generating proof tactics. Nevertheless, existing methods suffer from a long-tailed distribution in tactic usage within the training data. A few frequent tactics dominate the probability distribution, while many rare yet crucial ones are consistently suppressed in the model’s candidate ranking. This distributional bias can cause potentially provable goals to be prematurely abandoned during proof search. In addition, the decision making process of neural networks when generating tactics lacks explicit reasoning traces, making it difficult for humans to explain or verify the underlying logic. To address these limitations, we propose ProofFusion, an adaptive retrieval-augmented reasoning framework that improves the proving capability of neural theorem provers without requiring retraining. Our key insight is inspired by the way human provers tackle a new theorem by consulting similar previously proven theorems to guide their own reasoning. Specifically, we develop a proof semantic-aware retriever that searches a knowledge base for semantically similar historical proof goals together with their tactic, producing a traceable set of reference decisions. We then employ a dual-track reranking fusion mechanism to integrate both the original predictions of the neural model and the retrieved reference tactics. Furthermore, to mitigate potential noise introduced by retrieval, we design a capability-adaptive retrieval mechanism that dynamically determines when retrieval should be applied. We conduct a systematic evaluation on 10,782 theorems from 26 Coq projects in a real ITP environment. Experimental results show that ProofFusion increases the number of theorems proved by four state-of-the-art neural theorem provers by an average of 6.89%, and additionally proves 17.50% of previously unprovable theorems. In addition, it substantially improves the explainability of proof steps, achieving an average explainable proof goal proportion of 82.1% across the four provers. Together, these results demonstrate that ProofFusion is a practical and effective complement to existing neural theorem proving systems, enhancing both performance and explainability.
Article Search
Article: fse26maina-p354-p
Speculate: Generating REST API Specifications using LLMs
Krishanu Singh,
Kushagra Karar,
Abhilash Jindal, and
Guowei Yang
(Unaffiliated, India; IIT Delhi, India; University of Queensland, Australia)
REST APIs are widely used for accessing web services. To support their use and testing, developers often write API specifications in open formats like OpenAPI. However, writing these specifications manually is tedious and error-prone, leading to incomplete or outdated specifications that can hinder both API usage and automated testing. Existing tools attempt to generate API specifications from source code using static or dynamic analysis. Unfortunately, these tools are often tightly coupled to specific languages and frameworks, making them hard to generalize and extend.
This paper presents Speculate, the first approach to combine lightweight static analysis with large language models (LLMs) to automatically generate REST API specifications from source code. By leveraging LLMs trained on diverse codebases, Speculate generalizes easily across languages and frameworks. We evaluate Speculate on 19 real-world web repositories spanning two programming languages and three REST frameworks. Our results show that Speculate outperforms existing tools on both precision and recall across all dimensions.
Article Search
Artifacts Available
Article: fse26maina-p444-p
SWE Data Construction, Automatically!
Lianghong Guo,
Yanlin Wang,
Caihua Li,
Wei Tao,
Pengyu Yang,
Jiachi Chen,
Haoyu Song,
Duyu Tang, and
Zibin Zheng
(Sun Yat-sen University, China; Independent Researcher, China; Huawei Technologies, China)
Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the existing GitHub issue resolution data construction pipeline is challenging and labor-intensive. We identify three key limitations in existing pipelines: (1) test patches collected often omit binary file changes; (2) the manual construction of evaluation environments is labor-intensive; and (3) the fail2pass validation phase requires manual inspection of test logs and writing custom parsing code to extract test status from logs. In this paper, we propose SWE-Factory, a fully automated issue resolution data construction pipeline, to resolve these limitations. First, our pipeline automatically recovers missing binary test files and ensures the correctness of test patches. Second, we introduce SWE-Builder, an LLM-based agentic system that automates evaluation environment construction. Third, we introduce a standardized, exit-code-based log parsing method to automatically extract test status,
enabling a fully automated fail2pass validation. Experiments on 671 real-world GitHub issues across four programming languages show that our method can effectively construct valid evaluation environments for GitHub issues at a reasonable cost. For example, with GPT-4.1 mini, our SWE-Builder constructs 337 valid task instances out of 671 issues, at $0.047 per instance. Our ablation study further shows the effectiveness of different components of SWE-Builder. We also demonstrate through manual inspection that our exit-code-based fail2pass validation method is highly accurate, achieving an F1 score of 0.99. Additionally, we conduct an exploratory experiment to investigate whether we can use SWE-Factory to enhance models’ software engineering ability. After training five models on 2,809 Python task instances collected by our method, all models show improved software engineering ability. For example, the resolve rate of a trained Qwen2.5-Coder-14B-Instruct on SWE-bench Verified increases from 5.8% to 21.0%. We hope our method can accelerate the construction of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation.
Article Search
Article: fse26maina-p451-p
CertiCoder: Towards MISRA-Compliant C Code Generation with LLMs
Min Gou,
Zhiyu Yao,
Hualong Ma,
Ende Zhang,
Jian Zhou, and
Fei He
(University of Electronic Science and Technology of China, China; Tsinghua University, China)
Large language models (LLMs) are increasingly applied to code generation in IDEs, CI pipelines, and automated workflows. Existing evaluations, however, have largely focused on functionality, with comparatively limited attention to compliance with established safety standards. This gap is particularly critical for C, where programmes may compile and pass unit tests yet still violate MISRA C:2012, a widely adopted guideline in safety-critical domains. We present CertiCoder, a post-training framework with rule-aware optimization that transforms tool-verified outcomes into per-rule contrasts and trains models through three stages: rule tuning, cold-start supervised fine-tuning, and rule-aware preference optimization. This design helps models not only distinguish compliant from violating outputs but also associate violations with specific rules. To support reproducible assessment, we construct a Codeforces-derived C benchmark with frozen splits, multi-level decontamination, and metrics that jointly measure MISRA compliance (S1), functional correctness (F1), and their conjunction (J1). On Qwen2.5-Coder backbones (3B–14B), CertiCoder substantially improves compliance from near-zero to measurable J1 levels and generally preserves functional correctness, outperforming non–rule-aware baselines such as SFT and SafeCoder. To our knowledge, this makes CertiCoder among the first post-training frameworks to explicitly optimize both compliance and correctness, offering a practical step toward more auditable and extensible use of LLMs in safety-critical software systems.
Article Search
Article: fse26maina-p499-p
Recommending Usability Improvements with Multimodal Large Language Models
Sebastian Lubos,
Alexander Felfernig,
Damian Garber,
Viet-Man Le, and
Manuel Henrich
(Graz University of Technology, Austria; UNiQUARE Software Development, Austria)
Usability describes quality attributes of application user interfaces that determine how effectively users can interact with them. Traditional usability evaluation methods require considerable expertise and resources, which can be challenging, especially for small teams and organizations. Automating usability evaluation could make it more accessible and help to improve the user experience. The recent emergence of powerful multimodal large language models (MLLMs) has opened new opportunities for automating usability evaluation and recommendation of improvements. These models can process visual inputs such as images and videos alongside textual context, which enables the identification of usability issues and the generation of actionable suggestions to resolve these issues.
In this paper, we present a novel automated approach that uses limited application context and screen recordings of user interactions as input to an MLLM. The model automatically identifies and describes usability issues based on Nielsen’s usability heuristics, and provides corresponding explanations and improvement recommendations. To reduce the developer effort of manual prioritization, the recommendations are ranked by severity. The quality and practical usefulness of the generated recommendations were evaluated based on a user study that involved software engineers as participants. The evaluation focused on the highest-ranked suggestions provided by the model. The results demonstrate the potential of our approach to provide low-effort usability improvement recommendations. This makes it a promising complement to traditional evaluation methods, especially in settings with limited access to usability experts. In this sense, the approach serves as a basis for future integration into development tools to enable automated usability evaluation within software engineering workflows.
Article Search
Artifacts Available
Article: fse26maina-p500-p
PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models
Haocheng Huang,
Yuchen Chen,
Weisong Sun,
Peizhuo Lv,
Yuan Xiao,
Chunrong Fang,
Yang Liu, and
Xiaofang Zhang
(Soochow University, China; Nanjing University, China; Nanyang Technological University, Singapore)
Constructing and curating high-quality code datasets requires significant resources, making them valuable intellectual property. Unfortunately, these datasets currently face severe risks of unauthorized use. Although digital watermarking offers a post hoc mechanism for copyright authentication, existing methods are predominantly based on the co-occurrence pattern, which is not robust and is susceptible to watermark detection and removal attacks. In this paper, we propose PuzzleMark, a robust watermarking method for code datasets. To reduce the risk of watermark exposure, PuzzleMark introduces a carrier selection strategy that leverages code complexity to evaluate the suitability of code snippets as watermark carriers, and selects those with high suitability for watermarking. To enhance the robustness of the watermark, PuzzleMark proposes a novel concatenation pattern to replace the traditional co-occurrence pattern, and implements two watermarking strategies through variable name concatenation. PuzzleMark adaptively embeds watermarks based on the inherent characteristics of the code, making it more stealthy while maintaining design simplicity. For watermark verification, PuzzleMark employs Fisher’s exact test to verify suspicious models under a black-box setting. Experimental results demonstrate that PuzzleMark achieves a 100 percent verification success rate and a 0 percent false positive rate, with negligible impact on model performance. Both our human study and our evaluation using four state-of-the-art watermark detection methods show that PuzzleMark exhibits strong imperceptibility, with an average suspicious rate ≤ 0.24 and an average recall ≤ 30.41 percent, respectively. Furthermore, the consistent retention of verifiability under two attack scenarios further corroborates the robustness of PuzzleMark. As a practical digital watermarking method, PuzzleMark provides strong protection for the intellectual property of code datasets and offers new insights for future research.
Article Search
Article: fse26maina-p509-p
SmartIFSyn: Automated Information Flow Security Policy Synthesis for Smart Contracts
Yinghao Wu,
Miaomiao Zhang,
Fu Song, and
John Baugh
(Tongji University, China; Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Nanjing Institute of Software Technology, China; North Carolina State University, USA)
Smart contracts have achieved significant success, however, their security remains a long-standing challenge. The immutability and transparency of smart contracts require establishing a strong mechanism to prevent private leakage and trusted data tampering. Apart from traditional logic and code-level vulnerabilities arising from insufficient control over contract variables and function parameters, smart contracts may store private-dependent information in blockchain records, which is a critical type of vulnerability, but often overlooked in existing security analysis. In this paper, we present an automated approach for synthesizing security policies, named SmartIFSyn, to eliminate information flow vulnerabilities in smart contracts. We formalize the semantics of Solidity, the most widely used smart contract language, and analyze information flow security of Solidity smart contracts from two perspectives: local-variable security and global-interaction security. We present a type system to guide the elimination of local-variable vulnerabilities by inferring a policy and resort to constraint solving to synthesize a desired policy in case that the type system fails. The policy ensures both local-variable and global-interaction security while it is maximally aligned with user preference. Furthermore, the policy can be subsequently converted into enforceable specifications. We implement our approach in a tool and evaluate it on 17,160 real-world Ethereum smart contracts. The experimental results demonstrate the efficacy of our approach, e.g., detected 243 vulnerabilities in 223 real-world Ethereum smart contracts.
Article Search
Article: fse26maina-p514-p
Protocol Reverse Engineering via Deep Transfer Learning
Yanyang Zhao,
Zhengxiong Luo,
Wenlong Zhang,
Feifan Wu,
Yuanliang Chen,
Fuchen Ma,
Qi Xu,
Heyuan Shi, and
Yu Jiang
(Tsinghua University, China; National University of Singapore, Singapore; Central South University, China)
Protocol reverse engineering infers the specification of proprietary or poorly documented protocols and serves as the foundation for security analysis such as fuzz testing. While many existing techniques achieve this by mining statistical features from network traces, they face increasing challenges due to incomplete field pattern information available in the traces. Although protocol development has accumulated rich prior knowledge about protocol design, this knowledge remains largely untapped in protocol reverse engineering.
This paper introduces TransRE, a protocol reverse engineering tool that leverages prior syntax knowledge from standardized protocols through deep transfer learning to better understand proprietary protocols. TransRE first selects optimal source domains by analyzing inter-domain differences between the existing knowledge base and the target protocol. It then employs a neural network to extract representation features and applies domain adaptation techniques to optimize the syntax transfer model, enabling accurate inference of protocol formats. Our evaluation on 12 widely used protocols shows that TransRE identifies fields with a perfection score of 0.43, which is 1.48×-3.07× the performance achieved by five state-of-the-art methods. Furthermore, to demonstrate practical applicability, we enhanced an existing protocol fuzzer with TransRE for testing proprietary protocols in real-world network cameras and discovered four bugs.
Article Search
Article: fse26maina-p520-p
Understanding Binary Code Similarity for Real-World Vulnerability Detection: A Large-Scale Empirical Study
Jingdong Guo,
Chaopeng Dong,
Yimo Ren,
Siyuan Li,
Jie Liu,
Hong Li, and
Hongsong Zhu
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Hangzhou Dianzi University, China; Shandong University, China)
Firmware lies at the heart of IoT devices. Its development depends heavily on third-party libraries (TPLs), which greatly accelerate the process but simultaneously introduce associated vulnerabilities.
Binary Code Similarity Detection (BCSD) is an effective technique for identifying vulnerabilities in firmware by comparing pairs of code segments. However, existing studies either evaluate their performance only on small-scale datasets or lack diversity in terms of vulnerabilities, TPLs, and firmware. Consequently, a comprehensive understanding of BCSD for real-world vulnerability detection remains absent.
To bridge this gap, we conduct a large-scale study of vulnerability detection across 60,000 firmware images from 200 vendors using BCSD. Rather than introducing a novel model, we examine the influence of four key factors—vulnerable function versions, vulnerability search space, function sizes, and compilation toolchains on BCSD performance. Our results reveal that these factors substantially affect performance, often by wide margins. To address this, we propose a build-aware query strategy that derives queries from representative real-world binaries, effectively closing the gap and raising the mean reciprocal rank (MRR) from 0.818 to 0.981. Furthermore, we demonstrate that a TPL-aware, two-stage search process significantly enhances accuracy, improving MRR by 18.5% by limiting the search space.
Article Search
Article: fse26maina-p550-p
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
Jiahong Xiang,
Xiaoyang Xu,
Xiaopan Chu,
Hongliang Tian, and
Yuqun Zhang
(Southern University of Science and Technology, China; Ant Group, China)
Autonomous agents for automated program repair represent a promising frontier in software engineering, yet their effectiveness is often hindered by reliance on post-mortem, coarse-grained execution feedback. While integrating traditional interactive debuggers seems a natural solution, their low-level, line-by-line interaction paradigm turns to be cost-inefficient for LLM-based agents, leading to exhausted budgets and unproductive loops. To mitigate this, we introduce Agent-centric Debugging Interface (ADI), a novel agent-centric debugging interface designed for cost-efficient, end-to-end autonomous interaction. Specifically, Agent-centric Debugging Interface realizes a function-level interaction paradigm, powered by our Frame Lifetime Trace—a comprehensive data structure encapsulating a function's stateful execution trace—and a set of high-level navigational commands.
Our extensive evaluation on the SWE-bench benchmark demonstrates the effectiveness and efficiency of ADI. By simply equipping a basic agent with ADI, it successfully resolves 63.8% of the tasks on the SWE-bench Verified set, even slightly outperforming the highly-optimized and high-investment Claude-Tools agent, at an average cost of $1.28 per task with Claude-Sonnet-3.7. Furthermore, we demonstrate ADI's generality by integrating it as a plug-and-play component into the existing SOTA agents, delivering consistent gains ranging from 6.2% to 18.5% on the resolved tasks. These results indicate that Agent-centric Debugging Interface could achieve a general and efficient enhancement for the existing autonomous agents.
Article Search
Artifacts Available
Article: fse26maina-p637-p
BackportBench: A Multilingual Benchmark for Automated Patch Backporting
Zhiqing Zhong,
Jiaming Huang, and
Pinjia He
(Chinese University of Hong Kong, Shenzhen, China)
Many modern software projects evolve rapidly to incorporate new features and security patches.
It is important for users to update their dependencies to safer versions, but many still use older, vulnerable package versions because upgrading can be difficult and may break their existing codebase.
Software developers can mitigate this problem by backporting security patches to older releases. However, manually backporting is time-consuming and error-prone. The effectiveness of existing automated backporting techniques on general software remains unclear since they typically target only code-hunk or function-level patch porting scenarios and are evaluated with imperfect metrics.
To facilitate the development and evaluation of automated backporting techniques, we introduce BackportBench, the first comprehensive benchmark suite for patch backporting problem.
BackportBench is a multilingual benchmark that contains 202 patch backporting problems from PyPI, Maven, and npm, each with executable Docker environments and relevant test cases.
We evaluated existing patch porting methods and LLM-based techniques that have the potential to adapt to this task using BackportBench. The results show that the agentic method has outperformed traditional patch porting methods, especially on cases that require logical and structural changes. However, the performance varies across different programming languages.
Based on the findings, we draw several implications for researchers and software practitioners in future work on automated backporting.
Article Search
Article: fse26maina-p717-p
Detecting Bugs in Rust Compiler Fix Suggestions via Constraint-Violation-Guided Mutation
Zixi Liu,
Yang Feng,
Jialiang Jiang, and
Baowen Xu
(Nanjing University, China)
Rust is a modern systems programming language that ensures memory safety through unique mechanisms, including ownership, borrowing, and lifetime annotations. These features prevent critical vulnerabilities but also impose strict constraints that many developers find difficult to understand. To mitigate this challenge, the Rust compiler, rustc, provides rich diagnostics and fix suggestions. However, recent studies reveal that diagnostic issues account for about 20% of all reported rustc bugs. Our analysis of rustc's suggestion bugs fixed over the past three years shows that most of them originated from errors in Rust-specific core modules, such as the type checker and borrow checker, rather than from simple mistakes in the general diagnostic logic, like suggesting an incorrect variable name or mismatched parentheses. The impact of diagnostic issues, especially bugs in rustc's fix suggestion, should not be underestimated, as they can mislead developers and reduce rustc's usability, and in severe cases may even lead to rustc crashes. Existing testing tools, however, provide little support for systematically evaluating the correctness and reliability of these suggestions.
To address this gap, in this paper, we present SugBreaker, an automated testing framework specifically designed to validate rustc's suggestions. We propose a constraint-violation-guided mutation approach that injects type-related, borrow-related, and lifetime-related errors into valid Rust programs to trigger compiler diagnostics and iteratively verify the correctness of suggested fixes.
SugBreaker has already detected 12 bugs, and 11 of them have been confirmed or fixed; all of them are triggered by different rustc error messages.Compared with a series of rustc testing baseline tools, SugBreaker achieves broader coverage of rustc's core checking modules and a higher suggestion trigger rate, which further confirms the effectiveness and efficiency of SugBreaker for testing rustc's fix suggestions.
Article Search
Article: fse26maina-p758-p
Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs
He Yang Yuan,
Xin Wang,
Kundi Yao,
An Ran Chen,
Zishuo Ding, and
Zhenhao Li
(York University, Canada; Hong Kong University of Science and Technology (Guangzhou), China; University of Waterloo, Canada; University of Alberta, Canada)
Logging code plays an important role in software systems by recording key events and behaviors, which are essential for debugging and monitoring. However, insecure logging practices can inadvertently expose sensitive information or enable attacks such as log injection, posing serious threats to system security and privacy. Prior research has examined general defects in logging code, but systematic analysis of logging code security issues remains limited, particularly in leveraging LLMs for detection and repair. In this paper, we derive a comprehensive taxonomy of logging code security issues, encompassing four common issue categories and 10 corresponding patterns. We further construct a benchmark dataset with 101 real-world logging security issue reports that have been manually reviewed and annotated. We then propose an automated framework that incorporates various contextual knowledge to evaluate LLMs' capabilities in detecting and repairing logging security issues. Our experimental results reveal a notable disparity in performance: while LLMs are moderately effective at detecting security issues (e.g., the accuracy ranges from 12.9% to 52.5% on average), they face noticeable challenges in reliably generating correct code repairs. We also find that the issue description alone improves the LLMs' detection accuracy more than the security pattern explanation or a combination of both. Overall, our findings provide actionable insights for practitioners and highlight the potential and limitations of current LLMs for secure logging.
Article Search
Article: fse26maina-p783-p
Semantics-Guided Control-Flow Reconstruction for Firmware Binaries via Static Analysis
Fengjuan Gao,
Qingjie Zhu,
Yi Zhang,
Yu Wang,
Xuandong Li, and
Ke Wang
(Nanjing University, China; Nanjing University of Science and Technology, China)
Control-flow reconstruction is a fundamental yet challenging problem in firmware analysis, particularly for stripped or raw-format binaries that lack symbolic metadata. Existing methods typically rely on syntax heuristics or format-specific patterns, which are often inadequate for real-world firmware that includes indirect jumps, manually crafted assembly, and limited metadata.
We present a semantics-guided static analysis framework for accurate control-flow reconstruction in stripped ELF and raw-format firmware binaries. Our approach consists of two complementary components: (i) an intra-procedural control-flow reconstruction method that incrementally recovers direct branches, indirect jumps, and call-return flows via fixpoint-guided value-flow analysis; and (ii) an inter-procedural analysis that resolves indirect calls through cross-function value tracking and loop-structure matching. By decoupling control-flow reasoning from instruction semantics and function abstraction, our framework robustly handles tightly intertwined control-flow patterns and mitigates the impact of misanalysis.
We implement our approach in Scarf (Semantics-guided Control-flow Analysis for Raw and Firmware binaries) and evaluate it on over 300 real-world firmware binaries in both ELF and raw formats. Compared with state-of-the-art reverse engineering tools, Scarf consistently achieves higher precision in control-flow recovery and demonstrates clear advantages on raw firmware, especially in resolving indirect jumps, call–return flows, and indirect calls. These results demonstrate that semantics-guided analysis provides a robust and scalable foundation for control flow reconstruction in metadata-deficient firmware.
Article Search
Article: fse26maina-p791-p
When Shared Worlds Break: Demystifying Defects in Multi-user Extended Reality Software Systems
Shuqing Li,
Chenran Zhang,
Binchang Li,
Cuiyun Gao, and
Michael R. Lyu
(Chinese University of Hong Kong, Hong Kong; Harbin Institute of Technology, China)
Multi-user Extended Reality (XR) systems enable transformative shared experiences but introduce unique software defects that compromise user experience. Understanding software defects in multi-user XR systems is crucial for enhancing system reliability, yet remains underexplored. To fill the gap, this paper presents the first large-scale empirical study of multi-user XR defects, analyzing 2,649 real-world bug reports from diverse sources, including developer forums, GitHub repositories, and app reviews on mainstream XR app stores. Through rigorous qualitative analysis using iterative open coding, we develop a comprehensive taxonomy that classifies multi-user XR bugs along three dimensions: Symptom Manifestation, Root Cause Origin, and Consequence Severity. Our findings reveal that synchronization inconsistencies and avatar-related anomalies are the most prevalent symptoms, while network/synchronization logic defects and session management flaws emerge as dominant root causes. Critically, over 34% of analyzed bugs lead to severe consequences that fundamentally break the shared experience, including system crashes, persistent disconnections, and complete interaction breakdowns, etc. We also identify concerning privacy and health implications unique to multi-user XR contexts. Based on our findings of defect analysis, we provide actionable recommendations for developers, platform vendors, and researchers. Our results demonstrate that multi-user XR systems face distinct challenges at the intersection of distributed systems, real-time 3D interaction, and immersive experiences, necessitating specialized approaches to testing, debugging, and quality assurance.
Article Search
Article: fse26maina-p808-p
Flash: Query-Efficient Black-Box Static Malware Evasion through Transferable GAN-Guided Modification Sequences
Anyuan Sang,
Li Yang,
Lu Zhou,
Junbo Jia, and
Huipeng Yang
(Xidian University, China)
Machine learning (ML)–based static malware detectors are widely deployed for Portable Executable (PE) files due to their scalability and efficiency, yet they remain vulnerable to carefully crafted adversarial perturbations. Existing black-box evasion methods either rely on transfer attacks, which break down when surrogate and target decision boundaries diverge, or on query-driven searches, which require impractically many queries. We present Flash, a two-phase adversarial framework tailored for static PE malware detection that integrates the strengths of both approaches. In the first phase, a generative adversarial network is trained against heterogeneous surrogate detectors to generate function-preserving PE modifications with inherent evasiveness. In the second phase, an evolutionary optimizer refines these sequences directly against the target model with a dual-objective fitness that balances evasion success and minimal perturbation cost. Experiments on 12,039 VirusShare PE files and six state-of-the-art static detectors demonstrate that Flash reduces query counts by 86% while maintaining bypass rates above 95.8%. Furthermore, adversarial training with Flash-generated samples reduces attack success rates by 82.4%, highlighting Flash’s utility for both exposing vulnerabilities and strengthening the robustness of static PE malware detectors.
Article Search
Article: fse26maina-p857-p
JavaScript Pointer Analysis with Adaptive Heap Abstraction
Wenyuan Xu and
Anders Møller
(Aarhus University, Denmark)
The conventional approach to represent objects in static program analysis is to use allocation-site abstraction. This design choice may lead to redundant computations when many abstract objects are similar. Existing mechanisms that aim to merge such objects are not effective for JavaScript. We propose a novel adaptive heap abstraction technique that during analysis discovers and merges similar abstract objects, thereby reducing the analysis complexity while preserving most of the precision.
The technique has been implemented in a state-of-the-art program analyzer for JavaScript. On a collection of 96 challenging programs, it yields a 2X speedup on average (up to 17X) with a negligible loss of precision. The experimental results also show the effects of various analysis configurations.
Article Search
Artifacts Available
Article: fse26maina-p907-p
ACME: Automated Clause Mapping Engine for Testing Emerging Database Systems
Yuancheng Jiang,
Jianing Wang,
Chuqi Zhang,
Roland H. C. Yap,
Zhenkai Liang, and
Manuel Rigger
(National University of Singapore, Singapore; Shandong University, China)
A growing number of emerging database management systems, such as time-series and streaming database systems, have been developed to support specialized workloads with enhanced performance and functionality. However, these systems are often less mature than traditional relational database systems, making them more prone to logic bugs and internal errors affecting correctness and reliability. To address this, we propose an enhanced differential testing framework designed for emerging SQL-like database systems. Our key insight is that many of these systems are conceptually extensions of relational database systems, allowing us to uncover bugs by comparing query results with those from more robust and mature relational database systems. To bridge the differences in syntax and semantics between emerging and relational database systems, we leverage Large Language Models (LLMs) to make differential testing more effective by discovering clause mappings that translate system-specific features in emerging database systems into equivalent SQL expressions. Our approach proceeds in three steps: (i) analyzing the syntax and semantics of queries with runtime errors to reason on clause mappings using LLMs; (ii) validating the generated clause mappings by executing test queries and re-prompting upon validation failures; and (iii) generating semantically equivalent, yet syntactically diverse queries to broaden the coverage of differential testing. We implemented this approach in a tool called ACME and applied it to four widely used emerging database systems, uncovering 59 previously unknown bugs, including 17 logic bugs and 42 internal errors. Of these, 52 have been fixed and 5 confirmed by vendors. Our evaluation demonstrates that ACME enhances LLM reliability through query validation. Furthermore, the evaluation shows the effectiveness of differential testing even when using local models or a limited online token budget. Our results demonstrate the practicality and effectiveness of ACME in improving the robustness and accuracy of emerging database systems through scalable, LLM-assisted differential testing.
Article Search
Article: fse26maina-p919-p
PROGnosticator: Testing Source-to-Source Code Translators via Construct-Oriented Fuzzing
Yeaseen Arafat and
Stefan Nagy
(University of Utah, USA)
To ease software interoperability and migration, developers are increasingly embracing transpilers: auto- mated tools for converting source code from one language to another. Unfortunately, differences in language constructs, syntax, and semantics leave transpilers facing many translation bugs, emitting incorrect or non-functional translations. Thorough, proactive transpiler testing is thus critical to the reliability of emergent translation-oriented development. While fuzzers excel in testing adjacent classes of language processors (e.g., compilers), current fuzzers remain tied to only the specific code patterns expressed in their inputs— hardcoded grammars or seed programs—which are costly to curate and extend, constraining their testing to just narrow subsets of language constructs. Worse yet, their generated programs are often overly complex, requiring non-trivial reduction to pinpoint the exact code patterns behind transpiler errors. Evaluating current and future transpilers thus demands a rigorous, input-independent fuzzing strategy—systematically exercising languages’ broad range of code constructs without needing costly per-language expertise or re-engineering.
To bridge this gap, we present Construct-oriented Fuzzing: a language-agnostic yet construct-aware approach for systematically testing transpilers. Motivated by our insights from past transpiler bugs, revealing most translation errors embody construct-specific mishandling, our approach explicitly targets the vast space of code patterns derived from core language constructs. Harnessing large language models’ code understanding and synthesis, we (1) automatically enumerate a language’s core constructs, before (2) generating self-contained programs exercising them individually—and combinations thereof—precisely testing transpilers’ many edge-cases whilst eschewing cumbersome grammars or seeds. In evaluating our prototype, PROGnosticator, against four state-of-the-art compiler and transpiler fuzzers across seven transpilers for C, Go, and JavaScript, we show how our approach attains high per-language validity as well as construct-usage diversity—exposing 77 total transpiler bugs, of which 64 are previously unknown, with 63 since confirmed or fixed by developers.
Article Search
Article: fse26maina-p922-p
InDe-LLM: Defending against Jailbreak Attacks in LLM-Powered Systems via Intention Disentangling
Yujue Wang,
Quan Zhang,
Chijin Zhou,
Gwihwan Go,
Dalong Shi, and
Yu Jiang
(Tsinghua University, China; East China Normal University, China; Aviation Industry Corporation of China, China)
Jailbreak attacks have been regarded as a crucial threat to LLM-powered software systems. Recent studies indicate the existence of a steering vector within models' internal activations, which can adjust a model's propensity to reject user requests, and thus is regarded as an effective approach for training-free defense. However, attackers may wrap their malicious intentions within a seemingly benign context, which shifts the distribution of harmful prompts toward benign inputs along the steering vector, effectively bypassing existing defense approaches. In this work, we propose a defense framework InDe-LLM based on intention disentangling. By projecting the embedding of inputs into a benign-invariant subspace, we could disentangle the harmful intentions of jailbreak prompts without affecting benign inputs. Next, such disentangled harmful intentions can be easily identified based on LLMs' well-aligned concept of harmfulness, and rejected through activation steering. Our experiments show that InDe-LLM achieves high defense effectiveness, outperforming baselines by 27.2%–43.5% across three models and ten attacks while preserving high utility on benign inputs. Moreover, our evaluation demonstrates that it exhibits high transferability to unseen attacks.
Article Search
Article: fse26maina-p965-p
TransLibEval: Demystify Large Language Models’ Capability in Third-Party Library-Targeted Code Translation
Pengyu Xue,
Kunwu Zheng,
Zhen Yang,
Yifei Pei,
Linhao Wu,
Jiahui Dong,
Xiapu Luo,
Yan Xiao,
Fei Liu,
Yuxuan Zhang,
Xiran Lyu,
Xianhang Li,
Xuanyu Zhu, and
Chengyi Wang
(Shandong University, China; Hong Kong Polytechnic University, China; Sun Yat-sen University, China)
In recent years, Large Language Models (LLMs) have been widely studied in the code translation field on the method, class, and even repository levels. However, most of these benchmarks are limited in terms of Third-Party Library (TPL) categories and scales, making TPL-related errors hard to expose and hindering the development of targeted solutions. Considering the high dependence (over 90%) on TPLs in practical programming, demystifying and analyzing LLMs' code translation performance involving various TPLs becomes imperative.
To address this gap, we construct TransLibEval, the first benchmark dedicated to library-centric code translation. It consists of 200 real-world tasks across Python, Java, and C++, each explicitly involving TPLs from diverse categories such as data processing, machine learning, and web development, with comprehensive dependency coverage and high-coverage test suites. We evaluate seven recent LLMs of commercial, general, and code-specialized families under six translation strategies of three categories: Direct, IR-guided, and Retrieval-augmented. Experimental results show a dramatic performance drop compared with library-free settings (average CA decline over 60%), while diverse strategies demonstrate heterogeneous advantages. Furthermore, we analyze 4,831 failed cases from GPT-4o, one of the State-of-the-Art (SOTA) LLMs, revealing numerous third-party reference errors that were obscured previously. These findings highlight the unique challenges of library-centric translation and provide practical guidance for improving TPL-aware code intelligence.
Article Search
Article: fse26maina-p1029-p
Mining Long Tail Bugs: Identifying Rare and Overlooked Issues in Code
Wentao Liang,
Yanjun Wu,
Xiang Ling,
Tianyue Luo,
Dinghao Liu,
Haotian Zhang, and
Jingzheng Wu
(Institute of Software at Chinese Academy of Sciences, China; Shandong University, China)
Using data mining to extract frequent code patterns for bug detection has proven effective. However, prior studies have overlooked the prevalence of infrequent (rare) patterns, even though violations of such patterns can also lead to bugs.
In this paper, we present LTMiner, which mines rare patterns from large-scale projects and detects potential bugs by checking for violations of these patterns. In practice, rare patterns far outnumber frequent ones and lack strong statistical support. Consequently, we face a pattern explosion, and many rare patterns and their violations are uninteresting. LTMiner addresses this by using instance-based ranking and filtering to prioritize violations of rare patterns. It further employs a large language model (LLM) as a domain expert to audit top-ranked violations; mined information supports in-context learning, and task decomposition and self-reflection mitigate possible hallucinations. This pipeline effectively curbs pattern explosion and false positives, uncovering previously unknown bugs in large-scale projects at an acceptable cost.
Applied to Linux kernel 6.12.1, LTMiner identified 42 previously unknown bugs, 27 of which have been confirmed by developers. These results indicate that, although rare-pattern bugs are sparse, a considerable number remain and exhibit a non-negligible long tail. We believe that rare-pattern bugs constitute a promising blue ocean for bug detection.
Article Search
Article: fse26maina-p1049-p
Pig: Leveraging Large Language Models for Python Library Migrations
Miryeong Kang,
Wonseok Oh,
Gabin An, and
Hakjoo Oh
(Korea University, Republic of Korea)
We present Pig, a novel approach to automating Python library migration by leveraging large language models (LLMs). Library migration is an increasingly common task in modern Python development, yet it remains tedious and error-prone due to the lack of general solutions that can handle diverse libraries without relying on documentation or code examples. To address this challenge, Pig employs a four-step pipeline that effectively harnesses the capabilities of LLMs. First, Pig decomposes the migration task into smaller units by performing API-level slicing, allowing the LLM to focus on minimal, relevant context. Second, it guides LLMs using prompts informed by common failure patterns in naive LLM-based migrations and plausible API candidates. Third, Pig selectively extracts the migration-related code fragments from the LLM outputs. Finally, it transplants the migrated code back into the original program with post-processing to ensure semantic correctness and consistency. We demonstrate the effectiveness of Pig by evaluating it on 364 API-level migration tasks, where it improves the average success rate of the baseline approach by 53.5% across seven different LLM models.
Article Search
Artifacts Available
Article: fse26maina-p1058-p
VerilogASTBench: Benchmark Construction of Verilog AST Dataset with Dual-Stage AST Semantic Enhancement Framework
Luping Zhang,
Chao Chen,
Dapeng Yan,
Hui Xu,
Mingsheng Cao,
Jingkuan Song,
Zhikuang Cai, and
Yufeng Guo
(Nanjing University of Posts and Telecommunications, China; University of Electronic Science and Technology of China, China)
With the increasing complexity and scale of integrated circuit (IC) designs, automated circuit design methods are essential for Verilog implementation. Although Large Language Models (LLMs) perform well in general-purpose coding such as C++ and Python, their performance in Verilog is constrained by domain-specific semantic and structural rules as well as the scarcity of high-quality training datasets. To address these issues, this work proposes a semantically enhanced Verilog code parsing and automatic repair framework based on Abstract Syntax Tree (AST). First, an advanced register-transfer level (RTL) analysis framework was developed to achieve semantic enhancement through static analysis–driven functional role inference and attribute annotation. Second, enhanced AST information is used to construct structured prompts and generate semantically rich module descriptions via an Application Programming Interface (API) for dataset construction. Finally, an AST-guided automatic Verilog repair framework was designed, which leverages enhanced AST analysis for precise defect localization and intelligent repair through compiler feedback loops. Experimental results indicate that the proposed method successfully repaired 15.24% of defective Verilog code, resulting in a high-quality RTL benchmark dataset containing 318,021 samples. Models fine-tuned on this dataset demonstrate significant performance improvements across three benchmarks, achieving average improvements of 9.23% and 15.43% on Eval-Machine pass@1 and pass@5, 4.23% and 10.08% on Eval-Human pass@1 and pass@5, and 12.64% and 7.45% on RTLLM V1.1 Syntax-VCS and Func.
Article Search
Article: fse26maina-p1087-p
SmarTrim: Symbolic Execution for Smart Contracts Powered by Redundant Transaction-Sequence Pruning
Hyegeun Song,
Jiseong Han, and
Sunbeom So
(Korea University, Republic of Korea)
We present SmarTrim, a new symbolic execution technique for detecting vulnerabilities in smart contracts. Smart contracts require rigorous safety validation since flaws in them can cause significant financial loss. Numerous symbolic execution techniques, which generate vulnerable transaction sequences to trigger and help understand vulnerabilities, have been extensively studied to enhance the security and safety of smart contracts. However, their performance remains unsatisfactory due to the extremely large search space for transaction sequences. To mitigate this issue, SmarTrim introduces a novel technique that safely reduces the search space by detecting and pruning redundant transaction sequences. Experimental results show that SmarTrim greatly outperforms eleven state-of-the-art analyzers in detecting critical vulnerabilities in real-world smart contracts.
Article Search
Artifacts Available
Article: fse26maina-p1131-p
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models
Ajmain Inqiad Alam,
Palash Roy,
Chanchal K. Roy,
Banani Roy, and
Kevin A. Schneider
(University of Saskatchewan, Canada)
The accelerating adoption of Large Language Models (LLMs) in software engineering (SE) has brought with it a silent crisis: unsustainable computational cost. While these models demonstrate remarkable capabilities in different SE tasks, they are unmanageably large, slow to deploy, memory-intensive, and carbon-heavy. This reality threatens not only the scalability and accessibility of AI-powered SE, but also its long-term environmental sustainability. The research challenge is clear: we must go beyond accuracy and address efficiency and environmental cost as first-class design constraints. To meet this challenge, we introduce Carbon-Taxed Transformers (CTT), a systematic multi-architectural compression principled pipeline ordering inspired by economic carbon taxation principles. Drawing from the economic concept of carbon pricing, CTT operationalizes a computational carbon tax that penalizes architectural inefficiencies and rewards deployment-ready compression. We evaluate CTT across three core SE tasks: code clone detection, code summarization, and code generation, with models spanning encoder-only, encoder-decoder, and decoder-only architecture. Our results show that CTT delivers on inference: (1) up to 49× memory reduction, (2) time reduction up to 8-10× for clone detection, up to 3× for summarization, and 4–7× for generation, (3) up to 81% reduction in CO2 emissions and (4) CTT retains around 98% accuracy on clone detection, around 89% on summarization, and up to 91% (textual metrics) and 68% (pass@1) for generation. Two ablation studies show that pipeline ordering and individual component contributions are both essential, providing empirical justification for CTT’s design and effectiveness. This work establishes a viable path toward responsible AI in SE through aggressive yet performance-preserving compression.
Article Search
Article: fse26maina-p1134-p
DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design
Yuchen Chen,
Yuan Xiao,
Chunrong Fang,
Zhenyu Chen, and
Baowen Xu
(Nanjing University, China)
The proliferation of large language models for code (CodeLMs) and open-source contributions has heightened concerns over unauthorized use of source code datasets. While watermarking provides a viable protection mechanism by embedding ownership signals, existing methods rely on detectable trigger-target patterns and are limited to source-code tasks, overlooking other scenarios such as decompilation tasks. In this paper, we propose DuCodeMark, a stealthy and robust dual-purpose watermarking method for code datasets that generalizes across both source-code tasks and decompilation tasks. DuCodeMark parses each code sample into an abstract syntax tree (AST), applies language-specific style transformations to construct stealthy trigger-target pairs, and injects repressible poisoned features into a subset of return-typed samples to enhance robustness against watermark removal or evasion. These features remain inactive during normal training but are activated upon watermark removal, degrading model performance. For verification, DuCodeMark employs a black-box method based on the independent-samples t-test. We conduct a comprehensive evaluation of DuCodeMark across 72 settings spanning two code tasks, two programming languages, three CodeLMs, and six decoding temperatures. The results demonstrate that it consistently achieves strong verifiability (p < 0.05), high stealthiness (suspicion rate ≤ 0.36), robustness against both watermark and poisoning attacks (recall ≤ 0.57), and a substantial drop in model performance upon watermark removal (Pass@1 drops by 28.6
Article Search
Article: fse26maina-p1135-p
A Grounded Theory of Debugging in Professional Software Engineering Practice
Haolin Li and
Michael Coblenz
(University of California at San Diego, USA)
Debugging is a central yet complex activity in software engineering. Prior studies have documented debugging strategies and tool usage, but little theory explains how experienced developers reason about bugs in large, real-world codebases. We conducted a qualitative study using a grounded theory approach. We observed seven professional developers and five professional live-coding streamers working on 17 debugging tasks in their own codebases, capturing diverse contexts of debugging. We theorize debugging as a structured, iterative diagnostic process in which programmers update a mental model of the system to guide information gathering. Developers gather information by alternating between navigation and execution strategies, employing forward and backward tracing modes of reasoning and adapting these approaches according to codebase context, complexity, and familiarity. Developers also gather external resources to complement code-based evidence, with their experience enabling them to systematically construct a mental model. We contribute a grounded theory of professional debugging that surfaces the human-centered dimensions of the practice, with implications for tool design and software engineering education.
Article Search
Article: fse26maina-p1140-p
Binvariants: Enhancing Fuzzing of Closed-Source Binary Executables via Register-Level Likely Invariants
Zao Yang and
Stefan Nagy
(University of Utah, USA)
Closed-source software is ubiquitous in everyday computing, underscoring the need for robust security vetting of “binary-only” executable code. While code-coverage-guided fuzzing has long proven effective at unearthing software bugs, fuzzing in open-source contexts has since evolved beyond code coverage as its principal guiding metric. State-of-the-art fuzzing advancements demonstrate that likely data invariants—data-level properties which, if violated, expose unusual and often bug-preceding program states—significantly widen fuzzing’s reach to defects ordinarily occluded by coverage-only testing. Unfortunately, current invariant-guided fuzzing universally depends on source-level abstractions, rendering it unportable to binary-only targets. Consequently, closed-source software fuzzing—and more importantly, binary-only bug discovery—remain stalled at now-obsolete coverage-only techniques, even as open-source software fuzzing advances well past them.
To bridge this longstanding gap, this paper introduces register-level likely invariants: the first technique to integrate likely data invariants within binary-only fuzzing. In contrast to contemporary source-level data invariant mining, our approach operates directly on CPU registers, capturing the low-level program states that themselves encode higher-level data relationships. From these low-level states, we automatically derive likely data invariants and expose their violations as fuzzer-observable signals via runtime instrumentation, steering fuzzing into states often unreachable by code coverage alone. In doing so, our approach surfaces qualitatively different states, complementing traditional coverage-guided fuzzing with distinct bug-finding capabilities.
We implement our approach as a prototype, Binvariants, and evaluate its performance across 25 benchmark applications: 7 closed-source, as well as 18 open-source programs compiled as binary-only executables. Our results show that, compared to driving binary fuzzing solely via code coverage, register-level likely invariants helps fuzzing trigger over 27× more unique invariant violations beyond coverage-only fuzzing, thereby exercising a mean 52% more distinct code regions. Moreover, our approach uncovers 143 total bugs versus coverage-only fuzzing’s 137—including 20 missed by code coverage—demonstrating how register-level likely invariants extends binary-only fuzzing’s reach into execution states beyond what coverage alone is capable of.
Article Search
Article: fse26maina-p1145-p
GraphLocator: Graph-Guided Causal Reasoning for Issue Localization
Wei Liu,
Chao Peng,
Pengfei Gao,
Aofan Liu,
Wei Zhang,
Haiyan Zhao, and
Zhi Jin
(Peking University, China; ByteDance, China)
The issue localization task aims to identify the locations in a software repository that requires modification given a natural language issue description. This task is fundamental yet challenging in automated software engineering due to the semantic gap between issue description and source code implementation. This gap manifests as two mismatches: (1) symptom–to-cause mismatches, where descriptions do not explicitly reveal underlying root causes; (2) one-to-many mismatches, where a single issue corresponds to multiple interdependent code entities. To address these two mismatches, we propose GraphLocator, an LLM-based approach that mitigates symptom–to-cause mismatches through causal structure discovering and resolves one-to-many mismatches via dynamic issue disentangling. The key artifact of GraphLocator is the causal issue graph(CIG), in which vertices represent discovered sub-issues along with their associated code entities, and edges encode the causal dependencies between them. The workflow of GraphLocator consists of two phases: symptom vertices locating and dynamic CIG discovering; it first identifies symptom locations on the repository graph, then dynamically expands the CIG by iteratively reasoning over neighboring vertices, discovering new sub-issues and updating causal dependencies. Experiments on three real-world Python and Java datasets demonstrates the effectiveness of GraphLocator: (1) Compared with baselines, GraphLocator achieves more accurate localization with average improvements of +19.49% in function-level recall and +11.89% in precision with acceptable overhead. (2) GraphLocator outperforms baselines on both symptom-to-cause and one-to-many mismatch scenarios, achieving recall improvement of +16.44% and +19.18%, precision improvement of +7.78% and +13.23%, respectively. (3) The disentangled causal structure CIG generated by GraphLocator yields the highest relative improvement, resulting in a 28.74% increase in performance on the downstream issue-resolving task.
Preprint
Article: fse26maina-p1250-p
Feature Slice Matching for Precise Bug Detection
Ke Ma,
Jianjun Huang,
Wei You,
Bin Liang,
Jingzheng Wu,
Yanjun Wu, and
Yuanjun Gong
(Renmin University of China, China; Institute of Software at Chinese Academy of Sciences, China; University of Trento, Italy)
Measuring the function similarity to detect bugs is effective, but the statements unrelated to the bugs can impede the performance due to the noise interference. Suppressing the noise interference in existing works does not manage the tough job, i.e., eliminating the noise in the targets. In this paper, we propose MATUS to mitigate the target noise for precise bug detection based on similarity measurement. Feature slices are extracted from both the buggy query and the targets to represent the semantic feature of (potential) bug logics. In particular, MATUS guides the target slicing with the prior knowledge from the buggy code, in an end-to-end way to pinpoint the slicing criterion in the targets. All feature slices are embedded and compared based on the vector similarity. Buggy candidates are audited to confirm unknown bugs in the targets. Experiments show that MATUS holds advantages in bug detection for real-world projects with acceptable efficiency. In total, MATUS has spotted 31 unknown bugs in the Linux kernel. All of them have been confirmed by the kernel developers, and 11 have been assigned CVEs.
Article Search
Article: fse26maina-p1298-p
DiverFPS: Generating Diverse Solutions for Floating-Point SMT Formulas
Shuangyu Lyu,
Chuan Luo,
Ruizhi Shi,
Zhuo Su, and
Chunming Hu
(Beihang University, China)
Satisfiability Modulo Theories (SMT) is a fundamental technique underpinning a wide range of applications in software engineering and testing. Among various SMT theories, the theory of floating-point plays a crucial role in practical software systems, yet reasoning about floating-point constraints remains challenging. Although existing SMT solvers are capable of producing single solutions, many applications, particularly in software testing and verification, require diverse sets of solutions to adequately exercise program behaviors. While achieving high diversity is essential for exploring system behaviors, it is also important to limit the number of generated solutions, as overly large solution sets can considerably increase testing time and resource consumption, thereby diminishing efficiency. In this work, we propose DiverFPS, a novel floating-point SMT sampler designed to generate small solution sets that achieve high target abstract syntax tree (AST)-coverage, which is commonly regarded as the standard metric for assessing solution diversity in the SMT sampling domain. DiverFPS operates in two stages: an exploration stage that aims to achieve the target AST-coverage as fully as possible, and a pruning stage that prunes redundant solutions while preserving the target AST-coverage. We further introduce three novel techniques, namely solution-driven restart strategy, context-aware encoding technology, and coverage-driven pruning strategy, which enhance the performance of DiverFPS. Extensive experiments on publicly available SMT-LIB benchmarks for QF_FP and QF_BVFP logics demonstrate that DiverFPS outperforms state-of-the-art SMT samplers. It successfully achieves the target AST-coverage on a larger number of benchmarks, which existing samplers fail to reach, while producing solution sets that are up to 92.9These results demonstrate that DiverFPS is a high-performing sampler for floating-point SMT sampling.
Article Search
Article: fse26maina-p1304-p
Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel
Jiashuo Tian,
Dong Wang,
Chen Yang,
Haichi Wang,
Zan Wang, and
Junjie Chen
(Tianjin University, China)
False-positive bug reports represent a significant yet underexplored challenge in the development and maintenance of the Linux kernel.
They occur when correct system behavior is mistakenly flagged as a defect, consuming developer effort without leading to actual code improvements. Such reports can mislead developers, waste debugging resources, and delay the resolution of real bugs.
In this paper, we present the first comprehensive empirical study of false-positive bug reports in the Linux kernel.
We manually construct a dataset of 2,006 bug reports comprising 1,509 genuine bugs and 497 false positives collected from Bugzilla and Syzkaller.
Our analysis indicates that false positives demand effort comparable to real bugs, often requiring extended discussions and non-trivial closure time.
They occur in several components, especially File Systems and Drivers, mainly due to external dependencies and semantic misunderstandings.
To address this challenge, we evaluate large language models (LLMs) for automated false-positive bug report mitigation.
Among various prompting strategies, retrieval-augmented generation (RAG) performs best, achieving 91% recall and an F1 score of 88%.
These findings highlight the non-negligible cost of false positive bug reports and show the promise of LLMs for more efficient false positive mitigation in the Linux kernel.
Article Search
Article: fse26maina-p1333-p
Neuron-Guided Interpretation of Code LLMs: Where, Why, and How?
Zhe Yin,
Xiaodong Gu, and
Beijun Shen
(Shanghai Jiao Tong University, China)
Code language models have demonstrated strong capabilities across a wide range of code intelligence tasks. While the majority of existing research prioritizes performance improvements on benchmark datasets, few of them have focused on the internal interpretability of models—how specific neurons affect linguistic features such as syntax and semantics, which is critical for model transparency, controllability, and reliability. Although various neuron interpretability techniques have been developed in NLP, directly applying them to source code yields suboptimal results due to the unique characteristics of programming languages, such as their formal structure, hierarchical organization, and executability.
In this work, we empirically investigate the intrinsic mechanisms of code LLMs at the neuron level, aiming to localize both language-specific neurons (i.e., neurons that are selectively responsive to individual programming languages) and concept layers (i.e., feed-forward layers that encode language-agnostic representations of code).
Our study employs two state-of-the-art models, Llama-3.1-8B and Qwen2.5-Coder-32B, across five programming languages: C++, Java, Python, Go, and JavaScript. By analyzing neuron activation patterns in response to multilingual code inputs, we investigate the role of individual neurons and the contribution of different layers during output generation.
Our empirical findings reveal that: (1) code LLMs contain neurons specialized for individual programming languages, alongside a universal subset that supports general-purpose code generation; and (2) lower layers primarily encode language-specific syntactic structures, while middle layers capture semantic abstractions that generalize across languages, manifesting as concept layers.
To demonstrate the practical usability of these findings, we apply our findings to three downstream tasks: neuron-guided fine-tuning for code generation, clone detection using concept-layer embeddings, and transfer learning guided by concept-layer representations for code summarization. Experimental evaluations show that each strategy consistently improves the performance of multilingual code LLMs.
Preprint
Info
Article: fse26maina-p1339-p
Reducing Cost of LLM Agents with Trajectory Reduction
Yuan-An Xiao,
Pengfei Gao,
Chao Peng, and
Yingfei Xiong
(Peking University, China; ByteDance, China)
Multi-turn agent systems based on Large Language Models (LLMs) have become increasingly popular for software engineering tasks. While LLM agents demonstrate promising effectiveness, the high computational cost of input tokens due to ever-growing trajectories remains a significant efficiency concern. Efficiency has been largely overlooked in existing studies and agent products, and this paper addresses this gap by introducing an inference-time trajectory reduction approach that reduces computational costs.
By analyzing existing agent trajectories, we demonstrate that useless, redundant, and expired information is widespread across trajectories. Such waste can be identified and reduced without compromising the agent's performance. We propose a simple yet effective trajectory reduction approach, AgentDiet, which automatically removes such waste during agent execution. We implement AgentDiet on a top-performing coding agent, and our evaluation on two LLMs and two benchmarks shows that AgentDiet can reduce input tokens by 39.9%-59.7% and the total computational cost by 21.1%-35.9%, while maintaining the same agent performance. These results indicate that inference-time trajectory reduction is a promising direction for agent systems.
Article Search
Article: fse26maina-p1395-p
TLR: Codebase-Level C Memory Management Error Repair with Large Language Models
Xiao Cheng,
Zhihao Guo,
Huan Huo, and
Yulei Sui
(Macquarie University, Australia; University of Technology Sydney, Australia; UNSW, Australia)
Memory management errors in C remain a leading source of software vulnerabilities due to the inherent complexity of manual memory handling. Traditional Automated Program Repair (APR) largely relies on rule- or template-based techniques, which require expert-crafted specifications and often struggle to generalize. Recently, Large Language Models (LLMs) have emerged as a complementary approach, leveraging broad exposure to codebases and programming idioms to synthesize fixes that can extend beyond existing templates and rules. This paper introduces TLR, a novel framework that augments LLM-based repair with typestate-guided context retrieval. By using a finite typestate automaton to track error-propagation paths and memory state transitions, our approach provides the LLM with focused, semantically rich context for codebase-level memory error repair, effectively addressing both interprocedural reasoning and LLM context window limitations. Our framework has successfully repaired 37 out of 49 real-world memory errors derived from 14 open-source projects that collectively comprise approximately 1.57 million lines of code. Compared to state-of-the-art memory error APR tools, SAVER and ProvenFix, our approach correctly fixes 14.50× and 2.36× more errors, respectively; and on the double-free and use-after-free subset, TLR repairs all 7 cases whereas the crash-constraint-driven CrashRepair repairs only 1. Moreover, TLR outperforms current open-source state-of-the-art LLM-based repair tools, repairing more errors than SWE-agent 1.0 and the tree-of-thought agent Sand2Patch, while introducing far fewer harmful patches. We have also successfully repaired three critical zero-day memory errors, with fixes that have been accepted and implemented by the original developers. These results highlight a promising paradigm for codebase-level program repair through program analysis-guided, retrieval-augmented LLMs, combining formal verification strengths with neural model adaptability.
Article Search
Article: fse26maina-p1413-p
Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts
Jingyu Zhang,
Fan Wang,
Jacky Keung,
Yihan Liao,
Yan Xiao, and
Lei Ma
(Hong Kong Metropolitan University, Hong Kong; City University of Hong Kong, Hong Kong; Sun Yat-sen University, Shenzhen, China; University of Tokyo, Japan; University of Alberta, Canada)
Deep learning (DL)-based systems can exhibit unexpected behavior when exposed to out-of-distribution (OOD) scenarios, posing serious risks in safety-critical domains such as malware detection and autonomous driving. This underscores the importance of thoroughly testing such systems before deployment. To this end, researchers have proposed a wide range of test selection metrics designed to effectively select inputs. However, prior evaluations of metrics reveal three key limitations: (1) narrow testing objectives, for example, many studies assess metrics only for fault detection, leaving their effectiveness for performance estimation unclear; (2) limited coverage of OOD scenarios, with natural and label shifts are rarely considered; (3) Biased dataset selection, where most work focuses on image data while other modalities remain underexplored. Consequently, a unified benchmark that examines how these metrics perform under multiple testing objectives, diverse OOD scenarios, and different data modalities is still lacking. This leaves practitioners uncertain about which test selection metrics are most suitable for their specific objectives and contexts. To address this gap, we conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.
Article Search
Article: fse26maina-p1421-p
AccessRefinery: Fast Mining Concise Access Control Intents on Public Cloud
Ning Kang,
Peng Zhang,
Jianyuan Zhang,
Hao Li,
Dan Wang,
Zhenrong Gu,
Weibo Lin,
Shibiao Jiang,
Zhu He,
Xu Du,
Longfei Chen,
Jun Li, and
Xiaohong Guan
(Xi'an Jiaotong University, China; Huawei Cloud, China)
Modern cloud applications heavily rely on Identity and Access Management (IAM) services to enforce flexible access control over their data. However, the flexibility comes at a cost: IAM policies are often complex and prone to misconfigurations, leading to risks of data exposure. There is an increasing need to mine a compact set of intents that describe what the policies collectively try to achieve, thereby enabling operators to better understand their policies. However, existing tools on mining access control intent have two limitations: (1) the mining process is slow and even times out on some complex policies; (2) the mined intents are excessive in number and thus still hard to understand. To overcome these, this paper presents AccessRefinery, which can speed up the mining process while reducing the number of intents. The key idea for the speedup is to reduce the redundancy of the multi-round SMT solving, by preprocessing the constraints into bit-vector constraints. For intent reduction, AccessRefinery computes a compact set of intents that can cover the mined intents, by solving a min-set-cover problem. Experiments based on real and synthetic datasets show that AccessRefinery achieves a ∼10–100 × speedup in intent mining, and reduces the number of intents by up to ∼10 ×.
Article Search
Artifacts Available
Article: fse26maina-p1435-p
Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading
Hanzhen Lu,
Lishui Fan,
Jiachi Chen,
Qiuyuan Chen,
Zhao Wei, and
Zhongxin Liu
(Zhejiang University, China; Sun Yat-sen University, China; Tencent Technology, China)
Line-level code completion aims to complete the current line in real-time as developers type. Low latency is crucial to maintaining a seamless and uninterrupted coding experience, enabling developers to remain in a productive flow. However, existing approaches face a fundamental trade-off: large language models (LLMs) provide high-quality suggestions but require expensive computational resources to ensure acceptable inference latency. In contrast, static-analysis-based methods and small language models respond quickly but often generate suboptimal completions. To fill this gap, our idea is to rely on the small model by default and only escalate the large model when necessary to achieve latency-accuracy trade-offs. Based on this idea, we propose MCCom(Model-Cascading-based code Completion), a framework that cascades a local small model with a high-performance cloud large model for code completion. Realizing effective model cascading requires answering two non-trivial questions, i.e., when to invoke the large model and how to enable effective collaboration between small and large models. For the first question, we leverage a valuable but easily overlooked signal, i.e., user actions, during code completion to accurately identify failed completions. This deferral decision allows us to invoke the large model only when necessary, reducing both latency and cloud-side computation costs. To enable effective collaboration, MCCom employs a two-stage speculative decoding strategy and an iterative retrieval mechanism that collectively accelerate and improve the quality of completions. Due to the lack of high-quality small models for code completion, we also train a lightweight model with only 121M parameters to implement MCCom. The small model achieves an average of 73.8% of the performance of the state-of-the-art 7B model. We evaluate MCCom on the RepoEval benchmark and a new benchmark, StmtEval, collected from real-world projects. Experimental results show that our approach not only reduces inference latency by up to 47.9% and cuts down LLM usage by an average of 46.3%, but also improves the exact match rate of the large model by an average of 8.9%.
Preprint
Article: fse26maina-p1516-p
LoCaL: Countering Surface Bias in Code Evaluation Metrics
Simantika Bhattacharjee Dristi and
Matthew B. Dwyer
(University of Virginia, USA)
With the increasing popularity of large language models (LLMs) and LLM-based agents, reliable and effective code evaluation metrics (CEMs) have become crucial for progress across several software engineering tasks. While popular benchmarks often provide test cases to assess the correctness of generated code, crafting and executing test cases is expensive. Reference-based CEMs provide a cheaper alternative by scoring a candidate program based on its functional similarity to a reference. Although prior research has focused on reporting the weak correlation between these CEMs and functional correctness, the causes are only assumed, and plausible solutions remain unexplored. In this work, we critically evaluate four state-of-the-art reference-based CEMs, revealing their strong bias towards surface-level features rather than code functionality. Despite this
surface bias, current evaluation datasets for these CEMs rarely include code pairs that are surface-similar yet functionally dissimilar, or functionally similar yet surface-dissimilar.
To mitigate this gap, we propose LoCaL (Looks Can Lie), a CEM evaluation benchmark, with 3117 code
pairs at both the method and program levels. Each pair is labeled with a functional similarity score and aims to target regions where CEMs are likely to perform poorly. The functional similarity scores are calculated through differential fuzzing, which eliminates the need for predefined test cases and, at the same time, improves the reliability of the scores by executing an order of magnitude more tests than prior work. We find that all four CEMs show significant performance degradation on LoCaL, compared to the baselines. Finally, based on our findings, we draw the implication that exposing CEMs to LoCaL-like data might facilitate the development of metrics that are robust to surface bias.
Article Search
Artifacts Available
Article: fse26maina-p1536-p
SBridge: Identifying Source-to-Binary Function Similarity via Cross-Domain Control Block Matching
Heedong Yang,
Jeongwoo Lee,
Hajin Yun, and
Seunghoon Woo
(Korea University, Republic of Korea)
We present SBridge, a precise approach for identifying functions in binaries that are similar to the given source code functions. Identifying reused code in binaries is critical for security, particularly for detecting propagated vulnerabilities. Although binary-to-binary comparison is feasible, leveraging source code as the reference is more practical because source code is easier to collect and analyze directly without compilation. However, significant gaps between source and binary representations, including function inlining, create challenges in cross-domain function detection. Existing approaches primarily rely on string literals or structural similarities between entire functions, failing to capture detailed code behavior and generating many false alarms.
SBridge addresses these limitations through a key innovation: control block-based function matching, which encapsulates essential functional features by segmenting functions into meaningful units such as conditionals and loops. Leveraging control blocks as a cross-domain representation, SBridge enables precise measurement of function similarity between source and binary code, effectively overcoming challenges posed by function inlining and stripped binaries. For evaluation, we collected 3,904 real-world C/C++ binaries from BinKit. In experiments identifying binary functions identical to input source functions, despite approximately 40% of binary functions being inlined, SBridge achieved 75.13% recall@1 and 80.98% recall@5, outperforming existing approaches, which achieved up to 43.31% recall@1 and 50.2% recall@5. Our further analysis confirmed that SBridge effectively identifies propagated vulnerabilities in binaries.
Article Search
Article: fse26maina-p1590-p
OCPPuzz: Specification-Driven Fuzzing of Charging Station Management Systems with Large Language Model
Jongchan Hong,
Jaewon Kim, and
Sungjae Hwang
(Sungkyunkwan University, Republic of Korea)
Electric vehicles (EVs) are being rapidly adopted, with over 61,000 publicly accessible charging stations deployed across the United States as of 2024. A core component of this infrastructure is the Charging Station Management System (CSMS), which is responsible for security-critical tasks such as user authentication and billing. Given its importance, the CSMS has become a target of real-world attacks that have resulted in financial losses, data breaches, and denial-of-service (DoS) incidents. Nevertheless, research on CSMS security remains limited, and automated testing tools are lacking. Testing CSMS is challenging because they communicate with charging stations (CS) using the Open Charge Point Protocol (OCPP). Effective testing must contend with OCPP's complexity: 1) messages containing up to 48 fields, 2) inter- and intra-message field dependencies, and 3) its stateful nature, which requires tracking the states of both CS and CSMS during testing.
To address these challenges, we present OCPPuzz, a specification-based fuzzing framework for CSMS. OCPPuzz automatically extracts message structures, field constraints, and dependency rules from the OCPP specification, as well as valid CS-CSMS state transitions described in its use case diagrams. To handle specifications expressed in natural language and semi-formal diagrams, OCPPuzz combines heuristic rule-based extraction with a large language model (LLM). We evaluated OCPPuzz on four open-source CSMS implementations and uncovered numerous deviations from the OCPP specification that led to critical security issues, including DoS and free charging. We reported 930 implementation bugs to the corresponding vendors, of which 492 have been acknowledged so far. In addition, we reported 155 specification bugs in OCPP to the Open Charge Alliance (OCA); 78 have been committed for fixes and 82 acknowledged for further investigation. We expect additional acknowledgments and fixes in the near future.
Article Search
Article: fse26maina-p1789-p
From Specifications to Implementation in the Gen-AI Era: Lessons from a Project-Based Software Engineering Course
Yingying Wang,
Masih Beigi Rizi,
Fatemeh Khashei, and
Julia Rubin
(University of British Columbia, Canada)
By early 2025, AI code assistants had evolved into sophisticated collaborators capable of generating, explaining, reviewing, and modifying substantial portions of a software system. In February 2025, as we were delivering an upper-level undergraduate course on Software Engineering in our university, the term vibe coding emerged and quickly became popularized, referring to a practice in which developers describe a project or task in a prompt to an AI model that then generates code artifacts automatically.
In this paper, we first report on the course setup and our analysis of students’ experience using AI assistants for developing open-ended software engineering projects in the Spring 2025 offering of the course. We then discuss our independent exploratory study that we conducted in Summer 2025, systematically investigating strategies for working with AI tools, specifically GitHub Copilot and Cursor, to implement a project of a real-world size and complexity, while also systematically evaluating the quality of the generated code. The goal of both studies is to better understand whether and how to teach Software Engineering in the AI era, i.e., which skills are essential for effective human–AI collaboration.
Our results show that students utilized AI tools in all stages of project development, to generate and refine artifacts and to learn unfamiliar concepts. While Generative AI tools simplified repetitive development tasks and helped quickly ramp up when working with unfamiliar frameworks and programming languages, students had mixed levels of satisfaction with using the tools, both as chat interfaces and as IDE-embedded solutions. We found no correlation between the degree of students’ reliance on AI and their grades, hypothesizing that other factors, such as groups’ knowledge and commitment, play a larger role in producing high-quality results.
Our exploratory study shows that tools were still far from replacing software engineers or rendering software engineering education unnecessary. In fact, we observe that, in the new AI-empowered reality, skills such as requirements engineering, design, testing, and code review are becoming more relevant (and easier to teach) than ever. We conclude the paper with a set of suggestions for future offerings of this and similar Software Engineering courses, and for using AI in Software Engineering more broadly.
Article Search
Article: fse26maina-p1855-p
Verifying Smart Contract Security against Re-entrancy Attacks through Relational Value Analysis
Divya Rathore and
Kartik Nagar
(IIT Madras, India)
Reentrancy vulnerabilities are a critical security risk in smart contracts, posing a significant threat to the entire blockchain ecosystem. These vulnerabilities arise when a malicious attacker exploits the design of a smart contract to re-enter a function within the execution of another function, thus breaking atomicity and manipulating the smart contract state in unintended ways. While multiple countermeasures have been proposed to fortify smart contracts against re-entrancy based attacks, automatically verifying their effectiveness remains a difficult problem due to the inherent complexity of smart contracts and evolving attack techniques. In this work, we propose RAVEN, a sound and precise approach to verify smart contract safety against re-entrancy attacks automatically. At its core, RAVEN performs a content-sensitive semantic relational value analysis using the polyhedral abstract domain to establish hyper-properties such as absorption and commutativity of different program segments, which are sufficient to ensure safety against re-entrancy. Notably, unlike many prior approaches, we also prove the soundness of RAVEN, thus guaranteeing that contracts deemed as safe by RAVEN would not suffer from classical re-entrancy attacks.
We have implemented our approach and evaluated RAVEN against nine state-of-the-art tools using four comprehensive test suites of Solidity smart contracts labeled for re-entrancy. The results demonstrate that RAVEN attains higher precision than existing approaches in detecting both re-entrancy-safe and vulnerable contracts. In particular, RAVEN produced 0/781 false positives on two test suites and 444/21,355 false positives on the remaining two, representing an approximate 77.3% reduction in false positives over prior tools. Moreover, this improvement was achieved with a comparable average analysis time of 141.9 seconds, versus 128.8 seconds for prior tools.
Article Search
Artifacts Available
Article: fse26maina-p1898-p
In Line with Context: Repository-Level Code Generation via Context Inlining
Chao Hu,
Wenhao Zeng,
Yuling Shi,
Beijun Shen, and
Xiaodong Gu
(Shanghai Jiao Tong University, China)
Repository-level code generation has attracted growing attention in recent years. Unlike function-level code generation, it requires the model to understand the entire repository and reason over complex dependencies across functions, classes, and modules. However, existing approaches such as retrieval-augmented generation (RAG) or context-based function selection often fall short; they primarily rely on surface-level similarity and struggle to capture the rich dependencies that govern repository-level semantics.
In this paper, we introduce InlineCoder, a novel framework for repository-level code generation. InlineCoder enhances the understanding of repository context by inlining the unfinished function into its call graph, thereby reframing the challenge of repository understanding into a simpler function-level coding task. Given a function signature, InlineCoder first generates a draft completion (termed an "anchor"),
which approximates downstream dependencies and enables perplexity-based confidence estimation. This anchor drives a bidirectional inlining process: (1) Upstream Inlining, which embeds the anchor into its callers to capture diverse usage scenarios; and (2) Downstream Retrieval, which integrates the anchor's callees into the prompt to provide precise dependency context. The enriched context, combining draft completion with upstream and downstream perspectives, equips the LLM with a comprehensive repository view.
Extensive experiments on the DevEval and RepoExec benchmarks demonstrate that InlineCoder substantially outperforms a wide range of state-of-the-art baselines, achieving average relative gains of 29.73% in EM, 20.82% in ES, and 49.34% in BLEU on RepoExec compared to the strongest baseline.
These results highlight its effectiveness in repository context understanding as well as its generalization across domains.
Article Search
Article: fse26maina-p1908-p
Verifying Structural Robustness of Deep Neural Network
Hai Duong,
Thanh Le,
Lam Nguyen, and
ThanhVu (Vu) Nguyen
(George Mason University, USA; National Institute of Information and Communications Technology, Japan; CMC Applied Technology Institute, Vietnam)
Neural network verification has emerged as a useful technique for improving the reliability of deep learning systems. Current verification approaches primarily focus on local robustness, where perturbations are applied independently to each input element. Despite its common use, local robustness does not capture perturbations that exhibit coordinated relationships between input elements. Such perturbations arise from systematic transformations or filtering operations that preserve structural characteristics of the data. These perturbations, which we call “structural robustness”, represent a significant gap in existing verification capabilities.
This work focuses on structural robustness verification by formalizing two important classes of structured perturbations: linear position-invariant and linear position-varying. Those perturbations allow input elements to be modified in coordinated ways while preserving essential data structure. The main challenge is that structural perturbations cannot be directly expressed using standard interval-based specification formats that existing verification tools typically support.
To address this limitation, we introduce VeriS, a technique that reformulates structural robustness into standard local robustness problems by creating specialized subnetworks that encode perturbation behavior and integrate them with the original network architecture. VeriS enables verification across continuous spaces defined by structural robustness specifications while maintaining compatibility with existing verification tools. VeriS also introduces optimizations that significantly enhance verification performance such as converting complex operations into standard representations.
We implement and evaluate VeriS on benchmarks involving neural networks across three domains: image classification, audio processing, and healthcare applications. Our evaluation, which encompasses 5508 verification problems, demonstrates that VeriS successfully verifies 78% of structural robustness specifications when integrated with state-of-the-art verification tools. These results show that VeriS enables the verification of complex structural perturbations that were previously beyond the reach of existing neural network verification.
VeriS is available at: https://github.com/dynaroars/VeriS/.
Article Search
Article: fse26maina-p1917-p
IntentTester: Intent-Driven Multi-agent Framework for Cross-Library Test Migration
Yi Gao,
Ziyuan Zhang,
Xing Hu,
Xiaohu Yang, and
Xin Xia
(Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, China)
Unit tests capture both functional checks and domain-specific knowledge, but this knowledge remains
locked within individual projects and is rarely reused across libraries with overlapping functionality. Existing migration techniques based on structural code mappings (e.g., API signatures) often break down under divergent designs or cross-language settings, resulting in non-executable migrated tests. In this paper, we present IntentTester, a multi-agent framework for intent-driven test reuse. Instead of translating raw code, IntentTester abstracts tests into a language-agnostic Test Description Language (TDL), aligns them with
semantically related entities and dependencies in a repository graph, and synthesizes executable tests through LLM-guided reasoning and iterative validation. This design enables cross-library and cross-language migration without manual intervention, producing migrated tests that existing structure-mapping approaches cannot achieve. We evaluate IntentTester on nine open-source projects across three domains (JSON, HTML, and Time) and two languages (Java and Python). IntentTester generates 2,776 syntactically correct tests with 85% correctness; in comparison, the two baselines achieve 51% and 43%. Among them, 2,410 tests executed successfully, yielding a 74% effectiveness rate. Beyond higher success rates, IntentTester also surfaced previously unknown defects—including stack overflows, null dereferences, and parsing inconsistencies, several of which have been acknowledged or patched by maintainers. Our results show that intent-driven migration shifts the focus from code mappings to semantic alignment, allowing practical cross-library and cross-language test reuse while improving test quality and exposing implementation flaws.
Article Search
Article: fse26maina-p1924-p
Unfulfilled Promises: LLM-Based Detection of OS Compatibility Issues in Infrastructure as Code
Georgios-Petros Drosos,
Georgios Alexopoulos,
Thodoris Sotiropoulos,
Dimitris Mitropoulos, and
Zhendong Su
(ETH Zurich, Switzerland; University of Athens, Greece; National Infrastructures for Research and Technology, Greece)
Modern infrastructures rely on Infrastructure as Code (IaC) systems to keep complex deployments consistent, reproducible, and scalable at production scale. The reliability of these infrastructures, however, depends on the correctness of their building blocks, which are reusable components (modules) that each performs a dedicated task, such as installing a package, managing an OS user, or configuring a service, and reconciling its state with the desired specification. A central promise of these components is portability: a specification written once should correctly
manage the targeted resource on every OS the IaC component supports. When this property is violated,
defects can propagate across entire infrastructures, causing outages, security vulnerabilities,
and costly misconfigurations.
In this work, we introduce crOSsible, the first automated framework for cross-OS testing of IaC modules. crOSsible leverages large language models (LLMs) to synthesize and repair integration tests from structured module documentation, and executes them across 13 versions of 8 major Linux distributions. While our techniques are generally applicable to different IaC systems, we instantiate and evaluate them on Ansible, the most widely used IaC framework for managing individual servers. Evaluation across 259 popular Ansible modules demonstrates both effectiveness and real-world impact. In just 12 hours of testing, crOSsible uncovered 36 previously unknown bugs, including 22 portability violations. In total, 27 issues have been confirmed by maintainers, with 17 already fixed. The discovered issues range from crashes to dangerous soundness defects where modules reported success despite leaving systems misconfigured. Beyond bug discovery, crOSsible improved the code coverage of Ansible modules by 12.3% on average, systematically exercising OS-specific code paths that existing tests missed.
Article Search
Artifacts Available
Article: fse26maina-p2091-p
Automating Dockerfile Refactoring to Multi-stage Builds
Dongjin Chen,
Wenhua Yang,
Minxue Pan, and
Yu Zhou
(Nanjing University of Aeronautics and Astronautics, China; Nanjing University, China)
Containerization has become a cornerstone of modern software deployment, yet many projects still ship single-stage Dockerfiles that bundle compilers, build tools, and temporary files into production images, thereby hurting performance and security. Multi-stage builds are recommended, yet uptake appears uneven, plausibly because refactoring legacy Dockerfiles demands nontrivial reasoning about build lifecycles and dependency separation. This paper presents StageCraft, an automated refactoring approach that converts single-stage Dockerfiles into optimized multi-stage builds. StageCraft first performs static analysis to identify the technology stack and to infer build-time and runtime dependencies. It then applies a lightweight gate that estimates the refactoring benefit from a composite of image bloat, structural inefficiency, and security risk, proceeding only when refactoring is warranted. Finally, it synthesizes a multi-stage Dockerfile that isolates build tooling, copies only the artifacts needed at runtime, and applies production hardening. Evaluated on 521 real-world single-stage projects, StageCraft successfully produced working multi-stage Dockerfiles for 60.3% of targets. The resulting images were, on average, 52.2% smaller and contained 50.0% fewer high-risk vulnerabilities than the originals, outperforming baselines. StageCraft lowers the barrier to multi-stage adoption at scale, yielding leaner images with a reduced attack surface and improved maintainability. We release the tool, its knowledge assets, and the evaluation dataset to support reproducibility and future research.
Article Search
Article: fse26maina-p2093-p
TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment
Zhiqiang Yuan,
Weitong Chen,
Hanlin Wang,
Xin Peng,
Zhenpeng Chen, and
Yiling Lou
(Fudan University, China; Tsinghua University, China)
Code translation transforms code between programming languages while preserving functionality, which is critical in software development and maintenance. While traditional learning-based code translation methods have limited effectiveness due to the lack of sufficient parallel training data, Large Language Models (LLMs) have recently advanced this field with their strong code generation and comprehension capabilities. However, code translated by LLMs still suffers from diverse quality issues, such as syntax and semantic errors. In this work, we propose TransAGENT, a novel multi-agent system that eliminates the errors during LLM-based code translation. The main insight of TransAGENT is to localize error-prone code blocks via fine-grained execution alignment between source and target code. We evaluate TransAGENT on a newly constructed benchmark of recent programming tasks to mitigate data leakage. TransAGENT outperforms the latest UniTrans by up to 33.3% in translation accuracy and achieves an average improvement of 56.7% over Agentless in program repair performance. We also conduct an ablation study and evaluate TransAGENT across different LLMs, demonstrating its effectiveness and strong generalizability.
Article Search
Article: fse26maina-p2151-p
Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware Benchmark
Aoyang Fang,
Songhan Zhang,
Yifan Yang,
Haotong Wu,
Junjielong Xu,
Xuyang Wang,
Rui Wang,
Manyi Wang,
Qisheng Lu, and
Pinjia He
(Chinese University of Hong Kong, Shenzhen, China)
While cloud-native microservice architectures have revolutionized software development, their inherent operational complexity makes failure Root Cause Analysis (RCA) a critical yet challenging task. Numerous data-driven RCA models have been proposed to address this challenge. However, we find that the benchmarks used to evaluate these models are often too simple to reflect real-world scenarios. Our preliminary study reveals that simple rule-based methods can achieve performance comparable to or even surpassing state-of-the-art (SOTA) models on four widely used public benchmarks. This finding suggests that the oversimplification of existing benchmarks might lead to an overestimation of the performance of RCA methods.
To further investigate the oversimplification issue, we conduct a systematic analysis of popular public RCA benchmarks, identifying key limitations in their fault injection strategies, call graph structures, and telemetry signal patterns. Based on these insights, we propose an automated framework for generating more challenging and comprehensive benchmarks that include complex fault propagation scenarios. Our new dataset contains 1,430 validated failure cases from 9,152 fault injections, covering 25 fault types across 6 categories, dynamic workloads, and hierarchical ground-truth labels that map failures from services down to code-level causes. Crucially, to ensure the failure cases are relevant to IT operations, each case is validated to have a discernible impact on user-facing SLIs.
Our re-evaluation of 11 SOTA models on this new benchmark shows that they achieve low Top@1 accuracies, averaging 0.21, with the best-performing model reaching merely 0.37, and execution times escalating from seconds to hours. From this analysis, we identify three critical failure patterns common to current RCA models: scalability issues, observability blind spots, and modeling bottlenecks. Based on these findings, we provide actionable guidelines for future RCA research. We emphasize the need for robust algorithms and the co-development of challenging benchmarks. To facilitate further research, we publicly release our benchmark generation framework, the new dataset, and our implementations of the evaluated SOTA models.
Article Search
Article: fse26maina-p2270-p
Phantom Rendering Detection: Identifying and Analyzing Unnecessary UI Computations
Zhihao Lin,
Mingyi Zhou,
Bo Sun,
Han Hu,
Gang Fan, and
Li Li
(Beihang University, China; Huawei, China; Huawei Hong Kong Research Center, Hong Kong)
Modern mobile applications have high-resolution user interfaces (UI) and heavy computations, resulting in significant energy consumption and latency, especially on older devices. Precisely measuring the performance of mobile operations (e.g., the number of CPU instructions) and detecting performance issues are critical for mobile software engineering. In this study, we characterize a previously underexplored class of performance issues on mobile called Phantom Rendering, which occurs when mobile applications perform unnecessary UI-related offscreen computations but do not visually render them. For example, the animation component stops visually rendering on the screen but continues to refresh in the background. This problem represents a root-level disconnection between UI-related offscreen computational effort and visual rendering, inherent to dual-thread rendering architectures employed across modern mobile platforms such as Android, iOS, and OpenHarmony. While this architectural pattern is shared across platforms, our current implementation and evaluation focus on OpenHarmony. However, this is hard to detect automatically due to a lack of fine-grained performance
measurements and detection methods. To address the challenges, we propose HapPRDetection that contains a fine-grained performance profiler that can sample CPU Retired Instructions, the CPU instructions that have completed their execution and are no longer in the pipeline, and algorithms for automated detection of Phantom Rendering through differential analysis and hierarchical attribution. Our approach advances performance analysis methodology by bridging the semantic gap between fine-grained computational measurements (i.e., the number of CPU Retired Instructions at the function level) and high-level rendering behavior. Through our evaluation of the top-22 real-world mobile applications by download volume in OpenHarmony with 193 test steps, we show that Phantom Rendering issues occur in 19 test cases across 8 applications and that they range up to 40% of wasted CPU instructions in each problematic operation. We provide new insights into mobile rendering efficiency and our detection strategy offers practical solutions to identify and resolve Phantom Rendering during the mobile development loop. Our approach and implementation are available at https://github.com/SMAT-Lab/PhantomRendering.git.
Article Search
Article: fse26maina-p2280-p
Comment Traps: How Defective Commented-Out Code Augment Defects in AI-Assisted Code Generation
Yuan Huang,
Yukang Zhou,
Xiangping Chen, and
Zibin Zheng
(Sun Yat-sen University, China)
With the rapid development of large language models in code generation, AI-powered editors such as GitHub Copilot and Cursor are revolutionizing software development practices. At the same time, studies have identified potential defects in the generated code. Previous research has predominantly examined how code context influences the generation of defective code, often overlooking the impact of defects within commented-out code (CO code). AI coding assistants' interpretation of CO code in prompts affects the code they generate.
This study evaluates how AI coding assistants, GitHub Copilot and Cursor, are influenced by defective CO code. The experimental results show that defective CO code in the context causes AI coding assistants to generate more defective code, reaching up to 58.17 percent. Our findings further demonstrate that the tools do not simply copy the defective code from the context. Instead, they actively reason to complete incomplete defect patterns and continue to produce defective code despite distractions such as incorrect indentation or tags. Even with explicit instructions to ignore the defective CO code, the reduction in defects does not exceed 21.84 percent. These findings underscore the need for improved robustness and security measures in AI coding assistants.
Preprint
Article: fse26maina-p2429-p
Co-evolution of Types and Dependencies: Towards Repository-Level Type Inference for Python Code
Shuo Sun,
Shixin Zhang,
Jiwei Yan,
Jun Yan, and
Jian Zhang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Python's dynamic typing mechanism, while promoting flexibility, is a significant source of runtime type errors that plague large-scale software, which inspires the automatic type inference techniques.
Existing type inference tools have achieved advances in type inference within isolated code snippets.
However, repository-level type inference remains a significant challenge, primarily due to the complex inter-procedural dependencies that are difficult to model and resolve.
To fill this gap, we present PyTIR, a novel approach based on LLMs that achieves repository-level type inference through the co-evolution of types and dependencies.
PyTIR constructs an Entity Dependency Graph (EDG) to model the objects and type dependencies across the repository. During the inference process, it iteratively refines types and dependencies in EDG for accurate type inference. Our key innovations are: (1) an EDG model designed to capture repository-level type dependencies; (2) an iterative type inference approach where types and dependencies co-evolve in each iteration; and (3) a type-checker-in-the-loop strategy that validates and corrects inferences on-the-fly, thereby reducing error propagation.
When evaluated on 12 complex Python repositories, PyTIR significantly outperformed prior works, achieving a TypeSim score of 0.89 and a TypeExact score of 0.84, representing a 27% and 40% relative improvement over the strongest baseline. More importantly, PyTIR removed new type errors introduced by the tool by 92.7%. This demonstrates a significant leap towards automated, reliable type annotation for real-world Python development.
Article Search
Article: fse26maina-p2456-p
Interrogation Testing of CHC Solvers
David Kaindlstorfer,
Anastasia Isychev,
Valentin Wüstholz, and
Maria Christakis
(TU Wien, Austria; Diligence Security, Austria)
A Constrained Horn Clause (CHC) is a specific type of logic formula
that contains uninterpreted predicates. CHC formulas are often used by
static program analyzers to encode program properties, which are then
verified using CHC solvers. The solvers themselves are complex tools
and may contain bugs, which can lead to verifying unsafe programs,
flagging safe programs as unsafe, or providing analyzers with
incorrect invariants and counterexamples. It is, therefore, crucial to
develop techniques for systematically testing CHC solvers.
In this paper, we present the first interrogation-testing technique
for CHC solvers, which we implement in a tool called HornGator. Our
technique uses witnesses generated by the solver under test to form
new CHC instances. It also integrates a knowledge base maintaining a
history of past solver queries. All this information helps HornGator
generate more diverse instances, thereby improving its bug-finding
effectiveness. As a result, HornGator found 21 unique bugs in five
state-of-the-art CHC solvers, all of which are confirmed by the
developers, 18 are fixed, and eight are of the highest severity.
Preprint
Article: fse26maina-p2460-p
Building Software by Rolling the Dice: A Qualitative Study of Vibe Coding
Yi-Hung Chou,
Boyuan Jiang,
Yi Wen Chen,
Mingyue Weng,
Victoria Jackson,
Thomas Zimmermann, and
James A. Jones
(University of California at Irvine, USA; Independent Researcher, USA; Marketing Creative Associate, USA; University of Southampton, UK)
Large language models (LLMs) are reshaping software engineering by enabling vibe coding—building software primarily through prompts rather than writing code.
Although widely publicized as a productivity breakthrough, little is known about how practitioners actually define and engage in these practices.
To shed some light on this emerging phenomenon, we conducted a grounded theory study of 20 vibe-coding videos, including 7 live-streamed coding sessions (approximately 16 hours, 254 prompts) and 13 opinion videos (approximately 5 hours), supported by additional analysis of activity durations and intents of prompts.
Our findings reveal a spectrum of behaviors: some vibe coders rely almost entirely on AI without inspecting code, while others examine and adapt generated outputs.
Across approaches, all must contend with the stochastic nature of generation, with debugging and refinement described as “rolling the dice.”
Further, divergent mental models, shaped by vibe coders’ expertise and reliance on AI, influence prompting strategies, evaluation practices, and levels of trust.
Through additional quantitative analysis, vibe coders spend over 20% of session time on average waiting for model responses, with some sessions exceeding 50%. We also observe prompt redundancy: for some participants, nearly 40% of prompts repeat prior intents.
These findings open new directions for research on the future of software engineering and point to practical opportunities for tool design and education.
Article Search
Article: fse26maina-p2501-p
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
Jia Li,
Hongyi Deng,
Yiran Zhang,
Kechi Zhang,
Tianqi Shao,
Tiankuo Zhao,
Weinan Wang,
Zhi Jin,
Ge Li,
Yang Liu,
Yingtao Fang, and
Yihong Dong
(Wuhan University, China; Peking University, China; Nanyang Technological University, Singapore)
Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and EvoCodeBench have been created to evaluate LLMs by requiring them to generate code from natural language requirements. However, in enterprise applications and team development, developers typically write code based on structured designs or specifications rather than raw natural language descriptions. This gap between existing benchmarks and real industry development practices means that current benchmark scores may not accurately reflect how much code generation can help automate software development tasks.
To address this gap, we propose RealBench, a repository-level code generation benchmark aligned with real-world industry software development practices. Each example includes both natural language requirements and UML diagrams as system design, matching how developers typically receive specifications. Based on the constructed benchmarks, we conduct a systematic evaluation of advanced LLMs' code generation capabilities when provided with structured system designs. Specifically, we design three generation strategies to evaluate advanced LLMs on RealBench and propose two evaluation granularities with five metrics. The experimental results reveal key insights in current LLMs' capabilities for repo-level code generation aligned with real-world software development practices. First, we notice that regarding repo-level code generation, LLMs show much worse performance and there are significant performance gaps among LLMs. Second, LLMs are good at finding and creating modules defined in UML diagrams, but the quality of generated modules is often poor due to grammar and logic errors. Third, generating the entire repository at once is the best generation strategy on smaller repositories, while generating a complex repository with the module-by-module strategy works better compared to other strategies. Fourth, the detailed system design is very important for repository-level code generation tasks through conducting ablation studies on system designs. Lastly, we discuss the frequent error types in generated repositories to provide insights for optimizing repo-level code generation.
Article Search
Article: fse26maina-p2570-p
Revealing Regressions: A Comparative Study of State-Capture Strategies in Validating Program Behavior
Hang Du,
Vijay Krishna Palepu, and
James A. Jones
(University of California at Irvine, USA; Microsoft, USA)
A central challenge in software testing is deciding which parts of a program’s state to check as evidence of correct behavior to reveal regressions. These checks are embodied as test oracles, typically as assertions in test cases. State observation strategies play a decisive role in shaping how effectively regressions can be revealed. Such strategies range from exhaustive memory snapshots to selective attribute checks via getter methods and nullability checks. These strategies are deeply embedded in both research and practice: academic work has explored heuristic- and serialization-based oracles, while industry has widely adopted snapshot testing. Despite their importance, the effects of different state-capture choices remain poorly understood from a scientific perspective. In this work, we present an experimental framework for systematically analyzing
these design choices along the dimensions of observation scope, extraction approach, and extraction depth. Using this framework, we conduct an empirical study across twelve real-world projects, measuring how state-capture strategies influence regression-revealing capability and the richness of oracle information. Our findings reveal that human-written assertions often under-utilize available program state, achieving well below the fault-revealing potential of systematic observation strategies. Simple design choices, such as exposing unchecked intermediate return values, carefully selecting getters, and deepening state extraction, can yield measurable improvements (avg. 35.7%) in regression detection without needing to observe an overwhelmingly large amount of program-state data. Additionally, we highlight the challenges of observability, assertion desirability, and the trade-offs of capturing richer program states. Such insights show how small design choices can yield major differences in regression detection and potentially offer concrete directions for both tool
builders and practitioners.
Article Search
Article: fse26maina-p2650-p
AccessDroid: Detecting Screen Reader Accessibility Issues in Android Applications via Semantics Trees
Hang Zhou and
Wei Song
(Nanjing University of Science and Technology, China)
Screen readers are essential for visually impaired users to access Android apps, but inadequate developer support often leads to semantic ambiguity or label missing. While prior work has focused primarily on label missing issues, semantic ambiguity remains underexplored. In this paper, we categorize screen reader accessibility issues into three types: semantics separation, semantics downshift, and semantics omission. Bootstrapped by semantics trees, we propose AccessDroid, a lightweight dynamic analysis approach that automatically detects screen reader accessibility issues in Android app pages and generates diagnostic reports. Applied to 361 runtime pages from 50 real-world Android apps, AccessDroid successfully detects 249 true semantics separation issues, 279 true semantics downshift issues, and 170 true semantics omission issues. On average, AccessDroid only spends 15 milliseconds per app page, and achieves a precision of 98.8% and a recall of 98.3%, significantly outperforming two baseline approaches.
Article Search
Artifacts Available
Article: fse26maina-p2719-p
Failing with Purpose: Dangling Coverage-Guided Negative Test Generation from a Mechanized P4 Type System
Jaehyun Lee,
Seokhun Jeong, and
Sukyoung Ryu
(KAIST, Republic of Korea)
A type checker must reject ill-typed programs in addition to accepting well-typed programs. Negative type checker tests, programs expected to be rejected, validate that a type checker enforces the language’s typing rules as intended. We focus on negative type checker tests for P4, a domain-specific language for programmable network devices, whose type system encodes design principles and hardware constraints of the network dataplane. Failing to reject an ill-typed P4 program risks violating these principles and constraints, leading to unexpected errors. A comprehensive negative test suite covering subtle and diverse ill-typed conditions is thus important. However, constructing comprehensive negative tests is challenging: the negative input space lacks systematic characterization, and existing P4 program generators do not target subtle type errors.
This paper addresses the problem in three steps. (i) We mechanize the P4 type system using the SpecTec framework. Unlike the informal official P4 specification, the mechanized type system is formal and machine-readable. Mechanization enables a systematic analysis of the type system. (ii) Across the mechanized type system, we identify dangling premises, which are premises in the typing rules that, when violated, cause type errors. Based on them, we propose dangling coverage, a novel metric for quantifying negative test coverage. (iii) Finally, we implement a coverage-guided fuzzer that mutates well-typed P4 programs into ill-typed programs that increase dangling coverage. Our method identifies 939 dangling premises that characterize distinct ill-typed conditions in the P4 type system. The fuzzer generates a negative test suite achieving 33.02%p higher dangling coverage than the existing P4C reference compiler’s test suite. The generated tests also reveal 29 previously unknown bugs in the compiler frontend, demonstrating the effectiveness of both the coverage metric and the fuzzer. The tests generated by our fuzzer are now integrated into the P4C test suite.
Article Search
Artifacts Available
Article: fse26maina-p2779-p
Debugging Engine Enhanced by Prior Knowledge: Can We Teach LLM How to Debug?
Kunyi Li,
Sai Wu,
Xiu Tang,
Chang Yao,
Songhao Bu,
Quanqing Xu, and
Gang Chen
(Zhejiang University, China; Ant Group, China)
Automated Program Repair (APR) powered by Large Language Models (LLMs) has shown strong potential for improving software reliability. However, existing LLM-based APR approaches underutilize the rich debugging knowledge latent in large-scale bug-fix corpora. Prior work has primarily advanced APR through multi-agent coordination, prompt engineering, task decomposition, or model training. While effective to some extent, these methods rely heavily on the implicit reasoning capacity of LLMs, without explicitly modeling the debugging knowledge that underlies successful program repair. We present DeepK, a novel framework that systematically extracts, validates, and reuses debugging knowledge to guide LLMs in APR. Rather than treating historical bug-fix pairs solely as contextual exemplars, DeepK distills them into a structured knowledge base of verified debugging knowledge. This knowledge base can be seamlessly integrated into diverse APR pipelines, providing interpretable reasoning traces and step-wise repair strategies that enhance the repair performance.
We evaluate DeepK on multiple benchmarks (ACPR, Atcoder) using both GPT-4o and DeepSeek-v3. Results show that DeepK consistently surpasses state-of-the-art APR systems in repair accuracy. Ablation studies confirm the importance of edit description generation, multi-perspective retrieval, and two-fold debugging knowledge. Varying the number of retrieved entries further highlights the trade-off between informativeness and noise. These findings emphasize the essential contribution of explicit debugging knowledge in advancing LLM-based APR and establish DeepK as an extensible framework for building more reliable repair systems.
Article Search
Info
Article: fse26maina-p2854-p
Behind Defective Mobile AR Apps: Studying Reviews and Bugs of Android AR Software with Comparison to Prior Bug Studies
Tahmid Rafi,
Xueling Zhang,
Jianwei Niu, and
Xiaoyin Wang
(University of Texas at San Antonio, USA; Rochester Institute of Technology, USA)
As Augmented Reality (AR) applications grow in popularity, understanding and addressing AR software bugs becomes crucial. AR applications, due to their interaction with the physical world, present unique challenges that differ from traditional software. In this study, we try to understand the root causes of commonly complained mobile AR bugs. In our study, we collected user reviews from Google Play market, and issue reports from open source AR software projects from GitHub. We categorized bug symptoms and root causes, and studied the correlation between them. We further studied the fixing commits of these bugs and compare their distribution with findings from prior bug studies. Our study finds that (1) AR apps users are mostly affected by dysfunction bugs such as Hang and Crash and these bugs are common in AR apps, (2) API Misuse is the mostly common root cause of AR bugs, and property setting error is the most common form of API Misuse, (3) a small number of API patterns and event handling practices may account for a large portion of API Misuse, and (4) besides AR UI symptoms and the API Misuse root cause, bugs in AR apps have similar characters with bugs in other Android apps.
Article Search
Article: fse26maina-p2998-p
In Bugs We Trust? On Measuring the Randomness of a Fuzzer Benchmarking Outcome
Ardi Madadi,
Seongmin Lee,
Cornelius Aschermann, and
Marcel Böhme
(MPI-SP, Germany; University of California at Los Angeles, USA; Ruhr-University Bochum, Germany)
In Google’s FuzzBench platform, we find that the outcome of coverage-based evaluation more strongly agrees with the outcome of a bug-based evaluation than an independent bug-based evaluation itself. Recently, B'ohme et al. found that despite a very strong correlation between coverage achieved and bugs found, there is no strong agreement between the outcome of a coverage- and a bug-based evaluation: The fuzzer best at achieving coverage may be the worst at finding bugs. However, in trying to explain this moderate agreement, we wondered whether the outcome of bug-based benchmarking itself is perhaps much more “noisy” and turned to applied statistics to develop the tools necessary to investigate our hypothesis.
In this paper, we call this degree of “noisiness” of a benchmarking outcome the concordance of the benchmarking procedure and quantify it using a measure of statistical reliability widely used in psychology, called mean split-half reliability, i.e., the expected agreement on the benchmark outcome between two random halves of the benchmarking suite. In our experiments with FuzzBench and Magma, we find that the concordance of coverage-based benchmarking is consistently strong while that of bug-based benchmarking is weak on FuzzBench and moderate on Magma. In contrast to FuzzBench, for the Magma benchmark suite (which was designed for bug-based evaluation) a coverage-based evaluation does not predict the outcome of a bug-based evaluation better than an independent bug-based evaluation.
Moreover, to demonstrate the utility of concordance also for developers of benchmarking suites, we investigate concordance as a measure of benchmarking efficiency, as in green fuzzer benchmarking. We empirically confirm that the resources of a procedure with higher concordance can be reduced more substantially (in terms of campaign length or benchmark sampling size) while maintaining a similar benchmark outcome as a procedure with lower concordance. We report the corresponding savings in terms of carbon emissions.
Article Search
Article: fse26maina-p3037-p
Spectrum-Based Failure Attribution for Multi-agent Systems
Yu Ge,
Linna Xie,
Zhong Li,
Yu Pei, and
Tian Zhang
(Nanjing University, China; Hong Kong Polytechnic University, China)
Large Language Model Powered Multi-Agent Systems (MASs) are increasingly employed to automate complex real-world tasks, such as programming and scientific discovery. While promising, MASs are not immune to defects or failures. Failure attribution in MASs, i.e., to pinpoint the specific agent actions responsible for failures, is underexplored and labor-intensive, posing significant challenges for debugging and improving MASs. To bridge this gap, we propose FAMAS, the first spectrum-based failure attribution approach for MASs. The approach performs systematic trajectory replay and abstraction, followed by spectrum analysis. Its core idea is to estimate, from variations across repeated MAS executions, the likelihood that each agent action is responsible for the failure. In particular, we propose a novel suspiciousness formula tailored to MASs, which integrates two key factor groups, namely the agent behavior group and the action behavior group, to account for the agent activation patterns and action activation patterns within the MAS execution trajectories. Extensively evaluated against 12 baselines from the Who&When benchmark, FAMAS demonstrates superior performance, outperforming all compared methods.
Article Search
Article: fse26maina-p3090-p
Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs
Juyeon Yoon,
Somin Kim,
Robert Feldt, and
Shin Yoo
(KAIST, Republic of Korea; Chalmers University of Technology, Sweden)
Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation. Yet testing them on specific tasks remains difficult and costly: many prompts lack ground truths, forcing reliance on human judgments, while existing test adequacy measures typically rely on output uncertainty and thus are only available after full inference. A key challenge is to assess how useful a test input is in a way that reflects the demands of the task, ideally before even generating any output. We introduce Clotho, a task-specific, pre-generation test adequacy measure that estimates input difficulty directly from LLM hidden states. Given a large pool of unlabelled inputs for a specific task, Clotho uses a Gaussian Mixture Model (GMM) to adaptively sample the most informative cases for human labelling. Based on this reference set the GMM can then rank unseen inputs by their likelihood of failure. In our empirical evaluation across eight benchmark tasks and three open-weight LLMs, Clotho can predict failures with a ROC-AUC of 0.716, after labelling reference sets that are on average only 5.4% of inputs. It does so without generating any outputs, thereby significantly reducing LLM execution costs compared to output-based uncertainty or confidence measures. Comparison of Clotho and these post-generation adequacy measures shows that the two approaches complement each other. Crucially, we show that adequacy scores learnt from open-weight LLMs transfer effectively to proprietary models, extending the applicability of the approach. When prioritising test inputs for proprietary models, Clotho increases the average number of failing inputs from 18.7 to 42.5 out of 100, compared to random prioritisation.
Article Search
Artifacts Available
Article: fse26maina-p3213-p
WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements
Xiwen Teoh,
Yun Lin,
Duc-Minh Nguyen,
Ruofei Ren,
Wenjie Zhang, and
Jin Song Dong
(National University of Singapore, Singapore; Shanghai Jiao Tong University, China)
Visual language model (VLM) agents show great promise in automating graphical user interface (GUI) testing against requirements in natural language. However, the probabilistic nature of language models can have inherent hallucinations. Therefore, given a detected inconsistency between the requirement and the web application, it is hard to distinguish whether it stems from the hallucination or a real application bug. Addressing this issue presents two core technical challenges: (1) limited capability and accuracy in deriving implicit test oracles, where the agent must act as its own oracle to implicitly decide if the application’s behavior is correct without guidance, and (2) limited reliability due to probabilistic inference, where an LLM’s inconsistent reasoning undermines its trustworthiness as an oracle.
We introduce WebTestPilot, a neurosymbolic LLM-based approach that addresses both challenges through symbolization. WebTestPilot detects and abstracts critical GUI elements of a web application into symbolic variables. This design improves reliability by constraining assertion generation to operations grounded in explicitly defined symbols, thereby reducing unconstrained or inconsistent reasoning. At the same time, it improves accuracy by representing application states and their relationships in a structured symbolic form, which increases the likelihood of the agent recognizing data, causal, and temporal dependencies across states. Together, these capabilities enable WebTestPilot to generate reliable and accurate test oracles that capture meaningful implicit expectations derived from test requirements. To advance research in this area, we build a benchmark of bug-injected web apps for evaluating NL-to-E2E testing. The results show that WebTestPilot achieves a task completion rate of 99%, with 96% precision and 96% recall in bug detection, outperforming the best baseline (+70 precision, +27 recall). The agent generalizes across diverse natural language inputs (i.e., those containing typos, grammatical errors, redundant sentences, stylistic restyling, or abbreviations) and model scales (3B–72B). In a real-world deployment with a no-code platform, WebTestPilot discovered 8 bugs during development, including data binding, UI, and navigation issues.
Article Search
Article: fse26maina-p3243-p
Fool Me If You Can: On the Robustness of Binary Code Similarity Detection Models against Semantics-Preserving Transformations
Jiyong Uhm,
Minseok Kim,
Michalis Polychronakis, and
Hyungjoon Koo
(Sungkyunkwan University, Republic of Korea; Stony Brook University, USA)
Binary code analysis plays an essential role in cybersecurity, facilitating reverse engineering to reveal the inner workings of programs in the absence of source code. Traditional approaches, such as static and dynamic analysis, extract valuable insights from stripped binaries, but often demand substantial expertise and manual effort. Recent advances in deep learning have opened promising opportunities to enhance binary analysis by capturing latent features and disclosing underlying code semantics. Despite the growing number of binary analysis models based on machine learning, their robustness to adversarial code transformations at the binary level remains underexplored to date. In this work, we evaluate the robustness of deep learning models for the task of binary code similarity detection (BCSD) under semantics-preserving transformations. The unique nature of machine instructions presents distinct challenges compared to the typical input perturbations found in other domains. To achieve our goal, we introduce asmFooler, a system that evaluates the resilience of BCSD models using a diverse set of adversarial code transformations that preserve functional semantics. We construct a dataset of 9,565 binary variants from 620 baseline samples by applying eight semantics-preserving transformations across six representative BCSD models. Our major findings highlight several key insights: i) model robustness highly relies on the design of the processing pipeline, including code pre-processing, model architecture, and internal feature selection, which collectively determine how code semantics are captured; ii) the effectiveness of adversarial transformations is bounded by a transformation budget, shaped by model-specific constraints such as input size limits and the expressive capacity of semantically equivalent instructions; iii) well-crafted adversarial transformations can be highly effective, even when introducing minimal perturbations; and iv) such transformations efficiently disrupt the model’s decision (e.g., misleading to false positives or false negatives) by focusing on semantically significant instructions.
Article Search
Article: fse26maina-p3336-p
Papers B
Understanding the Limitations of C/C++ Binary Third-Party Library Detection Tool: An Empirical Study at Scale
Chengyue Liu,
Zhengzi Xu,
Kaixuan Li,
Jiahui Wu,
Sihao Qiu,
Siyuan Li,
Siyang Xiong,
Yang Xiao, and
Yang Liu
(Nanyang Technological University, Singapore; Imperial Global Singapore of Imperial College London, Singapore; Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Desay SV Automotive Singapore Pte, Singapore; at Chinese Academy of Sciences, China)
Detecting third-party libraries (TPLs) in C/C++ binaries is essential for ensuring software security and compliance, particularly in safety- and performance-critical domains. While numerous academic and commercial Software Composition Analysis (SCA) tools have been proposed, their true capabilities remain unclear due to the absence of large-scale benchmarks and systematic evaluation. Equally lacking is a deeper understanding of why these tools often underperform, which limits both research progress and practical adoption.
We address this gap with a large-scale study of binary SCA tools. We construct the largest publicly available benchmark to date, encompassing 38,228 test cases across 1,873 libraries drawn from a defined scope of 13,675 libraries. Using this benchmark, we systematically evaluate 11 representative tools, covering all open source research prototypes and widely adopted commercial solutions, across versions, architectures, and feature-database scales. Beyond aggregate performance metrics, we perform the first fine-grained, feature-level analysis to identify the intrinsic challenges of binary TPL detection. Our results show that existing tools perform unsatisfactorily, with average recall below 60% and precision around 75%. Feature-level analysis reveals fundamental obstacles: binaries lose most source-code features during compilation, and libraries exhibit high feature overlap due to functional similarity and dependency propagation. These findings explain current shortcomings, and we build on them to provide design recommendations, research directions, and practical guidance for managing open-source risks in binary software.
Article Search
Article: fse26mainb-p13-p
PlayCoder: Making LLM-Generated GUI Code Playable
Zhiyuan Peng,
Wei Tao,
Xin Yin,
Chenhao Ying,
Yuan Luo, and
Yiwen Guo
(Shanghai Jiao Tong University, China; LIGHTSPEED, China; Zhejiang University, China; Independent Researcher, China)
Large language models (LLMs) have transformed code generation, but their ability to generate code for applications with graphical user interfaces (GUIs), particularly games, remains underexplored.
Prior code-generation benchmarks assess correctness using test cases, but this is insufficient for GUI applications. These applications are interactive and event-driven, and their correctness depends on stateful behavior over sequences of user actions. Consequently, evaluation should account for interaction flows and UI state transitions rather than relying solely on pass or fail test outcomes.
To explore the performance of LLMs on GUI applications, we construct PlayEval, a repository-aware evaluation dataset from 43 multilingual (Python, TypeScript, and JavaScript) GUI applications. Different from existing GUI benchmarks which are difficult to transplant to Desktop platform, PlayEval consists of 6 major categories of GUI applications and directly facilitates evaluation on code generation tasks. To enable more reliable assessment beyond simple execution and unit tests, we propose Play@k, which measures whether at least one of k generated candidates yields an application that can be played end-to-end without logical errors. We further develop an LLM-based agent, PlayTester, that automates interactive evaluation by driving the GUI through task-oriented playthroughs and checking for logic violations. Through systematic evaluation, we demonstrate that 10 state-of-the-art code LLMs struggle to generate logically correct GUI applications, achieving near-zero Play@3 scores despite high compilation rates. To address these, we introduce PlayCoder, a multi-agent, repository-aware framework that writes, evaluates and refines GUI application code via closed-loop control. PlayCoder substantially improves functional correctness and semantic alignment for both open-source and closed-source models, achieving up to 38.1% Exec@3 and 20.3% Play@3. Case studies show that it detects silent logic flaws missed by traditional metrics and repairs them through targeted edits. These results indicate that coupling an end-to-end GUI testing agent with repository-aware automated program repair is an effective path towards reliable GUI code generation. Our implementation is publicly available at https://github.com/Tencent/PlayCoder.
Article Search
Article: fse26mainb-p28-p
Multi-LLM Persona Generation for Virtual Focus Groups in Software Engineering: A Controlled, Multi-domain Study of Emotional Requirements Elicitation
Guangrui Fan,
Dandan Liu,
Lihu Pan,
Rui Zhang, and
Qian Guo
(Taiyuan University of Science and Technology, China; Universiti Malaya, China)
Emotional requirements (ERs)---how users should feel and which emotional harms a system must avoid---often determine whether people adopt and keep using software, especially in sensitive domains. Yet, eliciting ERs via interviews, workshops, and focus groups is costly and hard to scale or repeat. We study whether simulated focus groups, moderated discussions among user personas generated and played by large language models (LLMs), can augment early ER elicitation. Across three domains (mental‑health journaling, personal finance, and fitness), we compare one‑shot and iterative single‑model pipelines with iterative pipelines that use two or three different LLMs, and we benchmark against two human baselines: a human focus group and Emotional Goal Modeling (EGM). Multi‑LLM pipelines generate more diverse personas and increase the share of AI‑only ERs (not seen in our human baselines) that source‑blinded raters judge relevant by 14.7 percentage points compared to an iterative single‑model workflow; same‑model controls suggest this gain is not solely due to provider differences. Compared to human methods, simulations contribute more innovative requirements, EGM contributes clearer and more feasible ones, and human focus groups provide more natural phrasing. Overall, the results support an augmentation‑oriented workflow: use multi‑LLM simulation to broaden candidate ERs, structured modeling to organize them, and human engagement to ground decisions.
Article Search
Article: fse26mainb-p38-p
Large Language Models for Opaque Predicate Resolution: A Universal Control Flow Deobfuscation Framework
Xiao Chen,
Qiuyun Wang,
Shuwei Wang,
Weize Zhang,
Yuling Liu,
Baoxu Liu, and
Zhengwei Jiang
(Institute of Information Engineering at Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, China)
Code obfuscation is a process that complicates reverse engineering, protects intellectual property, and conceals malware. Existing deobfuscation approaches often lack generality or struggle with complex, mixed, or unknown transformations. To address this issue, this paper proposes LUCID, a Large Language Model (LLM) based Universal Control-flow Integrated Deobfuscation framework. We first formalizes the control-flow deobfuscation task and introduce the Topologically Feasible Path Set (TFPS) as a new evaluation metrics. Building upon this foundation, LUCID leverages an LLM to infer Predicate Mapping Rules between basic blocks in linear time, which then guide the precise expansion of the Runtime Feasible Path Set to identify and eliminate spurious control flows. Finally, semantically equivalent paths are merged to reconstruct a clean, compilable, and behaviorally faithful control-flow graph, from which security-analyst-friendly C-like pseudocode is generated. Comprehensive evaluation on 780 binaries employing 13 distinct obfuscation techniques demonstrates that our method reduces average cyclomatic complexity by 52.4%, achieves full deobfuscation in 53.8% of cases, and suppresses TFPS inflation caused by bogus control flow by over 99%. The framework demonstrates superiority over existing state-of-the-art tools in terms of both generality and semantic consistency, thus evidencing the transformative potential of LLMs in facilitating scalable malware reverse engineering.
Article Search
Article: fse26mainb-p45-p
SwarmBox: A Plug-and-Play Drone Swarm Framework for Streamlined Development and Comprehensive Analysis
Minki Lee,
Seojin Lee, and
Seulbae Kim
(Pohang University of Science and Technology, Republic of Korea; Daegu Gyeongbuk Institute of Science and Technology, Republic of Korea)
Drone swarms are emerging as paradigm-shifting technology with the potential to redefine traditional robot missions such as logistics, surveillance, and disaster response, through their ability to coordinate large numbers of autonomous drones. Yet, progress in swarm research and development is constrained by a fragmented development ecosystem; every new algorithm must be validated on a bespoke testbed, introducing significant and redundant engineering overhead, while producing results that are difficult to reproduce or compare. As a remedy, we present SwarmBox, an open-source testbed framework that provides a shared foundation for swarm robotics. SwarmBox streamlines swarm research and development by providing three key capabilities: (1) a plug-and-play software architecture that decouples high-level swarm logic from low-level flight control and platform dependencies; (2) a swarm-level integrated analyzer that exposes emergent behaviors across drones and system layers to facilitate debugging and analysis; and (3) a configurable experimentation environment that supports diverse missions and communication topologies to promote fair and reproducible benchmarking.
Our evaluation demonstrates these benefits in practice. SwarmBox reduces software engineering effort by eliminating boilerplate code, abstracting low-level system details into coherent APIs, and simplifying swarm coordination into a uniform process. It improves fault diagnosis efficiency and uniquely exposes inter-agent interaction failures that traditional debugging methods cannot capture. It enables reproducible benchmarking by allowing systematic comparison of different swarm algorithms. It proves its generality and scalability by supporting diverse mission scenarios ranging from centralized coordination to fully decentralized swarms. Finally, its hardware abstraction layer minimizes the simulation-to-real gap, enabling unmodified application code to be seamlessly deployed on both simulation and physical drones. Together, these capabilities establish SwarmBox as a practical foundation for reproducible, community-driven swarm robotics research.
Article Search
Artifacts Available
Article: fse26mainb-p61-p
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
Chaozheng Wang,
Zezhou Yang,
Shuzheng Gao,
Cuiyun Gao,
Zongjie Li,
Yichen Li,
Ting Peng,
Hailiang Huang,
Yuetang Deng, and
Michael R. Lyu
(Chinese University of Hong Kong, China; Hong Kong University, China; Hong Kong University of Science and Technology, China; Tencent, China; Chinese University of Hong Kong, Hong Kong)
Code editing constitutes a fundamental practice in software development, wherein developers modify existing codebases according to natural language requirements. Accurate code editing necessitates a comprehensive understanding of both the existing codebase and the modification requirements. Although large language models (LLMs) have demonstrated promising performance in code editing tasks, they suffer from substantial inefficiency by generating entire modified files that largely consist of unchanged code. While smaller models could potentially address this inefficiency, they typically lack the capacity to effectively comprehend long code contexts required for accurate editing. To ensure both effectiveness and efficiency, we propose to decompose code editing into a two-stage cascade: edit sketch generation, wherein a large model first produces concise sketches representing the requisite modifications (the more challenging phase), and edit sketch application, wherein a smaller model integrates these sketches into the original code to produce the final output edited code (the simpler phase). This cascaded design reduces the number of tokens generated by the large model, as the majority of the output is handled by the smaller, more efficient model, thereby enhancing overall efficiency. However, the effectiveness of this approach is constrained by current small models’ limited capabilities in handling long-context scenarios and cross-file dependencies, which are essential for accurate sketch application in real-world codebases. To address these limitations and enhance smaller models’ sketch application capabilities, we introduce the first large-scale sketch application dataset comprising over 100K training instances and 800M tokens, along with a human-evaluated benchmark, and propose specialized training strategies including curriculum-based long-context training and multi-file augmentation. Our comprehensive experiments demonstrate that our cascaded framework inherently reduces inference costs compared to direct editing with large models. Furthermore, combining large models with our fine-tuned smaller models can achieve even superior performance. For instance, on the Aider benchmark, employing DeepSeek R1 as the edit sketch generation model alongside a fine-tuned Qwen2.5 Coder 14B model for the application phase improves Pass@2 11.1% compared to direct editing with DeepSeek R1 alone. Additionally, the cascaded approach reduces execution time and cost by 13% and 19%, respectively, demonstrating both performance gains and efficiency improvements.
Article Search
Article: fse26mainb-p79-p
Bash-Commenter: Leveraging Syntax-Aware Preference Optimization to Reinforce Large Language Model for Bash Code Comment Generation
Lei Yu,
Jingyuan Zhang,
Xin Wang,
Li Yang,
Fengjun Zhang,
Peng Wang,
Jia Xu, and
Jiajia Ma
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Bash script comprehension is a significant challenge in Linux environments due to Bash's syntactic freedom and complex command structures. Despite its critical role in system administration and development, Bash scripts often lack adequate comments, hindering code readability and maintainability. Existing approaches to automated Bash comment generation face two main challenges: (1) Limited training datasets that inadequately represent real-world Bash usage patterns, particularly for complex multi-line scripts; and (2) Insufficient understanding of Bash-specific concepts by Large Language Models (LLMs). Our empirical analysis shows that even after standard training, LLMs still struggle to precisely understand complex Bash command semantics, leading to inaccurate comments. To address these challenges, we propose Bash-Commenter, an advanced comment generation method based on LLaMA-3.1-8B. First, to overcome data limitations (Challenge 1), we construct a comprehensive dataset of complex, multi-line Bash scripts with high-quality comments. Second, to enhance semantic understanding (Challenge 2), we conduct Continual Pre-training (CPT) on large-scale Bash script data, followed by Supervised Fine-tuning (SFT) on our annotated dataset, strengthening the model's foundational knowledge of Bash syntax and semantics. Finally, to resolve the subtle semantic errors that persist, we introduce Syntax-Aware Preference Optimization (SAPO). This method automatically constructs preference pairs by applying single, atomic operations (e.g., modifying a command option or removing an argument) to a script's Abstract Syntax Tree (AST), creating minimal pairs of correct and subtly incorrect scripts. This final optimization stage enables fine-grained command semantics learning and context-dependent quality assessment, significantly improving comment accuracy. We evaluate Bash-Commenter on single-line Bash commands and multi-line Bash scripts. Our method outperforms state-of-the-art baselines, achieving 33.40% BLEU-4, 58.26% METEOR, and 57.03% ROUGE-L for 1,064 single-line commands, and 22.15% BLEU-4, 43.89% METEOR, and 32.80% ROUGE-L for 1,046 multi-line scripts. Moreover, human evaluation and LLM evaluation demonstrate the superior quality of comments generated by Bash-Commenter in terms of correctness, completeness, and naturalness.
Article Search
Article: fse26mainb-p86-p
AgentBound: Securing Execution Boundaries of AI Agents
Christoph Bühler,
Matteo Biagiola,
Luca Di Grazia, and
Guido Salvaneschi
(University of St. Gallen, Switzerland; USI Lugano, Switzerland)
Large Language Models (LLMs) have evolved into AI agents that interact with external tools and environments to perform complex tasks. The Model Context Protocol (MCP) has become the de facto standard for connecting agents with such resources, but security has lagged behind: thousands of MCP servers execute with unrestricted access to host systems, creating a broad attack surface. In this paper, we introduce AgentBound, the first access control framework for MCP servers. AgentBound combines a declarative policy mechanism, inspired by the Android permission model, with a policy enforcement engine that contains malicious behavior without requiring MCP server modifications. We build a dataset containing the 296 most popular MCP servers, and show that access control policies can be generated automatically from source code with 80.9% accuracy. We also show that AgentBound blocks the majority of security threats in several malicious MCP servers, and that the policy enforcement engine introduces negligible overhead. Our contributions provide developers and project managers with a foundation for securing MCP servers while maintaining productivity, enabling researchers and tool builders to explore new directions for declarative access control and MCP security.
Article Search
Artifacts Available
Article: fse26mainb-p93-p
SmartCoder-R1: Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization
Lei Yu,
Jingyuan Zhang,
Xin Wang,
Li Yang,
Fengjun Zhang, and
Jiajia Ma
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China)
Smart contracts automate the management of high-value assets, where vulnerabilities can lead to catastrophic financial losses. In the task of automated smart contract generation using Large Language Models (LLMs), this challenge is amplified by two interconnected failures: first, they operate as unauditable "black boxes" by failing to produce a transparent reasoning process, and second, as a consequence, they generate code riddled with critical security vulnerabilities. To address both issues, we propose SmartCoder-R1 based on Qwen2.5-Coder-7B, a novel framework for secure and explainable smart contract generation. It begins with Continual Pre-training (CPT) to specialize the base model on the nuances of smart contract code. To construct the data for subsequent stages, we first prompt the DeepSeek model to generate reasoning-and-code samples from verified on-chain contracts, followed by a rigorous validation process where each sample is manually reviewed by security experts for compilability, functionality, security, and reasoning completeness. Based on this, we then apply Long Chain-of-Thought Supervised Fine-Tuning (L-CoT SFT) on 7,998 of these expert-validated samples to train the model to emulate human security analysis. Finally, to directly mitigate vulnerabilities, we employ Security-Aware Group Relative Policy Optimization (S-GRPO), a reinforcement learning phase that refines the generation policy using 1,691 samples by optimizing a weighted reward signal for compilation success, security compliance, and format correctness. Evaluated against 18 state-of-the-art baselines on a challenging benchmark of 756 real-world functions from 289 deployed contracts, SmartCoder-R1 establishes a new state of the art by achieving top performance across five key metrics: a ComPass of 87.70%, a VulRate of 8.60%, a SafeAval of 80.16%, a FuncRate of 53.84%, and a FullRate of 50.53%. This FullRate marks a 45.79% relative improvement over the strongest baseline, DeepSeek-R1. Crucially, its generated reasoning also excels in human evaluations, achieving high-quality ratings for Functionality (82.7%), Security (85.3%), and Clarity (90.7%).
Article Search
Article: fse26mainb-p94-p
It Takes Two: Option-Aware Directed Greybox Fuzzing for Vulnerability PoC Generation
Susheng Wu,
Xin Hu,
Yiheng Cao,
Zhuotong Zhou,
Yiheng Huang,
Yijian Wu,
Bihuan Chen,
Zhijia Zhao, and
Xin Peng
(Fudan University, China)
Static analysis tools can identify potential vulnerabilities, but they often fall short in providing concrete proofs-of-concept (PoCs) to validate their findings. Directed greybox fuzzing (DGF) has emerged as a promising solution by systematically guiding execution toward suspicious code locations and generating reproducible PoCs that can trigger the target vulnerabilities. However, DGF tools often overlook the influence of configurable options on reaching target locations. Besides, option-aware greybox fuzzing (GF) tools suffer from ineffective option extraction to target locations, and inefficient coordination between options and file fuzzing.
To address these limitations, we present CoupleFuzz, a novel option-aware DGF tool that redefines PoC inputs as the combination of option input (OI) and file input (FI). CoupleFuzz adopts a two-phase workflow. The static analysis phase extracts option knowledge for guiding the fuzzing. The option-aware fuzzing phase employs taint analysis to dynamically prioritize effective option combinations and file bytes to target locations, and introduces a novel cross-guided fuzzing strategy that coordinates OI and FI fuzzing modules and enables each module to adapt to and benefit from its counterpart's advances, iteratively driving execution toward the target locations efficiently. Our evaluation has demonstrated that CoupleFuzz significantly outperforms the state-of-the-art DGF tools in generating PoCs for 22 real-world vulnerabilities, generating 15 (a 3.1× improvement) more PoCs than the best traditional DGF baseline and achieves an average speedup of 5.6× to reach target locations, with 6 0-day vulnerabilities confirmed by developers and 1 CVE identifier assigned.
Article Search
Article: fse26mainb-p122-p
Fairness Testing of Large Language Models in Role-Playing
Xinyue Li,
Zhenpeng Chen,
Jie M. Zhang,
Ying Xiao,
Tianlin Li,
Weisong Sun,
Yang Liu,
Yiling Lou, and
Xuanzhe Liu
(Peking University, China; Tsinghua University, China; King's College London, UK; Nanyang Technological University, Singapore; University of Illinois at Urbana-Champaign, USA)
Large Language Models (LLMs) have become foundational in modern language-driven software applications, profoundly influencing daily life. A critical technique in leveraging their potential is role-playing, where LLMs simulate diverse roles to enhance their real-world utility. However, while research has highlighted the presence of social biases in LLM outputs, it remains unclear whether and to what extent these biases emerge during role-playing scenarios. In this paper, we conduct an empirical study on fairness testing of LLMs in role-playing scenarios. To enable this testing, we use LLMs to generate 550 social roles spanning a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions that target various forms of bias. These questions, covering Yes/No, multiple-choice, and open-ended formats, are designed to prompt LLMs to adopt specific roles and respond accordingly. We employ a combination of rule-based and LLM-based strategies to identify biased responses, rigorously validated through human evaluation. Using the generated questions as the test cases, we conduct extensive evaluations of 10 advanced LLMs. The evaluation reveal 107,580 biased responses across the studied LLMs, with individual models yielding between 7,579 and 16,963 biased responses, underscoring the prevalence of bias in role-playing contexts. To support future research, we have publicly released the dataset, along with all scripts and experimental results.
Article Search
Article: fse26mainb-p144-p
Exorcist: Enabling Atomic-Level Runtime Detection of Spectre Attacks using Precise Event Based Sampling
Hao Jia,
Haoyu Ma,
Changfeng Ding, and
Jinku Li
(Xidian University, China; Beijing Jiaotong University, China)
While being key hardware techniques for improving the performance of modern processors, the speculative execution mechanisms also lead to side-channel attacks, which pose significant threats to the security of computer systems. While researchers have proposed various solutions to mitigate the threat posed by speculative attacks, most existing approaches have focused on offline static vulnerability analysis, which suffers from significant limitations of incompleteness and poor scalability.
Based on Intel's Precise Event Based Sampling (PEBS) technique, this paper proposes Exorcist, a novel runtime framework for detecting Spectre-PHT attacks at the atomic level. Leveraging the key observation that the atomic execution of a Spectre-PHT gadget triggers both a cache miss and a branch mis-precision in a fixed sequence, Exorcist uses carefully configured PEBS monitoring to efficiently capture all related fine-grained hardware performance events, then launches a kernel- and user-level collaborated taint analysis to effectively pinpoint Spectre-PHT gadgets that have actually exploited vulnerable speculative executions. Our approach significantly reduces the amount of native instructions needed to be processed to confirm actual Spectre exploitation, making it capable of achieving high accuracy and a negligible false positive rate with second-level response time. We have implemented a prototype of Exorcist and performed a comprehensive evaluation on it. Experimental results indicate that Exorcist can efficiently detect Spectre-PHT attacks at runtime with acceptable performance overhead, including JIT-compiled Spectre-PHT payloads written in JavaScript that are beyond the reach of existing offline analysis-based approaches.
Article Search
Article: fse26mainb-p147-p
V2E: Validating Smart Contract Vulnerabilities through Profit-Driven Exploit Generation and Execution
Jingwen Zhang,
Yuhong Nan,
Kaiwen Ning,
Mingxi Ye,
Wei Li,
Yuming Xiao,
Yuming Feng,
Weizhe Zhang, and
Zibin Zheng
(Sun Yat-sen University, China; Peng Cheng Laboratory, China; Harbin Institute of Technology, China)
Smart contracts are a critical component of blockchain systems. Due to the large amount of digital assets carried by smart contracts, their security is of critical importance. Although numerous tools have been developed for detecting smart contract vulnerability, their effectiveness remains limited, particularly due to the high false positives included in the reported results. Therefore, developers and auditors are often overwhelmed with manually verifying the reported issues. A fundamental reason behind this is that while a reported vulnerability satisfies specific vulnerable patterns, it may not actually be exploitable, either because the vulnerable code cannot be triggered or it does not result in any financial loss.
In this paper, we propose V2E, a new framework for validating whether a reported vulnerability is truly exploitable. The core idea of V2E is to automatically generate executable Proof-of-Concept Exploit (PoC for short), and then assess if the vulnerability could be triggered and incur any real damage (i.e., causing financial loss) by the PoC. While LLMs have shown proficiency in PoC generation, achieving our task is by no means trivial. In detail, it is difficult for LLM to: (1) generate and update PoC to trigger a specific vulnerability, (2) evaluate the PoC’s effectiveness to validate exploitable vulnerability. To this end, V2E automates the whole process through a novel combination of PoC generation, validation, and refinement: (1) Firstly, V2E generates targeted PoCs by analyzing potential vulnerability paths. (2) Then, V2E verifies the validity of PoCs through triggerability and profitability analysis. (3) In addition, V2E iteratively refines the generated PoC based on PoC execution feedback, therefore, increasing the chance to confirm the vulnerability. Evaluation on 264 manually labeled contracts shows that V2E outperforms the baseline approach. Particularly, V2E successfully identifies 102 out of 124 exploitable vulnerabilities, achieving a precision of 91.9% and a recall of 82.3%. In addition, it successfully eliminates 71 out of 140 false alarms (50.7%). Besides, V2E effectively enhances the performance of SOTA tools. It reduces the false positive rates of Slither by 76.9%, Mythril by 56.9% and Confuzzius by 65%.
Article Search
Article: fse26mainb-p150-p
Eidolon: Perform Noise-Aware Fuzzing on FHE Libraries via Equivalence Expression Transformation
Zhensheng Xian,
Zhen Yan,
Yuanliang Chen,
Xuelian Cao,
Fuchen Ma,
Dalong Shi, and
Yu Jiang
(Tsinghua University, China; Aviation Industry Corporation of China, China)
Ensuring data privacy during computation is a critical challenge in many security systems. Fully Homomorphic
Encryption (FHE) addresses this gap by enabling multiple operations on encrypted data without decryption,
thus ensuring privacy is preserved throughout computation. However, existing cryptographic testing tools
are unable to test the core functionality of FHE, which is the execution of computations on encrypted data.
They are expertly designed to generate structured data for testing cryptographic algorithms. This structural
mismatch, combined with a lack of awareness of FHE-specific noise management, leads them to generate
invalid test inputs that fail to probe FHE libraries’ core logic.
To address this gap, we propose Eidolon, a noise-aware fuzzer. It directs mutations toward arithmetic
expressions that explore the computational space defined by the noise budget. As its test oracle, Eidolon
leverages Equivalence Expression Transformation, which transforms a standard arithmetic expression into
two mathematically identical but structurally different forms (e.g., Factored, Horner) to detect inconsistencies
in their outputs. We evaluated Eidolon on SEAL, OpenFHE, HElib, and TFHE. Compared with existing
cryptographic and grammar-based fuzzers, Eidolon achieves 28.7%, 45.5%, 75.6%, and 37.6% higher final code
coverage than CLFuzz, Cryptofuzz, CDF, and Peach, respectively. In total, Eidolon uncovered 20 previously
unknown bugs, 10 of which have been fixed and 12 assigned CVEs.
Article Search
Article: fse26mainb-p157-p
Denoising Fault Localization with Test Line Proximity
Marius Smytzek and
Andreas Zeller
(CISPA Helmholtz Center for Information Security, Germany)
When a program fails, statistical fault localization (SFL) provides important debugging hints by identifying the locations whose execution most correlates with failures.
However, such correlations can be weakened if a test contains both _passing_ and _failing_ assertions, creating ambiguous and misleading associations.
Likewise, if multiple lines correlate with the same strength, SFL provides little guidance to disambiguate between them.
This paper proposes a novel proximity-based weighting scheme for SFL that assigns different _weights_ to locations in the test subject based on temporal proximity to a failure.
The more recently a subject line is executed before the test fails, the higher its weight.
We operationalize a known heuristic into a lightweight statistical form compatible with existing SFL formulas.
Our approach applies to _any test_, from simple single-line tests (where it preserves SFL behavior), to single-assertion tests with multiple lines (where it benefits from temporal proximity), to complex multi-assertion tests (where it provides the most benefit by distinguishing failing from passing assertions).
Once computed, the weights can be integrated into any existing SFL technique.
Our evaluation of proximity-weighted fault localization on 310~real-world programs shows that it consistently outperforms fault localization techniques across all test types.
Proximity-weighted fault localization shows per-subject relative improvements of 200%–400%, meaning that, for a typical subject, it provides 3 to 5 times the baseline effectiveness.
These improvements represent substantial gains over baseline techniques.
Our approach can be integrated into existing fault localization techniques to improve performance, making it a valuable addition to automated debugging.
Article Search
Artifacts Available
Article: fse26mainb-p171-p
CrossFit: Demystifying VM Callback Bugs in Interpreters
Chibin Zhang,
Qiang Liu, and
Mathias Payer
(EPFL, Switzerland)
Scripting languages like Python, Ruby, or PHP are integral to modern software development. Despite security measures like memory safety and sandboxing, vulnerabilities within these engines can lead to critical issues such as remote code execution or sandbox escapes. A particularly pervasive class of vulnerabilities is callback bugs, which occur when user-defined callbacks violate runtime invariants, such as freeing an object still in use (can be reached through live pointers) or modifying a data structure during traversal. These violations can result in severe consequences, including use-after-free, null-pointer dereferences, and type confusion, often leading to crashes, memory corruption, or even exploitable vulnerabilities. Detecting callback bugs remains challenging due to a lack of general understanding, as they have not been formally characterized or systematically studied. As such, existing tools lack the ability to (1) establish clear links between script-side callbacks and their native-side invokers, and (2) generate scripts that systematically satisfy preconditions required to trigger these bugs.
We propose CrossFit, a novel 2-tier approach combining static analysis and targeted fuzzing to systematically discover callback bugs. CrossFit first establishes links between script-side callbacks and their native-side invokers through context link analysis, enabling targeted exploration of high-risk code paths. It then generates proof-of-concept scripts with custom classes and magic methods, introducing side-effect operations to violate runtime invariants. Our evaluation shows that CrossFit effectively outperforms existing tools by up to 12.04% in terms of callsite coverage (i.e., potential sites where callback bugs may occur). We also identified 20 new bugs in Python, Ruby, and PHP, many of which are severe memory corruptions. Moreover, we provide a comprehensive benchmark totaling 150 proof-of-concepts to improve interpreter security.
Article Search
Artifacts Available
Article: fse26mainb-p174-p
Detecting Code-Comment Inconsistencies in Smart Contracts by Combining LLM and Program Analysis
Jiashuo Zhang,
Jiachi Chen,
Ting Zhang,
Yue Li,
Daoyuan Wu,
Yanlin Wang,
Jianbo Gao,
Ting Chen, and
Zhong Chen
(Peking University, China; Zhejiang University, China; Taiyuan University of Technology, China; Lingnan University, Hong Kong; Sun Yat-sen University, China; Beijing Jiaotong University, China; University of Electronic Science and Technology of China, China)
Smart contracts have attracted rapid development and widespread application. Due to the complexity of real-world smart contracts, it is error-prone to correctly enforce all intended functionalities in code implementations, resulting in unintended functional behaviors and security issues in practice. Code-comment inconsistency detection has emerged as an important solution to these issues, which leverages the redundant functional specifications in comments to detect code implementations that violate developers' intentions. However, existing inconsistency detection solutions are typically pattern-based and limited to fixed types of inconsistencies, which prevents them from detecting the diverse inconsistencies between real-world code implementations and casually written comments. To bridge the gap, this paper presents SmartComment, the first technique that combines LLMs with program analysis techniques for detecting code-comment inconsistencies in smart contracts. SmartComment introduces an LLM-driven workflow which simulates real-world interactions between code reviewers and developers to identify inconsistencies. It incorporates various program analysis techniques into the workflow, including comment propagation and code context extraction for generating input context for inconsistency detection, as well as program variant generation and differential analysis for inconsistency confirmation. Our evaluation results show that SmartComment detects 203 valid inconsistencies from a dataset of 1,000 real-world contracts with a precision of 79.9%, highlighting its effectiveness in detecting prevalent and diverse real-world inconsistencies. Compared to previous work, SmartComment achieves both higher precision and recall, detecting over 90% of inconsistencies that existing methods fail to identify. Furthermore, an ablation experiment demonstrates the effectiveness of incorporating program analysis techniques into SmartComment, improving the F1-score from 58.7% to 81.3%.
Article Search
Article: fse26mainb-p186-p
Aligning with Human Coding Preferences for Improving Code Generation
Xin Yin,
Chao Ni, and
Xiaohu Yang
(Zhejiang University, China)
Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human preferences, the optimal training strategy remains unclear across diverse code preference types. This paper systematically investigates the roles of SFT and DPO in aligning LLMs with different code preferences. Through both theoretical analysis and empirical observation, we hypothesize that SFT excels in types with objectively verifiable optimal solutions, while applying SFT followed by DPO (S&D) enables models to explore superior solutions in types without objectively verifiable optimal solutions. Based on the analysis and experimental evidence, we propose Adaptive Preference Optimization (APO), a dynamic integration approach that adaptively amplifies preferred responses, suppresses dispreferred ones, and encourages exploration of potentially superior solutions during training. Extensive experiments across six representative code preference tasks validate our theoretical hypotheses and demonstrate that APO consistently matches or surpasses the performance of existing SFT and S&D strategies. Our work provides both theoretical foundations and practical guidance for selecting appropriate training strategies in different code preference alignment types.
Article Search
Article: fse26mainb-p194-p
EfficientUICoder: A Bidirectional Token Compression Framework for Efficient MLLM-Based UI Code Generation
Jingyu Xiao,
Zhongyi Zhang,
Yuxuan Wan,
Yintong Huo,
Yang Liu, and
Michael R. Lyu
(Chinese University of Hong Kong, China; Huazhong University of Science and Technology, China; Singapore Management University, Singapore; Nanyang Technological University, Singapore; Chinese University of Hong Kong, Hong Kong)
Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance in UI2Code tasks (i.e., generating code from UI mockups), significantly enhancing website development efficiency. However, UI2Code tasks incur substantially higher computational overhead compared to traditional code generation tasks. This overhead is primarily driven by the large number of input image tokens required to represent complex visual designs and the extensive volume of output code tokens needed to describe complete webpage structures. In this paper, we conduct a comprehensive preliminary study on popular MLLMs for UI2Code tasks, identifying significant redundancies in both image and code tokens. We observe that these redundancies not only exacerbate computational complexity but also hinder the model’s ability to focus on key UI elements, leading to excessively lengthy and often invalid HTML files. To address these challenges, we propose EfficientUICoder, a bidirectional compression framework designed for efficient UI code generation. First, we introduce an Element and Layout-aware Token Compression method, which preserves essential UI element and layout information by detecting element regions and constructing a UI element tree for efficient representation. Second, we design a Region-aware Token Refinement strategy that refines selected tokens by leveraging attention scores to evaluate semantic importance, discarding low-attention tokens from selected regions while integrating high-attention tokens from unselected regions. Third, we develop an Adaptive Duplicate Token Suppression mechanism, which dynamically modulates token probabilities during decoding by tracking HTML/CSS code structure frequencies and applying exponential penalty strategies to minimize repetitive generation. Extensive experiments demonstrate that EfficientUICoder achieves a 55%-60% compression ratio without compromising the quality of the generated webpages, effectively reducing output code redundancy. In terms of efficiency, EfficientUICoder achieves superior improvements, reducing computational cost by up to 44.9%, generated tokens by up to 41.4%, prefill time by up to 46.6%, and inference time by up to 48.8% on 34B-level MLLMs. Code is available at https://github.com/WebPAI/EfficientUICoder.
Article Search
Article: fse26mainb-p200-p
Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering
Francesco Sovrano,
Gabriele Dominici, and
Alberto Bacchelli
(Collegium Helveticum at ETH Zurich, Switzerland; University of Zurich, Switzerland; University of Italian-Speaking Switzerland, Switzerland)
Prompt-induced cognitive biases are changes in a general-purpose AI (GPAI) system's decisions caused solely by biased wording in the input (e.g., framing, anchors), not task logic. In software engineering (SE) decision support (where problem statements and requirements are natural language) small phrasing shifts (e.g., popularity hints or outcome reveals) can push GPAI models toward suboptimal decisions. We study this with PROBE-SWE, a dynamic benchmark for SE that pairs biased and unbiased versions of the same SE dilemmas, controls for logic and difficulty, and targets eight SE-relevant biases (anchoring, availability, bandwagon, confirmation, framing, hindsight, hyperbolic discounting, overconfidence). We ask whether prompt engineering mitigates bias sensitivity in practice, focusing on actionable techniques that practitioners can apply off-the-shelf in real environments. Testing common strategies (e.g., chain-of-thought, self-debiasing) on cost-effective GPAI systems, we find no statistically significant reductions in bias sensitivity on a per-bias basis. We then adopt a Prolog-style view of the reasoning process: solving SE dilemmas requires making explicit any background axioms and inference assumptions (i.e., SE best practices) that are usually implicit in the prompt. So, we hypothesize that bias-inducing features short-circuit assumption elicitation, pushing GPAI models toward biased shortcuts. Building on this, we introduce an end-to-end method that elicits best practices and injects axiomatic reasoning cues into the prompt before answering, reducing overall bias sensitivity by 51% on average (p < .001). Finally, we report a thematic analysis that surfaces linguistic patterns associated with heightened bias sensitivity, clarifying when GPAI use is less advisable for SE decision support and where to focus future countermeasures.
Article Search
Article: fse26mainb-p207-p
Break to Adapt: Knowledge-Based Updates of Breaking Dependencies in JavaScript
Yifan Xia,
Chengwei Liu,
Zifan Xie,
Lyuye Zhang,
Peiyu Liu,
Kangjie Lu,
Yang Liu,
Wenhai Wang, and
Shouling Ji
(Zhejiang University, China; Nankai University, China; Chongqing University, China; Nanyang Technological University, Singapore; University of Minnesota, USA)
Modern software libraries continually evolve to maintain activeness, improve performance, and enhance
security. This evolution frequently introduces major version updates that include incompatible modifications (breaking changes), which often cause downstream projects to fail and significantly increase the effort required for maintenance. To reduce this burden on developers, recent studies propose automated approaches that leverage API characterization techniques and compilation error reports to guide LLM-based updates. However, these methods rely on static typing and compilation feedback, which are not available in dynamic languages such as JavaScript. As a result, when applied to JavaScript, they degrade to a knowledge-free setting and cannot effectively support automated dependency updates. In this paper, we propose a knowledge-based approach that mines explicit update knowledge, referred to as breaking change records, from library repositories to support automated dependency updates in the presence of breaking changes (i.e., breaking dependency updates). We first conduct an empirical investigation of how breaking change records are maintained in top-ranked
JavaScript libraries. Our study reveals that 83.8% of libraries maintain such records within their repositories, often spread across multiple locations. However, the quality of these records varies considerably, with issues such as implicit references to changed objects and vague adaptation instructions. To address this issue, we systematically characterize breaking change records by identifying recurring patterns in both their locations and the types of information they convey. Building on these insights, we develop BDUpdater, an agent based framework that aggregates repository sources and refines raw records into structured breaking change lists,
thereby mitigating the absence of API differencing tools. Leveraging this mined knowledge, BDUpdater
pinpoints changed objects and statically identifies the affected client code, while supplying fine-grained change information to LLMs to guide accurate client code migration in the absence of compilation error messages. Our evaluation on 13 popular JavaScript libraries and 84 clients shows that BDUpdater recovers 90.5% of the breaking dependency updates with high semantic equivalence to developer-authored adaptations, while incurring an average cost of only $0.05 per client.
Article Search
Article: fse26mainb-p219-p
VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns
Jia Li,
Zhuangbin Chen,
Yuxin Su, and
Michael R. Lyu
(Chinese University of Hong Kong, Hong Kong; Sun Yat-sen University, China)
The increasing prevalence of software vulnerabilities highlights the need for effective Automatic Vulnerability Repair (AVR) tools. While LLM-based approaches are promising, they struggle to incorporate structured security knowledge from sources like CWE and NVD. Current methods either use this information superficially by concatenating the CWE-ID into the input prompt, yielding negligible benefits, or rely on few-shot learning with rigid, non-generalizable examples, which limits their effectiveness in real-world scenarios.
To address this gap, we propose VulKey, an LLM-based AVR framework that leverages a hierarchical abstraction of expert knowledge to guide patch generation. Our novel three-level abstraction formulates repair strategies in terms of CWE type, syntactic actions, and semantic key elements. This approach captures the essence of a security fix with greater generality than concrete examples and more semantic richness than traditional syntax-based templates, overcoming the coverage limitations of prior methods.
VulKey is implemented as a two-stage pipeline: first, expert knowledge matching predicts an appropriate repair pattern for the vulnerability; second, repair code generation uses a pattern-guided, fine-tuned LLM to produce secure patches.
On the real-world C/C++ dataset PrimeVul, VulKey achieves 31.5% repair accuracy, surpassing the best baseline by 7.6% and outperforming leading tools such as VulMaster and GPT-5. Moreover, VulKey demonstrates cross-language and cross-model generalizability, with state-of-the-art performance on the Java benchmark Vul4J. These results underscore the importance of structured expert knowledge in advancing AVR effectiveness.
Our work demonstrates that explicitly modeling and integrating expert security knowledge through hierarchical patterns is a crucial step toward building more effective and reliable AVR tools.
Article Search
Article: fse26mainb-p225-p
GREClue: Failure Indexing with Graph-Based Failure Representation and Entropy-Based Deep Clustering
Zhenyu Yang and
Zhongxing Yu
(Shandong University, China)
Failure indexing aims to group multiple failures according to their root causes and is an essential step in parallel debugging. Failure indexing consists mainly of two steps: failure representation and failure clustering. While many research efforts have been devoted to these two steps, serious issues still exist. For failure representation, existing works use coverage or program memory information, which unfortunately can not capture deep failure semantic. For failure clustering, advanced failure indexing methods use clustering algorithms with preset cluster centers, but this kind of clustering algorithm can handle spherical clusters well but performs poorly when handling clusters of other shapes. To address these issues, this paper proposes GREClue, a novel failure indexing approach with Graph-based failure Representation and Entropy-based deep Clustering. GREClue overcomes the issues in order. For failure representation, GREClue designs the failure semantic graph (FSG), a new graph representation that effectively contains semantic information and runtime information of failures. Based on FSG, GREClue further consists of an entropy-based deep clustering component, which can accurately cluster failed tests without presetting cluster centers. An extensive evaluation of GREClue shows that compared to the state-of-the-art failure indexing method, GREClue improves both the performance of estimating the number of faults and the clustering effectiveness by 10% to 41%. Moreover, it has also been shown that GREClue can effectively facilitate parallel debugging.
Article Search
Article: fse26mainb-p229-p
Understanding Code Similarity across Instruction Set Architectures: An Empirical Study
Haonan Yu,
Jiaxin Zhu,
Yingying Zheng,
Yuwei Zhang,
Wei Wang,
Jun Wei, and
Tao Huang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; University of Chinese Academy of Sciences Nanjing, China; Nanjing Institute of Software Technology, China)
Software interacts with hardware through Instruction Set Architectures (ISAs), such as x86, ARM, and RISC-V. Although many developers may be unaware of ISA heterogeneity, ISA-specific code is pervasive in foundational software systems that underpin the digital infrastructure of human society. Maintaining separate implementations is common when supporting multiple ISAs in such a foundational software project. This may introduce substantial additional effort. Meanwhile, separate ISA-specific implementations frequently exhibit code similarities across ISAs. While prior code similarity research has largely focused on general-purpose clones or cross-language settings, similarity in ISA-specific implementations remains underexplored. To understand ISA-specific code and their similarities, and to gain insights for better management, we conducted an empirical study of 20 open-source foundational projects that support multiple ISAs. We confirmed the need for separate ISA-specific implementations by identifying the roles and characteristics of large-scale ISA-specific code, with assistance from large language models (LLMs). Our analysis of the ISA-specific code revealed a weighted average similarity of 21.7% across ISAs. We also observed cross-ISA co-change and cross-ISA participation patterns in the development and maintenance of ISA-specific code. By centering on ISA-specific implementations rather than general-purpose clones, this study provides a dedicated empirical characterization of a practically important but underexplored code-similarity setting, yielding evidence that can inform both researchers and practitioners working on ISA-related software engineering.
Article Search
Article: fse26mainb-p270-p
How Do Developers Interact with AI? An Exploratory Study on Modeling Developer Programming Behavior
Yinan Wu,
Ze Shi Li,
Kathryn Thomasset Stolee, and
Bowen Xu
(North Carolina State University, USA; University of Oklahoma, USA)
Artificial Intelligence (AI) is reshaping how developers adopt software engineering practices, yet the multi-dimensional nature of developer-AI interaction remains under-explored. Prior studies have primarily examined dimensions observable from developer activities such as “Prompt Crafting” and “Code Editing,” overlooking how hidden intentions and emotional dimensions intertwine with concrete actions during AI-assisted programming. Understanding the interplay is essential for improving developer experience and future AI assistant designs. To understand this phenomenon, we conducted a mixed-methods study with 76 developers. We first split developers into AI-assisted and non-AI groups. Each developer performed a programming task (either Python with API management or Java with SQL). Developers retrospectively labeled their self-reported intentions, tool-supported actions, and emotions (on a 7-point valence scale) from screen recordings, supplemented by participant surveys and interviews. Our user study resulted in a novel model, named S-IASE, with four dimensions to describe programming behavior for a given development state: intention, action, supporting tool, and emotion. Our analysis reveals several aggregated and sequential behavioral patterns. For example, for aggregated patterns, using AI assistants often led developers to focus more on actively “creating” code, evaluating, and verifying the generated results; for sequential patterns, AI-assisted participants showed emotionally stable development flows, as opposed to non-AI-assisted participants who experienced more fluctuating emotions. Interviews revealed further nuance: some developers reported impostor-like feelings, expressing guilt or self-doubt about relying on AI for programming. The uncovered patterns indicate that our model can provide actionable insights for improving AI assistants’ responsiveness, training developers in AI collaboration, and designing developer-centric AI studies. Our work bridges an important gap in understanding the complexities of developer-AI interaction in the programming context and sheds light on future developer-centric research directions.
Article Search
Artifacts Available
Article: fse26mainb-p272-p
Adaptive Mutation Scheduling with Deep Reinforcement Learning for Smart Contract Fuzzing
Qianqian Pang,
Xin Yin,
Tingting Bi,
Lingfeng Bao,
Chao Ni, and
Xiaohu Yang
(Zhejiang University, China; University of Melbourne, Australia)
Smart contracts underpin a wide range of decentralized applications—from financial services to supply-chain management—but their immutability and direct control of assets magnify the impact of any security bugs. Although many fuzz approaches have been proposed and have demonstrated their effectiveness in uncovering vulnerabilities, existing methods often rely on unguided random mutation scheduling, generate redundant inputs, and fail to adapt to smart contract-specific characteristics. To overcome these challenges, we present FuzzMaster, a feedback-driven fuzzing framework that combines deep reinforcement learning (DRL) with lightweight probabilistic scheduling to steer mutation selection at runtime intelligently. By continuously analyzing execution feedback—code coverage, function-call sequences, and vulnerability signals—FuzzMaster’s DRL agent and probabilistic tables prioritize high-impact mutations and avoid wasted effort on redundant seeds. On standard VeriSmart and SmartBugs benchmarks, FuzzMaster achieves a 66.2% detection rate with 100% precision (versus 46.9% for ItyFuzz and 43.1% for Confuzzius) and uncovers most bugs within the first second of execution. Meanwhile, in real-world Ethereum contracts, FuzzMaster identified 97 vulnerabilities in 6 categories. These results demonstrate that dynamic, vulnerability-aware mutation scheduling can dramatically improve both the efficiency and effectiveness of smart contract fuzz testing.
Article Search
Article: fse26mainb-p273-p
Reducing the TCB of SGX-Oriented LibOSes at Runtime
Donghui Yu,
Dahan Pan,
Fengwei Zhang,
Haoran Fang,
Ya Fang, and
Yuanyuan Zhang
(Shanghai Jiao Tong University, China)
Intel Software Guard Extensions (SGX) provides a trusted execution environment (TEE) for applications to protect runtime code and data from the untrusted environment. All code residing in the enclave, including LibOSes and the libraries, is all taken into the Trusted Computing Base (TCB). The TCB size closely correlates with the potential vulnerabilities and the system’s attack surface. Vulnerabilities in the enclave can tamper with the control flow of the enclave program and potentially lead to data breaches. Existing SGX frameworks present a dilemma: either include a large LibOS in the enclave to ensure functionality, which inflates the TCB and attack surface, or move the OS out for a minimal TCB, which severely restricts application support or incurs high overhead, insecure interactions with the untrusted world.
In this paper, we introduce DynaTCB, a runtime framework designed to balance security and functionality. To enhance security, DynaTCB dynamically adjusts the TCB according to runtime program behavior. To preserve functionality, it logically removes unneeded code from the TCB instead of physically removing it from the enclave, thus avoiding the drawbacks of frequent untrusted interactions. It performs binary-level analysis without requiring source code. Experimental results show that DynaTCB achieves a code reduction of over 95% for the Coreutils suite and more than 80% for real-world applications with 14.2% overhead. Additionally, DynaTCB successfully breaks the gadget chains in SGX and mitigates several Common Vulnerabilities and Exposures (CVEs), affirming its potential to significantly enhance security in SGX environments.
Article Search
Article: fse26mainb-p290-p
Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation
Sanjeepan Sivapiran and
Gias Uddin
(York University, Canada)
Large Language Model (LLM) alignment trains an LLM using preference data to produce outputs that better meet established quality standards (e.g., to mitigate toxic or biased responses in text generation, or to enforce coding best practices in code generation). While LLM alignment techniques are studied for non-coding tasks, we know little about their usefulness for coding tasks. Intuitively, for code generation tasks using LLMs, alignment techniques could help with coding solutions that better adhere to established software engineering standards, such as more secure code, supporting better coding practices, etc. However, it is unclear whether LLM code alignment could support both functional requirements (producing executable, correct code) and non-functional requirements (code readability, style, maintainability). It is also unknown whether alignment for a code LLM should begin with base pretrained version or the finetuned (i.e., instruction-tuned) version of the LLM. In this paper, we offer insights on the above two research questions by conducting an empirical study. We studied five state-of-the-art (SOTA) LLMs using two widely used LLM alignment techniques: Direct Preference Optimization (DPO) and BoNBoN. For each training record, we created a preference pair as accepted and rejected instances by using the SelfCodeAlign pipeline. DPO and BoNBoN are reward-free models, i.e., they eliminate the need for multiple reward scores for output preferences. We tuned each LLM using the two alignment techniques in two settings: pretrained and finetuned versions of an LLM. We evaluated functional requirements using four SOTA benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval) and non-functional requirements using the CODAL benchmark, which evaluates code quality across five dimensions derived from software engineering practices. We find that pretrained-to-aligned pathways achieve larger improvements in the aligned variant over its pretrained variant (CodeLlama-7b: +75% non-functional, Llama3-8b: +42% functional). But the pretrained variant is generally less accurate than its finetuned variant. However, finetuned-to-aligned offers smaller performance improvements or, in some cases, degradation in the aligned variant than its finetuned variant. This means that while the base pretrained version is less accurate than its base finetuned variant, alignment reduces the performance gap between pretrained and finetuned variants. Non-functional requirements improve more consistently than functional requirements via alignment. Based on these findings, we provide nine recommendations to guide alignment for code LLMs.
Article Search
Article: fse26mainb-p293-p
TypePro: Boosting LLM-Based Type Inference via Inter-Procedural Slicing
Teyu Lin,
Minghao Fan,
Huaxun Huang,
Zhirong Shen, and
Rongxin Wu
(Xiamen University, China)
Dynamic languages (such as Python and JavaScript) offer flexibility and simplified type handling for programming, but this can also lead to an increase in type-related errors and additional overhead for compile-time type inference. As a result, type inference for dynamic languages has become a popular research area. Existing approaches typically achieve type inference through static analysis, machine learning, or large language models (LLMs). However, current work only focuses on the direct dependencies of variables related to type inference as the context, resulting in incomplete contextual information and thus affecting the accuracy of type inference.
To address this issue, this paper proposes a method called TypePro, which leverages LLMs for type inference in dynamic languages. TypePro supplements contextual information by conducting inter-procedural code slicing. Then, TypePro proposes a set of candidate complex types based on the structural information of data types implied in the slices, thereby addressing the lack of domain knowledge of LLMs.
We conducted experiments on the ManyTypes4Py and ManyTypes4TypeScript datasets, achieving Top-1 exact match (EM) rates of 88.9% and 86.6%, respectively.
Notably, TypePro improves the Top-1 Exact Match by 7.1 and 10.3 percentage points over the second-best approach, showing the effectiveness and robustness of TypePro.
Article Search
Article: fse26mainb-p301-p
Understanding and Predicting Accepted Code Suggestions in AI-Assisted Programming
Jing Jiang,
Liehao Li,
Jinyun Hou,
Xin Tan, and
Li Zhang
(Beihang University, China)
AI-assisted programming tools are widely adopted, yet their practical utility is often undermined by undesired suggestions that interrupt developer workflows and cause frustration. While existing research has explored developer-AI interactions when programming qualitatively, a significant gap remains in quantitative analysis of developers’ acceptance of AI-generated code suggestions, partly because the necessary fine-grained interaction data is often proprietary. To bridge this gap, this paper conducts an empirical study using 66,239 industrial developer-AI interactions from a large technology company. We analyze features that are significantly different between accepted code suggestions and rejected ones. We find that accepted suggestions are characterized by significantly higher historical acceptance counts and ratios for both developers and projects, longer generation intervals, shorter preceding code context in the project, and older IDE versions. Based on these findings, we introduce CSAP (Code Suggestion Acceptance Prediction) to predict whether a developer will accept the code suggestion before it is displayed. Our evaluation of CSAP shows that it achieves an accuracy of 0.973 and 0.922 on the imbalanced and balanced datasets, respectively. Compared to a large language model baseline and an in-production industrial filter, CSAP improves the accuracy by 12.6% and 69.5% on the imbalanced dataset, and by 87.0% and 140.1% on the balanced dataset. Our results demonstrate that targeted personalization is a powerful approach for filtering out code suggestions with predicted rejection and reducing developer interruption. To the best of our knowledge, it is the first quantitative study of code suggestion acceptance on large-scale industrial data, and this work also sheds light on an important research direction of AI-assisted programming.
Article Search
Article: fse26mainb-p302-p
Automated Detection of Configuration-Specific Security Vulnerabilities via Patch Analysis
Felipe de Sant'Anna Paixão,
Joanna C. S. Santos,
Paulo Anselmo da Mota Silveira Neto,
Daniel Sadoc Menasche,
Gustavo Bittencourt Figueiredo, and
Eduardo Santana de Almeida
(Federal University of Bahia, Brazil; University of Notre Dame, USA; Federal Rural University of Pernambuco, Brazil; Federal University of Rio de Janeiro, Brazil)
We study how security patches in highly configurable C/C++ systems map onto the space of compile-time
variants. We formalize the Vulnerability Impact Condition (VIC)—a Boolean predicate over configuration
options that denotes all variants that contained the original flaw—and introduce PatchLens, a purely static
technique that recovers VICs by aligning AST-level patch hunks with source-level presence conditions and
resolving file inclusion via lightweight build system analysis. Evaluating PatchLens on 1,192 Linux kernel,
289 FFmpeg, and 100 PHP patches, we compute precise, human-readable VICs without the need to compile
any system variant. The resulting predicates are compact (avg. 1.84 variables for Linux, 3.23 for FFmpeg, 1.04
for PHP) and show that only a small fraction of vulnerabilities are system-wide, which carry higher CVSS
scores; meanwhile, CVE texts almost never encode the required options (≈1% average recall), motivating
automated enrichment of CVE descriptions with VICs. PatchLens and the accompanying dataset enable
immediate applications in CI (variant-aware triage and test selection), targeted sampling and fuzzing, and
feature risk scoring, offering a scalable, explainable path to vulnerability assessment in highly configurable
software.
Article Search
Artifacts Available
Article: fse26mainb-p311-p
Property Refinement in Linear Temporal Logic: Formal Semantics and Algorithms for Software Verification
Luca Brodo,
Giuseppe Scalora, and
Stefan Henkler
(Hamm-Lippstadt University of Applied Sciences, Germany)
Model checking using temporal logic is a key aspect of formal verification of modern complex software systems. These systems are often the result of distributed development processes, involving multiple teams and iterative design cycles. This complexity is mirrored in the corresponding formal specifications, which often consist of a large number of temporal logic properties that need to be verified against a system model. Many of these properties overlap semantically, yet traditional verification treats them as independent, resulting in redundant checks that waste computational resources and inflate engineering effort.
We address this by introducing and operationalizing a formal theory of property refinement for temporal logic into a concrete methodology that automatically identifies redundant properties from a verification suite. This is achieved by first partitioning specifications into equivalence classes based on shared atomic propositions, followed by an analysis of intra-class refinement relations to construct the minimal sufficient subset. Our extensive empirical evaluation confirms the practical viability of our approach, demonstrating it can reduce the number of required verification tasks by up to 75% and accelerate model checking by up to three orders of magnitude, with a one-time associated overhead cost of 0.035% for the refinement analysis. The results of the evaluation confirm that the benefits of this approach are threefold: it reduces the computational requirements for formal verification; it allows engineers to focus on the core requirements of the system, hereby reducing engineering effort; finally, the one-time negligible investment in refinement yields compounding returns, making it particularly advantageous for agile and long-term development lifecycles.
This work thus establishes property refinement analysis as a key technique for scaling requirement engineering and software verification to modern complex software systems.
Article Search
Article: fse26mainb-p322-p
OdoTest: An Automated Testing Approach for Odometry Systems
Jixiang Zhou,
Mingfei Cheng,
Shuncheng Tang,
An Guo,
Xiaofei Xie,
Yinxing Xue, and
Lijun Zhang
(University of Science and Technology of China, China; Singapore Management University, Singapore; Hong Kong Polytechnic University, China; Institute of AI for Industries, China; Institute of Software at Chinese Academy of Sciences, China)
Reliable pose estimation is critical for intelligent systems, including autonomous vehicles, unmanned aerial vehicles, and virtual reality applications. While visual-inertial odometry (VIO) has made significant advancements in estimating pose, its performance can still be affected by sensor noise, environmental variations, and calibration errors. To evaluate the performance of VIO, existing testing methods rely on real-world datasets or manually degraded data, where sensor measurements or images are artificially modified according to predefined, fixed perturbation patterns. These approaches are costly, require time-consuming annotation, and rely on ad hoc modifications that limit scenario diversity and hinder the exploration of challenging cases.
In this paper, we design and implement OdoTest, the automated testing framework for VIO systems.
OdoTest equips sensor-specific transformation operators, including IMU perturbations, camera degradations, and sensor calibration errors, that systematically generate various and realistic test scenarios. Moreover, OdoTest adopts an odometry fitness-guided testing strategy to prioritize scenario generation, improving testing efficiency. By leveraging odometry-specific metamorphic relations (MRs), OdoTest can automatically detect odometry errors without requiring manual ground-truth labels for test scenarios. We evaluate OdoTest on twelve state-of-the-art VIO systems to answer three main research questions: 1) the effectiveness of transformation operators in revealing odometry errors; 2) the ability of OdoTest to generate error-triggering scenarios; and 3) whether retraining on these scenarios can improve odometry performance. Results show that OdoTest’s transformation operators effectively expose odometry errors, while its fitness-guided scenario
generation efficiently creates challenging test cases. Moreover, retraining with these scenarios significantly enhances both estimation accuracy and system robustness, which demonstrates the usefulness of OdoTest.
Article Search
Article: fse26mainb-p323-p
Evaluating LLM-Based Regression Test Generation
Jing Liu,
Seongmin Lee,
Eleonora Losiouk, and
Marcel Böhme
(MPI-SP, Germany; University of California at Los Angeles, USA; University of Padua, Italy)
Large Language Models (LLMs) have shown tremendous promise in automated software engineering. In this paper, we investigate the opportunities of LLMs for just-in-time regression test generation for programs, like parsers, interpreters, or compilers, that take highly structured, human-readable inputs. When a new bug fix or code change is committed, the repository (as part of the CI/CD workflow) runs an LLM for a few minutes to generate regression test cases for that commit that exercise the changed code and potentially trigger any bugs.
Specifically, we investigate LLM-based regression test generation as a machine translation task that takes the developer-provided commit message, the code change, and the name of the input format (e.g., XML) and produces regression test cases for the described change in the given input format. In our experiments testing 72 commits to Mujs, Libxml2, Poppler, JerryScript, Z3, PHP, JQ, and MicroPython, our feedback-directed, zero-shot LLM-based prototype Cleverest performed well, even if we did not provide the code change. In under 2 minutes, on average, Cleverest found as many bugs as the state-of-the-art directed greybox fuzzer WAFLGo in 24 hours, even though WAFLGo started with a commit-reaching seed corpus in the majority of cases. If we amplify the Cleverest-generated test cases using those as a seed corpus in coverage-guided greybox fuzzing, the number of bugs found doubles. We call the integration with fuzzing as ClevFuzz.
In addition, we found that some commit messages are more expressive than others, thus we wonder how this impacts the effectiveness of Cleverest. Our results above demonstrate that Cleverest picks up on the change intention. For instance, if the commit message describes that this patch changes how floating point variables are treated in the Mujs JavaScript interpreter, then Cleverest generates JavaScript programs that contain floating point variables. To study the impact of expressiveness, we change the commit messages minimally to reduce and increase the information in the commit message, respectively, and find a substantial impact on effectiveness. For instance, adding 17 words on average (max. 43) to make ineffective commit messages more expressive significantly increased the number of bugs found.
Article Search
Artifacts Available
Article: fse26mainb-p337-p
ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering
Md Awsaf Alam Anindya,
Showvik Biswas,
Anindya Iqbal,
Jaydeb Sarker, and
Amiangshu Bosu
(Bangladesh University of Engineering and Technology, Bangladesh; University of Nebraska Omaha, USA; Wayne State University, USA)
Toxic interactions during code reviews can undermine teamwork and hinder productivity in software engineering (SE) teams. While prior studies explore toxicity detection and empirical investigation, they lack real-time detoxification tools to support the SE community. To address this gap, we present ToxiShield, a browser extension for GitHub pull requests that is built using three modules: i) Toxicity Filter -- to identify whether a text is toxic, ii) Communication coach -- to facilitate just-in-time fine-grained toxicity categorization with explanations, and iii) The Reframer -- that generates a revised, constructive alternative of a toxic text. For each module, we trained and evaluated multiple deep learning and Large Language Models (LLMs) to identify the best choice. A BERT-based binary detection model, trained on 38,761 code review samples, achieves 98% accuracy and an F1-score of 97% and is the selected one for the Toxicity Filter module. For the Communication Coach, prompt-tuned Claude 3.5 Sonnet achieved the best performance with 39% MCC and 42% F1 in multiclass toxicity classification with detailed reasoning. For Reframer, we evaluated five LLMs using a fine-tuning strategy on a dataset of 10,120 code review comments. The fine-tuned Llama 3.2 model achieves 95.27% style transfer accuracy, 97.03% fluency, 67.07% content preservation, and an 84% J-score. We further validated ToxiShield through a human evaluation using the Technology Acceptance Model with 10 participants, confirming its perceived usefulness and ease of adoption. ToxiShield sets a benchmark for advancing constructive communication in software engineering, driving inclusivity and healthier collaboration in open-source communities.
Article Search
Article: fse26mainb-p349-p
RepoReasoner: Evaluating Repository-Level Code Reasoning Ability of Long-Context Language Models
Yanlin Wang,
Suiquan Wang,
Yanli Wang,
Bowen Zhang,
Daya Guo,
Jiachi Chen, and
Zibin Zheng
(Sun Yat-sen University, China)
Recent large language models (LLMs) have shown strong performance on software engineering tasks, yet most existing benchmarks evaluate code reasoning at the function level, where all relevant information is localized. This setting fails to reflect real-world development, which requires reasoning across multiple files and complex dependency structures. We introduce RepoReasoner, a benchmark for evaluating repository-level code reasoning. It assesses two complementary abilities: Output Prediction, which measures fine-grained, stateful execution reasoning across files, and Call Chain Prediction, which evaluates high-level architectural dependency understanding under noisy context. Our benchmark is constructed through a multi-stage pipeline that leverages dynamic tracing of pytest executions to obtain ground-truth call chains, along with LLM-based I/O rewriting to reduce memorization effects. We evaluate seven state-of-the-art LLMs. Even under oracle context, the best-performing model achieves only 69.1% Pass@1 on Output Prediction, indicating that cross-file reasoning remains a major challenge. In Call Chain Prediction, models exhibit high precision but low recall, suggesting limited multi-hop dependency understanding. Furthermore, performance drops on rewritten data reveal partial reliance on memorization, and longer contexts do not consistently improve results due to noise. These findings highlight fundamental limitations in current LLMs’ repository-level reasoning and motivate future work on structured architectural understanding and cross-file inference.
Article Search
Article: fse26mainb-p365-p
VulInstruct: Teaching LLMs Root-Cause Reasoning for Vulnerability Detection via Security Specifications
Hao Zhu,
Jia Li,
Cuiyun Gao,
Jiaru Qian,
Yihong Dong,
Huanyu Liu,
Lecheng Wang,
Ziliang Wang,
Xiaolong Hu, and
Ge Li
(Peking University, China; Tsinghua University, China; Harbin Institute of Technology, China; New H3C Technologies, China)
Large language models (LLMs) have achieved remarkable progress in code understanding and analysis tasks. However, state-of-the-art LLMs demonstrate limited performance in vulnerability detection tasks, and even state-of-the-art models struggle to distinguish vulnerable code from patched code. We argue that a key reason for this limitation is that LLMs lack an understanding of security specifications—the expectations defined by developers and security teams about how code should behave to remain safe. When the actual behavior of the code differs from these expectations and introduces a security risk, it becomes a potential vulnerability. However, such knowledge is rarely explicit in training data, leaving models unable to reason about the root causes of security flaws. To address this challenge, we propose VulInstruct, a specification-guided approach that systematically extracts reusable security specifications from historical vulnerabilities to instruct the detection of new ones. Specifically, VulInstruct designs two automatic pipelines to construct a specification knowledge base from complementary perspectives: (i) General specifications, extracted from high-quality patches across diverse projects, capturing fundamental safe behaviors accumulated across the open-source ecosystem; and (ii) Domain-specific specifications, context-dependent expectations repeatedly violated in particular repositories or domains that are relevant to the target code under analysis. Before analyzing new code, VulInstruct leverages this specification knowledge base to retrieve relevant past cases and their associated specifications, enabling LLMs to reason about expected safe behaviors rather than relying solely on surface patterns. We evaluate VulInstruct under strict evaluation criteria requiring both correct predictions and valid reasoning. On the PrimeVul dataset, VulInstruct achieves 45.0% F1-score (32.7% improvement) and 37.7% recall (50.8% improvement) compared to the strongest baselines, while uniquely detecting 24.3% of all identified vulnerabilities—2.4× more than any baseline. In pair-wise evaluation distinguishing vulnerable from patched code, VulInstruct also achieves a 32.3% relative improvement over the best baseline. Beyond benchmarks, VulInstruct discovered a previously unknown high-severity vulnerability in production code by recognizing violations of extracted specifications, demonstrating its practical value for real-world vulnerability discovery. All code and supplementary materials are available at https://github.com/zhuhaopku/VulInstruct.
Article Search
Article: fse26mainb-p370-p
Knowledge-Graph-Driven Data Synthesis for Low-Resource Software Development: A HarmonyOS Case Study
Mingwei Liu,
Zheng Pei,
Yanlin Wang,
Zihao Wang,
Zikang Li,
Enci Lin,
Xin Peng, and
Zibin Zheng
(Sun Yat-sen University, China; Zhuhai Key Laboratory of Trusted Large Language Models, China; Fudan University, China)
In low-resource framework development (e.g., HarmonyOS), large language models (LLMs) often lack sufficient pre-training exposure, resulting in poor code generation performance. Although they generally preserve programming logic across languages, they frequently fail on framework-specific APIs and syntax, revealing a gap between learned algorithmic knowledge and unfamiliar framework conventions. Consequently, even advanced models such as GPT-4o struggle to produce correct code without prior exposure.
Inspired by these challenges, we propose APIKG4SYN, a framework that leverages API knowledge graphs to synthesize API-oriented question–code pairs without requiring executable environments. It incorporates both single-API and multi-API information, with the latter guided by uncertainty estimation (UE) and Monte Carlo Tree Search (MCTS), to construct high-quality fine-tuning data. For evaluation, we select HarmonyOS as a case study due to its accessible documentation and growing ecosystem, and build the first benchmark for its code generation. Experimental results show that fine-tuning Qwen2.5-Coder-7B with APIKG4SYN achieves a pass@1 of 25.00%, outperforming untuned GPT-4o (17.59%). We further observe that larger volumes of data generated by APIKG4SYN consistently lead to better fine-tuning performance, and that the optimal Single-API to Multi-API ratio is 8:2. Ablation studies also confirm the necessity and effectiveness of each component in our framework. These findings highlight the effectiveness of API-oriented data in enhancing LLM performance for low-resource software development scenarios.
Article Search
Article: fse26mainb-p376-p
Look Before You Leap: Context-Sensitive GUI Grounding for Boosting Automated Extended Reality (XR) Testing
Shuqing Li,
Binchang Li,
Yepang Liu,
Cuiyun Gao,
Jianping Zhang,
Shing-Chi Cheung, and
Michael R. Lyu
(Chinese University of Hong Kong, China; Harbin Institute of Technology, China; Southern University of Science and Technology, China; Harbin Institute of Technology, Shenzhen, China; Hong Kong University of Science and Technology, China)
In recent years, Extended Reality (XR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual or virtual-real environments. Users can interact with XR applications (apps) through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). IGE constitutes the fundamental element of XR GUI, embodying rich semantic information. The accurate recognition and precise understanding of these IGEs is instrumental, serving as the foundation of GUI grounding, which can facilitates downstream tasks, including automated XR testing. A straightforward XR test generator can interact randomly within the app’s 3D environment, making it trapped in uninteractable space and resulting in an ineffective and inefficient testing process. In contrast, a more intelligent test generator, informed by the accurate locations and semantics of IGEs, can make wiser decisions on interaction targets and orders, forming test sequences that cover more functionalities faster. The most recent IGE detection approaches in SE are designed for 2D mobile apps and typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in XR apps, due to a multitude of challenges including complexities posed by open-vocabulary and heterogeneous IGE categories, intricacies of context-sensitive interactability, and the necessities of precise spatial perception and visual-semantic alignment for accurate IGE detection results. Thus, it is necessary to embark on the IGE research tailored to XR apps.
In this paper, we propose the first zero-shot context-sensitive interactable GUI element detection framework for Extended Reality apps, named Orienter. Rather than relying on generic visual grounding which fails in 3D environments, Orienter introduces a structured workflow tailored to XR constraints. It first synthesizes XR-specific semantic contexts (e.g., global interaction paradigms and 3D spatial properties) before performing detection. To overcome severe spatial hallucinations inherent in LMMs, the detection process is iterated within an XR-constrained reflection loop. Specifically, Orienter contains three components, including (1) Semantic context comprehension for capturing the apps’ GUI context, (2) Reflection-directed IGE candidate detection for identifying and localizing valid GUI elements based on multi-perspective description guided IGE detection, as well as feedback-directed reflection, and (3) Context-sensitive interactability classification which integrates semantic contexts for interactability prediction. To evaluate our approach and facilitate follow-up research, we construct the first benchmark dataset which contains 1,552 images from 100 industrial-setting apps on Steam, with 4,470 interactable annotations across 766 semantics categories. Extensive experiments on the dataset demonstrate that Orienter is more effective than the state-of-the-art GUI element detection approaches, including general or GUI-automation-targeted vision language models, and deep learning based models, surpassing their F1 Score by at least 34.9% and 20.1× in distinguishing the interactibility and semantics of the IGEs, respectively. Orienter is beneficial for boosting the performance of automatic testing by isolating the interactable action space from the whole space, regardless of the testing strategies employed. Experiments demonstrate that Orienter-guided testing covers 103.1% more IGEs with 125.7% more effective interactions than testing without action space isolation.
Article Search
Article: fse26mainb-p380-p
QuanForge: A Mutation Testing Framework for Quantum Neural Networks
Minqi Shao,
Shangzhou Xia, and
Jianjun Zhao
(Kyushu University, Japan)
With the growing synergy between deep learning and quantum computing, Quantum Neural Networks (QNNs) have emerged as a promising paradigm by leveraging quantum parallelism and entanglement. However, testing QNNs remains underexplored due to their complex quantum dynamics and limited interpretability. Developing a mutation testing technique for QNNs is promising while requires addressing stochastic factors, including the inherent randomness of mutation operators and quantum measurements. To tackle these challenges, we propose QuanForge, a mutation testing framework specifically designed for QNNs. We first introduce statistical mutation killing to provide a more reliable criterion. QuanForge incorporates nine post-training mutation operators at both gate and parameter levels, capable of simulating various potential errors in quantum circuits. Finally, a mutant generation algorithm is formalized that systematically produces effective mutants, thereby enabling a robust and reliable mutation analysis. Through extensive experiments on benchmark datasets and QNN architectures, we show that QuanForge can effectively distinguish different test suites and localize vulnerable circuit regions, providing insights for data enhancement and structural assessment of QNNs. We also analyze the generation capabilities of different operators and evaluate performance under simulated noisy conditions to assess the practical feasibility of QuanForge for future quantum devices.
Article Search
Article: fse26mainb-p395-p
ReFLAIR: Detecting Responsive Layout Reflow Issues using Multimodal Generative AI
Yirui He,
Ziyao He,
Syed Fatiul Huq, and
Sam Malek
(University of California at Irvine, USA)
With over 60 percent of global Internet traffic originating from mobile devices, Responsive Web Design (RWD) has become essential for ensuring seamless user experiences across diverse screen sizes and resolutions. The Web Content Accessibility Guidelines require that both information and functionality remain accessible when reflow occurs. However, existing checkers or research tools have limitations: they either employ static analysis that fails to capture how content is actually displayed to users or focus on only one accessibility aspect, such as keyboard operation. Consequently, no prior study has comprehensively addressed both information and functionality loss during reflow for Graphical User Interfaces (GUIs).
This paper introduces ReFLAIR (Reflow Fault Localization using AI-based Responsive analysis), a multi-modal generative AI–driven approach for dynamically detecting reflow issues that cause loss of information or functionality in the GUI. ReFLAIR systematically extracts informative and actionable widgets, compares their presence and behavior across original and reflowed layouts, and employs both scrolling and large language model–guided expansion to uncover hidden interface widgets. We evaluate ReFLAIR on a dataset of 24 diverse webpages drawn from popular sites, prior benchmarks, and newly released webpages. Results show that ReFLAIR outperforms five state-of-the-art techniques, achieving precision improvements of at least 20.49% and recall improvements of at least 55.40%, while maintaining reasonable computational and runtime cost. An ablation study confirms that dynamic exploration (i.e., scrolling and expansion) is essential for high accuracy. We evaluated scalability and generalizability by extending the dataset to 36 webpages, covering 28 domains and higher complexity, and experimenting with alternative models and viewports. The results reinforce ReFLAIR’s consistency across diverse subjects and configurations. In summary, our approach contributes to accessibility testing by providing an effective, scalable, and cost-efficient solution for identifying reflow issues in RWD.
Article Search
Article: fse26mainb-p429-p
TORAI: Multi-source Root Cause Analysis for Blind Spots in Microservice Service Call Graph
Luan Pham,
Huong Ha,
Xiuzhen Zhang, and
Hongyu Zhang
(RMIT University, Australia; Chongqing University, China)
Existing multi-source root cause analysis (RCA) methods for microservice systems assume all services have traces to construct a service call graph. However, this assumption is not practical as microservice systems evolve rapidly and may contain blackbox services without traces, such as compiled software or unsupported services. We refer to these services as blind spots. In the presence of blind spots, the performance of existing multi-source RCA methods may be affected, as they only diagnose visible services on the call graph. To overcome this limitation, we propose TORAI, a novel unsupervised approach that effectively pinpoints fine-grained root causes without relying on the service call graph. Instead, TORAI first measures anomaly severity using available multi-source telemetry data. It then performs clustering to group services based on their severity symptoms and conducts causal analysis to rank services within each severity cluster. Finally, TORAI aggregates the cluster rankings and uses hypothesis testing to identify fine-grained root causes. TORAI provides an unsupervised approach that leverages available multi-source telemetry data for RCA without requiring a constructed service call graph or further intrusive actions, thus addressing the limitations of existing methods. Our experiments on three benchmark systems demonstrate that TORAI outperforms state-of-the-art baselines remarkably in the presence of blind spots. Performance on real-world failures further shows that TORAI can accurately pinpoint the root causes in top-3 recommendations.
Article Search
Article: fse26mainb-p431-p
One Size Does Not Fit All: Revisiting Code Context Engineering for Repository-Level Code Generation
Yichen Li,
Qiye Lin,
Yun Peng,
Zhihan Jiang,
Jinyang Liu,
Chaozheng Wang,
Yintong Huo, and
Cuiyun Gao
(Chinese University of Hong Kong, China; Harbin Institute of Technology, Shenzhen, China; Singapore Management University, Singapore)
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation at the function or file level. However, they achieve limited performance on repository-level code generation due to the complicated repository context where a substantial amount of files and functions exist. To address this challenge, code context engineering methods are proposed to accurately extract the relevant code context required by such tasks. These methods belong to three dominant paradigms: 1) Similarity-based In-Context Learning (ICL), which retrieves similar code examples in the repository as demonstrations; 2) Static Analysis, which captures relevant code context based on structural dependencies in the repository; and 3) Navigation-based Paradigms, which invokes LLMs to dynamically explore repositories for context identification.
Despite the prevalence of code context engineering approaches, their individual contributions are often coupled within complex agents in repository-level code generation, making it difficult to isolate and evaluate the effectiveness of each paradigm. This paper presents the first large-scale and systematic empirical study that compares three code context engineering paradigms independently. We evaluate ten representative code context engineering methods based on the three paradigms, with eight popular LLMs on this task. We also propose a new metric named Dependency Collection Rate (DCR) and efficiency metrics to enable the direct comparison of code context engineering methods, rather than observing their impacts only on the final code generation performance.
Our findings reveal fundamental trade-offs: static analysis provides the most reliable balance between effectiveness and cost for function-level generation, while navigation-based approaches become increasingly advantageous as task complexity grows. However, navigation requires powerful models and incurs 10-20× higher computational costs. Based on these findings, we provide actionable implications for AI coding researchers and software developers to guide the design and deployment of context-aware coding tools.
Article Search
Article: fse26mainb-p432-p
Hallucinations in LLM-Based Code Summarization: Unveiling, Detection, and Mitigation
Guanghua Wan,
Yuanning Feng,
Yao Wan,
Zhaoyang Chu,
Zhangqian Bi,
Junxiao Han,
Zhou Zhao,
Hongyu Zhang,
Pingpeng Yuan,
Xuanhua Shi, and
Hai Jin
(Huazhong University of Science and Technology, China; Hangzhou City University, China; Zhejiang University, China; Chongqing University, China)
Code summarization plays a vital role in program comprehension and software maintenance by generating natural language descriptions to summarize the semantics of code. While Large Language Models (LLMs) have shown remarkable performance in this area, recent empirical studies reveal a critical limitation: LLMs are prone to hallucinations, producing summaries that are factually inaccurate or unfaithful to the source code, potentially misleading developers. In this paper, we propose to unveil, detect, and mitigate hallucinations in LLM-based code summarization. First, we construct Hallu-Eval, a novel dataset for unveiling hallucination phenomena and rigorously evaluating the effectiveness of hallucination detection and mitigation in LLM-based code summarization. It comprises both original code snippets to capture naturally occurring hallucinations and their semantically perturbed counterparts, which are designed to systematically induce challenging logical hallucinations, all complemented with manual hallucination annotations on a curated testbed of 800 code-summary pairs. Next, we propose Hallu-Det, a synergistic approach that combines direct entity-level detection to identify explicit hallucinations with a synonymous mutation-based refinement to reliably confirm or refute more ambiguous cases. Finally, we introduce Hallu-Shield, an inference-time mitigation approach that leverages an external value model to guide LLMs toward producing more faithful summaries without costly retraining of the LLM itself. Extensive experiments show that Hallu-Eval effectively triggers hallucinations, increasing the hallucination rate of models such as Qwen2.5-Coder-7B from 17% to 97% on perturbed code. Our detection approach, Hallu-Det, achieves the best performance among baselines, reaching an F1-score of 0.95 for summaries generated by Qwen2.5-Coder-7B. Moreover, our mitigation method, Hallu-Shield, reduces hallucination rates. For example, it lowers the rate from 66% to 59%, a 10.6% relative reduction, on DeepSeek-Coder-6.7B, while simultaneously improving summary quality, achieving a 74.0% win rate evaluated by an LLM-as-a-judge majority vote ensemble.
Article Search
Article: fse26mainb-p438-p
CodeCureAgent: Automatic Classification and Repair of Static Analysis Warnings
Pascal Joos,
Islem Bouzenia, and
Michael Pradel
(CISPA Helmholtz Center for Information Security, Germany)
Static analysis tools are widely used to detect bugs, vulnerabilities, and code smells.
Traditionally, developers must resolve these warnings manually by analyzing the warning, deciding whether to fix or suppress it, and validating the correctness of the code change.
Because this process is tedious, developers sometimes ignore warnings, leading to an accumulation of warnings and a degradation of code quality.
Prior work suggests techniques to automatically repair static analysis warnings, but these are limited to specific analysis rules, cannot perform multi-file edits, and rely on weak validation mechanisms.
This paper presents CodeCureAgent, an approach that harnesses LLM-based agents to automatically analyze, classify, and repair static analysis warnings.
Unlike previous work, our method does not follow a predetermined algorithm.
Instead, we adopt an agentic framework that iteratively invokes tools to gather additional information from the codebase (e.g., via code search) and edit the codebase to resolve the warning.
CodeCureAgent detects and suppresses false positives, while fixing true positives when identified.
We equip CodeCureAgent with a three-step heuristic to approve patches: (1) build the project, (2) verify that the warning disappears without introducing new warnings, and (3) run the test suite.
We evaluate CodeCureAgent on a dataset of 1,000 SonarQube warnings found in 106 Java projects and covering 291 distinct rules.
Our approach produces plausible fixes for 96.8% of the warnings, outperforming state-of-the-art baseline approaches by 29.2%-34.0% in plausible-fix rate.
Manual inspection of 291 cases reveals a correct-fix rate of 86.3%, showing that CodeCureAgent can reliably repair static analysis warnings.
The approach incurs LLM costs of about 2.9 cents (USD) and an end-to-end processing time of about four minutes per warning.
We envision CodeCureAgent cleaning existing codebases and being integrated into CI/CD pipelines to prevent the accumulation of static analysis warnings.
Article Search
Artifacts Available
Article: fse26mainb-p442-p
Thought Is All You Need: Smart Contract Vulnerability Detection with Thought-Augmented Large Language Model
Chaoyuan Peng,
Muhui Jiang,
Yajin Zhou, and
Lei Wu
(Zhejiang University, China; BlockSec, China)
Smart contracts are self-executing agreements with code-defined terms enabling trustless blockchain transactions. Their immutability and control over significant financial assets make them attractive attack targets, with vulnerabilities potentially causing catastrophic financial losses. Large Language Models (LLMs) have revolutionized numerous domains with remarkable capabilities in code understanding and problem-solving. Despite these advancements, recent research reveals that LLMs still face significant limitations in accurately detecting complex vulnerabilities in smart contracts.
This disparity between the capabilities of LLMs and the stringent requirements of security analysis underscores the necessity for tailored methodologies to enhance LLM-based vulnerability detection strategies.
In this paper, we propose Synapse, the first smart contract vulnerability detection framework leveraging thought-augmented LLM and fine-grained analysis under focal context. Specifically, Synapse emulates security researchers' vulnerability discovery workflow, including vulnerability pattern learning, thought instantiation, reasoning, and verification. We employ a Buffer of Vulnerability Reasoning Thoughts (BoVRT) approach for LLMs to learn and apply vulnerability-specific reasoning to concrete contracts, improving detection accuracy. We also leverage specialized reasoning and code models to optimize different stages of the vulnerability detection process. To evaluate Synapse, we collected real-world on-chain contract incidents from security company alerts not covered by existing datasets. Synapse identified 117 previously undiscovered vulnerabilities in on-chain smart contracts, including one critical vulnerability that safeguarded assets totaling $30 million from potential losses.
Article Search
Article: fse26mainb-p446-p
AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation
Kaifeng He,
Mingwei Liu,
Chong Wang,
Zike Li,
Yanlin Wang,
Xin Peng, and
Zibin Zheng
(Sun Yat-sen University, China; Nanyang Technological University, Singapore; Fudan University, China)
Code generation with large language models (LLMs) is highly sensitive to token selection during decoding, particularly at uncertain decision points that influence program logic. While standard strategies such as greedy decoding treat all tokens uniformly, they overlook code-specific uncertainty patterns, leading to suboptimal performance. This paper presents an empirical study revealing that many generation errors stem from token ranking mistakes at high-uncertainty steps, where the correct token is present but not top-ranked.
Motivated by these findings, we propose AdaDec, a lookahead-based uncertainty-guided adaptive decoding framework that integrates a token-level pause-then-rerank mechanism driven by token uncertainty. AdaDec learns model-specific uncertainty thresholds and applies a lookahead-based reranking strategy when uncertainty is high.
Experiments on HumanEval+, MBPP+, and DevEval benchmarks show that AdaDec improves Pass@1 accuracy by up to 20.9% in absolute terms over greedy decoding. More importantly, it consistently outperforms both competitive baselines like Beam Search and state-of-the-art adaptive decoding methods such as AdapT, while maintaining high efficiency through selective, uncertainty-triggered pausing.
Our results highlight the promise of uncertainty-aware adaptive decoding for improving both the reliability and efficiency of LLM-based code generation.
Article Search
Article: fse26mainb-p469-p
StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis
Jiayi Mao,
Liqun Li,
Yanjie Gao,
Zegang Peng,
Shilin He,
Chaoyun Zhang,
Si Qin,
Samia Khalid,
Qingwei Lin,
Saravan Rajmohan,
Sitaram Lanka, and
Dongmei Zhang
(Tsinghua University, China; Microsoft, China; Microsoft Research, China; Renmin University of China, China; Microsoft, USA)
Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present , a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist site reliability engineers (SREs) in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution directed acyclic graphs (DAGs) from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to ensure correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that achieves a ∼94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs. Our code and sample data are publicly available at https://github.com/microsoft/StepFly.
Article Search
Article: fse26mainb-p519-p
SWR-Bench: Assessing LLM Performance in Real-World Code Review Comment Generation
Zhengran Zeng,
Ruikai Shi,
Keke Han,
Yixin Li,
Kaicheng Sun,
Yidong Wang,
Zhuohao Yu,
Rui Xie,
Wei Ye, and
Shikun Zhang
(Peking University, China; Northwestern Polytechnical University, China)
Automated Code Review (ACR) is crucial for software quality, yet existing benchmarks often fail to reflect real-world complexities, hindering the evaluation of modern Large Language Models (LLMs). Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics. To address these limitations, we introduce SWR-Bench, a new benchmark comprising 1000 manually verified Pull Requests (PRs) from GitHub, offering PR-centric review with full project context. SWR-Bench employs an objective LLM-based evaluation method that aligns strongly with human judgment (∼90% agreement) by verifying if issues from a structured ground truth are covered in generated reviews. Our systematic evaluation of mainstream ACR tools and LLMs on SWR-Bench reveals that current systems underperform, and ACR tools are more adept at detecting functional errors. Subsequently, we propose and validate a simple multi-review aggregation strategy that significantly boosts ACR performance, increasing F1 scores by up to 43.67%. Our contributions include the SWR-Bench benchmark, its objective evaluation method, a comprehensive study of current ACR capabilities, and an effective enhancement approach, offering valuable insights for advancing ACR research.
Article Search
Article: fse26mainb-p535-p
GPU-Accelerated Flow-Sensitive Pointer Analysis for C/C++ Programs
Jiaqi He and
Karim Ali
(University of Alberta, Canada; NYU Abu Dhabi, United Arab Emirates)
Flow-sensitive pointer analysis offers highly precise results that are essential for various security analyses, bug detection tools, and compiler optimizations. However, its high computational cost often leads to prohibitively long analysis times, especially for large, real-world programs. Despite decades of research, state-of-the-art algorithms still struggle to achieve acceptable performance at industrial scale, forcing developers to choose less precise alternatives.
To overcome these limitations, we present GPA, a GPU-accelerated flow-sensitive pointer analysis for C/C++ programs. To maximize hardware utilization, GPA dynamically balances computation by combining the massive parallelism of GPUs with a graph neural network that predicts per-variable workloads. Compared to state-of-the-art CPU-based analyses, GPA improves runtime performance by a factor of 1.3× to 14× on large programs (i.e., ≥275 KLOC LLVM IR) without sacrificing precision. However, on most small programs (i.e., <100 KLOC LLVM IR) and some medium ones (i.e., 100–275 KLOC LLVM IR), traditional CPU implementations run faster due to the memory management overhead on GPUs that GPA incurs. By making the computation of highly precise pointer information more tractable, GPA enables running analyses and developer tools that were previously infeasible on large codebases.
Article Search
Article: fse26mainb-p564-p
Automated Knowledge-Aware Test Reuse
Ziyuan Zhang,
Yi Gao,
Xing Hu,
Xin Xia, and
Shanping Li
(Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, China)
The quality of foundational libraries is critical to the reliability of modern software ecosystems. However, developers often do not have enough time to test for various reasons. Recent studies show that 68% of deep learning libraries lack unit tests and their absence negatively impacts libraries' health. While automated test generation techniques have been proposed, they frequently produce invalid inputs or semantically inconsistent assertions due to missing domain knowledge. An alternative is cross-library test reuse, as many libraries in scientific computing and machine learning expose functionally similar APIs. Nevertheless, effective test reuse requires careful semantic alignment rather than naive "copy-and-paste", as subtle API mismatches and incomplete domain knowledge often yield invalid tests.
To address these challenges, we present KATRER, a knowledge-aware framework for automated test reuse. KATRER models each library as a heterogeneous graph that integrates semantic and structural information and introduces a Test Fingerprint representation for tests, which captures code, docstrings, usage examples, pre-/post-conditions, and invoked APIs. This enables accurate API alignment and prevents invalid calls. KATRER further employs a collaborative LLM pipeline that verifies semantic compatibility, performs stepwise API substitution, and validates test correctness.
We evaluated KATRER among five libraries in two domains, generating 13,191 sub-tests from 1,484 source tests, with 7,257 retained after filtering. KATRER improves both PassRate@1 and TFC@1 by 14.46%, while uncovering 22 previously unknown defects in CuPy and cuML, 8 of which have already been fixed. These results demonstrate that knowledge-aware test reuse substantially reduces manual adaptation effort and enhances the robustness of widely used libraries.
Article Search
Article: fse26mainb-p605-p
Evaluating Risk and Confidence in Performance Bounds of Configuration Sampling Strategies
Kallistos Weis,
Martina Maggio,
Norbert Siegmund, and
Sven Apel
(Saarland University, Germany; Lund University, Sweden; ScaDS.AI, Germany; Leipzig University, Germany)
Modern software usually exposes a large number of configuration options to the user, giving rise to enormous configuration spaces in practice. Appropriate choices for these options dramatically influence the performance of the software (throughput, memory consumption, execution time, etc.). However, due to the sheer size of the configuration space, systematically identifying the worst- or best-performing configurations is computationally infeasible through exhaustive exploration. Instead, practitioners rely on budgeted sampling strategies, such as uniform random sampling or statistical recursive search, to explore the configuration space under fixed measurement budgets in an attempt to find the worst- or best-performing configuration. Even worse, a fundamental limitation of existing sampling strategies is the lack of quantifiable guarantees that the selected configuration truly reflects worst-case (or best-case) performance. In this paper, we define the basic concepts of posterior risk and posterior confidence and present a probabilistic framework to evaluate how well sampling strategies identify the worst- or best-performing configuration of a software system. We evaluate our framework by comparing five representative sampling strategies on seven real-world configurable software systems. We find that statistical recursive search yields consistently tighter best-case guarantees—higher posterior confidence and lower posterior risk—than the alternatives at the same budget. Our results demonstrate the applicability of our framework as a principled basis for reporting, comparing, and refining sampling strategies, and as a tool for practitioners to select strategies and budgets with quantified guarantees across systems and sample sizes.
Article Search
Artifacts Available
Article: fse26mainb-p612-p
Can Old Tests Do New Tricks for Resolving SWE Issues?
Yang Chen,
Toufique Ahmed,
Reyhaneh Jabbarvand, and
Martin Hirzel
(University of Illinois at Urbana-Champaign, USA; IBM Research, USA)
Test suites in real-world projects are often large and achieve high code coverage, yet they remain insufficient for detecting all bugs. The abundance of unresolved issues in open-source project trackers highlights this gap. While regression tests are typically designed to ensure past functionality is preserved in the new version, they can also serve a complementary purpose: debugging the current version. Specifically, regression tests can (1) enhance the generation of reproduction tests for newly reported issues, and (2) validate that patches do not regress existing functionality. We present TestPrune, a fully automated technique that leverages issue tracker reports and strategically reuses regression tests for both bug reproduction and patch validation. A key contribution of TestPrune is its ability to automatically minimize the regression suite to a small, highly relevant subset of tests. Due to the predominance of LLM-based debugging techniques, this minimization is essential as large test suites exceed context limits, introduce noise, and inflate inference costs. TestPrune can be plugged into any agentic bug repair pipeline and orthogonally improve overall performance. As a proof of concept, we show that TestPrune leads to a 6.2
Article Search
Article: fse26mainb-p613-p
SQLiFuzz: Uncovering SQL Injection in Any Web Applications
I Putu Arya Dharmaadi,
Van-Thuan Pham,
Fadi Mohsen, and
Fatih Turkmen
(University of Groningen, Netherlands; Udayana University, Indonesia; University of Melbourne, Australia)
SQL injection (SQLi) is one of the most critical and prevalent security vulnerabilities, as it enables attackers to manipulate backend databases, bypass authentication, and even gain complete control of the underlying system. Since web applications are the primary targets of SQLi, they must be thoroughly tested to ensure they are free of this vulnerability. Recently, several fuzz testing solutions tailored to SQLi vulnerabilities have been developed; however, our preliminary analysis reveals key limitations that hinder their effectiveness: they primarily focus on GUI-based inputs while neglecting API endpoints, rely on less effective request selection and generation strategies, and require complex configurations to be deployed in practice.
To address these gaps, we propose SQLiFuzz, a universal and simple-to-deploy SQL injection fuzzer that operates across both GUI (web pages) and API entry points. SQLiFuzz introduces three key distinguishing features: (i) a reverse proxy that unifies request collection and fuzzing, and allows seamless integration with existing crawlers and API scanners, (ii) a database proxy that enables request–query matching and serves as a reliable oracle, and (iii) a feedback-driven fuzzer that prioritizes potentially effective requests and parameters, and validates exploitability through database responses. We evaluated SQLiFuzz on six security benchmarks and ten real-world applications. SQLiFuzz successfully detects the majority of known SQLi cases in benchmarks and uncovered nine new vulnerabilities that had been overlooked by state-of-the-art tools in real-world applications. These results highlight SQLiFuzz’s ability to detect SQL injection across diverse web application frameworks and architectures while maintaining practicality and ease of deployment.
Article Search
Article: fse26mainb-p620-p
ScanCoder: Leveraging Human Attention Patterns to Enhance LLMs for Code
Yueke Zhang,
Yifan Zhang,
Zihan Fang,
Greg Trafton,
Daniel Levin,
Kevin Leach, and
Yu Huang
(Vanderbilt University, USA; US Naval Research Laboratory, USA)
Code comprehension is a fundamental challenge in software engineering that impacts developer productivity and software quality. While Large Language Models (LLMs) demonstrate strong capabilities in code generation and summarization, they process code differently from human developers, who employ strategic attention patterns focused on semantically critical elements. Recent research has successfully integrated human attention patterns captured through eye-tracking into AI models for software engineering tasks, however, existing human-AI approaches face critical limitations that prevent widespread practical deployment, particularly for LLM enhancement. Existing approaches to incorporate human cognitive insights face scalability limitations due to resource-intensive eye-tracking studies and lack empirical validation for cross-language generalizability.
We present ScanCoder, a framework that integrates cognitive simulation with LLM enhancement through
(1) generating human-like attention patterns at scale using minimal eye-tracking data via cognitive simulation with Adaptive Character of Thought-Rational (ACT-R) architecture, and (2) cognitively-guided fine-tuning that emphasizes tokens according to their cognitive salience and attention order. Our approach demonstrates cross-language transfer by applying C++-derived cognitive patterns to enhance Java programming tasks. Comprehensive evaluation on CodeXGLUE benchmarks shows consistent improvements across different LLM architectures and scales (1B–8B parameters), achieving gains of up to 41.66 points on CrystalBLEU for code completion and 21.93 points on BERTScore for code summarization. Mechanistic analysis reveals that cognitive guidance reshapes model attention in task-dependent ways, increasing focus on semantically critical tokens by 2.5×. This work establishes the first scalable framework for integrating simulated human cognitive patterns into LLM training, enabling more interpretable and effective code understanding.
Article Search
Article: fse26mainb-p630-p
Cross-Refactoring-Type Test Program Migration for Refactoring Engines
Chunhao Dong,
Yanjie Jiang,
Yang Zhang,
Yuxia Zhang, and
Hui Liu
(Beijing Institute of Technology, China; Tianjin University, China; Hebei University of Science and Technology, China)
Refactoring engines are an essential part of modern IDEs, supporting automated or semi-automated software refactorings. However, these refactoring engines suffer from software bugs, just like other complex software systems, and buggy refactoring engines may silently inject fatal errors into software projects. Consequently, thorough testing of refactoring engines is highly desirable. To this end, this paper proposes an automated approach called EngineTest to test refactoring engines. Unlike existing approaches, it is the first in this line to leverage both LLMs and bug reports of refactoring types other than the targeted refactoring type. The key rationale is that the same mistake, e.g., ignoring the potential name shadowing, may result in similar errors in multiple refactoring types.Consequently, we may retrieve a bug report for a refactoring type and carefully construct a few-shot example as input, guiding LLMs to migrate it into test programs for exercising other refactoring types.Finally, we run the test cases and validate them with differential testing and LLM-based inconsistent checking. We evaluate the proposed approach with widely used state-of-the-practice refactoring engines in Eclipse, NetBeans, and IntelliJ IDEA. Our evaluation results demonstrate that a total of 85 previously unknown bugs (including Java, C/C++, and Python) have been identified, and 30 of them have been manually confirmed by the tool vendors.
Article Search
Article: fse26mainb-p672-p
ViBR: Automated Bug Replay from Video-Based Reports using Vision-Language Models
Sidong Feng,
Dingbang Wang,
Nikola Tomic,
Tingting Yu,
Aldeida Aleti, and
Chunyang Chen
(Chinese University of Hong Kong, Shenzhen, China; University of Connecticut, USA; TU Munich, Germany; Monash University, Australia)
Bug reports play a critical role in software maintenance by helping users convey encountered issues to developers. Recently, GUI screen capture videos have gained popularity as a bug reporting artifact due to their ease of use and ability to retain rich contextual information. However, automatically reproducing bugs from such recordings remains a significant challenge. Existing methods often rely on fragile image-processing heuristics, explicit touch indicators, or pre-constructed UI transition graphs, which require non-trivial instrumentation and app-specific setup. This paper presents ViBR, a lightweight and fully automated approach that reproduces bugs directly from GUI recordings. Specifically, ViBR combines CLIP-based embedding similarity for action boundary segmentation with Vision-Language Models (VLMs) for region-aware GUI state comparison and guided bug replay. Experimental results show that ViBR successfully reproduces 72% of bug recordings, significantly outperforming state-of-the-art baselines and ablation variants.
Article Search
Article: fse26mainb-p692-p
From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing
Linfeng Liang,
Xiao Cheng,
Tsong Yueh Chen, and
Xi Zheng
(Macquarie University, Australia; Swinburne University of Technology, Australia)
Simulation-based testing of autonomous driving systems (ADS) must uncover realistic and diverse failures in dense, heterogeneous traffic. However, existing search-based seeding methods (e.g., genetic algorithms) struggle in high-dimensional spaces, often collapsing to limited modes and missing many failure scenarios. We present PtoP, a framework that combines adaptive random seed generation with Stein Variational Gradient Descent (SVGD) to produce diverse, failure-inducing initial conditions. SVGD balances attraction toward high-risk regions and repulsion among particles, yielding risk-seeking yet well-distributed seeds across multiple failure modes. PtoP is plug-and-play and enhances existing online testing methods (e.g., reinforcement learning--based testers) by providing principled seeds. Evaluation in CARLA on two industry-grade ADS (Apollo, Autoware) and a native end-to-end system shows that PtoP improves safety violation rate (up to 27.68%), scenario diversity (9.6%), and map coverage (16.78%) over baselines.
Article Search
Artifacts Available
Article: fse26mainb-p753-p
Active Learning of Symbolic Automata for Reactive Programs via Dynamic Symbolic Mapper
Yoel Kim and
Yunja Choi
(Kyungpook National University, Republic of Korea)
Active learning of formal behavior models from program source code is a powerful approach for a wide range of software analysis, validation, and verification tasks, including understanding system intent, automating specification mining, generating test oracles, and checking formal properties. Recent advances in active learning of symbolic automata, powered by program synthesis and model checking, provide both rich expressiveness and soundness guarantees for the learned models. However, these techniques often encounter significant performance bottlenecks, particularly when dealing with reactive programs that expose many program variables with large value domains.
This work introduces an extended active learning algorithm tailored for reactive programs by incorporating a novel dynamic symbolic mapper for learning symbolic automata. The mapper abstracts program behavior using learner-inferred predicates over program variables, encodes each valuation of these variables as a Boolean vector induced by these predicates (Boolean abstraction), and dynamically refines the abstraction in response to missing behaviors identified by the teacher. The mapper is granularity-aware: for teaching, it employs the coarsest predicates sufficient to expose missing behaviors, enabling broad exploration; for learning, it refines the abstraction only to the finest predicates necessary to resolve the uncovered gaps, trying to avoid refinements that could otherwise be triggered by coarse abstraction.
We evaluated our approach on 120 benchmarks, including SV-COMP tasks, Simulink model-driven programs, LeetCode problems, and representative embedded control software. The results show that our method learns 32 more symbolic automata and reduces the average active learning time by 55.6% compared to the state-of-the-art.
Article Search
Artifacts Available
Article: fse26mainb-p759-p
A Wily Hare Has Three Havens: Combating Programmable Logic Controller Attacks via Virtualization Redundancy
Wenjie Wang,
Yazhe Wang, and
Lei Ren
(Zhongguancun Laboratory, China)
Programmable Logic Controllers (PLCs) lack built-in security mechanisms, and their critical role in industrial control systems makes them prime targets for cyberattacks. Next-generation PLCs increasingly adopt embedded virtualization to partition functional domains and to integrate industrial control with advanced workloads on unified hardware. Nevertheless, many PLC vendors and researchers have largely overlooked the potential of virtualization to strengthen PLC security.
To address this gap, we propose TriHaven, an embedded virtualization-based security architecture for PLCs. TriHaven separates the PLC control loop and deploys its components in dedicated virtual machines, each with distinct design characteristics. By further implementing a redundancy-compare strategy for PLC control logic execution, TriHaven enables real-time attack detection and rapid emergency response through control switching, thereby mitigating diverse and previously unknown threats. Integrating virtualization with PLC redundancy introduces new challenges: security-preserving multi-VM scan-cycle design under real-time constraints, low-latency integrity-safe cross-domain I/O exchange, and secure synchronization of a network-isolated standby PLC, solving a consistency problem absent in prior redundancy designs.
Using the Jailhouse virtualization, experiments on OpenPLC and Beremiz--two open-source PLC runtimes--demonstrate the feasibility of TriHaven while preserving essential security objectives, under the standard assumption that the PLC hardware and its underlying hypervisor remain physically protected and trustworthy.
Article Search
Article: fse26mainb-p790-p
Boosting LLMs for Mutation Generation
Bo Wang,
Ming Deng,
Mingda Chen,
Chengran Yang,
Youfang Lin,
Mark Harman,
Mike Papadakis, and
Jie M. Zhang
(Beijing Jiaotong University, China; Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, China; Singapore Management University, Singapore; University College London, UK; University of Luxembourg, Luxembourg; King's College London, UK)
LLM-based mutation testing is a promising testing technology, but existing approaches typically rely on a fixed set of mutants as few-shot examples or none at all. This can result in generic low-quality mutants, missed context-specific mutation patterns, substantial numbers of redundant and non-compilable mutants, and limited semantic similarity to real bugs. To overcome these limitations, we introduce SMART (Semantic Mutation with Adaptive Retrieval and Tuning). SMART integrates retrieval-augmented generation (RAG) on a vectorized dataset of real-world bugs, focused code chunking, and supervised fine-tuning using mutants coupled with real-world bugs. We conducted an extensive empirical study of SMART using 1,991 real-world Java bugs from the Defects4J and ConDefects datasets, comparing SMART to the state-of-the-art LLM-based approaches, LLMut and LLMorpheus.
The results reveal that SMART substantially improves mutation validity, effectiveness, and efficiency (even enabling 7B-scale models to match or even surpass large models like GPT-4o). We also demonstrate that SMART significantly improves downstream software engineering applications, including test case prioritization and fault localization. More specifically, SMART improves validity (weighted average generation rate) from 42.89% to 65.6%. It raises the non-duplicate rate from 87.38% to 95.62%, and the compilable rate from 88.85% to 90.21%. In terms of effectiveness, it achieves a real bug detection rate of 92.61% (vs. 57.86% for LLMut) and improves the average Ochiai coefficient from 25.61% to 38.44%. For fault localization, SMART ranks 64 more bugs as Top-1 under MUSE and 57 more under Metallaxis.
Article Search
Artifacts Available
Article: fse26mainb-p824-p
ChainDelta: Automatic Patch-Based Exploit Generation for Ethereum with Fuzzing Agents
Mingxi Ye,
Yuhong Nan,
Zhijie Zhong,
Jianzhong Su,
Xingwei Lin,
Peilin Zheng, and
Zibin Zheng
(Sun Yat-sen University, China; Guangdong Engineering Technology Research Center of Blockchain, China; Zhejiang University, China)
Given the critical nature of Ethereum, exploiting 1-day vulnerabilities that are patched but not yet widely deployed is essential. Meanwhile, Automatic Patch-based Exploit Generation (APEG) is a promising technique for this, as it helps developers understand root causes, verify fixes in downstream forks, and detect incomplete patches. However, existing exploit generation tools can not work well for vulnerabilities on Ethereum due to three key unique challenges: (1) navigating complex and cross-language exploit paths hidden within patches, (2) synthesizing complicated and stateful environment configurations, and (3) handling non-deterministic inconsistencies between blockchain nodes that lead to false alarms.
To address these challenges, we introduce ChainDelta, a novel fuzzing agent framework driven by Large Language Models to automatically generate exploits based on Ethereum security patches. ChainDelta consists of three core modules: a directed fuzzer utilizes call graph analysis to guide testing towards vulnerable code based on the patch information; an agent-based environment fuzzer acts as an expert to automatically set up the necessary blockchain states to trigger vulnerabilities; and finally, a state-aware sanitizer performs differential analysis while monitoring the blockchain transient state to distinguish true inconsistencies from benign non-determinism.
We evaluate ChainDelta on a diverse benchmark with real-world patches, covering a wide range of types such as data racing and denial-of-service. ChainDelta successfully generated exploits with a 64% success rate and only a 15.8% false positive rate. An ablation study confirms the contribution of each module to the overall performance. To demonstrate its practical impacts, we conducted a real-world auditing campaign on top of ChainDelta, leading to the discovery of four previously undisclosed vulnerabilities with bug bounties.
Article Search
Article: fse26mainb-p839-p
From Suspicious Signals to Crashes: Guiding Bug-Driven GUI Testing via Code-Inspired Tracing
Mengzhuo Chen,
Zhe Liu,
Chunyang Chen,
Junjie Wang,
Boyu Wu,
Yuekai Huang,
Jun Hu, and
Qing Wang
(Institute of Software at Chinese Academy of Sciences, China; TU Munich, Germany; University of Chinese Academy of Sciences, Beijing, China; Institute of Software at Chinese Academy of Sciences, Beijing, China)
Mobile applications frequently suffer from crash bugs that are triggered under specific GUI interaction
sequences. Existing automated GUI testing techniques mainly emphasize increasing coverage through diverse
exploration strategies, but they often fail to reach the precise interaction contexts that lead to crashes, resulting
in low bug detection efficiency. This paper proposes TraceDroid, a novel automated GUI testing approach
that leverages suspicious code-level signals to guide dynamic exploration. Instead of treating static analysis as
an independent detection method, TraceDroid uses heuristic rules distilled from real crash reports to detect
suspicious code segments, associate them with GUI widgets, and collect code-level interaction signals. It then
constructs the Activity Transition Graph (ATG), performs rough path generation, and employs LLM-based
executable path completion to produce a set of suspicious paths. Finally, TraceDroid executes these paths
through global path planning, local path generation, and execution-aware monitoring to efficiently expose
crashes. We evaluate TraceDroid on 70 real crash bugs across 42 open-source apps, comparing it with 15
state-of-the-art baselines. TraceDroid achieves the best performance, with a recall of 77%, exceeding the best
baseline by 28%, while maintaining comparable or higher coverage. Furthermore, TraceDroid successfully
detects 21 previously unknown crash bugs in 116 popular Google Play apps, of which 15 have been fixed and
6 confirmed by developers, demonstrating its effectiveness in real-world scenarios.
Article Search
Article: fse26mainb-p844-p
Characterizing Trust Boundary Vulnerabilities in TEE Container Systems: An Empirical Study
Weijie Liu,
Hongbo Chen,
Shuo Huai,
Zhen Xu,
Wenhao Wang,
XiaoFeng Wang,
Danfeng Zhang,
Zhi Li,
Haixu Tang, and
Zheli Liu
(Nankai University, China; Indiana University Bloomington, USA; Nanyang Technological University, Singapore; Institute of Information Engineering at Chinese Academy of Sciences, China; Duke University, USA; Huazhong University of Science and Technology, China)
Trusted Execution Environments (TEEs) have become a cornerstone of confidential computing, attracting significant attention from academia and industry. To support secure and scalable application deployment on confidential clouds, TEE containers (Tcons) have been introduced as middleware to shield applications from malicious operating systems and orchestration layers while preserving usability.
In this paper, we present the first comprehensive analysis of Tcons, focusing on three critical layers: OS interfaces, encrypted I/O, and orchestration mechanisms. To enable systematic evaluation, we design TBouncer, an automated analyzer that precisely exercises and benchmarks Tcon isolation boundaries.
Our study uncovers fundamental flaws in existing Tcons, leading to exploitable vulnerabilities such as code execution, denial-of-service, and information leakage. In total, we identify six attack vectors, twelve new bugs, and three CVEs.
These findings provide new insights into the underestimated attack surface of Tcons and highlight key directions for building more secure and trustworthy container solutions.
Article Search
Article: fse26mainb-p848-p
Odyssey: Hunting Smart Contract Vulnerabilities with Fine-Grained State Modeling and Exploration
Jianzhong Su,
Mingxi Ye,
Jiachi Chen,
Yuhong Nan,
Peilin Zheng,
Tao Zhang, and
Zibin Zheng
(Sun Yat-sen University, China; GuangDong Engineering Technology Research Center of Blockchain, China; Macau University of Science and Technology, China; Sun Yat-sen University, Guangzhou, China)
With the rapid development of decentralized applications, many malicious actors exploit smart contract vulnerabilities for launching attacks. Moreover, as smart contracts utilize more state variables to support complex functionalities, some vulnerabilities require specific states to trigger (marked as vulnerable states), bringing new challenges to the vulnerability detection task. Although many smart contract fuzzers have been proposed for this task, they face limitations due to their inability to efficiently explore smart contract states.
To address this challenge, we propose a novel fuzzer, Odyssey, with fine-grained state modeling and exploration, which increases the probability of reaching vulnerable states. We improve the efficacy of the fuzzer with two key mechanisms: (1) modeling an essential state space consisting of the variables related to sensitive operations to compress the exploration scope; (2) designing state-aware exploration strategies to identify test seeds that cover new state scope or cause new state transitions, to improve the efficiency of exploration.
To evaluate the performance in vulnerability detection, we adopt Odyssey to a labeled benchmark consisting of 130 vulnerable contracts. Odyssey detects at least 70% more vulnerabilities than other fuzzers. Moreover, we evaluate Odyssey on a dataset that consists of 143 DApps (involving 437 contracts) from real-world security incidents. The experimental results demonstrate that state-aware feedback enhances the ability of Odyssey in state exploration by achieving 19% higher state coverage. Meanwhile, Odyssey totally finds 15 exploits of vulnerabilities from real-world attacks, showing its advantage in detecting real-world vulnerabilities.
Article Search
Article: fse26mainb-p869-p
NESA: Relational Neuro-Symbolic Static Program Analysis
Chengpeng Wang,
Yifei Gao,
Wuqi Zhang,
Xuwei Liu,
Jinyao Guo,
Mingwei Zheng,
Qingkai Shi, and
Xiangyu Zhang
(Purdue University, USA)
Static program analysis plays an essential role in program optimization, bug detection, and debugging. However, reliance on compilation and limited customization hinder its adoption in the real world. This paper presents a compositional neuro-symbolic approach named NESA that facilitates compilation-free and customizable static program analysis using large language models (LLMs) with mitigated hallucinations. Specifically, we propose an analysis policy language, a restricted form of Datalog, to support users decomposing a static program analysis problem into several sub-problems that target simpler syntactic or semantic properties upon smaller code snippets. The problem decomposition enables the LLMs to target more manageable semantic-related sub-problems with reduced hallucinations, while the syntactic ones are resolved by parsing-based analysis without hallucinations. An analysis policy then is evaluated with lazy and incremental prompting, which
significantly mitigates the hallucinations and improves the performance. We evaluate NESA for program slicing and bug detection upon benchmark and real-world programs. Evaluation results show that while NESA supports compilation-free and customizable analysis, it can still achieve comparable and even better performance than existing techniques. In a customized taint vulnerability detection upon TaintBench, for example, NESA achieves a precision of 66.27%, a recall of 78.57%, and an F1 score of 0.72, surpassing an industrial approach by 0.20 in F1 score. NESA also detects 13 real-world memory leak bugs, which have been fixed by developers.
Article Search
Article: fse26mainb-p877-p
LLM-Assisted Input-Requirement-Aware Differential Testing of Array Programming Frameworks
Zhichao Zhou and
Jingzhu He
(ShanghaiTech University, China)
Array programming (AP) frameworks (e.g., NumPy and Octave) are widely adopted in scientific computing. Critical defects can jeopardize the entire ecosystem. The stability of API designs enables differential testing on various implementations (e.g., two versions). However, two primary obstacles remain. First, current test generation cannot effectively generate valid inputs, as the APIs (e.g., matrix multiplication) have type constraints and semantic requirements. Second, unit testing approaches test APIs independently, but they share a core N-dimensional array structure (ndarray) as inputs. Modifying one API may alter the ndarray's properties, breaking the correctness of others. We propose a differential testing tool for array programming, called ArrayDiff. We first collect semantic requirements from NumPy's APIs and leverage LLMs to transfer NumPy's requirements to other frameworks. Then, we propose an input-requirement-aware API call generator (IRA-ACG). Based on IRA-ACG, ArrayDiff employs search algorithms to evolve tests while ensuring valid inputs. ArrayDiff can generate valid and complex API call sequences to detect potential differences. We evaluate ArrayDiff and its ablation versions on five AP pairs. They detect 47 valid-input differences and 39 invalid ones, with 23 confirmed as bugs or document issues. IRA-ACG boosts the detection of valid-input differences, which constitute most confirmed bugs. Comparing ArrayDiff with TitanFuzz (LLM-based fuzzer) and Ghostwriter (unit tester) confirms the benefits of IRA-ACG and sequence-level testing.
Article Search
Article: fse26mainb-p951-p
VisionScratch: LLM-Based Automated Feedback Generation using Code-Produced Videos for Scratch Programs
Yuan Si,
Daming Li,
Hanyuan Shi, and
Jialu Zhang
(University of Waterloo, Canada; Independent Researcher, USA; Independent Researcher, China)
Block-based programming environments such as Scratch are increasingly popular in programming education, in particular for young learners. While the use of blocks helps prevent syntax errors, semantic bugs remain common and difficult to debug. Existing tools for Scratch debugging rely heavily on predefined rules or user manual inputs, and crucially, they ignore the platform’s inherently visual nature.
We introduce VisionScratch, the first multimodal feedback generation system for Scratch that leverages both the project’s block code and its generated gameplay video to diagnose and repair bugs. VisionScratch uses a two-stage pipeline: a vision-language model first aligns visual symptoms with code structure to identify a single critical issue, then proposes minimal, abstract syntax tree level repairs that are verified via execution in the Scratch virtual machine.
We evaluate VisionScratch on a set of real-world Scratch projects against state-of-the-art LLM-based tools and human testers. Results show that gameplay video is a crucial debugging signal: VisionScratch substantially outperforms prior tools in both bug identification and repair quality,
even without access to project descriptions or goals.
This work demonstrates that video can serve as a first-class specification in visual programming environments, opening new directions for LLM-based debugging beyond symbolic code alone.
Article Search
Article: fse26mainb-p970-p
Agentic Verification of Software Systems
Haoxin Tu,
Huan Zhao,
Yahui Song,
Mehtab Zafar,
Ruijie Meng, and
Abhik Roychoudhury
(National University of Singapore, Singapore)
Automatically generated code is gaining traction recently, owing to the prevalence of Large Language Models (LLMs). Further, the AlphaProof initiative has demonstrated the possibility of using AI for general mathematical reasoning. Reasoning about computer programs (software) can be accomplished via general mathematical reasoning; however, it tends to be more structured and richer in contexts. This forms an attractive proposition, since then AI agents can be used to reason about voluminous code that gets generated by AI.
In this work, we present a first LLM agent, AutoRocq, for conducting program verification. Unlike past works, which rely on extensive training of LLMs on proof examples, our agent learns on-the-fly and improves the proof via an iterative refinement loop. The iterative improvement of the proof is achieved by the proof agent communicating with the Rocq (formerly Coq) theorem prover to get additional context and feedback. The final result of the iteration is a proof derivation checked by the Rocq theorem prover. In this way, our proof construction involves autonomous collaboration between the proof agent and the theorem prover. This autonomy facilitates the search for proofs and decision-making in deciding on the structure of the proof tree. Experimental evaluation on SV-COMP benchmarks and on Linux kernel modules shows promising efficacy in achieving automated program verification. As automation in code generation becomes more widespread, we posit that our proof agent can be potentially integrated with AI coding agents to achieve a generate and validate loop, thus moving closer to the vision of trusted automatic programming.
Article Search
Article: fse26mainb-p985-p
The Effect of Complexity and Provenance on Code Review Decisions: Evidence from a Controlled Experiment
Neha Singh,
Francesco Sovrano,
Vincent Hellendoorn, and
Alberto Bacchelli
(University of Zurich, Switzerland; Google DeepMind, USA)
A code revision is a proposed modification to a specific code snippet under review, created in response to a reviewer’s comment. Modern code review platforms, such as GitHub, allow participants to provide these revisions inline as concrete change suggestions that the others can accept or reject with a single action. While this feature promises efficiency, it may also shape how developers evaluate changes. We hypothesize that under higher code complexity, which increases cognitive load and uncertainty, and depending on suggestion provenance (human vs. AI), reviewers may rely more on heuristic judgments and readily available suggestions, potentially reducing review effectiveness. To test our hypothesis, we present the results of a between-subjects experiment with 385 participants, who were asked to review a changeset including the acceptance/rejection of a proposed code revision. The study tested for the effects of code complexity (low vs. high) and provenance labels (human vs. AI), while controlling for revision correctness. We analyzed developers’ review decisions through compliance patterns: acceptance of correct or rejection of incorrect code revisions (appropriate-compliance), acceptance of incorrect code revisions (over-compliance), and rejection of correct code revisions (under-compliance). We found that higher code complexity significantly ( < .05) increases over-compliance, with reviewers more frequently accepting incorrect suggestions. In contrast, provenance labels had no statistically supported effect on review outcomes. We also found no statistically supported evidence that provenance moderates the effect of complexity. This work contributes: (i) empirical evidence that higher code complexity increases the likelihood of accepting incorrect revision suggestions, (ii) an analysis of provenance showing no main effect on overall compliance, and (iii) clarification that the effect of complexity does not statistically depend on whether revisions are AI- or human-labeled, with any observed differences treated as preliminary and exploratory. Together, these results highlight the need for review systems that surface complexity cues and support more deliberate evaluation of suggested revisions, especially in cognitively demanding contexts. Data and Materials: https://doi.org/10.5281/zenodo.19481940
Article Search
Artifacts Available
Article: fse26mainb-p998-p
DualCodeDetect: Zero-Shot LLM-Generated Code Detection via Dual-Channel Perturbation
Zhengdao Li,
Xiuwei Shang,
Zhenkan Fu,
Shikai Guo,
Weiming Zhang,
NengHai Yu, and
Kejiang Chen
(University of Science and Technology of China, China; Dalian Maritime University, China)
The rapid advancement of large language models (LLMs) in code generation has greatly improved software development efficiency, but it has also raised concerns about misuse, making the distinction between human-written and LLM-generated code an urgent task. However, existing detection methods for LLM-generated content, particularly perturbation-based zero-shot methods, are primarily designed for natural language scenarios and fail to transfer effectively to the task of detecting generated code. When directly applied to code, they face two major challenges: (1) the low-entropy nature of code restricts the perturbation space and weakens discriminative signals; and (2) prior perturbation methods often compromise semantic integrity or executability, leading to substantial performance degradation. To address these issues, we propose DualCodeDetect, a novel zero-shot detection framework that amplifies the differences between LLM-generated and human-written code through a dual-channel perturbation mechanism. In the semantic channel, we design an identifier perturbation strategy based on outside-nucleus sampling, which disrupts the strong consistency of LLMs in identifier selection. In the structural channel, empirical analysis reveals that LLM-generated code exhibits greater uniformity in stylistic features; leveraging this insight, we construct a rule-based library of semantics-preserving code transformations to introduce structural perturbations that further magnify statistical disparities. In experiments conducted across two datasets and ten representative code LLMs, DualCodeDetect achieves an average AUROC of 0.8477, alongside FPR and FNR values of 0.0430 and 0.0552, respectively, on Python under both T=0.2 and T=1.0 temperature settings with a reasonable runtime overhead. Furthermore, it demonstrates strong cross-language generalization on Java, C++, and JavaScript, confirming its significant superiority over existing detection methods.
Article Search
Article: fse26mainb-p1013-p
Two-Level Adaptation for Budget-Constrained Continuous Dynamic Dependence Analysis
Xiaoqin Fu and
Haipeng Cai
(Washington State University, USA; SUNY Buffalo, USA)
Dynamic dependence analysis is essential for software performance optimization, debugging, regression testing, and security analysis. In long-running and distributed systems, continuously performing this analysis while balancing cost and precision under strict time budgets is a persistent challenge. Current state-of-the-art (SOTA) approaches tackled this problem but suffer from inefficient budget utilization and/or imprecise dependence results, limiting their practicality in real-world software engineering workflows. We introduce GDist, a novel two-level adaptation framework that self-tunes analysis parameters to optimize cost-effectiveness for continuous dynamic dependence analysis. GDist integrates a decision-tree-based learning model, tailored to system executions, with domain knowledge about analysis precision levels, ensuring a more precise and budget-observing adaptation strategy. We evaluate GDist on 12 real-world distributed systems and in two key applications: regression test reduction and vulnerability detection. Results show that GDist improves budget utilization by 29% and precision by 18% over SOTA. Specifically, GDist reduces regression testing costs by 31% while maintaining test effectiveness and lowers the cost of identifying true-positive vulnerabilities by 36% in enterprise-scale systems. Its lightweight adaptation and the inherent interpretability of decision trees make it well-suited for scalable, cost-aware software maintenance and security analysis. These merits position GDist as a practical and adaptive solution for modern software engineering challenges.
Article Search
Article: fse26mainb-p1042-p
Unveiling AI-Driven Web Applications: Insights into Characteristics, Functionality, and Compliance
Liuhuo Wan,
Zicong Liu,
Chuan Yan,
Liujia Wan,
Naipeng Dong,
Zi Huang, and
Guangdong Bai
(University of Queensland, Australia; Northeastern University, China; City University of Hong Kong, Hong Kong)
Collaborative platforms such as Google Workspace, Microsoft Teams, and Zoom increasingly rely on third-party
applications (referred to as plugins) to extend their core functionalities, with AI-assisted plugins emerging as a
key driver of productivity. Despite their popularity and rapid adoption, little is known about the characteristics
of the marketplace, the potential security and privacy risks that concern users, and the compliance of plugins
with AI ethics guidelines. In this paper, we present the first large-scale, cross-platform study of plugins from five
major web application marketplaces, covering domains from office productivity to software development. We
systematically examine the distribution characteristics of current plugins, analyze users’ concerns, and assess
their compliance with emerging AI regulations. Our findings indicate that (i) the current marketplaces exhibit
an uneven distribution of functionality and installations, (ii) AI-assisted plugins face a range of emerging
issues that negatively impact user experience, and (iii) a significant proportion of plugins fail to comply with
established AI ethics principles. Our work highlights the need for strigent policies and security auditing to
maintain quality of AI-assisted plugins.
Article Search
Article: fse26mainb-p1065-p
Project-Level C-to-Rust Translation via Pointer Knowledge Graphs
Zhiqiang Yuan,
Wenjun Mao,
Zhuo Chen,
Xiyue Shang,
Chong Wang,
Yiling Lou, and
Xin Peng
(Fudan University, China; Nanyang Technological University, Singapore)
Translating C code into safe Rust is an effective way to ensure memory safety. Compared to rule-based approaches, which often produce largely unsafe Rust code, LLM-based methods generate more idiomatic and safer Rust by leveraging extensive training on human-written code. Despite their promise, existing LLM-based approaches still struggle with project-level C-to-Rust translation. They typically partition a C project into smaller units (e.g., functions) based on call graphs and translate them in a bottom-up manner to resolve dependencies. However, this unit-by-unit paradigm often fails to handle pointers due to the lack of a global view of their usage.
To address this limitation, we propose a novel C-to-Rust Pointer Knowledge Graph (KG) that augments code dependency graphs with two types of pointer semantics: (i) pointer usage information, which captures global behaviors such as points-to flows and lifts low-level struct interactions to higher-level abstractions; and (ii) Rust-oriented annotations, which encode ownership, mutability, nullability, and lifetime. Building on this KG, we further propose PtrTrans, a project-level C-to-Rust translation approach. In PtrTrans, the KG provides LLMs with comprehensive global pointer semantics, guiding them to generate safe and idiomatic Rust code. Experimental results show that PtrTrans reduces unsafe usages in translated Rust by 99.9% compared to both rule-based and conventional LLM-based methods, while achieving 29.3% higher functional correctness than fuzzing-enhanced LLM approaches.
Article Search
Article: fse26mainb-p1091-p
CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries
Ximing Fan,
Yong Fang,
Peng Jia,
Yang Liu,
Yijia Xu,
Xi Peng, and
Yuhao Zhou
(Sichuan University, China; Nanyang Technological University, Singapore)
In the AI-driven era, NVIDIA CUDA libraries have become indispensable for accelerating compute-intensive
tasks, yet their security assessment remains critically understudied due to closed-source code and unique
programming paradigms. Existing efforts primarily target CUDA compiler vulnerabilities (e.g., NVCC), but
overlook broader library-specific risks. This paper addresses the challenges of fuzzing CUDA libraries: (1)
context-dependent API ordering and closed-source opacity hinder the synthesis of valid, diverse API sequences;
and (2) implicit parameter dependencies reduce the effectiveness of mutation for LLM-generated harnesses.
Anewtool called CuFuzz has been proposed, aimed at uncovering potential vulnerabilities in the CUDA
libraries. CuFuzz has the ability to generate testing harnesses for various CUDA library functions from scratch,
perform efficient parameter mutation, and adapt to the needs of multiple CUDA libraries. First, LLMs are used
to extract semantic relationships from CUDA documentation and sample codes, constructing a knowledge
graph that prioritizes API interactions and contextual dependencies. The API coverage bitmap is proposed
to guide the fuzzer to explore under-tested library functions. In addition to harness generation, the API
knowledge graph is also combined with compiler diagnostics to repair erroneous harnesses, thereby improving
compilation success rates. Subsequently, CuFuzz employs the LLMs to analyze and decouple parameter
dependencies, separates out the mutable parameters, and performs parameter-isolated mutation on them to
enhance mutation efficiency.
Evaluated across three CUDA releases (12.4, 12.7, and 13.0) on eight widely adopted libraries (e.g., cuBLAS,
cuFFT), CuFuzz achieves 2.97× higher API coverage and 4.0× superior API edge coverage relative to baseline
(Fuzz4all), on average. The experiments uncovered 6 unknown bugs, validated by NVIDIA’s security team and
obtained 2 CVEs.
Article Search
Article: fse26mainb-p1117-p
Understanding, Detecting, and Repairing Real-World In-Context-Learning-Based Text-to-SQL Errors
Jiawei Shen,
Chengcheng Wan,
Ruoyi Qiao,
Jiazhen Zou,
Hang Xu,
Yuchen Shao,
Yueling Zhang,
Weikai Miao, and
Geguang Pu
(East China Normal University, China; Shanghai Innovation Institute, China)
Large language models (LLMs) have been adopted for text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into SQL queries. However, such a technique faces correctness problems. In this paper, we conduct the first comprehensive study of text-to-SQL errors of ICL-based techniques. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 27 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement while having high computational overhead and many mis-repairs. Based on these findings, we propose MapleDoctor, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleDoctor outperforms existing solutions by repairing 13.8% more queries with a negligible number of mis-repairs and reducing 67.4% repair latency. The artifact is publicly available at GitHub.
Article Search
Article: fse26mainb-p1132-p
An Empirical Study of Fuzz Harness Degradation
Philipp Görz,
Joschua Schilling,
Nicolai Bissantz, and
Thorsten Holz
(Ruhr-University Bochum, Germany; CISPA Helmholtz Center for Information Security, Germany; MPI-SP, Germany)
Fuzzing is a widely used technique to automatically test software for potential faults. To fuzz software projects efficiently and effectively, software developers must use fuzz harnesses, i.e., small programs that connect the fuzzer to the project’s code under test. However, as projects evolve, it is unclear whether fuzz harnesses are maintained in lockstep or left to stagnate, and whether unmaintained fuzz harnesses gradually degrade in terms of code coverage and bug-finding effectiveness.
In this paper, we focus on OSS-Fuzz, the largest continuous fuzzing platform in practice, which provides harnesses for 510 security-critical open-source C/projects. These harnesses are usually contributed by project maintainers or external developers, yet their ongoing maintenance is not always ensured. Our analysis shows that, overall, harnesses exhibit only a small reduction in coverage and retain surprising longevity in their ability to uncover bugs. At the same time, we also identify cases where harnesses degrade, analyze their root causes and the involved semantics of the code changes, and categorize them systematically. Finally, we extend OSS-Fuzz and Fuzz Introspector, a companion project to investigate fuzzer performance, with new metrics to automatically detect harness degradation, enabling more effective monitoring of fuzzing quality in evolving projects.
Article Search
Article: fse26mainb-p1142-p
Towards Automated Crowdsourced Testing via Personified-LLM
Shengcheng Yu,
Yuchen Ling,
Chunrong Fang,
Zhenyu Chen, and
Chunyang Chen
(TU Munich, Germany; Nanjing University, China)
The rapid proliferation and increasing complexity of software demand robust quality assurance, with graphical user interface (GUI) testing playing a pivotal role. Crowdsourced testing has proven effective in this context by leveraging the diversity of human testers to achieve rich, scenario-based coverage across varied devices, user behaviors, and usage environments. In parallel, automated testing, particularly with the advent of large language models (LLMs), offers significant advantages in controllability, reproducibility, and efficiency, enabling scalable and systematic exploration. However, automated approaches often lack the behavioral diversity characteristic of human testers, limiting their capability to fully simulate real-world testing dynamics. To address this gap, we present PersonaTester, a novel personified-LLM-based framework designed to automate crowdsourced GUI testing. By injecting representative personas, defined along three orthogonal dimensions: testing mindset, exploration strategy, and interaction habit, into LLM-based agents, PersonaTester enables the simulation of diverse human-like testing behaviors in a controllable and repeatable manner. Experimental results demonstrate that PersonaTester faithfully reproduces the behavioral patterns of real crowdworkers, exhibiting strong intra-persona consistency and clear inter-persona variability (117.86% -- 126.23% improvement over the baseline). Moreover, persona-guided testing agents consistently generate more effective test events and trigger more crashes (100+) and functional bugs (11) than the baseline without persona, thus substantially advancing the realism and effectiveness of automated crowdsourced GUI testing.
Article Search
Article: fse26mainb-p1153-p
DECODE: Dynamic Exploration for Constraint-Guided Vulnerability Discovery in Deep Learning Operators
Haotong Liu,
Zhi Wang,
Zhuohang Liu, and
Wanpeng Li
(Nankai University, China; University of Liverpool, UK)
The security and robustness of deep learning (DL) frameworks are vital, as vulnerabilities in low-level operator implementations can lead to serious reliability and security risks.
While testing has proven effective in uncovering such flaws, existing techniques struggle to accurately capture the complex input constraints required by DL operators, resulting in low test coverage and missed bugs.
To address this, we propose DECODE, a fully automated framework that performs efficient and precise constraint extraction through dynamic analysis.
DECODE models operator-specific input requirements by observing valid execution traces and exploring constraint relationships.
It then uses these refined constraints to generate high-quality test inputs capable of exposing memory error vulnerabilities.
We evaluated DECODE on two widely used DL frameworks - TensorFlow and PyTorch - where it uncovered 96 bugs (54 in TensorFlow and 42 in PyTorch), 82 of which have been confirmed by developers.
Compared to state-of-the-art tools, DECODE detected 41, 75, and 87 more bugs than IvySyn, DocTer, and DeepREL, respectively, demonstrating its superior effectiveness in uncovering previously undetected vulnerabilities.
Article Search
Article: fse26mainb-p1216-p
CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation
Tobias Kiecker,
Jan Arne Sparka,
Martin Reuter,
Albert Ziegler, and
Lars Grunske
(Humboldt-Universität zu Berlin, Germany; XBOW, Sweden)
Maintaining consistency between code and documentation is a crucial yet frequently overlooked aspect of software development. Even minor mismatches can confuse API users, introduce new bugs, and increase overall maintenance effort.
This creates demand for automated solutions that can assist developers in identifying code-documentation inconsistencies. However, since automatic reports still require human confirmation, false positives carry serious consequences: wasting developer time and discouraging practical adoption.
We introduce CASCADE (Consistency Analysis for Source Code And Documentation through Execution), a novel tool for detecting inconsistencies with a strong emphasis on reducing false positives.
CASCADE leverages Large Language Models (LLMs) to generate unit tests directly from natural-language documentation.
Since these tests are derived from the documentation, any failure during execution indicates a potential mismatch between the documented and actual behavior of the code.
To minimize false positives, CASCADE also generates code from the documentation to cross-check the generated tests. By design, an inconsistency is reported only when two conditions are met: the existing code fails a test, while the code generated from the documentation passes the same test.
We evaluated CASCADE on a novel dataset of 71 inconsistent and 814 consistent code-documentation pairs drawn from open-source Java projects. Further, we applied CASCADE to additional Java, C#, and Rust repositories, where we uncovered 13 previously unknown inconsistencies, of which 10 have subsequently been fixed, demonstrating both CASCADE's precision and its applicability to real-world codebases.
Article Search
Artifacts Available
Article: fse26mainb-p1294-p
A Tuple-Oriented Sampling Method for Generating Small Pairwise Covering Arrays in Configurable Software Systems
Kaichen Chen,
Yi Xiang,
Haining Wang,
Jiatong Ma,
Fujian Feng,
Miqing Li, and
Han Huang
(South China University of Technology, China; Guizhou Minzu University, China; University of Birmingham, UK; Sun Yat-sen University, China)
Pairwise testing is the most commonly used combinatorial interaction testing (CIT) technique to verify highly configurable systems, aiming to select the minimum number of testing configurations to cover all valid pairwise combinations of option values. The core problem of pairwise testing is the pairwise covering array generation (PCAG) problem. Existing PCAG methods typically struggle to generate small-scale pairwise covering arrays (PCA) for instances with complex constraints, or they require excessive computational time. To address these limitations, we propose DivSampCA, which employs a tuple-oriented adaptive sampling technique to enhance the diversity of the sampled configurations. Moreover, DivSampCA employs a novel full coverage strategy to ensure that the remaining uncovered pairwise tuples are covered with as few configurations as possible. We validate our method on 121 publicly available configurable system instances, and the experimental results show that DivSampCA achieves the smallest covering array in 71% of the instances, which is on average 15.54% smaller than that of other algorithms. Moreover, it is the fastest in 65% of the instances, reducing the average time by 42.36%. These results indicate that DivSampCA can generate smaller covering arrays in a shorter time and represents a significant advancement in solving the PCAG problem.
Article Search
Article: fse26mainb-p1351-p
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction
Zeming Wei,
Chengcan Wu, and
Meng Sun
(Peking University, China)
Large Language Models (LLMs) have achieved tremendous success in various tasks, yet concerns about their safety and security have emerged. In particular, they pose risks of generating harmful content and are vulnerable to jailbreaking attacks, creating unaddressed security issues regarding their deployments. In the context of software engineering for artificial intelligence (SE4AI) techniques, model-based analysis has demonstrated notable potential for analyzing and monitoring machine learning models, particularly in stateful deep neural networks. However, it suffers from scalability issues when extended to LLMs due to their vast feature spaces. In this paper, we aim to address the scalability issue of model-based analysis techniques for safeguarding LLM-scale models. Motivated by the recent discovery of low-dimensional safety-critical representations that emerged in LLMs, we propose ReGA, a model-based analysis framework with Representation-Guided Abstraction, to safeguard LLMs against harmful prompts and generations. By leveraging safety-critical representations, which are key directions in hidden states that indicate safety-related concepts, ReGA effectively narrows the scalability gap when developing the abstract model for safety modeling. Our comprehensive evaluation shows that ReGA performs sufficiently well in distinguishing between safe and harmful inputs, achieving an AUROC of 0.975 at the prompt level and 0.985 at the conversation level. Additionally, ReGA exhibits robustness to real-world attacks and generalization across different safety perspectives, outperforming existing safeguard paradigms in terms of interpretability and scalability. Overall, ReGA serves as an efficient and scalable solution to enhance LLM safety by integrating representation engineering with model-based abstraction, paving the way for new paradigms to utilize software insights for AI safety. Our code is available at https://github.com/weizeming/ReGA.
Article Search
Article: fse26mainb-p1401-p
PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages
Deniz Simsek,
Aryaz Eghbali, and
Michael Pradel
(University of Stuttgart, Germany; CISPA Helmholtz Center for Information Security, Germany)
Security vulnerabilities in software packages are a significant concern for developers and users alike. Patching these vulnerabilities in a timely manner is crucial to restoring the integrity and security of software systems. However, previous work has shown that vulnerability reports often lack proof-of-concept (PoC) exploits, which are essential for fixing the vulnerability, testing patches, and avoiding regressions. Creating a PoC exploit is challenging because vulnerability reports are informal and often incomplete, and because it requires a detailed understanding of how inputs passed to potentially vulnerable APIs may reach security-relevant sinks. In this paper, we present PoCGen, a novel approach to autonomously generate and validate PoC exploits for vulnerabilities in npm packages. The approach combines the complementary strengths of LLMs (e.g.,
understanding informal vulnerability reports), static analysis (e.g., identifying taint paths), and dynamic analysis (e.g., validating generated exploits). PoCGen successfully generates exploits for 71% of the vulnerabilities in the SecBench.js dataset. This success rate significantly outperforms a recent baseline (by 38 absolute percentage points), while imposing an average cost of only $0.02 per generated exploit. Moreover, PoCGen generates successful exploits for 60% of 126 recent real-world vulnerabilities, which helped augment five recent vulnerability reports in the GitHub Security Advisories database with PoCGen-generated PoC exploits.
Article Search
Article: fse26mainb-p1422-p
ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?
Doha Nam,
Taehyoun Kim,
Duksan Ryu, and
Jongmoon Baik
(KAIST, Republic of Korea; Agency for Defense Development, Republic of Korea; Jeonbuk National University, Republic of Korea)
Just-in-Time software defect prediction (JIT-SDP) plays a critical role in prioritizing risky code changes during code review and continuous integration. However, existing datasets often suffer from noisy labels and low precision in identifying bug-inducing commits. To address this, we present ReDef (Revert-based Defect dataset), a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects. Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks. Ambiguous instances are conservatively filtered out via a GPT-assisted triage process involving multiple votes and audits. This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior resources. Beyond dataset construction, we provide a systematic evaluation of how Code Language Models (CLMs)—specifically CodeBERT, CodeT5+, UniXcoder, and Qwen2.5—reason about code modifications. We first investigate which input encodings most effectively expose change information under five different strategies. We then design four counterfactual perturbation strategies (e.g., swapping added/deleted blocks, inverting diff polarity) to serve as diagnostic probes. We posit that if models genuinely capture change semantics, such distortions should lead to a clear decline in predictive performance. Our results show that compact diff-style encodings consistently outperform whole-function formats across all CLMs, supported by rigorous statistical confirmation. However, under counterfactual tests, performance remains effectively stable, revealing that what appears to be robustness in fact reflects a reliance on superficial cues rather than true semantic understanding. These findings indicate that, at least in code-change understanding tasks, current CLMs remain limited in their ability to genuinely comprehend the relational dynamics of code modifications.
Article Search
Artifacts Available
Article: fse26mainb-p1468-p
Towards the Localization of Multi-Root-Cause Failures in Microservice Systems: An Active Intervention Framework
Yazhuo Gao,
Lin Yang,
Lianxiao Meng,
Ran Zhu, and
Yining Cao
(PLA Academy of Military Sciences, China; Information Engineering University, Zhengzhou, China; National Key Laboratory of Science and Technology on Information System Engineering, Beijing, China; National Key Laboratory of Science and Technology on Information System Engineering, China)
In large-scale microservice systems, multi-root-cause failures often intertwine, significantly increasing overall system risk and triggering a deluge of cascading alerts that pose serious challenges to fault diagnosis and recovery. Existing root-cause localization techniques remain largely passive, relying on rule-based pattern recognition or graph-based propagation inference, and thus falter when faced with the complexity of multi–root-cause failures.
To address these challenges, this paper introduces a novel active-intervention-based framework for root-cause localization. This framework uses Hierarchical Reinforcement Learning (HRL) to infer root causes and employs an Intervention-enhanced Graph ATtention network (IGAT) to predict the fault scenarios each cause may trigger. By iteratively comparing these predicted scenarios against the system’s real-time state, the framework dynamically refines its localization model.
Experimental results on two public datasets and a constructed dataset show that our method outperforms the second-best method by at least 22% on the PR@1 metric in single root cause scenarios and leads by 51.7% on the RE@3 metric in multiple root cause scenarios. These results indicate that the method may offer certain advantages in the field of fault root cause analysis.
Article Search
Article: fse26mainb-p1480-p
ExpeRepair: Dual-Memory Enhanced LLM-Based Repository-Level Program Repair
Fangwen Mu,
Junjie Wang,
Lin Shi,
Song Wang,
Shoubin Li, and
Qing Wang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Beihang University, China; York University, Canada)
Automatically repairing software issues remains a fundamental challenge at the intersection of software engineering and AI. Although recent advances in Large Language Models (LLMs) have demonstrated potential for repository-level repair tasks, current methods exhibit two notable limitations: (1) they often address issues in isolation, neglecting to incorporate insights from previously resolved issues, and (2) they rely on static, rigid prompting strategies that constrain their ability to generalize across diverse and evolving contexts. We propose ExpeRepair, a novel LLM-based program repair framework inspired by the dual-memory systems of human cognition, where episodic and semantic memory synergistically support learning and decision-making. Unlike existing methods, ExpeRepair continuously learns from historical repair experiences via dual-channel knowledge accumulation, enabling it to adaptively reuse past knowledge during inference. Specifically, ExpeRepair organizes prior repair knowledge into two complementary memories: an episodic memory that stores concrete repair demonstrations, and a semantic memory that encodes abstract, reflective insights. At inference time, ExpeRepair activates both memory systems by retrieving relevant demonstrations from episodic memory and recalling high-level repair insights from semantic memory. It further enhances adaptability through dynamic prompt composition, integrating both memory types to replace static prompts with context-aware, experience-driven prompts. We evaluate ExpeRepair on two benchmarks: SWE-Bench Lite and SWE-Bench Verified. Experimental results show that ExpeRepair achieves pass@1 scores of 60.3% and 74.6% on the two benchmarks, respectively, achieving the best performance among the evaluated open-source methods. We have open-sourced ExpeRepair at https://github.com/ExpeRepair/ExpeRepair.
Article Search
Article: fse26mainb-p1572-p
Bringing Managed Language Support to WebAssembly with External Library Linking
Shuyao Jiang,
Ruiying Zeng,
Yangfan Zhou, and
Michael Lyu
(Chinese University of Hong Kong, China; Fudan University, China)
WebAssembly (Wasm) has emerged as a powerful bytecode format for running applications with near-native performance in portable and secure environments. However, while Wasm currently supports compiled languages like C, C++, and Rust, it lacks robust support for managed languages such as Python, Java, and JavaScript. This limitation hinders the deployment of applications in domains like machine learning and data processing that rely heavily on managed language ecosystems. To address this, we propose WALL-E, a novel framework to integrate managed languages into Wasm environments without complex runtime nesting or recompilation. WALL-E employs a unique external library linking strategy, using a client-server architecture to connect Wasm modules with managed language libraries running in their native runtimes. This approach preserves the native execution speed and language feature compatibility of managed languages by eliminating the overhead associated with double-layer virtual machines. Our evaluation shows that WALL-E supports ten managed languages without framework modifications and achieves a speedup of hundreds of times over the runtime nesting solution, with low communication overhead. WALL-E enhances the practicality of Wasm in cloud and edge computing, enabling efficient multi-language applications.
Article Search
Article: fse26mainb-p1581-p
Unleashing HPC Application Performance through Software Deployment: A Joint Model of Software Parallelism and Co-location
Yuxin Ren,
Li Zhou,
Chumin Sun,
Rui Fan,
Jie Sun,
Ning Jia, and
Xinwei Hu
(Huawei Technologies, China; Huawei Technologies, Hong Kong)
Software deployment is a critical software engineering practice, particularly for high performance computing (HPC) software.The deployment determines the software execution performance because the deployment maps a number of software components to multiple CPUs in a server. Inappropriate mapping decreases software parallelism and increases resource contention due to software co-location on the same CPUs. However, calculating the mapping to maximize the software performance is challenging, primarily due to the lack of a joint performance model that accounts for both software parallelism and co-location. Consequently, existing industry practice has to rely on experienced engineers to manually tune the mapping during deployment, resulting in substantial human resource waste of man-months and suboptimal software performance.
This paper proposes a holistic approach to mapping multiple CPUs among multiple software components
to achieve better applicability and performance. We develop a performance model for predicting performance impact of different CPU mapping configurations, along with a search algorithm to identify the best mapping scheme. Our performance model jointly considers software parallelism and co-location, breaks the performance estimation into regularized execution and interference coefficient to improve accuracy, and integrates expert knowledge to reduce the model complexity. Our search algorithm employs nested iterative packing algorithm to explore all possible mapping schemes, thereby uncovering the optimal solution. Evaluation on a multi module HPC application shows 17% better performance than its default CPU mapping Our solution has been deployed in a commercial HPC cluster with more than 50K CPU cores, delivering 26.5% performance improvement and saving many man-months effort spent on performance tuning.
Article Search
Article: fse26mainb-p1609-p
CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes
Zhaoxuan Li,
Qionglu Zhang,
Hengyuan Liu,
Xiaoyan Gu,
Xianhui Lu,
Hongbo Liu,
Bingzheng Wang,
Haihui Fan,
Ziming Zhao,
Rui Zhang, and
Li Zhou
(Institute of Information Engineering at Chinese Academy of Sciences, China; Zhejiang University, China; Institute of Software at Chinese Academy of Sciences, China)
Manual formal analysis of cryptographic schemes is labor-intensive and requires substantial expertise. While model-checking tools (e.g., Scyther and Tamarin) and computational-security tools (e.g., CryptoVerif and EasyCrypt) improve the automation of security proofs, they still rely on experts to abstract schemes and write tool-specific formal descriptions. Large language models (LLMs) are a promising alternative, but their effectiveness in this domain remains unexplored due to the absence of standardized evaluation methodologies. To fill this gap, we introduce CrypFormBench (C.F.B for short), a comprehensive benchmark jointly covering symbolic and computational security to evaluate five core LLM capabilities: interpretation, generation, completion, transformation, and correction. It comprises 700 instances spanning 677 schemes, 7 mainstream formal verifier languages, and 160 security properties. The evaluation of 9 state-of-the-art LLMs reveals that most of them perform well on interpretation and completion, given their code-awareness advantages, but struggle with generation, transformation, and correction. Overall, their performance remains limited, with Claude-3.5 achieving the highest score at 48.7 out of 100. We further provide practical guidance, e.g., few-shot prompting, Pass@K sampling, and lightweight fine-tuning, to mitigate the executability bottleneck and improve tool-usable outputs. Taken together, our benchmark and analyses offer a grounded view of current progress and concrete directions toward reliable LLM-assisted formal cryptographic analysis.
Article Search
Article: fse26mainb-p1639-p
Failure-Based Testing for Deep Reinforcement Learning Agents
Weibin Lin,
Jiangtao Meng, and
Zheng Zheng
(Beihang University, China)
Deep Reinforcement Learning (DRL) agents have been widely adopted across diverse domains to address challenging decision-making problems, such as autonomous driving and robotic control. Given that many of these applications are safety- and security-critical, rigorous testing of DRL agents is indispensable. Existing testing methods are typically guided by reward signals to detect failures. However, for well-trained agents, whose performance approaches optimal levels in standard operating conditions, reward signals remain generally high, making current methods ineffective at uncovering critical failures.
To address these challenges, we propose a novel failure-based method that leverages task-induced failure insights to enhance failure detection capability while reducing the number of tests required. Since DRL agents are inherently designed with human-defined tasks, they provide valuable cues about task difficulty. Intuitively, a DRL agent is more likely to fail when confronted with a more difficult task; therefore, PRT prioritizes these tasks. Building on this foundation, we propose Prior Random Testing, a black-box failure-based testing method that enables targeted prioritization while preserving the diversity of generated test cases. Guided by task-induced failure insights, PRT prioritizes failure-prone regions of the input domain, thereby facilitating efficient failure detection.
PRT is evaluated on four widely used benchmarks and compared with different state-of-the-art methods including fuzzing, search-based and generative-based methods. PRT ranks among the top performers in terms of both the cost of finding the first failure and the diversity of test cases. Notably, compared to random testing, PRT achieves better diversity and reduces the testing cost by over 50%.
Article Search
Article: fse26mainb-p1647-p
EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems
Luan Pham,
Victor Nicolet,
Joey Dodds,
Hui Guan, and
Daniel Kroening
(RMIT University, Australia; Amazon, Canada; Amazon, USA)
Anomaly detection and localization (ADL) is critical for maintaining reliability and availability in cloud systems. Recent ADL developments focus on metric and log data, leaving event data unexplored. To address this gap, we propose EventADL, the first open-box event-based ADL framework for cloud-based service systems. To motivate the design of our framework, we conduct a systematic analysis on 520 real-world incidents, and provide insights into how anomalies and their root causes manifest through event data. EventADL has three phases: offline training, online anomaly detection, and root cause localization. During the training phase, EventADL first learns Event Semantic Patterns (ESPs), which capture normal interactions between system entities using historical event data, and then learns Event Frequency Patterns (EFPs), which capture the normal frequency of known ESPs. In the online anomaly detection phase, any data in the event stream that deviates significantly from either pattern is identified as anomalous. For localization, EventADL constructs an Intervention Graph that models the relationships between recent system interactions and the detected anomalies for automatic root cause localization. The framework is designed to operate efficiently with unlabeled data and to produce interpretable anomalies with their corresponding root causes. Our evaluation on three real cloud service systems and two real-world incidents demonstrates that EventADL outperforms existing methods, achieving F1-scores of at least 90% for anomaly detection and 100% top-3 accuracy in root cause localization.
Article Search
Article: fse26mainb-p1676-p
GAER: Graph Auto-encoders for Unsupervised Software Architecture Recovery
Rakhshanda Jabeen,
Morgan Ericsson,
Jonas Nordqvist, and
Anna Wingkvist
(Electrolux Professional, Sweden; Linnaeus University, Sweden)
Recovering the modular architecture of software systems from source code remains challenging when documentation is incomplete or outdated. Manual recovery is labor-intensive and does not scale to large systems. Although automated techniques have been proposed, many rely on handcrafted heuristics. They may focus on a limited set of inputs (such as dependencies or text), or combine multiple inputs using fixed rules, rather than integrating structural and semantic cues in a data-driven way.
We introduce GAER, an unsupervised architecture recovery approach based on graph autoencoders. GAER models a system as a heterogeneous dependency graph with typed relations and combines dependency information, folder hierarchy, and code semantics as node features. We study two factors that influence recovery outcomes: the choice of encoder (Graph Attention Network [GAT] vs. Graph Convolution Network [GCN]) and the number of clusters used for the final decomposition. We evaluate GAER across 10 open-source systems and benchmark it against established architecture recovery baselines using standard recovery measures against ground truth mappings. Overall, GAER is competitive, with strong baselines and often achieving higher agreement with ground truth, and the GAT variant generally performs best. By integrating multiple architectural cues in a graph learning framework, GAER produces architectural views that support system comprehension and reduce the effort required to maintain architectural documentation.
Article Search
Artifacts Available
Article: fse26mainb-p1793-p
Unveiling the Fragility of Binary Code Similarity Detection via Targeted Attacks with Model Explanations
Mingjie Chen,
Tiancheng Zhu,
Mingxue Zhang,
Yiling He,
Minghao Lin,
Penghui Li, and
Kui Ren
(Zhejiang University, China; Huazhong University of Science and Technology, China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, China; University College London, UK; Independent Researcher, USA; Columbia University, USA)
Binary code similarity detection (BCSD) serves as a fundamental technique for various software engineering tasks, e.g., vulnerability detection and classification. Attacks against such BCSD models have therefore drawn extensive attention, aiming at misleading the models to generate erroneous predictions. Prior works have explored various approaches to generating semantic-preserving variants, i.e., adversarial samples, to evaluate the robustness of the models against adversarial attacks. However, they have mainly relied on heuristic criteria or iterative greedy algorithms to locate salient code influencing the model output, which often leads to inefficient search and high computational cost. Moreover, when processing programs with high complexities, such attacks tend to be time-consuming.
In this work, we unveil the fragility of BCSD models through a novel attack framework guided by model explanations. In particular, we focus on targeted attacks where the attack goal is to mislead the model’s predictions to a specific target. Our attack leverages explainers to pinpoint critical code snippet for perturbations, reducing the exploration overhead. The evaluation results demonstrate that the proposed attacks effectively improve the attack efficiency, while maintaining comparable or higher success rates. Importantly, the speedup for perturbation target selection achieves up to 63.66×, demonstrating the practical value of explanation-guided localization. Our real-world case studies on vulnerability detection and classification further demonstrate the security implications of our attacks, highlighting fundamental robustness limitations in current BCSD models, and the urgent need for more robust designs.
Article Search
Artifacts Available
Article: fse26mainb-p1808-p
SmartDispatch: Dynamic Substitution of NumPy-Style APIs on Heterogeneous CPU-GPU Systems
Jinku Cui,
Yueming Hao,
Shuyin Jiao,
Jiajia Li, and
Xu Liu
(North Carolina State University, USA; Meta, USA)
The popularity of Python in various application domains has driven widespread adoption of NumPy-style APIs. To improve performance, libraries such as PyTorch, JAX, CuPy, and cuPyNumeric offer GPU-compatible counterparts to NumPy functions. However, substituting NumPy with these alternatives is not always beneficial due to the overheads of type conversion, data transfer, and kernel launch costs. We present SmartDispatch, a runtime framework that dynamically substitutes NumPy-style API calls with semantically equivalent implementations from other libraries to improve performance. Our system includes a knowledge base of equivalent APIs, a hardware-aware microbenchmarking component to identify substitution thresholds, and a runtime substitution tool. Evaluation on four platforms with varying CPU-GPU architectures using machine learning models from real-world benchmarks shows that consistent performance gains (1.3× to 5.8×) can be achieved without requiring code modification, demonstrating the effectiveness of cross-library substitution in heterogeneous environments.
Article Search
Article: fse26mainb-p1862-p
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
Qiang Ke,
Yanjie Zhao,
Hongjin Leng,
Shengming Zhao, and
Haoyu Wang
(Huazhong University of Science and Technology, China; Xiamen University Malaysia, Malaysia; Fudan University, China)
While Retrieval-Augmented Generation (RAG) is increasingly adopted to ground Large Language Models (LLMs) in software artifacts, the optimal configuration of its components remains an open question for software engineering (SE) tasks. The lack of systematic guidance forces practitioners into costly, ad-hoc experimentation. This paper presents a comprehensive, component-wise empirical study that dissects the RAG pipeline, evaluating over 21 distinct models and methods. Our study systematically isolates and evaluates 4 query processing techniques, 7 retrieval models spanning sparse, dense, and hybrid paradigms, 4 context refinement methods, and 6 distinct generators. We test these components on a suite of 3 core SE tasks: code generation, summarization, and repair. Our empirical findings reveal a crucial insight: the retriever-side components, particularly the choice of the retrieval algorithm, often exert a more significant influence on final system performance than the selection of the generator model. Strikingly, the classic lexical retriever BM25 demonstrates exceptionally robust performance across diverse tasks. Our analysis provides a practical, data-driven roadmap for researchers and practitioners, offering clear guidance on prioritizing optimization efforts when constructing effective RAG systems for software engineering contexts.
Article Search
Article: fse26mainb-p1886-p
LinkAnchor: An Autonomous LLM-Based Agent for Issue-to-Commit Link Recovery
Arshia Akhavan,
Alireza Hoseinpour,
Abbas Heydarnoori,
Hamid Bagheri, and
Mehdi Keshani
(Bowling Green State University, USA; University of Nebraska-Lincoln, USA; University of Zurich, Switzerland)
Issue-to-commit link recovery in software repositories is fundamental to software traceability and project management, yet it remains a challenging task. Prior studies show that only about 42.2% of issues on GitHub are correctly linked to their commits, highlighting the need for more effective solutions. Existing work has explored a range of ML/DL approaches, and more recently, large language models (LLMs) have been applied to this problem. However, these methods face two major limitations. First, LLMs are restricted by limited context windows and cannot simultaneously process all available data sources, such as long commit histories, extensive issue discussions, and large code repositories. Second, most approaches operate on individual issue-commit pairs, where a model independently scores the relevance of a single commit to an issue. This pairwise formulation fails to account for the complex associativity of software fixes, where an issue is often resolved by an aggregate chain of commits rather than a single atomic change. By ignoring these temporal and parental dependencies, existing methods often fail to incorporate the complete resolution logic and might misidentify intermediate commits as final fixes. Furthermore, this strategy is computationally inefficient in large repositories, as it requires exhaustively evaluating an enormous number of candidate pairs. To address these challenges, we present LinkAnchor, the first autonomous LLM-based agent designed specifically for issue-to-commit link recovery. LinkAnchor introduces a lazy-access architecture that allows the underlying LLM to dynamically retrieve only the most relevant contextual data, such as commits, issue comments, and code files, without exceeding token limits. Instead of isolated scoring, LinkAnchor treats link recovery as a dynamic search process, navigating the commit graph to identify the final resolving commit and effectively aggregating the entire chain of contributing changes. LinkAnchor is first to formalize ILR as dynamic heuristic search over commit chains (vs. prior pairwise scoring), enabling aggregate reasoning that recovers distributed fixes (46% of cases). Our evaluations show that LinkAnchor outperforms state-of-the-art baselines by 41–714% in Hit@1 across six large-scale open-source projects, while costing only about 0.01 US dollars per issue. Finally, LinkAnchor is designed and tested for both GitHub and Jira, and its modular architecture makes it straightforward to extend to other platforms.
Article Search
Artifacts Available
Article: fse26mainb-p1979-p
How Low Can You Go? The Data-Light SE Challenge
Kishan Kumar Ganguly and
Tim Menzies
(North Carolina State University, USA)
Much of software engineering (SE) research assumes that progress depends on massive datasets and CPU-intensive optimizers. Yet has this assumption been rigorously tested?
The counter-evidence presented in this paper suggests otherwise. For over 100 optimization tasks from recent SE papers (including software configuration, performance tuning, product line engineering, project health forecasting, defect prediction, software testing, software process and cost estimation, and cross-domain generalization datasets), even with just a few dozen labels, very simple methods (e.g., diversity sampling, a minimal Bayesian learner, its distance-based non-parametric variant, or random probes) achieve over 90% of the best reported results. Furthermore, these simple methods perform just as well as more complex state-of-the-art optimizers like SMAC, TPE, DEHB, etc. While some tasks would require better outcomes and more sampling, these results seen after a few dozen samples would suffice for many engineering needs (particularly when the goal is rapid and cost-efficient guidance rather than slow and exhaustive optimization).
To say that another way, at least some SE tasks are better served by lightweight approaches that demand fewer labels and far less computation. We hence propose the data-light challenge: when will a handful of labels suffice for SE tasks? To enable a large-scale investigation of this issue, we contribute (1) a mathematical formalization of labeling, (2) lightweight baseline algorithms, and (3) results on public-domain data showing the conditions under which lightweight methods excel or fail.
For the purposes of open science, our scripts and data are online at https://github.com/KKGanguly/NEO.
Article Search
Article: fse26mainb-p2072-p
iCoRe: An Iterative Correlation-Aware Retriever for Bug Reproduction Test Generation
Junyi Wang,
Jialun Cao, and
Zhongxin Liu
(Zhejiang University, China; Hong Kong University of Science and Technology, China)
Automatically generating bug reproduction tests (BRT) from issue descriptions is crucial for facilitating software maintenance. Large Language Model (LLM)-based approaches have shown great potential for this task. Their effectiveness heavily relies on retrieving high-quality context from the codebase. The retrieval phase of existing approaches relies on either traditional methods like BM25 or modern LLM-driven strategies. The LLM-based retrieval strategies typically involve equipping an LLM with tools to autonomously explore the code repository or having it select the most relevant files and code snippets from a provided list as context. However, these retrieval methods suffer from three key limitations: (1) They often employ a unified strategy for retrieving both source code and test cases, overlooking their distinct retrieval requirements. (2) They focus solely on semantic similarity, ignoring function call relationships that reflect behavioral relevance, which often leads to the retrieval of irrelevant context. (3) The retrieval lacks a feedback loop from the generation phase, preventing it from refining the context based on execution results. These limitations collectively result in low-quality context, thereby hindering the accuracy of bug reproduction.
To address these challenges, we propose iCoRe, an iterative, correlation-aware context retrieval approach. iCoRe is explicitly designed to be aware of three key correlations: 1) the correlation between source code and test cases, which requires differentiated retrieval, 2) the correlation between textual semantics and function call structures for accurate relevance assessment, and 3) the correlation between the retrieval and generation phases, which enables iterative feedback and refinement. To evaluate iCoRe, we integrate it with an LLM-based BRT generator and conduct a comprehensive evaluation on the SWT-bench Lite and TDD-bench Verified benchmarks. Experimental results show that our method achieves a Fail-to-Pass rate of 42.0% and 52.8% respectively, representing significant 19.7%--31.7% relative improvements over existing retrieval methods.
Article Search
Article: fse26mainb-p2166-p
Chiseling Out Efficiency: Structured Skeleton Supervision for Efficient Code Generation
Yu Yu,
Zhihong Sun,
Jia Li,
Yao Wan,
Chuanyi Li,
Hongyu Zhang,
Ruyun Wang,
Tao Huang,
Zhi Jin,
Ge Li, and
Chen Lyu
(Shandong Normal University, China; Tsinghua University, China; Huazhong University of Science and Technology, China; Nanjing University, China; Chongqing University, China; Institute of Information Engineering at Chinese Academy of Sciences, China; Peking University, China)
Large Language Models (LLMs) are capable of generating syntactically correct and functionally complete programs, greatly streamlining software development. However, recent studies reveal that these programs typically execute substantially slower than human-optimized counterparts. Existing approaches to bridging this efficiency gap typically involve either iteratively optimizing code after generation or fine-tuning models on corpora of efficient code. Yet, these methods expose the model to efficiency signals only by mimicking complete, optimized solutions, without explicitly encoding the structural code patterns essential for achieving high runtime performance. Addressing this gap presents two core challenges: (1) extracting and representing latent, efficiency-oriented structural patterns embedded within complex syntax and control flows, and (2) effectively learning these patterns without destabilizing the semantic training of LLMs. To tackle these challenges, we propose EffiSkel, an efficiency skeleton-guided framework that explicitly extracts and learns efficiency skeletons—abstract, reusable structural patterns underpinning efficient code—by leveraging three complementary strategies: lexical analysis based on token-frequency saliency, syntactic analysis using similarity over Abstract Syntax Trees (ASTs), and dynamic line-level profiling of execution time. These skeletons are integrated into a multi-task learning regime that jointly optimizes code generation and skeleton prediction, introducing an explicit inductive bias toward efficiency-aware code generation. Experiments across multiple programming languages and benchmarks demonstrate that EffiSkel significantly enhances both functional correctness and efficiency, resulting on Mercury with DeepSeek-Coder (6.7B) a +11.11% (vs. EffiCoder) and +3.71% (vs. CodeDPO) higher Efficiency Ratio (ER), and a +0.36 (vs. EffiCoder) and +0.22 (vs. CodeDPO) increase in Average Speedup (AS). These results highlight the effectiveness of explicitly modeling efficiency skeletons in improving the runtime performance of code generated by LLMs.
Article Search
Article: fse26mainb-p2213-p
Structure-Aware Delta Debugging with Geometric-Information Weights
Yonggang Tao and
Jingling Xue
(UNSW, Sydney, Australia)
Delta debugging is a fundamental technique for automatically minimizing failure-inducing inputs. ProbDD improves ddmin via probability-guided search, and Weighted Delta Debugging (WDD) further incorporates token-based weighting to mitigate size disparities. However, token counts measure textual volume rather than structural complexity, potentially misrepresenting reduction difficulty.
We propose Structure-Aware Delta Debugging (SADD), which models structural complexity using geometric and information-theoretic properties of the syntax tree. SADD defines a unified weight function that integrates geometric volume, decision uniformity, and effective branching complexity. We instantiate this model in two variants: SAddmin, which performs structure-aware partitioning within ddmin, and SAProbDD, which injects structural weights into ProbDD's gain function, leaving the underlying search control unchanged.
We evaluate SADD against ddmin, ProbDD, and WDD (instantiated as Wddmin and WProbDD) within two state-of-the-art frameworks, HDD and Perses, on 62 real-world C and XML benchmarks. Under HDD, SADD reduces average debugging time by up to 57.12% over ddmin and 15.04% over Wddmin on C, and by 30.12% over ddmin on XML; improvements over ProbDD reach 14.64% on C and 20.10% on XML. Gains are smaller under Perses or on structurally simple inputs, where token-based weighting already suffices. An ablation study shows that geometric volume forms the foundation of improvement, while information-theoretic metrics provide complementary refinements. Overall, explicitly modeling structural complexity improves delta debugging when input hierarchies exhibit meaningful structural diversity.
Article Search
Artifacts Available
Article: fse26mainb-p2326-p
Compiling Code LLMs into Lightweight Executables
Jieke Shi,
Junda He,
Zhou Yang,
Chengran Yang,
Mykhailo Klymenko,
Thong (James) Hoang,
Sherry (Xiwei) Xu,
Zhenchang Xing, and
David Lo
(Singapore Management University, Singapore; University of Alberta, Canada; Alberta Machine Intelligence Institute, Canada; CSIRO's Data61, Australia)
The demand for better prediction accuracy and higher execution performance in neural networks continues to grow. The emergence and success of Large Language Models (LLMs) have produced many cloud-based tools for software engineering tasks such as code suggestion. Although effective, cloud deployment raises concerns over privacy, latency, and reliance on network connectivity. Running LLMs locally on personal devices such as laptops would address these issues, because it enables offline use and reduces response time. However, local deployment is challenging, since commodity devices lack high-performance accelerators such as GPUs and are constrained by limited memory and compute capacity, which makes it hard to execute large models efficiently.
We present Ditto, a framework that optimizes both the model size of Code LLMs and the inference programs that execute them, with a focus on statically-typed languages such as C. Our approach integrates two components. The first is a quantization technique inspired by product quantization, which groups model parameters into per-block codebooks via K-Means clustering and stores each weight as a bit-packed low-bitwidth index. The quantizer further supports a mixed-precision mode that keeps a small number of sensitivity-critical tensors in float32. The second component is a compilation pass integrated into LLVM that automatically detects and replaces unoptimized General Matrix-Vector Multiplication (GEMV) operations, which are the most computationally intensive part of code models, with calls into Basic Linear Algebra Subprograms (BLAS) libraries that are highly optimized for the target hardware. The output of Ditto is a compiled executable that runs the selected Code LLM on commodity hardware.
We evaluate Ditto on three popular Code LLMs, namely Code Llama, MagicCoder, and OpenCodeInterpreter, achieving up to 10.5× faster inference, 6.4× lower memory usage, and 10.5× lower energy consumption compared with their original inference pipelines, while preserving accuracy close to the full-precision models, with an average loss of only 0.27% in pass@1. Ditto also outperforms the state-of-the-art int8 quantization baseline, achieving up to 6.96% higher pass@1 accuracy, 2.2× speedup, and 1.6× memory reduction, which demonstrates the effectiveness of our approach.
Article Search
Article: fse26mainb-p2335-p
Three Heads Are Better Than One: A Multi-perspective Reasoning Framework for Enhanced Vulnerability Detection
Xin Peng,
Bo Lin,
Jing Wang,
Xiaoling Li,
Jun Ma,
Jie Yu,
Xiaoguang Mao, and
Shangwen Wang
(National University of Defense Technology, China)
Automated vulnerability detection is crucial for enhancing software security by identifying potential flaws that attackers could exploit, thereby reducing the reliance on labor-intensive manual code audits. Recent advancements have shifted towards leveraging large language models (LLMs) for vulnerability detection, with techniques like Vul-RAG and VulnSage demonstrating progress through structured prompting and external knowledge integration. However, these approaches typically rely on a single reasoning paradigm, limiting their ability to address the complex and diverse nature of real-world vulnerabilities. To overcome these limitations, we propose ReasonVul, a novel multi-perspective reasoning framework that harnesses cognitive synergy among three specialized LLM agents, each embodying a distinct reasoning mode. The framework begins with independent analyses of the source code, followed by a structured debate mechanism to resolve conflicts through iterative rebuttal and revision, ultimately converging on a collaborative judgment. Evaluated on the PrimeVul dataset, ReasonVul achieves a PairAcc of 40.00% and an F1-score of 72.52%, surpassing the best baseline by 81.24% in PairAcc. Further tests on the JITVUL dataset confirm its generalizability, with a PairAcc of 28.67%. Additionally, we analyzed 542 conflict cases and found that 389 were correctly resolved, highlighting the framework's ability to uncover hidden vulnerabilities through the error-correction mechanism driven by the debate. This work emphasizes the importance of multi-perspective reasoning and collaborative validation in achieving robust and comprehensive vulnerability detection in real-world software systems.
Article Search
Article: fse26mainb-p2347-p
One Size Does Fit All: Exploring Model Fusion for Software Engineering Tasks
Yinggang Qiu,
Yihao Qin,
Mingyang Geng,
Shangwen Wang, and
Dezun Dong
(National University of Defense Technology, China)
Large language models (LLMs) have achieved remarkable performance in software engineering (SE), and fine-tuning LLM for specific SE tasks has gradually become a new paradigm. However, storing fine-tuned checkpoints for multiple tasks incurs heavy storage and deployment complexity. Model fusion, which operates on fine-tuned parameters, offers excellent parameter compression and scalability, yet its effectiveness in the SE domain remains underexplored, making such an investigation essential for guiding the development of customized fusion techniques for the SE domain. To bridge this gap, we conduct a systematic study of model fusion in the SE contexts and reveal the following major findings: (1) when fusing programming languages (PLs) within the same task, model fusion usually works well and can enhance the performance of PLs with fewer data when PLs share similar features. (2) when fusing SE tasks of the same category within a same PL, all methods except TALL-Masks generally suffer substantial performance degradation on specific tasks; (3) when fusing SE tasks of different categories across different PLs, all existing model fusion methods exhibit significant performance degradation on certain tasks. In our evaluation results, TALL-Masks, which introduces a mask for each task to extract the most relevant dimensions from the fusion parameters, achieves promising performance. However, during parameters fusion, weak features (i.e., small variation in fine-tuned parameters) are easily overshadowed by strong ones (i.e., large variation in fine-tuned parameters) during parameter fusion, causing the constructed masks to fail to extract the most relevant parameters. To overcome this situation, we propose an improved version of TALL-Masks, called Scaling-Masks. The key idea is to amplify weak features to prevent them from being overshadowed by strong ones, which is achieved by scaling the value range of weak features to match that of strong features. Experimental results demonstrate that Scaling-Masks can significantly improve fusion performance for tasks with extremely weak features without affecting other tasks, with normalized accuracy improved by 63.49% for vulnerability detection when fusing SE tasks of different categories and 24.02% for PHP when fusing PLs in the code repair task.
Article Search
Article: fse26mainb-p2398-p
Improving Data Leakage Detection in Machine Learning Notebooks through Static Slicing and Structured LLM Prompts
Taha Draoui,
Mohamed Wiem Mkaouer, and
Christian Newman
(University of Michigan-Flint, USA; Rochester Institute of Technology, USA)
Data leakage remains a critical yet under-diagnosed issue in machine learning pipelines, leading to inflated results and unreliable deployments. Existing detection approaches rely on static rules that often miss open-coded manipulations and fail to capture the diversity of real-world notebooks. This paper introduces a novel methodology that integrates static slicing with large language models (LLMs) to improve leakage detection. We use a Datalog-based static analysis that isolates compact, provenance-aware slices corresponding to model training and evaluation pairs, and we pair these with structured LLM prompts that guide step-by-step reasoning about potential leakage for each isolated slice. Evaluated on a curated benchmark of Python notebooks from Kaggle and GitHub, our approach achieves state-of-the-art performance in both preprocessing and overlap leakage detection, improving F1 scores over the previous state-of-the-art by 22% for preprocessing leakage and 15% for overlap leakage. Beyond these improvements, our slicing-based methodology substantially outperforms end-to-end prompting, demonstrating that precise program slicing is key to enabling LLMs to reliably detect leakage. Our findings highlight the effectiveness of combining program slicing and prompt engineering for data leakage detection and establish the first systematic LLM-based solution for detecting data leakage in machine learning code.
Article Search
Article: fse26mainb-p2417-p
Influence-Aware Bayesian-Inspired Token Reweighting for Improved Code Generation
Yuqi Zhu,
Ge Li,
Hong Mei,
Zhi Jin,
Jia Li,
Qibin Zheng, and
Jieyuan Zhang
(Academy of Military Sciences, China; Peking University, China; Wuhan University, China; Advanced Institute of Big Data, Beijing, China)
Large language models (LLMs) have achieved remarkable progress in code generation, yet the structural properties of programming languages introduce distinctive challenges. In particular, program correctness is disproportionately influenced by a subset of structurally critical tokens, such as API names, variable identifiers, and control-flow keywords, termed as influential tokens. Errors in predicting these tokens often propagate and accumulate through subsequent decoding steps, leading to substantial degradation in overall correctness. Addressing the heterogeneous difficulty of predicting such tokens is therefore crucial for improving the reliability of code generation.
To address this challenge, we introduce Influence-Aware Bayesian Code Generation (I-BAYGEN), a framework that explicitly handles influential tokens. The framework consists of two components. First, it identifies influential tokens using a loss-based detection mechanism, and measures the influential degree of each token in three ways. Second, to handle influential tokens, we introduce auxiliary reasoning paths as additional evidence to refine the token distribution during code generation in a Bayesian-inspired manner. To captures structural dependencies, we incorporate influence scores as adaptive weights in the self-rewarding mechanism, encouraging greater optimization emphasis on structurally critical tokens. Using the influence-aware reweighting mechanism, the framework provides differentiated treatment to tokens based on their prediction difficulties, with influential tokens receiving enhanced attention through a reward weighting scheme and deeper reasoning processes. Comprehensive experiments on competition-level programming benchmarks demonstrate that I-BAYGEN achieves up to 47.2% relative improvement in correctness over state-of-the-art non-weighted approach. We further show that I-BAYGEN generalizes robustly across multiple programming languages and out-of-distribution scenarios, highlighting its potential for real-world code generation tasks. Moreover, qualitative analysis reveals that the framework produces reasoning paths that are more interpretable and logically coherent than non-weighted method, effectively addressing heterogeneous token difficulty in code generation.
Article Search
Article: fse26mainb-p2439-p
pPatch: Automated Vulnerability Unpatching
Tianyi Jing,
Pengyu Ding,
Meng Xu,
Yinhao Hu,
Zheng Yu, and
Dongliang Mu
(Huazhong University of Science and Technology, China; University of Waterloo, Canada; Zhongguancun Laboratory, China; Northwestern University, USA)
Unpatching, the process of reverting security patches to reintroduce historical vulnerabilities into newer software versions, is valuable for creating realistic benchmarks to evaluate security analysis tools. However, this process is challenging due to code evolution, leading to context conflicts, compilation errors, or untriggerable issues. In fact, 61.25% of Linux kernel security patches we examined cannot be trivially reverted to recent versions. To address this, we propose pPatch, an automated framework designed to systematically unpatch security vulnerabilities and generate vulnerability benchmark. pPatch overcomes the limitations of naive reversion by employing a novel approach that progressively consults conflicting commits to identify and integrate necessary code changes, aiming for minimal modifications to preserve program semantics while successfully re-exposing the original vulnerability and minimizing unintended side effects. Then pPatch unpatches 614 historic kernel vulnerabilities from Linux kernel v6.6 and v6.12, resulting in 371 and 353 successfully unpatched vulnerabilities with manual analysis.
Article Search
Article: fse26mainb-p2454-p
SpecWeaver: End-to-End HTTP API Specification Inference across Multi-layer Routing in Production Web Services
Wenbo Hu,
Jie Lu,
Jingting Chen,
Feng Li,
Chenghang Shi,
Xiaonan Shi,
Jinchen Wang, and
Wei Huo
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Institute of Computing Technology at Chinese Academy of Sciences, China)
HTTP API specifications are essential for modern web development, yet existing tools fail in production environments due to multi-layer routing. Production deployments employ infrastructure-level and framework-level routing that apply sequential rewrite and dispatch rules, creating a gap between client-visible external paths and internal paths. To address this challenge, we present SpecWeaver, the first tool to automatically extract and unify heterogeneous configuration-defined routing rules with code-level handlers. Our approach combines routing component information gathering, iterative configuration discovery using LLMs, and routing extraction from diverse configuration files to construct a routing graph representation that captures end-to-end routing relationships. Evaluated on 10 production applications, SpecWeaver extracts 36,361 rewrite rules and 48,394 dispatch rules with 99.44%/100% precision, materializes 8,288 external API paths, contributes 5,116 previously undocumented APIs with 160 unauthenticated endpoints, helps an existing testing tool improve testing coverage by 328.6% compared to the baseline, and discovers 305 bugs.
Article Search
Article: fse26mainb-p2477-p
Accelerating Policy Synthesis in Large-Scale MDPs via Hierarchical Adaptive Refinement
Alexandros Evangelidis,
Gricel Vázquez, and
Simos Gerasimou
(University of York, UK; Cyprus University of Technology, Cyprus)
Software-intensive systems, such as software product lines and robotics, utilise Markov decision processes (MDPs) to capture uncertainty and analyse sequential decision-making problems. Despite the usefulness of conventional policy synthesis methods, they fail to scale to large state spaces. Our approach addresses this issue and accelerates policy synthesis in large MDPs by dynamically refining the MDP and iteratively selecting the most fragile MDP regions for refinement. This iterative procedure offers a balance between accuracy and efficiency, as refinement occurs only when necessary. We formally show that the composed policy is near-optimal under standard assumptions, with error bounded by the local solver tolerance and boundary mismatch. Across diverse case studies and MDPs up to 1M states, we demonstrate that our approach achieves up to 2× speedup over PRISM, offering a competitive solution for real-world policy synthesis in large MDPs.
Article Search
Article: fse26mainb-p2516-p
Red Teaming LLMs via Linguistic-Aware Fuzzing
Shuai Yuan,
Nian Luo,
Jingling Sun,
Yihao Huang, and
Chengyu Zhang
(University of Electronic Science and Technology of China, China; National University of Singapore, Singapore; Loughborough University, UK)
Safety alignment aims to prevent Large Language Models (LLMs) from producing harmful content. However, safety alignment remains vulnerable to malicious instructions. Red teaming is a critical methodology for identifying such vulnerabilities in LLMs. Existing approaches often rely on jailbreak templates or rule-based transformation, limiting the diversity of generated tests and the continued testing capability of these test approaches. To address these limitations, we propose Lingfuzz, a linguistic-aware fuzzing framework for continuing red teaming LLMs. The key idea of this framework is to mutate the existing malicious instructions at the lexical and syntactic levels, while keeping the malicious intentions of the instructions. Such mutations enable generating diverse malicious instructions due to the unlimited space of lexical and syntactic choices, while having a continued testing capability by iteratively mutating the mutants. We evaluated Lingfuzz on five aligned commercial LLMs and one open-source LLM in black-box testing. The results show that Lingfuzz
triggers safety alignment vulnerabilities in 71.0% of the cases in JailbreakBench benchmark, higher than the second-highest baseline of 63.8%. Lingfuzz also possesses outstanding multilingual testing capabilities that far exceed other baselines with approximately 24% ESR advantage testing in Jailbench benchmark. The malicious instructions generated by Lingfuzz have almost three times higher diversity than previous work according to the self-BLEU metric. Lingfuzz also demonstrates strong continued testing capability by showing five times less sensitivity to the LLM evolution than other approaches.
Article Search
Article: fse26mainb-p2562-p
Sound Termination and Non-termination Analysis of C Programs with Bit-Precise Bounded Semantics and Advanced Constructs
Negar Fathi,
Hiroshi Unno,
Tachio Terauchi, and
Rahul Purandare
(University of Nebraska-Lincoln, USA; Tohoku University, Japan; Waseda University, Japan)
Program termination and non-termination analysis is a foundational problem in formal verification with important implications for software safety and reliability. Despite extensive research, existing techniques struggle with real-world C programs that manipulate complex data types such as pointers, arrays, and structures, or that perform low-level operations such as bitwise arithmetic and bounded integer computations. This paper introduces Athena, a framework for sound termination and non-termination analysis of C programs that models finite-width and bit-precise integer semantics and supports advanced constructs. Athena combines pointer-to-array rewriting, bounded integer semantics enforced via modulo arithmetic or bit-vector semantics, and an extended translation to Labeled Transition Systems (LTS), yielding structured, analyzable representations suitable for logic-based reasoning. Our analysis engine builds on MuVal, a modular verification engine based on the first-order fixpoint logic µCLP with background theories, and extends it with support for array, tuple, and bit-vector theories in ranking function synthesis and recurrent set detection. We evaluate Athena on the 2024 Termination Competition (TermCOMP) and on 117 real-world benchmarks featuring 445 non-termination bugs, after excluding benchmarks that rely on undefined behavior. It achieves 60.95% correctness on the real-world benchmarks and 76.28% on TermCOMP, while producing zero wrong results across both suites. These results highlight Athena’s strong combination of precision and soundness for the termination and non-termination analysis of complex C programs.
Article Search
Artifacts Available
Article: fse26mainb-p2604-p
Cost-Effective Testing of MPC Compilers
Sebastian Watzinger,
Valentin Wüstholz,
Deepak Garg, and
Maria Christakis
(TU Wien, Austria; Diligence Security, Austria; MPI-SWS, Germany)
Secure multi-party computation (MPC) enables privacy-preserving
computations using secret data, with applications ranging from health
care and finance to machine learning and blockchains. MPC compilers
translate high-level function descriptions to the low-level
representations required for the actual execution, making them
critical for both usability and scalability of MPC. However, these
compilers may contain logic bugs that cause them to quietly produce
wrong outputs, the consequences of which could be catastrophic given
the sensitive applications of this technology. Testing MPC compilers
in order to find these severe bugs is, therefore, paramount.
With only a single testing tool currently available (which is not
publicly available in its entirety and has several serious
limitations), this issue is far from resolved. In this paper, we
present BabelFuzz, a cost-effective framework for testing MPC compilers.
By introducing an expressive intermediate representation (IR) for its
seed-program generation, BabelFuzz is able to support multiple compilers
that use different input languages, while keeping the development
effort of adding new targets low. Even better, this approach allows us
to translate our IR to mainstream languages, which provides a powerful
differential-testing oracle for highly efficient bug detection.
BabelFuzz not only found 27 new logic bugs across four MPC compilers, but
it is also able to rediscover every fixed bug the previous state of
the art in testing MPC compilers found.
Article Search
Article: fse26mainb-p2665-p
Automated Repair of TEE Partitioning Issues via DSL-Guided and LLM-Assisted Patching
Chengyan Ma,
Jieke Shi,
Ruidong Han,
Ye Liu,
Feng Li,
Yuqing Niu, and
David Lo
(Singapore Management University, Singapore)
Trusted Execution Environments (TEEs) provide hardware-based isolation to protect sensitive data and computations from potentially compromised operating systems (OS). However, TEE applications inevitably interact with the untrusted OS through SDK interfaces, and improper partitioning can introduce severe vulnerabilities such as data leakage and code injection. While prior work has proposed static analysis tools to detect such issues, automated repair remains largely unexplored. This problem is particularly challenging due to three TEE-specific factors: the lack of standardized secure development guidelines, the difficulty of extracting semantic information from low-level C code, and the absence of mature testing and validation methods. In this work, we present TEERepair, a framework for automatically repairing bad partitioning issues in TEE applications. Our approach tackles the above challenges by introducing a domain-specific language (DSL) to encode repair rules that express and capture common TEE security patterns, which are instantiated as patch templates with placeholders for context-specific variables. We then leverage large language models (LLMs) to reason about code semantics and synthesize context-aware patches, and further generate test clients to validate the repairs. We evaluate TEERepair on the TEE Partitioning Errors Benchmark (PartitioningE-Bench), achieving a significantly higher repair success rate of 87.6% compared to baselines. Furthermore, applying TEERepair to real-world TEE projects, we submitted 5 repair pull requests, 2 of which have been confirmed and merged by project maintainers.
Article Search
Article: fse26mainb-p2722-p
Automated Repair of Requirements for Cyber-Physical Systems in Simulink Requirements Tables
Aren A. Babikian,
Alessio Di Sandro,
Federico Formica,
Claudio Menghi, and
Marsha Chechik
(University of Toronto, Canada; McMaster University, Canada; University of Bergamo, Italy)
The development of complex software systems, e.g., cyber-physical systems (CPSs), involves continuous evolution of both system implementations and their requirements. These two artifacts often proceed independently, creating a risk of misalignment. For example, a system may be updated due to implementation-level concerns, yielding a new version that no longer satisfies its original requirements. Traditional compliance recovery techniques, e.g., automated program repair, address this problem by modifying the system while assuming that requirements are correct. However, faulty, outdated or inadequate requirements are a well-documented challenge in practice, motivating the complementary task of requirement repair. In this paper, we propose a framework that leverages system execution data to repair misaligned CPS requirements, thereby restoring requirement-to-system compliance. Our approach evaluates the correctness of declarative requirements over time-based, real-valued signals expressed using the MATLAB Simulink Requirements Tables language. We evaluate seven variants of our framework on six real-world case studies covering 12 requirements. Results confirm the effectiveness of the proposed framework in producing correct and useful repaired requirements.
Article Search
Artifacts Available
Article: fse26mainb-p2723-p
Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability
Leizhen Zhang,
Shuhan Chen, and
Sheng Chen
(University of Louisiana at Lafayette, USA; East China Normal University, China)
Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions—Vertex Cover and a discrete 3D-packing formulation—designed to probe representation-invariant reasoning. Our evaluation begins with the conventional lens (accuracy/precision/recall/F1) and the phase-transition setting. We find that traditional metrics are frequently misleading. Models achieve high scores even though they tend to classify all formulas as satisfiable, fail to reproduce
the classical easy--hard--easy signature around the 3-SAT threshold, and degrade sharply as the number of
variables N grows.
To address this, we introduce a paired-formula protocol (minimally different satisfiable/unsatisfiable instances) and a new measure, Accurate Differentiation Rate (ADR), which requires prediction on both members of each pair correct. ADR cleanly separates reasoning-oriented models from heuristic ones and correlates with witness validity (truth assignments that actually satisfy the formula). Extending beyond CNF, we test cross-representation consistency via standard reductions: (i) Convert CNF to Vertex Cover and (ii) Convert 3-SAT to discrete 3D packing with verifiable placement constraints. Decisions made on CNF and on their graph/packing counterparts agree for most models on >80 percent of instances, revealing stable decision rules across representations. A leading model (e.g., GPT-5) achieves both high invariance and correctness on small N, but still suffers scale-induced degradation.
Taken together, our results support the thesis that SAT is a
conservative probe for LLM reasoning: performance on SAT predicts transfer to other NP-style reductions, while paired evaluation with ADR provides a faithful, representation-robust assessment beyond conventional metrics.
Article Search
Artifacts Available
Article: fse26mainb-p2731-p
Reducing Coverage-Equivalent Inputs in Grammar-Based Fuzzing by Avoiding Recurrent Rule Sequences
Jaehan Yoon,
Yunji Seo,
Hakjoo Oh, and
Sooyoung Cha
(Sungkyunkwan University, Republic of Korea; Korea University, Republic of Korea)
We present RSFuzz, a new technique to enhance grammar-based fuzzing by reducing the generation of coverage-equivalent inputs during testing. Grammar-based fuzzers apply production rules from a given grammar (e.g., forming a derivation tree) to generate well-structured inputs for the target program. However, a key limitation is that many existing fuzzers still produce a large number of ”coverage-equivalent” inputs—those that revisit already explored program paths—thereby restricting their ability to uncover new bugs and improve coverage. To address this issue, RSFuzz automatically identifies recurrent sequences of production rules that lead to coverage-equivalent inputs and prevents their reuse during fuzzing. A key challenge lies in the large number of coverage-equivalent input groups, each with many inputs and corresponding derivation trees, making it difficult to identify underlying recurrent sequences. RSFuzz tackles this challenge with a customized algorithm that iteratively groups coverage-equivalent inputs, selects promising groups, and extracts recurrent sequences by abstracting derivation trees based on accumulated data while running any grammar-based fuzzer. We integrated RSFuzz with existing random, probabilistic, and grammar-coverage based fuzzers and evaluated it on 12 real-world programs using JavaScript, JSON, CSV, and Markdown input formats. Experimental results show that incorporating RSFuzz into the three fuzzers detects 121, 46, and 17 additional crashes with distinct stack traces, increases line coverage by 6.0%, 4.8%, and 3.0%, and reduces duplicate-coverage input generation by 23.3%, 28.7% and 14.9%, respectively, compared to their performance without RSFuzz.
Article Search
Article: fse26mainb-p2746-p
Natural Language-Focused Software Engineering via Code-Documentation Equivalence
Aryaz Eghbali,
Zhongxin Liu, and
Michael Pradel
(CISPA Helmholtz Center for Information Security, Germany; Zhejiang University, China)
Source code documentation is an integral part of software development and maintenance, as it helps in understanding the code and facilitates communication among developers. However, existing documentation is often incomplete, outdated, or inaccurate, which can lead to misunderstandings and errors. In the era of large language models (LLMs), which are being extensively used for software engineering tasks, the quality of documentation becomes even more critical, as documentation provides important context for the models. In this paper, we introduce the notion of documentation-to-code equivalence, a novel property that captures whether documentation accurately and completely describes the code it documents. We present a novel approach, called Documentary, to automatically generate equivalent documentation for a given code snippet. Our evaluation shows that Documentary can generate equivalent documentation for 53.4% of the evaluated function-level code snippets. To show the benefits of documentation-to-code equivalence, we describe and evaluate two software engineering tasks: code understanding and code editing. Our results show that documentation-to-code equivalence allows an LLM to predict the output of a function with 12.8–24.5% higher accuracy, when compared to human-written documentation and documentation generated by a baseline approach. Furthermore, human developers consider documentation generated by Documentary to be more useful for understanding and editing code than the original human-written documentation.
Article Search
Article: fse26mainb-p2869-p
RAT: Retrieval-Augmented Testing of Certificate Revocation List Parsers in TLS Implementations
Chu Chen,
Qianxin Cheng,
Pinghong Ren,
Hairong Yu,
Cong Tian,
Zhenhua Duan,
Xu Lu,
Bin Yu,
WenSheng Wang, and
Jin Liu
(Qufu Normal University, China; Xidian University, China; Xi'an University of Technology, China)
Transport Layer Security (TLS) implementations form the backbone of countless software systems, spanning web browsers, email clients, cloud services, and Internet of Things software, by enabling secure authentication and encrypted communication. However, their reliability hinges on the integrity of components like the Certificate Revocation List (CRL), which revokes compromised certificates to prevent attackers from exploiting expired or unauthorized credentials. Despite the CRL's critical role, CRL parsers, which decode CRL data for validation, remain overlooked in security research, exposing TLS-dependent software to potential threats. To address this gap, we introduce Retrieval-Augmented Testing (RAT), a framework powered by Large Language Models (LLMs), to systematically evaluate CRL parsers in mainstream TLS implementations such as OpenSSL, GnuTLS, and wolfSSL. RAT begins with leveraging an LLM to retrieve historical bug reports and cross-reference them with Request for Comments (RFC) 5280 specifications, generating structured test cases via an Abstract Syntax Notation One-aware mutation engine. These test cases are then fed into CRL parsers, and RAT employs an LLM to normalize their outputs. By analyzing these normalized results, RAT detects discrepancies and uncovers latent risks in CRL parsers. Our work makes the following contributions: (1) Unlike prior work focusing on certificate validation, this is the first study to systematically assess CRL parsers; (2) We propose RAT, a novel testing framework that leverages an LLM to integrate insights from retrieved bug reports and RFC 5280, enabling automated test-case generation; and (3) We have implemented an open-source prototype of RAT and experiments uncovered 23 new bugs, features, enhancements, fixes in commits, and x509s, demonstrating RAT's potential to strengthen the reliability and security of CRL parsers in TLS implementations.
Article Search
Article: fse26mainb-p2897-p
MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis
Congying Xu,
Hengcheng Zhu,
Songqiang Chen,
Jiarong Wu,
Valerio Terragni, and
Shing-Chi Cheung
(Hong Kong University of Science and Technology, China; Guangzhou HKUST Fok Ying Tung Research Institute, China; University of Auckland, New Zealand)
Metamorphic testing (MT) is a widely recognized technique for alleviating the oracle problem in software testing. However, its adoption is hindered by the difficulty of constructing effective metamorphic relations (MRs), which often require domain-specific or hard-to-obtain knowledge. In this work, we propose a novel approach that leverages the functional coupling between methods, which is readily available in source code, to automatically construct MRs and generate metamorphic test cases (MTCs). Our technique, MR-Coupler, identifies functionally coupled method pairs, employs large language models to generate candidate MTCs, and validates them through test amplification and mutation analysis. In particular, we leverage three functional coupling features to avoid expensive enumeration of possible method pairs, and a novel validation mechanism to reduce false alarms. Our evaluation of MR-Coupler on 100 human-written MTCs and 50 real-world bugs shows that it generates valid MTCs for over 90% of tasks, improves valid MTC generation by 64.90%, and reduces false alarms by 36.56% compared to baselines. Furthermore, the MTCs generated by MR-Coupler detect 44% of the real bugs. Our results highlight the effectiveness of leveraging functional coupling for automated MR construction and the potential of MR-Coupler to facilitate the adoption of MT in practice. We also released the tool and experimental data to support future research.
Article Search
Article: fse26mainb-p2981-p
TUSR: A Test Unit-Based Framework for Repairing Obsolete GUI Test Scripts
Shaoheng Cao,
Minxue Pan, and
Xuandong Li
(Nanjing University, China)
Graphical User Interface (GUI) testing is a primary technique for testing mobile applications. Among existing testing methods, script-based testing is widely adopted because test scripts can be replayed across devices and versions. However, as mobile apps evolve, frequent changes to GUI appearance and interaction logic often render scripts written for earlier versions obsolete. Existing repair approaches typically follow a stepwise framework, repairing scripts action by action. While effective for minor GUI appearance changes, they struggle when interaction logic is modified.
In this paper, we present TUSR, a test unit–based repair framework that fundamentally departs from the stepwise paradigm. TUSR introduces the concept of a "Test Unit"—a contiguous sequence of test actions serving a single testing intention—and repairs scripts at the unit level. It first splits scripts into test units using a multi-modal LLM with a rule-based correction mechanism to ensure consistency, then conducts dynamic repair guided by a Chain-of-Thought prompt enhanced with reflective memory. Experiments on 20 real-world Android applications, covering 262 test scripts and 3,485 actions, demonstrate that TUSR significantly outperforms state-of-the-art black-box and white-box repair approaches.
Article Search
Article: fse26mainb-p3051-p
GUIMigrator: Semantics-Preserving Transpilation from Android XML to Compose and SwiftUI
Yi Gao,
Xing Hu,
Xiaohu Yang, and
Xin Xia
(Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, China)
Constructing user interfaces (UIs) is one of the most resource-intensive tasks in mobile development, often consuming more than half of overall effort.
Although declarative frameworks such as Jetpack Compose (Android) and SwiftUI (iOS) have become mainstream, the majority of existing Android apps still rely on legacy XML-based layouts.
Migrating these UIs to declarative paradigms is essential for maintainability and cross-platform reuse, but manual migration is costly, error-prone, and difficult to scale.
We present GUIMigrator, a semantics-preserving framework that automates the migration of Android XML-based UIs to Jetpack Compose and SwiftUI.
We design the Semantic UI Transpiler (SUT), which abstracts layout structures and resource semantics from legacy XML and systematically re-expresses them using the component abstractions and idioms of modern declarative frameworks.
This design ensures that migrated UIs preserve both visual fidelity and functional equivalence, while generating idiomatic, compilable code that maintains cross-platform consistency with minimal manual intervention.
By separating semantic interpretation from platform-specific realization, GUIMigrator provides a deterministic yet extensible basis for cross-platform modernization, avoiding the unpredictability of purely generative approaches.
We evaluate GUIMigrator on 31 open-source applications across ten domains.
Results show that GUIMigrator achieves high migration completeness and strong visual similarity (81.9% SSIM on Jetpack Compose and 78.2% on SwiftUI on average), while maintaining substantially higher project-wide semantic coherence (PSC) than modern LLM baselines.
In addition, GUIMigrator reduces manual development effort by over 90%.
These findings demonstrate that GUIMigrator provides an effective and practical solution for reusing Android UIs across modern declarative frameworks, advancing automated cross-platform UI development.
Article Search
Article: fse26mainb-p3073-p
Generalizing Test Cases for Comprehensive Test Scenario Coverage
Binhang Qi,
Yun Lin,
Xinyi Weng,
Chenyan Liu,
Hailong Sun,
Gordon Fraser, and
Jin Song Dong
(National University of Singapore, Singapore; Shanghai Jiao Tong University, China; Beihang University, China; University of Passau, Germany)
Test cases are essential for software development and maintenance. In practice, developers derive multiple test cases from an implicit pattern based on their understanding of requirements and inference of diverse test scenarios, each validating a specific behavior of the focal method. However, producing comprehensive tests is time-consuming and error-prone: many important tests that should have accompanied the initial test are added only after a significant delay, sometimes only after bugs are triggered.
Existing automated test generation techniques largely focus on code coverage. Yet in real projects, practical tests are seldom driven by code coverage alone, since test scenarios do not necessarily align with control-flow branches. Instead, test scenarios originate from requirements, which are often undocumented and implicitly embedded in a project's design and implementation. However, developer-written tests are frequently treated as executable specifications; thus, even a single initial test that reflects the developer's intent can reveal the underlying requirement and the diverse scenarios that should be validated.
In this work, we propose TestGeneralizer, a framework for generalizing test cases to comprehensively cover test scenarios. TestGeneralizer orchestrates three stages: (1) enhancing the understanding of the requirement and scenario behind the focal method and initial test; (2) generating a test scenario template and crystallizing it into various test scenario instances; and (3) generating and refining executable test cases from these instances. To ensure accuracy and completeness, TestGeneralizer combines rule-based prompts, automatically optimized via a prompt auto-tuning technique, with crucial project knowledge retrieved through program analysis. We evaluate TestGeneralizer against three state-of-the-art baselines (EvoSuite, gpt-o4-mini, and ChatTester) on 12 open-source Java projects, covering 506 multi-test focal methods and 1,637 test scenarios. TestGeneralizer achieves significant improvements: +57.67% and +59.62% over EvoSuite, +37.44% and +32.82% over gpt-o4-mini, and +31.66% and +23.08% over ChatTester, in mutation-based and LLM-assessed scenario coverage, respectively. In a field study, we submitted 27 generalized tests overlooked by developers; 16 were accepted and merged into official repositories, demonstrating the practical usefulness of TestGeneralizer.
Article Search
Article: fse26mainb-p3353-p
Uncovering Similar but Different Packages in PyPI and Potential Security Threats
Sunha Park,
Soojin Han, and
Seunghoon Woo
(Korea University, Republic of Korea; Dongduk Women's University, Republic of Korea)
In this study, we present a large-scale, in-depth study of package replication in PyPI. As a vital platform, PyPI streamlines Python package distribution for developers. However, beyond small-scale code cloning, we observe that many replicated packages exist on PyPI, which duplicate most of the codebase from existing packages. Such replication not only confuses developers but also propagates known vulnerabilities and enables the creation of new malicious packages. To address this issue, we comprehensively examine the characteristics and potential threats of replicated packages. Using one-third of the entire PyPI repository (200K packages), we investigate replication from three perspectives: replication of popular packages, vulnerable packages, and malicious packages. Our experiments reveal three critical findings about package replication in PyPI: (1) by identifying 1,361 replicated packages of the top 3K popular projects, we show that replication frequently redistributes substantial portions of existing packages under different maintainers; (2) by uncovering 256 previously unknown replicated vulnerable packages, we demonstrate that replication creates vulnerability blind spots that current detection tools rarely catch; (3) by analyzing 3,883 known malicious packages, we found that 186 (4.79%) replicated popular ones, and this pattern further led us to identify seven previously unknown replicated malicious packages, highlighting its role as an attack vector for malware distribution through minor modifications and code injection.
Article Search
Article: fse26mainb-p3575-p
Event-B Agent: Towards LLM Agent for Formal Model Synthesis and Repair
Hongshu Wang,
Xinyue Zuo,
Yuhan Sun,
Qin Li,
Yamine Ait Ameur, and
Jin Song Dong
(National University of Singapore, Singapore; East China Normal University, China; IRIT - National Polytechnic Institute of Toulouse, France)
Building software that is correct by construction is a long-standing goal in software engineering, as it ensures reliability during design and development rather than after deployment. Formal methods realize this vision by enabling the expression of system behavior and requirements in mathematics, thereby guaranteeing correctness through formal verification, including theorem proving and model checking. However, the steep learning curve and demand for mathematical expertise hinder the widespread adoption of formal methods. Large language models (LLMs) have recently shown promise in bridging this gap through autoformalization. However, existing LLM-based approaches are largely limited to isolated tasks, such as theorem proving without formalization or model synthesis with insufficient verification. While valuable, these efforts do not fully exploit the potential of a more comprehensive framework in which models and proofs evolve together, a process that closely reflects real-world development practice. To address this gap, we propose Event-B Agent, a novel framework inspired by the interleaved nature of software design. Given natural language requirements, Event-B Agent constructs an initial model and iteratively repairs and refines it using formal verification feedback. Refinement simplifies proof discharge, while repair of models and proofs ensures the soundness of each refinement step. Together, these two components reinforce each other to progressively improve the model quality. Evaluation across systems of varying complexity demonstrates that Event-B Agent substantially outperforms baselines in end-to-end formal model synthesis and repair, while maintaining reasonable efficiency. These results suggest that Event-B Agent is a promising step toward correct-by-construction formal model synthesis and repair.
Article Search
Article: fse26mainb-p3803-p
Corrections
proc time: 3.94