Powered by
Proceedings of the ACM on Software Engineering, Volume 3, Number FSE
Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-Breaking
Yifan Huang,
Xiaojun Jia,
Wenbo Guo,
Yuqiang Sun,
Yihao Huang,
Chong Wang, and
Yang Liu
(Nanyang Technological University, Singapore; National University of Singapore, Singapore)
Large language models (LLMs) have revolutionized software development through AI-assisted coding tools, enabling developers with limited programming expertise to create sophisticated applications. This democratization of software development has significantly lowered the barriers to entry for complex programming tasks. However, the same accessibility extends to malicious actors who may exploit these powerful tools to generate harmful software, including malware, ransomware, and other security threats. Existing jailbreaking research primarily focuses on general attack scenarios against LLMs, with limited exploration of malicious code generation as a jailbreak target and insufficient technical expertise to evaluate whether generated outputs align with specified malicious objectives. To address this gap, we propose SPELL, a comprehensive testing framework for LLM developers and the Secure Team, specifically designed to evaluate the weakness of security alignment in malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns with exploitation of successful techniques. Extensive evaluation across three advanced code models (GPT-4.1, Claude-3.5, and Qwen2.5-Coder) demonstrates SPELL’s effectiveness, achieving attack success rates of 83.75%, 19.38%, and 68.12% respectively across eight malicious code categories. The generated prompts successfully produce malicious code in real-world AI development tools such as Cursor, with outputs confirmed as malicious by state-of-the-art detection systems at rates exceeding 73%, including four instances flagged as extremely dangerous by all detection tools. These findings reveal significant security gaps in current LLM implementations and provide valuable insights for improving AI safety alignment in code generation applications.
Article Search
Article: fse26maina-p10-p
Still Manual? Automated Linter Configuration via DSL-Based LLM Compilation of Coding Standards
Zejun Zhang,
Yixin Gan,
Zhenchang Xing,
Tian Zhang,
Yi Li,
Qinghua Lu,
Sherry (Xiwei) Xu, and
Liming Zhu
(Nanyang Technological University, Singapore; Nanjing University, China; CSIRO’s Data61, Australia)
Article Search
Article: fse26maina-p27-p
GadgetHunter: Region-Based Neuro-symbolic Detection of Java Deserialization Vulnerabilities
Kaixuan Li,
Jian Zhang,
Chong Wang,
Sen Chen,
Zong Cao,
Min Zhang, and
Yang Liu
(Nanyang Technological University, Singapore; Beihang University, China; Nankai University, China; Imperial Global Singapore of Imperial College London, Singapore; East China Normal University, China)
Java deserialization vulnerabilities (JDVs) enable attackers to execute arbitrary code by crafting malicious serialized objects that trigger sequences of method calls (gadget chains) leading to dangerous operations. Existing detection approaches face a fundamental trade-off: static analysis achieves scalability but suffers from high false positives due to infeasible paths and imprecision with dynamic features like reflection; dynamic validation reduces false positives but incurs prohibitive costs and fails to explore deep exploitation chains.
We present GadgetHunter, a neuro-symbolic JDV detector that combines scalable static analysis with
targeted LLM reasoning and JDV exploitation-oriented constraint solving. Our approach partitions gadget chains into regions based on analyzability: statically resolvable segments are processed via interprocedural taint analysis, while dynamic boundaries are delegated to LLMs for semantic validation. We then extract critical constraints from each gadget and compose them into SMT formulas to determine chain feasibility through satisfiability solving. Evaluation on the ysoserial benchmark demonstrates that GadgetHunter reduces false negatives by up to 32% and false positives by 12-85% compared to state-of-the-art tools, while discovering 197 previously unknown gadget chains and rediscovering 4 recent CVEs. Our results show that combining symbolic reasoning with semantic understanding achieves both precision and practical impact in vulnerability detection.
Article Search
Article: fse26maina-p42-p
Towards Automated Smart Contract Generation: Evaluation, Benchmarking, and Retrieval-Augmented Repair
Zaoyu Chen,
Haoran Qin,
Nuo Chen,
Xiangyu Zhao,
Lei Xue,
Xiapu Luo, and
Xiao-Ming Wu
(Hong Kong Polytechnic University, Hong Kong; Sun Yat-sen University, China)
Article Search
Article: fse26maina-p90-p
MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge
Shuai Liang,
Pengfei Chen,
Bozhe Tian,
Gou Tan,
Maohong Xu,
Youjun Qu,
Yahui Zhao,
Yiduo Shang, and
Chongkang Tan
(Sun Yat-sen University, China; China Unicom Software Research Institute, China; Individual Researcher, China)
The dynamics and complexity of cloud-native systems present significant challenges for Root Cause Analysis (RCA). While causality-based RCA methods have shown significant progress in recent years, their practical adoption is fundamentally limited by three intertwined challenges: poor scalability against system complexity, brittle generalization across different system topologies, and inadequate integration of domain knowledge. These limitations create a vicious cycle, hindering the development of robust and efficient RCA solutions. This paper introduces MetaRCA, a generalizable RCA framework for cloud-native systems. MetaRCA first constructs a Meta Causal Graph (MCG) offline, a reusable knowledge base defined at the metadata level. To build the MCG, we propose an evidence-driven algorithm that systematically fuses knowledge from Large Language Models (LLMs), historical fault reports, and observability data. When a fault occurs, MetaRCA performs a lightweight online inference by dynamically instantiating the MCG into a localized graph based on the current context, and then leverages real-time data to weight and prune causal links for precise root cause localization. Evaluated on 252 public and 59 production failures, MetaRCA demonstrates state-of-the-art performance. It surpasses the strongest baseline by 29 percentage points in service-level and 48 percentage points in metric-level accuracy. This performance advantage widens as system complexity increases, with its overhead scaling near-linearly. Crucially, MetaRCA shows robust cross-system generalization, maintaining over 80% accuracy across diverse systems.
Preprint
Article: fse26maina-p149-p
Coding in a Bubble? Evaluating LLMs in Resolving Context Adaptation Bugs during Code Adaptation
Tanghaoran Zhang,
Xinjun Mao,
Shangwen Wang,
Yuxin Zhao,
Yao Lu,
Zezhou Tang,
Wenyu Xu,
Longfei Sun,
Changrong Xie,
Kang Yang, and
Yue Yu
(National University of Defense Technology, China; Pengcheng Laboratory, China)
Article Search
Article: fse26maina-p224-p
Deployability-Centric Infrastructure-as-Code Generation: Fail, Learn, Refine, and Succeed through LLM-Empowered DevOps Simulation
Tianyi Zhang,
Shidong Pan,
Zejun Zhang,
Zhenchang Xing, and
Xiaoyu Sun
(Australian National University, Australia; New York University, USA; Columbia University, USA; Nanyang Technological University, Singapore; CSIRO’s Data61, Australia)
Infrastructure-as-Code (IaC) generation holds significant promise for automating the provisioning of cloud infrastructure. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions. However, current evaluation focuses on syntactic correctness while ignoring deployability, the critical measure of the utility of IaC configuration files. Six state-of-the-art LLMs performed poorly on deployability, achieving only 20.8∼30.2% deployment success rate on the first attempt. In this paper, we construct DPIaC-Eval, the first deployability-centric IaC template benchmark consisting of 153 real-world scenarios across 58 unique services. Also, we propose an LLM-based deployability-centric framework, dubbed IaCGen, that uses an iterative feedback mechanism encompassing format verification, syntax checking, and live deployment stages, thereby closely mirroring the real DevOps workflows. Results show that IaCGen can make 54.6∼91.6% generated IaC templates from all evaluated models deployable in the first 10 iterations. Additionally, human-in-the-loop feedback that provide direct guidance for the deployability errors, can further boost the performance to over 90% passItr@25 on all evaluated LLMs. Furthermore, we explore the trustworthiness of the generated IaC templates on user intent alignment and security compliance. The poor performance (25.2% user requirement coverage and 8.4% security compliance rate) indicates a critical need for continued research in this domain.
Article Search
Article: fse26maina-p296-p
Validating LLM-Generated SQL Queries through Metamorphic Prompting
Li Lin,
Qinglin Zhu,
Jintai Hong,
Chong Wang,
Yang Liu, and
Rongxin Wu
(Xiamen University, China; Nanyang Technological University, Singapore)
Large Language Models (LLMs) can translate natural language (NL) into SQL, enabling non-experts to query databases via conversational interfaces. However, the generated SQL often contains intent-violating hallucinations—queries that are syntactically valid and executable, yet semantically misaligned with the user’s question. These failures are especially risky in real-world settings where users cannot verify the correctness. In this paper, we propose MRSQLGen, a framework for detecting intent-violating hallucinations, built on the metamorphic prompting paradigm. MRSQLGen rewrites the input prompt using task-specific transformation rules derived from a hallucination taxonomy, and validates the generated SQL by checking behavioral consistency across multiple executions. Each transformation is associated with a metamorphic relationship (MR) that defines the expected relation between results; discrepancies are aggregated through a majority-vote strategy to robustly flag hallucinations without ground-truth SQL. We evaluate MRSQLGen on two benchmarks (Spider and Bird) using five representative LLMs, including GPT-4o. Experimental results demonstrate that MRSQLGen consistently outperforms state-of-the-art hallucination detection techniques, achieving higher precision and recall in detecting hallucinated SQL queries.
Article Search
Article: fse26maina-p328-p
Beyond Language Boundaries: Uncovering Programming Language Families for Code Language Models
Shangbo Yun,
Xiaodong Gu,
Jianghong Huang, and
Beijun Shen
(Shanghai Jiao Tong University, China)
The rapid proliferation of diverse programming languages presents both opportunities and challenges for developing multilingual code LLMs. While existing techniques often train code LLMs by simply aggregating multilingual code data, few explore the deeper relationships between programming languages and how such relationships can be utilized to optimize both training and inference. In this work, we investigate two fundamental questions: (1) What are the deep linguistic relationships among programming languages? and (2) How can these relationships be leveraged to improve multilingual code LLMs? We propose an embedding-based framework to uncover the latent families of programming languages. Our approach begins by defining 21 primary linguistic features of programming languages, such as variable definition, control structures, and method declarations, and then employs LLMs to generate feature-aligned code samples across multiple languages. By embedding these semantically parallel code snippets from 19 languages, we construct a similarity matrix and perform hierarchical clustering to uncover inherent language relationships. Our analysis reveals clear hierarchical structures among programming languages. Closely related languages form well-defined clusters (e.g., C, C++, Java, and Swift group together), while Go exhibits as a “lingua franca” with the highest cross-language similarity. Building on the uncovered language families, we propose three strategies to enhance multilingual LLM training: transfer learning across linguistically related languages, linguistic proximity-guided curriculum learning, and centroid-based intermediary code translation. Experiments on four code intelligence tasks demonstrate that our methods significantly improve multilingual LLM performance. This work offers a universal perspective on programming languages and advances more effective strategies for multilingual code LLM training.
Preprint
Article: fse26maina-p343-p
ProofFusion: Improving Neural Theorem Proving via Adaptive Retrieval-Augmented Reasoning
Manqing Zhang,
Yunwei Dong,
Lingru Zhou,
Bingxu Xiao, and
Yepang Liu
(Northwestern Polytechnical University, China; Southern University of Science and Technology, China)
Interactive theorem proving (ITP) is a powerful approach to ensuring the correctness of complex software systems. However, it often requires substantial manual effort, which makes it costly to use in practice. Recently, neural network based approaches have shown promise in automatically generating proof tactics. Nevertheless, existing methods suffer from a long-tailed distribution in tactic usage within the training data. A few frequent tactics dominate the probability distribution, while many rare yet crucial ones are consistently suppressed in the model’s candidate ranking. This distributional bias can cause potentially provable goals to be prematurely abandoned during proof search. In addition, the decision making process of neural networks when generating tactics lacks explicit reasoning traces, making it difficult for humans to explain or verify the underlying logic. To address these limitations, we propose ProofFusion, an adaptive retrieval-augmented reasoning framework that improves the proving capability of neural theorem provers without requiring retraining. Our key insight is inspired by the way human provers tackle a new theorem by consulting similar previously proven theorems to guide their own reasoning. Specifically, we develop a proof semantic-aware retriever that searches a knowledge base for semantically similar historical proof goals together with their tactic, producing a traceable set of reference decisions. We then employ a dual-track reranking fusion mechanism to integrate both the original predictions of the neural model and the retrieved reference tactics. Furthermore, to mitigate potential noise introduced by retrieval, we design a capability-adaptive retrieval mechanism that dynamically determines when retrieval should be applied. We conduct a systematic evaluation on 10,782 theorems from 26 Coq projects in a real ITP environment. Experimental results show that ProofFusion increases the number of theorems proved by four state-of-the-art neural theorem provers by an average of 6.89%, and and additionally proves 17.50% of previously unprovable theorems. In addition, it substantially improves the explainability of proof steps, achieving an average explainable proof goal proportion of 82.1% across the four provers. Together, these results demonstrate that ProofFusion is a practical and effective complement to existing neural theorem proving systems, enhancing both performance and explainability.
Article Search
Article: fse26maina-p354-p
SWE Data Construction, Automatically!
Lianghong Guo,
Yanlin Wang,
Caihua Li,
Wei Tao,
Pengyu Yang,
Jiachi Chen,
Haoyu Song,
Duyu Tang, and
Zibin Zheng
(Sun Yat-sen University, China; Independent Researcher, China; Huawei Technologies, China)
Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the existing GitHub issue resolution data construction pipeline is challenging and labor-intensive. We identify three key limitations in existing pipelines: (1) test patches collected often omit binary file changes; (2) the manual construction of evaluation environments is labor-intensive; and (3) the fail2pass validation phase requires manual inspection of test logs and writing custom parsing code to extract test status from logs. In this paper, we propose SWE-Factory, a fully automated issue resolution data construction pipeline, to resolve these limitations. First, our pipeline automatically recovers missing binary test files and ensures the correctness of test patches. Second, we introduce SWE-Builder, an LLM-based agentic system that automates evaluation environment construction. Third, we introduce a standardized, exit-code-based log parsing method to automatically extract test status,
enabling a fully automated fail2pass validation. Experiments on 671 real-world GitHub issues across four programming languages show that our method can effectively construct valid evaluation environments for GitHub issues at a reasonable cost. For example, with GPT-4.1 mini, our SWE-Builder constructs 337 valid task instances out of 671 issues, at $0.047 per instance. Our ablation study further shows the effectiveness of different components of SWE-Builder. We also demonstrate through manual inspection that our exit-code-based fail2pass validation method is highly accurate, achieving an F1 score of 0.99. Additionally, we conduct an exploratory experiment to investigate whether we can use SWE-Factory to enhance models’ software engineering ability. After training five models on 2,809 Python task instances collected by our method, all models show improved software engineering ability. For example, the resolve rate of a trained Qwen2.5-Coder-14B-Instruct on SWE-bench Verified increases from 5.8% to 21.0%. We hope our method can accelerate the construction of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation.
Article Search
Article: fse26maina-p451-p
PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models
Haocheng Huang,
Yuchen Chen,
Weisong Sun,
Peizhuo Lv,
Yuan Xiao,
Chunrong Fang,
Yang Liu, and
Xiaofang Zhang
(Soochow University, China; Nanjing University, China; Nanyang Technological University, Singapore)
Article Search
Article: fse26maina-p509-p
SmartIFSyn: Automated Information Flow Security Policy Synthesis for Smart Contracts
Yinghao Wu,
Miaomiao Zhang,
Fu Song, and
John Baugh
(Tongji University, China; Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Nanjing Institute of Software Technology, China; North Carolina State University, USA)
Smart contracts have achieved significant success, however, their security remains a long-standing challenge. The immutability and transparency of smart contracts require establishing a strong mechanism to prevent private leakage and trusted data tampering. Apart from traditional logic and code-level vulnerabilities arising from insufficient control over contract variables and function parameters, smart contracts may store private-dependent information in blockchain records, which is a critical type of vulnerability, but often overlooked in existing security analysis. In this paper, we present an automated approach for synthesizing security policies, named SmartIFSyn, to eliminate information flow vulnerabilities in smart contracts. We formalize the semantics of Solidity, the most widely used smart contract language, and analyze information flow security of Solidity smart contracts from two perspectives: local-variable security and global-interaction security. We present a type system to guide the elimination of local-variable vulnerabilities by inferring a policy and resort to constraint solving to synthesize a desired policy in case that the type system fails. The policy ensures both local-variable and global-interaction security while it is maximally aligned with user preference. Furthermore, the policy can be subsequently converted into enforceable specifications. We implement our approach in a tool and evaluate it on 17,160 real-world Ethereum smart contracts. The experimental results demonstrate the efficacy of our approach, e.g., detected 243 vulnerabilities in 223 real-world Ethereum smart contracts.
Article Search
Article: fse26maina-p514-p
Protocol Reverse Engineering via Deep Transfer Learning
Yanyang Zhao,
Zhengxiong Luo,
Wenlong Zhang,
Feifan Wu,
Yuanliang Chen,
Fuchen Ma,
Qi Xu,
Heyuan Shi, and
Yu Jiang
(Tsinghua University, China; National University of Singapore, Singapore; Central South University, China)
Protocol reverse engineering infers the specification of proprietary or poorly documented protocols and serves as the foundation for security analysis such as fuzz testing. While many existing techniques achieve this by mining statistical features from network traces, they face increasing challenges due to incomplete field pattern information available in the traces. Although protocol development has accumulated rich prior knowledge about protocol design, this knowledge remains largely untapped in protocol reverse engineering.
This paper introduces TransRE, a protocol reverse engineering tool that leverages prior syntax knowledge from standardized protocols through deep transfer learning to better understand proprietary protocols. TransRE first selects optimal source domains by analyzing inter-domain differences between the existing knowledge base and the target protocol. It then employs a neural network to extract representation features and applies domain adaptation techniques to optimize the syntax transfer model, enabling accurate inference of protocol formats. Our evaluation on 12 widely used protocols shows that TransRE identifies fields with a perfection score of 0.43, which is 1.48×-3.07× the performance achieved by five state-of-the-art methods. Furthermore, to demonstrate practical applicability, we enhanced an existing protocol fuzzer with TransRE for testing proprietary protocols in real-world network cameras and discovered four bugs.
Article Search
Article: fse26maina-p520-p
Flash: Query-Efficient Black-Box Static Malware Evasion through Transferable GAN-Guided Modification Sequences
Anyuan Sang,
Li Yang,
Lu Zhou,
Junbo Jia, and
Huipeng Yang
(Xidian University, China)
Machine learning (ML)–based static malware detectors are widely deployed for Portable Executable (PE) files due to their scalability and efficiency, yet they remain vulnerable to carefully crafted adversarial perturbations. Existing black-box evasion methods either rely on transfer attacks, which break down when surrogate and target decision boundaries diverge, or on query-driven searches, which require impractically many queries. We present Flash, a two-phase adversarial framework tailored for static PE malware detection that integrates the strengths of both approaches. In the first phase, a generative adversarial network is trained against heterogeneous surrogate detectors to generate function-preserving PE modifications with inherent evasiveness. In the second phase, an evolutionary optimizer refines these sequences directly against the target model with a dual-objective fitness that balances evasion success and minimal perturbation cost. Experiments on 12,039 VirusShare PE files and six state-of-the-art static detectors demonstrate that Flash reduces query counts by 86% while maintaining bypass rates above 95.8%. Furthermore, adversarial training with Flash-generated samples reduces attack success rates by 82.4%, highlighting Flash’s utility for both exposing vulnerabilities and strengthening the robustness of static PE malware detectors.
Article Search
Article: fse26maina-p857-p
TransLibEval: Demystify Large Language Models’ Capability in Third-Party Library-Targeted Code Translation
Pengyu Xue,
Kunwu Zheng,
Zhen Yang,
Yifei Pei,
Linhao Wu,
Jiahui Dong,
Xiapu Luo,
Yan Xiao,
Fei Liu,
Yuxuan Zhang,
Xiran Lyu,
Xianhang Li,
Xuanyu Zhu, and
Chengyi Wang
(Shandong University, China; Hong Kong Polytechnic University, China; Sun Yat-sen University, China)
In recent years, Large Language Models (LLMs) have been widely studied in the code translation field on the method, class, and even repository levels. However, most of these benchmarks are limited in terms of Third-Party Library (TPL) categories and scales, making TPL-related errors hard to expose and hindering the development of targeted solutions. Considering the high dependence (over 90%) on TPLs in practical programming, demystifying and analyzing LLMs' code translation performance involving various TPLs becomes imperative.
To address this gap, we construct TransLibEval, the first benchmark dedicated to library-centric code translation. It consists of 200 real-world tasks across Python, Java, and C++, each explicitly involving TPLs from diverse categories such as data processing, machine learning, and web development, with comprehensive dependency coverage and high-coverage test suites. We evaluate seven recent LLMs of commercial, general, and code-specialized families under six translation strategies of three categories: Direct, IR-guided, and Retrieval-augmented. Experimental results show a dramatic performance drop compared with library-free settings (average CA decline over 60%), while diverse strategies demonstrate heterogeneous advantages. Furthermore, we analyze 4,831 failed cases from GPT-4o, one of the State-of-the-Art (SOTA) LLMs, revealing numerous third-party reference errors that were obscured previously. These findings highlight the unique challenges of library-centric translation and provide practical guidance for improving TPL-aware code intelligence.
Article Search
Article: fse26maina-p1029-p
VerilogASTBench: Benchmark Construction of Verilog AST Dataset with Dual-Stage AST Semantic Enhancement Framework
Luping Zhang,
Chao Chen,
Dapeng Yan,
Hui Xu,
Mingsheng Cao,
Jingkuan Song,
Zhikuang Cai, and
Yufeng Guo
(Nanjing University of Posts and Telecommunications, China; University of Electronic Science and Technology of China, China)
With the increasing complexity and scale of integrated circuit (IC) designs, automated circuit design methods are essential for Verilog implementation. Although Large Language Models (LLMs) perform well in general-purpose coding such as C++ and Python, their performance in Verilog is constrained by domain-specific semantic and structural rules as well as the scarcity of high-quality training datasets. To address these issues, this work proposes a semantically enhanced Verilog code parsing and automatic repair framework based on Abstract Syntax Tree (AST). First, an advanced register-transfer level (RTL) analysis framework was developed to achieve semantic enhancement through static analysis–driven functional role inference and attribute annotation. Second, enhanced AST information is used to construct structured prompts and generate semantically rich module descriptions via an Application Programming Interface (API) for dataset construction. Finally, an AST-guided automatic Verilog repair framework was designed, which leverages enhanced AST analysis for precise defect localization and intelligent repair through compiler feedback loops. Experimental results indicate that the proposed method successfully repaired 15.24% of defective Verilog code, resulting in a high-quality RTL benchmark dataset containing 318,021 samples. Models fine-tuned on this dataset demonstrate significant performance improvements across three benchmarks, achieving average improvements of 9.23% and 15.43% on Eval-Machine pass@1 and pass@5, 4.23% and 10.08% on Eval-Human pass@1 and pass@5, and 12.64% and 7.45% on RTLLM V1.1 Syntax-VCS and Func.
Article Search
Article: fse26maina-p1087-p
Feature Slice Matching for Precise Bug Detection
Ke Ma,
Jianjun Huang,
Wei You,
Bin Liang,
Jingzheng Wu,
Yanjun Wu, and
Yuanjun Gong
(Renmin University of China, China; Institute of Software at Chinese Academy of Sciences, China; University of Trento, Italy)
Measuring the function similarity to detect bugs is effective, but the statements unrelated to the bugs can impede the performance due to the noise interference. Suppressing the noise interference in existing works does not manage the tough job, i.e., eliminating the noise in the targets. In this paper, we propose MATUS to mitigate the target noise for precise bug detection based on similarity measurement. Feature slices are extracted from both the buggy query and the targets to represent the semantic feature of (potential) bug logics. In particular, MATUS guides the target slicing with the prior knowledge from the buggy code, in an end-to-end way to pinpoint the slicing criterion in the targets. All feature slices are embedded and compared based on the vector similarity. Buggy candidates are audited to confirm unknown bugs in the targets. Experiments show that MATUS holds advantages in bug detection for real-world projects with acceptable efficiency. In total, MATUS has spotted 31 unknown bugs in the Linux kernel. All of them have been confirmed by the kernel developers, and 11 have been assigned CVEs.
Article Search
Article: fse26maina-p1298-p
Neuron-Guided Interpretation of Code LLMs: Where, Why, and How?
Zhe Yin,
Xiaodong Gu, and
Beijun Shen
(Shanghai Jiao Tong University, China)
Code language models have demonstrated strong capabilities across a wide range of code intelligence tasks. While the majority of existing research prioritizes performance improvements on benchmark datasets, few of them have focused on the internal interpretability of models—how specific neurons affect linguistic features such as syntax and semantics, which is critical for model transparency, controllability, and reliability. Although various neuron interpretability techniques have been developed in NLP, directly applying them to source code yields suboptimal results due to the unique characteristics of programming languages, such as their formal structure, hierarchical organization, and executability.
In this work, we empirically investigate the intrinsic mechanisms of code LLMs at the neuron level, aiming to localize both language-specific neurons (i.e., neurons that are selectively responsive to individual programming languages) and concept layers (i.e., feed-forward layers that encode language-agnostic representations of code).
Our study employs two state-of-the-art models, Llama-3.1-8B and Qwen2.5-Coder-32B, across five programming languages: C++, Java, Python, Go, and JavaScript. By analyzing neuron activation patterns in response to multilingual code inputs, we investigate the role of individual neurons and the contribution of different layers during output generation.
Our empirical findings reveal that: (1) code LLMs contain neurons specialized for individual programming languages, alongside a universal subset that supports general-purpose code generation; and (2) lower layers primarily encode language-specific syntactic structures, while middle layers capture semantic abstractions that generalize across languages, manifesting as concept layers.
To demonstrate the practical usability of these findings, we apply our findings to three downstream tasks: neuron-guided fine-tuning for code generation, clone detection using concept-layer embeddings, and transfer learning guided by concept-layer representations for code summarization. Experimental evaluations show that each strategy consistently improves the performance of multilingual code LLMs.
Preprint
Info
Article: fse26maina-p1339-p
Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts
Jingyu Zhang,
Fan Wang,
Jacky Keung,
Yihan Liao,
Yan Xiao, and
Lei Ma
(Hong Kong Metropolitan University, Hong Kong; City University of Hong Kong, China; Sun Yat-sen University, Shenzhen, China; University of Tokyo, Japan; University of Alberta, Canada)
Deep learning (DL)-based systems can exhibit unexpected behavior when exposed to out-of-distribution (OOD) scenarios, posing serious risks in safety-critical domains such as malware detection and autonomous driving. This underscores the importance of thoroughly testing such systems before deployment. To this end, researchers have proposed a wide range of test selection metrics designed to effectively select inputs. However, prior evaluations of metrics reveal three key limitations: (1) narrow testing objectives, for example, many studies assess metrics only for fault detection, leaving their effectiveness for performance estimation unclear; (2) limited coverage of OOD scenarios, with natural and label shifts are rarely considered; (3) Biased dataset selection, where most work focuses on image data while other modalities remain underexplored. Consequently, a unified benchmark that examines how these metrics perform under multiple testing objectives, diverse OOD scenarios, and different data modalities is still lacking. This leaves practitioners uncertain about which test selection metrics are most suitable for their specific objectives and contexts. To address this gap, we conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.
Article Search
Article: fse26maina-p1421-p
AccessRefinery: Fast Mining Concise Access Control Intents on Public Cloud
Ning Kang,
Peng Zhang,
Jianyuan Zhang,
Hao Li,
Dan Wang,
Zhenrong Gu,
Weibo Lin,
Shibiao Jiang,
Zhu He,
Xu Du,
Longfei Chen,
Jun Li, and
Xiaohong Guan
(Xi'an Jiaotong University, China; Huawei Cloud, China)
Article Search
Article: fse26maina-p1435-p
Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading
Hanzhen Lu,
Lishui Fan,
Jiachi Chen,
Qiuyuan Chen,
Zhao Wei, and
Zhongxin Liu
(Zhejiang University, China; Sun Yat-sen University, China; Tencent Technology, China)
Line-level code completion aims to complete the current line in real-time as developers type. Low latency is crucial to maintaining a seamless and uninterrupted coding experience, enabling developers to remain in a productive flow. However, existing approaches face a fundamental trade-off: large language models (LLMs) provide high-quality suggestions but require expensive computational resources to ensure acceptable inference latency. In contrast, static-analysis-based methods and small language models respond quickly but often generate suboptimal completions. To fill this gap, our idea is to rely on the small model by default and only escalate the large model when necessary to achieve latency-accuracy trade-offs. Based on this idea, we propose MCCom(Model-Cascading-based code Completion), a framework that cascades a local small model with a high-performance cloud large model for code completion. Realizing effective model cascading requires answering two non-trivial questions, i.e., when to invoke the large model and how to enable effective collaboration between small and large models. For the first question, we leverage a valuable but easily overlooked signal, i.e., user actions, during code completion to accurately identify failed completions. This deferral decision allows us to invoke the large model only when necessary, reducing both latency and cloud-side computation costs. To enable effective collaboration, MCCom employs a two-stage speculative decoding strategy and an iterative retrieval mechanism that collectively accelerate and improve the quality of completions. Due to the lack of high-quality small models for code completion, we also train a lightweight model with only 121M parameters to implement MCCom. The small model achieves an average of 73.8% of the performance of the state-of-the-art 7B model. We evaluate MCCom on the RepoEval benchmark and a new benchmark, StmtEval, collected from real-world projects. Experimental results show that our approach not only reduces inference latency by up to 47.9% and cuts down LLM usage by an average of 46.3%, but also improves the exact match rate of the large model by an average of 8.9%.
Preprint
Article: fse26maina-p1516-p
Unfulfilled Promises: LLM-Based Detection of OS Compatibility Issues in Infrastructure as Code
Georgios-Petros Drosos,
Georgios Alexopoulos,
Thodoris Sotiropoulos,
Dimitris Mitropoulos, and
Zhendong Su
(ETH Zurich, Switzerland; University of Athens, Greece)
Modern infrastructures rely on Infrastructure as Code (IaC) systems to keep complex deployments consistent, reproducible, and scalable at production scale. The reliability of these infrastructures, however, depends on the correctness of their building blocks, which are reusable components (modules) that each performs a dedicated task, such as installing a package, managing an OS user, or configuring a service, and reconciling its state with the desired specification. A central promise of these components is portability: a specification written once should correctly
manage the targeted resource on every OS the IaC component supports. When this property is violated,
defects can propagate across entire infrastructures, causing outages, security vulnerabilities,
and costly misconfigurations.
In this work, we introduce crOSsible, the first automated framework for cross-OS testing of IaC modules. crOSsible leverages large language models (LLMs) to synthesize and repair integration tests from structured module documentation, and executes them across 13 versions of 8 major Linux distributions. While our techniques are generally applicable to different IaC systems, we instantiate and evaluate them on Ansible, the most widely used IaC framework for managing individual servers. Evaluation across 259 popular Ansible modules demonstrates both effectiveness and real-world impact. In just 12 hours of testing, crOSsible uncovered 36 previously unknown bugs, including 22 portability violations. In total, 27 issues have been confirmed by maintainers, with 17 already fixed. The discovered issues range from crashes to dangerous soundness defects where modules reported success despite leaving systems misconfigured. Beyond bug discovery, crOSsible improved the code coverage of Ansible modules by 12.3% on average, systematically exercising OS-specific code paths that existing tests missed.
Article Search
Article: fse26maina-p2091-p
Automating Dockerfile Refactoring to Multi-stage Builds
Dongjin Chen,
Wenhua Yang,
Minxue Pan, and
Yu Zhou
(Nanjing University of Aeronautics and Astronautics, China; Nanjing University, China)
Containerization has become a cornerstone of modern software deployment, yet many projects still ship single-stage Dockerfiles that bundle compilers, build tools, and temporary files into production images, thereby hurting performance and security. Multi-stage builds are recommended, yet uptake appears uneven, plausibly because refactoring legacy Dockerfiles demands nontrivial reasoning about build lifecycles and dependency separation. This paper presents StageCraft, an automated refactoring approach that converts single-stage Dockerfiles into optimized multi-stage builds. StageCraft first performs static analysis to identify the technology stack and to infer build-time and runtime dependencies. It then applies a lightweight gate that estimates the refactoring benefit from a composite of image bloat, structural inefficiency, and security risk, proceeding only when refactoring is warranted. Finally, it synthesizes a multi-stage Dockerfile that isolates build tooling, copies only the artifacts needed at runtime, and applies production hardening. Evaluated on 521 real-world single-stage projects, StageCraft successfully produced working multi-stage Dockerfiles for 60.3% of targets. The resulting images were, on average, 52.2% smaller and contained 50.0% fewer high-risk vulnerabilities than the originals, outperforming baselines. StageCraft lowers the barrier to multi-stage adoption at scale, yielding leaner images with a reduced attack surface and improved maintainability. We release the tool, its knowledge assets, and the evaluation dataset to support reproducibility and future research.
Article Search
Article: fse26maina-p2093-p
Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware Benchmark
Aoyang Fang,
Songhan Zhang,
Yifan Yang,
Haotong Wu,
Junjielong Xu,
Xuyang Wang,
Rui Wang,
Manyi Wang,
Qisheng Lu, and
Pinjia He
(Chinese University of Hong Kong, Shenzhen, China)
Article Search
Article: fse26maina-p2270-p
Comment Traps: How Defective Commented-Out Code Augment Defects in AI-Assisted Code Generation
Yuan Huang,
Yukang Zhou,
Xiangping Chen, and
Zibin Zheng
(Sun Yat-sen University, China)
With the rapid development of large language models in code generation, AI-powered editors such as GitHub Copilot and Cursor are revolutionizing software development practices. At the same time, studies have identified potential defects in the generated code. Previous research has predominantly examined how code context influences the generation of defective code, often overlooking the impact of defects within commented-out code (CO code). AI coding assistants' interpretation of CO code in prompts affects the code they generate.
This study evaluates how AI coding assistants, GitHub Copilot and Cursor, are influenced by defective CO code. The experimental results show that defective CO code in the context causes AI coding assistants to generate more defective code, reaching up to 58.17 percent. Our findings further demonstrate that the tools do not simply copy the defective code from the context. Instead, they actively reason to complete incomplete defect patterns and continue to produce defective code despite distractions such as incorrect indentation or tags. Even with explicit instructions to ignore the defective CO code, the reduction in defects does not exceed 21.84 percent. These findings underscore the need for improved robustness and security measures in AI coding assistants.
Preprint
Article: fse26maina-p2429-p
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
Jia Li,
Hongyi Deng,
Yiran Zhang,
Kechi Zhang,
Tianqi Shao,
Tiankuo Zhao,
Weinan Wang,
Jin Zhi,
Ge Li,
Yang Liu,
Yingtao Fang, and
Yihong Dong
(Wuhan University, China; Peking University, China; Nanyang Technological University, Singapore)
Article Search
Article: fse26maina-p2570-p
AccessDroid: Detecting Screen Reader Accessibility Issues in Android Applications via Semantics Trees
Hang Zhou and
Wei Song
(Nanjing University of Science and Technology, China)
Screen readers are essential for visually impaired users to access Android apps, but inadequate developer support often leads to semantic ambiguity or label missing. While prior work has focused primarily on label missing issues, semantic ambiguity remains underexplored. In this paper, we categorize screen reader accessibility issues into three types: semantics separation, semantics downshift, and semantics omission. Bootstrapped by semantics trees, we propose AccessDroid, a lightweight dynamic analysis approach that automatically detects screen reader accessibility issues in Android app pages and generates diagnostic reports. Applied to 361 runtime pages from 50 real-world Android apps, AccessDroid successfully detects 249 true semantics separation issues, 279 true semantics downshift issues, and 170 true semantics omission issues. On average, AccessDroid only spends 15 milliseconds per app page, and achieves a precision of 98.8% and a recall of 98.3%, significantly outperforming two baseline approaches.
Article Search
Artifacts Available
Article: fse26maina-p2719-p
Debugging Engine Enhanced by Prior Knowledge: Can We Teach LLM How to Debug?
Kunyi Li,
Sai Wu,
Xiu Tang,
Chang Yao,
Songhao Bu,
Quanqing Xu, and
Gang Chen
(Zhejiang University, China; Ant Group, China)
Automated Program Repair (APR) powered by Large Language Models (LLMs) has shown strong potential for improving software reliability. However, existing LLM-based APR approaches underutilize the rich debugging knowledge latent in large-scale bug-fix corpora. Prior work has primarily advanced APR through multi-agent coordination, prompt engineering, task decomposition, or model training. While effective to some extent, these methods rely heavily on the implicit reasoning capacity of LLMs, without explicitly modeling the debugging knowledge that underlies successful program repair. We present DeepK, a novel framework that systematically extracts, validates, and reuses debugging knowledge to guide LLMs in APR. Rather than treating historical bug-fix pairs solely as contextual exemplars, DeepK distills them into a structured knowledge base of verified debugging knowledge. This knowledge base can be seamlessly integrated into diverse APR pipelines, providing interpretable reasoning traces and step-wise repair strategies that enhance the repair performance.
We evaluate DeepK on multiple benchmarks (ACPR, Atcoder) using both GPT-4o and DeepSeek-v3. Results show that DeepK consistently surpasses state-of-the-art APR systems in repair accuracy. Ablation studies confirm the importance of edit description generation, multi-perspective retrieval, and two-fold debugging knowledge. Varying the number of retrieved entries further highlights the trade-off between informativeness and noise. These findings emphasize the essential contribution of explicit debugging knowledge in advancing LLM-based APR and establish DeepK as an extensible framework for building more reliable repair systems.
Article Search
Info
Article: fse26maina-p2854-p
In Bugs We Trust? On Measuring the Randomness of a Fuzzer Benchmarking Outcome
Ardi Madadi,
Seongmin Lee,
Cornelius Aschermann, and
Marcel Böhme
(MPI-SP, Germany; University of California at Los Angeles, USA; Ruhr-University Bochum, Germany)
In Google’s FuzzBench platform, we find that the outcome of coverage-based evaluation more strongly agrees with the outcome of a bug-based evaluation than an independent bug-based evaluation itself. Recently, B'ohme et al. found that despite a very strong correlation between coverage achieved and bugs found, there is no strong agreement between the outcome of a coverage- and a bug-based evaluation: The fuzzer best at achieving coverage may be the worst at finding bugs. However, in trying to explain this moderate agreement, we wondered whether the outcome of bug-based benchmarking itself is perhaps much more “noisy” and turned to applied statistics to develop the tools necessary to investigate our hypothesis.
In this paper, we call this degree of “noisiness” of a benchmarking outcome the concordance of the benchmarking procedure and quantify it using a measure of statistical reliability widely used in psychology, called mean split-half reliability, i.e., the expected agreement on the benchmark outcome between two random halves of the benchmarking suite. In our experiments with FuzzBench and Magma, we find that the concordance of coverage-based benchmarking is consistently strong while that of bug-based benchmarking is weak on FuzzBench and moderate on Magma. In contrast to FuzzBench, for the Magma benchmark suite (which was designed for bug-based evaluation) a coverage-based evaluation does not predict the outcome of a bug-based evaluation better than an independent bug-based evaluation.
Moreover, to demonstrate the utility of concordance also for developers of benchmarking suites, we investigate concordance as a measure of benchmarking efficiency, as in green fuzzer benchmarking. We empirically confirm that the resources of a procedure with higher concordance can be reduced more substantially (in terms of campaign length or benchmark sampling size) while maintaining a similar benchmark outcome as a procedure with lower concordance. We report the corresponding savings in terms of carbon emissions.
Article Search
Article: fse26maina-p3037-p
Fool Me If You Can: On the Robustness of Binary Code Similarity Detection Models against Semantics-Preserving Transformations
Jiyong Uhm,
Minseok Kim,
Michalis Polychronakis, and
Hyungjoon Koo
(Sungkyunkwan University, Republic of Korea; Stony Brook University, USA)
Binary code analysis plays an essential role in cybersecurity, facilitating reverse engineering to reveal the inner workings of programs in the absence of source code. Traditional approaches, such as static and dynamic analysis, extract valuable insights from stripped binaries, but often demand substantial expertise and manual effort. Recent advances in deep learning have opened promising opportunities to enhance binary analysis by capturing latent features and disclosing underlying code semantics. Despite the growing number of binary analysis models based on machine learning, their robustness to adversarial code transformations at the binary level remains underexplored to date. In this work, we evaluate the robustness of deep learning models for the task of binary code similarity detection (BCSD) under semantics-preserving transformations. The unique nature of machine instructions presents distinct challenges compared to the typical input perturbations found in other domains. To achieve our goal, we introduce asmFooler, a system that evaluates the resilience of BCSD models using a diverse set of adversarial code transformations that preserve functional semantics. We construct a dataset of 9,565 binary variants from 620 baseline samples by applying eight semantics-preserving transformations across six representative BCSD models. Our major findings highlight several key insights: i) model robustness highly relies on the design of the processing pipeline, including code pre-processing, model architecture, and internal feature selection, which collectively determine how code semantics are captured; ii) the effectiveness of adversarial transformations is bounded by a transformation budget, shaped by model-specific constraints such as input size limits and the expressive capacity of semantically equivalent instructions; iii) well-crafted adversarial transformations can be highly effective, even when introducing minimal perturbations; and iv) such transformations efficiently disrupt the model’s decision (e.g., misleading to false positives or false negatives) by focusing on semantically significant instructions.
Article Search
Article: fse26maina-p3336-p
proc time: 0.43