FSE – Journal Issue

Editorial Message
The Proceedings of the ACM series presents the highest-quality research conducted in diverse areas of computer science, as represented by the ACM Special Interest Groups. The Proceedings of the ACM on Software Engineering (PACMSE) focuses on top-quality, original research on all aspects of software engineering, from requirements elicitation to quality assessment and from design to maintenance, evolution, and deployment. PACMSE covers a broad range of topics and methods that help conceive, create, and maintain better software, be it embedded, cloud-based, mobile, ubiquitous, or one that runs on conventional computers. The journal welcomes contributions to new methodologies, tools, theories, and models, as well as empirical studies and survey papers related to the wide spectrum of software engineering topics. We particularly welcome contributions that offer additional artifacts, such as code, datasets, etc., in the light of reproducibility. The journal operates in close collaboration with the ACM Special Interest Group on Software Engineering (SIGSOFT) and is committed to making high-quality peer-reviewed scientific research in software engineering free of restrictions on both access and use.

Article: fse26foreword-fm001-p doi:

Research Papers

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-Breaking
Yifan Huang, Xiaojun Jia, Wenbo Guo, Yuqiang Sun, Yihao Huang, Chong Wang, and Yang Liu
(Nanyang Technological University, Singapore; National University of Singapore, Singapore)
Large language models (LLMs) have revolutionized software development through AI-assisted coding tools, enabling developers with limited programming expertise to create sophisticated applications. This democratization of software development has significantly lowered the barriers to entry for complex programming tasks. However, the same accessibility extends to malicious actors who may exploit these powerful tools to generate harmful software, including malware, ransomware, and other security threats. Existing jailbreaking research primarily focuses on general attack scenarios against LLMs, with limited exploration of malicious code generation as a jailbreak target and insufficient technical expertise to evaluate whether generated outputs align with specified malicious objectives. To address this gap, we propose SPELL, a comprehensive testing framework for LLM developers and the Secure Team, specifically designed to evaluate the weakness of security alignment in malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns with exploitation of successful techniques. Extensive evaluation across three advanced code models (GPT-4.1, Claude-3.5, and Qwen2.5-Coder) demonstrates SPELL’s effectiveness, achieving attack success rates of 83.75%, 19.38%, and 68.12% respectively across eight malicious code categories. The generated prompts successfully produce malicious code in real-world AI development tools such as Cursor, with outputs confirmed as malicious by state-of-the-art detection systems at rates exceeding 73%, including four instances flagged as extremely dangerous by all detection tools. These findings reveal significant security gaps in current LLM implementations and provide valuable insights for improving AI safety alignment in code generation applications.

Publisher's Version Article: fse26maina-p10-p doi:10.1145/3797063

Still Manual? Automated Linter Configuration via DSL-Based LLM Compilation of Coding Standards
Zejun Zhang, Yixin Gan, Zhenchang Xing, Tian Zhang, Yi Li, Qinghua Lu, Sherry (Xiwei) Xu, and Liming Zhu
(Nanjing University, China; CSIRO's Data61, Australia; Nanyang Technological University, Singapore)
Coding standards are essential for maintaining consistent and high-quality code across teams and projects. Linters help developers enforce these standards by detecting code violations. However, manual linter configuration is complex and expertise-intensive, and the diversity and evolution of programming languages, coding standards, and linters lead to repetitive and maintenance-intensive configuration work. To reduce manual effort, we propose LintCFG, a domain-specific language (DSL)-driven, LLM-based compilation approach to automate linter configuration generation for coding standards, independent of programming languages, coding standards, and linters. Inspired by compiler design, we first design a DSL to express coding rules in a tool-agnostic, structured, readable, and precise manner. Then, we build linter configurations into DSL configuration instructions. For a given natural language coding standard, the compilation process parses it into DSL coding standards, matches them with the DSL configuration instructions to set configuration names, option names and values, verifies consistency between the standards and configurations, and finally generates linter-specific configurations. Experiments with Checkstyle for Java coding standard show that our approach achieves over 90% precision and recall in DSL representation, with accuracy, precision, recall, and F1-scores close to 70% (with some exceeding 70%) in fine-grained linter configuration generation. Notably, our approach outperforms baselines by over 100% in precision. An ablation study confirms the effectiveness of the main components of our approach. A user study further shows that our approach improves developers’efficiency in configuring linters for coding standards. Finally, we demonstrate the generality of the approach by generating ESLint configurations for JavaScript coding standards, showcasing its broad applicability across other programming languages, coding standards, and linters. We developed a lightweight, general-purpose AI skill, which is publicly available on GitHub

Publisher's Version Article: fse26maina-p27-p doi:10.1145/3797064

GadgetHunter: Region-Based Neuro-symbolic Detection of Java Deserialization Vulnerabilities
Kaixuan Li, Jian Zhang, Chong Wang, Sen Chen, Zong Cao, Min Zhang, and Yang Liu
(Nanyang Technological University, Singapore; Beihang University, China; Nankai University, China; Imperial Global Singapore of Imperial College London, Singapore; East China Normal University, China)
Java deserialization vulnerabilities (JDVs) enable attackers to execute arbitrary code by crafting malicious serialized objects that trigger sequences of method calls (gadget chains) leading to dangerous operations. Existing detection approaches face a fundamental trade-off: static analysis achieves scalability but suffers from high false positives due to infeasible paths and imprecision with dynamic features like reflection; dynamic validation reduces false positives but incurs prohibitive costs and fails to explore deep exploitation chains.
We present GadgetHunter, a neuro-symbolic JDV detector that combines scalable static analysis with targeted LLM reasoning and JDV exploitation-oriented constraint solving. Our approach partitions gadget chains into regions based on analyzability: statically resolvable segments are processed via interprocedural taint analysis, while dynamic boundaries are delegated to LLMs for semantic validation. We then extract critical constraints from each gadget and compose them into SMT formulas to determine chain feasibility through satisfiability solving. Evaluation on the ysoserial benchmark demonstrates that GadgetHunter reduces false negatives by up to 32% and false positives by 12-85% compared to state-of-the-art tools, while discovering 197 previously unknown gadget chains and rediscovering 4 recent CVEs. Our results show that combining symbolic reasoning with semantic understanding achieves both precision and practical impact in vulnerability detection.

Publisher's Version Article: fse26maina-p42-p doi:10.1145/3797065

SnakeCharmer: Automatic Fuzzing Harness Generation for Pure and Hybrid Python Libraries
Gabriel Sherman and Stefan Nagy
(University of Utah, USA)
With Python’s rising popularity, ensuring the correctness of its ever-growing ecosystem of software libraries is more critical than ever. Recently, fuzzing has become a de facto technique for vetting software libraries, enabled via the use of harnesses: small wrapper programs that inject fuzzer-generated test cases into the library under test. While harness creation has shed its reliance on human expertise and is now fully automated for languages such as C and C++, Python remains uniquely challenging—both for pure Python libraries as well as hybrid ones combining Python with native C/C++ extensions—due to (1) limited visibility across language boundaries, (2) the absence of reliable bug oracles, and (3) incomplete type information. Consequently, attempts at automating harnessing for Python fail to both uphold critical runtime behaviors and produce the structured call and data flows needed for effective fuzzing, leaving much of today’s Python ecosystem largely unvetted.
To overcome these challenges and broaden fuzzing’s reach across Python libraries, this paper introduces SnakeCharmer: the first automated harness generation approach for both pure and hybrid Python libraries. At its core, SnakeCharmer leverages static analysis to first capture key interface information from both Python and native code components, subsequently enriching it with runtime-captured type information and exception behaviors. During fuzzing, SnakeCharmer further distinguishes between expected exceptions and true library bugs, filtering out benign exceptions that would otherwise derail testing progress. Together, these techniques significantly enhance the scope and effectiveness of fuzzing across the Python library ecosystem, enabling the automated discovery of bugs in code previously inaccessible to existing Python fuzzing efforts.
We evaluate SnakeCharmer alongside today’s leading Python auto-harnessing approach, PyRTFuzz; the actively fuzzed expert-written harnesses from both OSS-Fuzz and PolyFuzz; and the harnesses generated by Google’s own state-of-the-art LLM-driven automatic harnessing approach, OSS-Fuzz-Gen. Across 21 diverse Python libraries, SnakeCharmer attains type-recovery precision and exception-filtering accuracy of 95% and 97%, respectively, further attaining 1.48×, 1.87×, 1.78×, and 1.40× the code coverage of the fuzzing harnesses from PyRTFuzz, OSS-Fuzz, PolyFuzz, and OSS-Fuzz-Gen, respectively. Further, SnakeCharmer finds 16, 24, and 24 more Python library bugs than all expert- and LLM-created harnesses as well as PyRTFuzz, respectively—uncovering a total of 20 new bugs, with 18 since confirmed or fixed by developers.

Publisher's Version Article: fse26maina-p67-p doi:10.1145/3797066

UNICS: Multilingual Code Search via Unified Pseudocode and Contrastive Transfer Learning
Ye Fan, Jidong Ge, Chuanyi Li, LiGuo Huang, and Bin Luo
(Nanjing University, China; Southern Methodist University, USA)
While pre-trained models have achieved remarkable success in code search, their multilingual capabilities remain a major hurdle, plagued by data imbalance, cross-lingual semantic interference, and the loss of critical information from existing unified representations like Abstract Syntax Trees (ASTs) or Intermediate Representations (IRs). Furthermore, conventional contrastive learning strategies often rely on simplistic hard negative sampling while overlooking the potential of mining hard positives to learn code's intrinsic semantic invariance. To address these challenges, we introduce UNICS, a framework for multilingual code search built on a two-stage training strategy. In the first stage, UNICS is pre-trained on a novel dataset we constructed, which uses pseudo-code as a unified representation to learn a cross-lingual, algorithm-level logic that preserves full semantic fidelity. The second stage employs a multi-task transfer learning strategy that adapts this general knowledge to specific languages by decomposing code into semantic slices (e.g., API calls, function bodies) and incorporating tasks for hard positive mining and cross-lingual dynamic hard negative sampling. Experimental results demonstrate that UNICS achieves state-of-the-art performance across multiple multilingual and cross-lingual benchmarks, showcasing superior generalization and performance balance, especially in zero-shot transfer tasks to low-resource languages.

Publisher's Version Article: fse26maina-p88-p doi:10.1145/3797067

Towards Automated Smart Contract Generation: Evaluation, Benchmarking, and Retrieval-Augmented Repair
Zaoyu Chen, Haoran Qin, Nuo Chen, Xiangyu Zhao, Lei Xue, Xiapu Luo, and Xiao-Ming Wu
(Hong Kong Polytechnic University, Hong Kong; Sun Yat-sen University, China)
Smart contracts, predominantly written in Solidity and executed on blockchains like Ethereum, are immutable, making functional correctness paramount: once deployed, bugs and vulnerabilities become permanent. Despite rapid progress in transformer-based code LLMs, existing evaluations of Solidity code completion rely heavily on surface-form metrics (e.g., BLEU, CrystalBLEU) or hand-grading, which poorly correlate with functional correctness. Unlike Python, Solidity lacks large-scale and execution-based benchmarks, hindering systematic assessment and optimization of LLMs for smart contract development.
To bridge this research gap, we introduce SolBench, a comprehensive benchmark and automated testing pipeline for Solidity, designed to emphasize functional correctness via differential fuzzing. SolBench contains 28,825 functions from 7,604 contracts collected from Etherscan (genesis to 2024), spanning 10 popular domains. We benchmark 14 diverse LLMs (open/closed, 1.3B to 671B parameters, general/code-specific, with/without reasoning). The dominant failure mode is missing crucial details (e.g., type definitions, state variables) in intra-contract context. Providing full-contract context mitigates this and improves code completion accuracy.
However, full-context inference can be prohibitively expensive in practice. Generating outputs with large context windows using state-of-the-art models often incurs significant costs, rendering naive context scaling economically impractical. Crucially, most of a contract is irrelevant to implementing a given function; only a small subset of details is needed. To exploit this, we propose Retrieval-Augmented Repair (RAR), which integrates retrieval into code repair: it uses the executor's error messages to extract only the most relevant snippets from the full contract. RAR sharply reduces input length for function completion, improving accuracy while significantly cutting computational cost. We further analyze retrieval and code repair strategies within RAR, showing substantial improvements in accuracy and efficiency. SolBench and our RAR framework enable principled evaluation and cost-effective improvement of Solidity code generation. Dataset and code are available at https://github.com/ZaoyuChen/SolBench.

Publisher's Version Article: fse26maina-p90-p doi:10.1145/3797068

MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge
Shuai Liang, Pengfei Chen, Bozhe Tian, Gou Tan, Maohong Xu, Youjun Qu, Yahui Zhao, Yiduo Shang, and Chongkang Tan
(Sun Yat-sen University, China; China Unicom Software Research Institute, China; Individual Researcher, China)
The dynamics and complexity of cloud-native systems present significant challenges for Root Cause Analysis (RCA). While causality-based RCA methods have shown significant progress in recent years, their practical adoption is fundamentally limited by three intertwined challenges: poor scalability against system complexity, brittle generalization across different system topologies, and inadequate integration of domain knowledge. These limitations create a vicious cycle, hindering the development of robust and efficient RCA solutions. This paper introduces MetaRCA, a generalizable RCA framework for cloud-native systems. MetaRCA first constructs a Meta Causal Graph (MCG) offline, a reusable knowledge base defined at the metadata level. To build the MCG, we propose an evidence-driven algorithm that systematically fuses knowledge from Large Language Models (LLMs), historical fault reports, and observability data. When a fault occurs, MetaRCA performs a lightweight online inference by dynamically instantiating the MCG into a localized graph based on the current context, and then leverages real-time data to weight and prune causal links for precise root cause localization. Evaluated on 252 public and 59 production failures, MetaRCA demonstrates state-of-the-art performance. It surpasses the strongest baseline by 29 percentage points in service-level and 48 percentage points in metric-level accuracy. This performance advantage widens as system complexity increases, with its overhead scaling near-linearly. Crucially, MetaRCA shows robust cross-system generalization, maintaining over 80% accuracy across diverse systems.

Publisher's Version Article: fse26maina-p149-p doi:10.1145/3797069

Mitigating the Risk of Defects and Improving Knowledge Distribution with Code Reviewer Recommenders
Mohammadali Sefidi Esfahani and Peter C. Rigby
(Concordia University, Canada)
Defects are inevitable in software projects, leading to increased maintenance costs, user dissatisfaction, and a diminished software reputation. Code review is one of the most critical software quality assurance activities that reduces software defects and improves software quality. Prior works have quantified the impact of reviewer recommenders on the risk of inducing new defects based on the highest level of expertise among the developers in the reviewer team. However, our analysis shows that prior work overestimates the safety of a change and ignores the defect-finding effectiveness of the diverse knowledge of reviewers. In this study, we incorporate the knowledge of the entire reviewer team into the author’s level of expertise and introduce the novel Contribution-aware Changeset Safety Ratio (CCSR) outcome to assess the impact of code reviewer recommenders on the risk of inducing defects more accurately.
When a pull request is risky, a natural mitigation is to add an expert reviewer. We are unaware of any works that have quantified the impact of adding a reviewer to risky PRs. We propose the novel AddExpertRec(𝐷𝑡) strategy that recommends an additional expert reviewer for defect-prone pull requests to reduce the likelihood of introducing new defects when the risk is above the threshold 𝐷𝑡. The simulation results show that AddExpertRec(𝐷𝑡) can enhance the defect finding effectiveness of existing recommenders while still balancing reviewer workload and spreading knowledge to reduce the impact of turnover. Ultimately, our results give managers the ability to select a recommender strategy that best suits their project needs based on their resource constraints. The scripts and data are available in our replication package.

Publisher's Version Article: fse26maina-p175-p doi:10.1145/3797070

WalleTruth: Visual-Oriented Software Testing for Web3 Wallet Browser Extensions
Xiaohui Hu, Ningyu He, and Haoyu Wang
(Huazhong University of Science and Technology, China; Hong Kong Polytechnic University, Hong Kong)
Serving as the first touch point for users to the cryptocurrency world, cryptocurrency wallets allow users to manage, receive, and transmit digital assets on blockchains and interact with emerging decentralized finance (DeFi) applications. Unfortunately, cryptocurrency wallets have always been the prime targets for attackers, and incidents of wallet breaches have been reported from time to time. Although some recent studies have characterized the vulnerabilities and scams related to wallets, they have mostly been studied at a coarse granularity, overlooking potential risks inherent in detailed designs of cryptocurrency wallets, especially from perspectives including user interaction and advanced features. To fill the void, in this paper, we present a fine-grained security analysis of browser-based cryptocurrency wallets. To pinpoint security issues in wallet components, we design WalleTruth, a visual-oriented testing framework specifically for browser-based wallet extensions. We have identified 12 attack vectors that can be abused by attackers to exploit cryptocurrency wallets and exposed 21 concrete attack strategies. By applying WalleTruth on 39 widely-adopted browser-based wallet extensions, we find that all of them can be abused to steal crypto assets from innocent users. Identified potential attack vectors were reported to developers in a timely manner and 26 issues have been patched already. This calls for urgent action from the community to mitigate threats related to cryptocurrency wallets.

Publisher's Version Article: fse26maina-p176-p doi:10.1145/3797150

Precondition Synthesis for Deep Neural Networks with Statistical Guarantees
Zengyu Liu, Bai Xue, Pengfei Yang, and Ji Wang
(National University of Defense Technology, China; Institute of Software at Chinese Academy of Sciences, China; Institute of AI for Industries at Chinese Academy of Sciences, China)
Deep neural networks (DNNs) are increasingly being deployed in safety-critical systems. However, existing formal verification methods provide limited quantitative guarantees for their reliable specification, and emerging precondition synthesis techniques are hindered by the scalability and architectural limitations of DNNs. In this paper, we propose a select-and-solve framework, StatPre, to automatically synthesize preconditions with statistical guarantees. StatPre aims to maximally weaken the synthesized preconditions while keeping them as accurate as possible to the real preconditions through a Box-based abstraction. The framework operates in two phases: the center selection phase, which identifies potential center points using a cluster-based heuristic with potential assessment, and the expansion solution phase, which solves the problem of optimizing maximal preconditions by employing statistical model approximation, equivalent constraint transformation, and automatic iterative execution. We evaluated StatPre on 15 models with 27 properties from 6 benchmarks and compared it with 4 existing deterministic and statistical schemes. The results demonstrate that StatPre effectively synthesizes preconditions with broader coverage while accurately approximating the real preconditions in practice. Additionally, StatPre exhibits competitive performance in handling high-dimensional, non-ReLU, complex-structured neural networks.

FSE – Journal Issue

Frontmatter

Research Papers

Corrections