FSE – Journal Issue

Contents - Abstracts - Authors

Frontmatter

Title Page
Article: fse26foreword-fm000-p doi:

Editorial Message
Article: fse26foreword-fm001-p doi:

Sponsors
Article: fse26foreword-fm003-p doi:

Research Papers

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-Breaking
Yifan Huang, Xiaojun Jia, Wenbo Guo, Yuqiang Sun, Yihao Huang, Chong Wang, and Yang Liu
(Nanyang Technological University, Singapore; National University of Singapore, Singapore)
Large language models (LLMs) have revolutionized software development through AI-assisted coding tools, enabling developers with limited programming expertise to create sophisticated applications. This democratization of software development has significantly lowered the barriers to entry for complex programming tasks. However, the same accessibility extends to malicious actors who may exploit these powerful tools to generate harmful software, including malware, ransomware, and other security threats. Existing jailbreaking research primarily focuses on general attack scenarios against LLMs, with limited exploration of malicious code generation as a jailbreak target and insufficient technical expertise to evaluate whether generated outputs align with specified malicious objectives. To address this gap, we propose SPELL, a comprehensive testing framework for LLM developers and the Secure Team, specifically designed to evaluate the weakness of security alignment in malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns with exploitation of successful techniques. Extensive evaluation across three advanced code models (GPT-4.1, Claude-3.5, and Qwen2.5-Coder) demonstrates SPELL’s effectiveness, achieving attack success rates of 83.75%, 19.38%, and 68.12% respectively across eight malicious code categories. The generated prompts successfully produce malicious code in real-world AI development tools such as Cursor, with outputs confirmed as malicious by state-of-the-art detection systems at rates exceeding 73%, including four instances flagged as extremely dangerous by all detection tools. These findings reveal significant security gaps in current LLM implementations and provide valuable insights for improving AI safety alignment in code generation applications.

Publisher's Version Article: fse26maina-p10-p doi:10.1145/3797063

Still Manual? Automated Linter Configuration via DSL-Based LLM Compilation of Coding Standards
Zejun Zhang, Yixin Gan, Zhenchang Xing, Tian Zhang, Yi Li, Qinghua Lu, Sherry (Xiwei) Xu, and Liming Zhu
(Nanjing University, China; CSIRO's Data61, Australia; Nanyang Technological University, Singapore)
Coding standards are essential for maintaining consistent and high-quality code across teams and projects. Linters help developers enforce these standards by detecting code violations. However, manual linter configuration is complex and expertise-intensive, and the diversity and evolution of programming languages, coding standards, and linters lead to repetitive and maintenance-intensive configuration work. To reduce manual effort, we propose LintCFG, a domain-specific language (DSL)-driven, LLM-based compilation approach to automate linter configuration generation for coding standards, independent of programming languages, coding standards, and linters. Inspired by compiler design, we first design a DSL to express coding rules in a tool-agnostic, structured, readable, and precise manner. Then, we build linter configurations into DSL configuration instructions. For a given natural language coding standard, the compilation process parses it into DSL coding standards, matches them with the DSL configuration instructions to set configuration names, option names and values, verifies consistency between the standards and configurations, and finally generates linter-specific configurations. Experiments with Checkstyle for Java coding standard show that our approach achieves over 90% precision and recall in DSL representation, with accuracy, precision, recall, and F1-scores close to 70% (with some exceeding 70%) in fine-grained linter configuration generation. Notably, our approach outperforms baselines by over 100% in precision. An ablation study confirms the effectiveness of the main components of our approach. A user study further shows that our approach improves developers’efficiency in configuring linters for coding standards. Finally, we demonstrate the generality of the approach by generating ESLint configurations for JavaScript coding standards, showcasing its broad applicability across other programming languages, coding standards, and linters. We developed a lightweight, general-purpose AI skill, which is publicly available on GitHub

Publisher's Version Article: fse26maina-p27-p doi:10.1145/3797064

GadgetHunter: Region-Based Neuro-symbolic Detection of Java Deserialization Vulnerabilities
Kaixuan Li, Jian Zhang, Chong Wang, Sen Chen, Zong Cao, Min Zhang, and Yang Liu
(Nanyang Technological University, Singapore; Beihang University, China; Nankai University, China; Imperial Global Singapore of Imperial College London, Singapore; East China Normal University, China)

Publisher's Version Article: fse26maina-p42-p doi:10.1145/3797065

SnakeCharmer: Automatic Fuzzing Harness Generation for Pure and Hybrid Python Libraries
Gabriel Sherman and Stefan Nagy
(University of Utah, USA)
With Python’s rising popularity, ensuring the correctness of its ever-growing ecosystem of software libraries is more critical than ever. Recently, fuzzing has become a de facto technique for vetting software libraries, enabled via the use of harnesses: small wrapper programs that inject fuzzer-generated test cases into the library under test. While harness creation has shed its reliance on human expertise and is now fully automated for languages such as C and C++, Python remains uniquely challenging—both for pure Python libraries as well as hybrid ones combining Python with native C/C++ extensions—due to (1) limited visibility across language boundaries, (2) the absence of reliable bug oracles, and (3) incomplete type information. Consequently, attempts at automating harnessing for Python fail to both uphold critical runtime behaviors and produce the structured call and data flows needed for effective fuzzing, leaving much of today’s Python ecosystem largely unvetted.
To overcome these challenges and broaden fuzzing’s reach across Python libraries, this paper introduces SnakeCharmer: the first automated harness generation approach for both pure and hybrid Python libraries. At its core, SnakeCharmer leverages static analysis to first capture key interface information from both Python and native code components, subsequently enriching it with runtime-captured type information and exception behaviors. During fuzzing, SnakeCharmer further distinguishes between expected exceptions and true library bugs, filtering out benign exceptions that would otherwise derail testing progress. Together, these techniques significantly enhance the scope and effectiveness of fuzzing across the Python library ecosystem, enabling the automated discovery of bugs in code previously inaccessible to existing Python fuzzing efforts.
We evaluate SnakeCharmer alongside today’s leading Python auto-harnessing approach, PyRTFuzz; the actively fuzzed expert-written harnesses from both OSS-Fuzz and PolyFuzz; and the harnesses generated by Google’s own state-of-the-art LLM-driven automatic harnessing approach, OSS-Fuzz-Gen. Across 21 diverse Python libraries, SnakeCharmer attains type-recovery precision and exception-filtering accuracy of 95% and 97%, respectively, further attaining 1.48×, 1.87×, 1.78×, and 1.40× the code coverage of the fuzzing harnesses from PyRTFuzz, OSS-Fuzz, PolyFuzz, and OSS-Fuzz-Gen, respectively. Further, SnakeCharmer finds 16, 24, and 24 more Python library bugs than all expert- and LLM-created harnesses as well as PyRTFuzz, respectively—uncovering a total of 20 new bugs, with 18 since confirmed or fixed by developers.

Publisher's Version Article: fse26maina-p67-p doi:10.1145/3797066

UNICS: Multilingual Code Search via Unified Pseudocode and Contrastive Transfer Learning
Ye Fan, Jidong Ge, Chuanyi Li, LiGuo Huang, and Bin Luo
(Nanjing University, China; Southern Methodist University, USA)

Publisher's Version Article: fse26maina-p88-p doi:10.1145/3797067

Towards Automated Smart Contract Generation: Evaluation, Benchmarking, and Retrieval-Augmented Repair
Zaoyu Chen, Haoran Qin, Nuo Chen, Xiangyu Zhao, Lei Xue, Xiapu Luo, and Xiao-Ming Wu
(Hong Kong Polytechnic University, Hong Kong; Sun Yat-sen University, China)
Smart contracts, predominantly written in Solidity and executed on blockchains like Ethereum, are immutable, making functional correctness paramount: once deployed, bugs and vulnerabilities become permanent. Despite rapid progress in transformer-based code LLMs, existing evaluations of Solidity code completion rely heavily on surface-form metrics (e.g., BLEU, CrystalBLEU) or hand-grading, which poorly correlate with functional correctness. Unlike Python, Solidity lacks large-scale and execution-based benchmarks, hindering systematic assessment and optimization of LLMs for smart contract development.
To bridge this research gap, we introduce SolBench, a comprehensive benchmark and automated testing pipeline for Solidity, designed to emphasize functional correctness via differential fuzzing. SolBench contains 28,825 functions from 7,604 contracts collected from Etherscan (genesis to 2024), spanning 10 popular domains. We benchmark 14 diverse LLMs (open/closed, 1.3B to 671B parameters, general/code-specific, with/without reasoning). The dominant failure mode is missing crucial details (e.g., type definitions, state variables) in intra-contract context. Providing full-contract context mitigates this and improves code completion accuracy.
However, full-context inference can be prohibitively expensive in practice. Generating outputs with large context windows using state-of-the-art models often incurs significant costs, rendering naive context scaling economically impractical. Crucially, most of a contract is irrelevant to implementing a given function; only a small subset of details is needed. To exploit this, we propose Retrieval-Augmented Repair (RAR), which integrates retrieval into code repair: it uses the executor's error messages to extract only the most relevant snippets from the full contract. RAR sharply reduces input length for function completion, improving accuracy while significantly cutting computational cost. We further analyze retrieval and code repair strategies within RAR, showing substantial improvements in accuracy and efficiency. SolBench and our RAR framework enable principled evaluation and cost-effective improvement of Solidity code generation. Dataset and code are available at https://github.com/ZaoyuChen/SolBench.

Publisher's Version Article: fse26maina-p90-p doi:10.1145/3797068

MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge
Shuai Liang, Pengfei Chen, Bozhe Tian, Gou Tan, Maohong Xu, Youjun Qu, Yahui Zhao, Yiduo Shang, and Chongkang Tan
(Sun Yat-sen University, China; China Unicom Software Research Institute, China; Individual Researcher, China)

Publisher's Version Article: fse26maina-p149-p doi:10.1145/3797069

Mitigating the Risk of Defects and Improving Knowledge Distribution with Code Reviewer Recommenders
Mohammadali Sefidi Esfahani and Peter C. Rigby
(Concordia University, Canada)

Publisher's Version Article: fse26maina-p175-p doi:10.1145/3797070

WalleTruth: Visual-Oriented Software Testing for Web3 Wallet Browser Extensions
Xiaohui Hu, Ningyu He, and Haoyu Wang
(Huazhong University of Science and Technology, China; Hong Kong Polytechnic University, Hong Kong)

Publisher's Version Article: fse26maina-p176-p doi:10.1145/3797150

Precondition Synthesis for Deep Neural Networks with Statistical Guarantees
Zengyu Liu, Bai Xue, Pengfei Yang, and Ji Wang
(National University of Defense Technology, China; Institute of Software at Chinese Academy of Sciences, China; Institute of AI for Industries at Chinese Academy of Sciences, China)

FSE – Journal Issue

Frontmatter

Research Papers

Corrections