FSE 2024
Proceedings of the ACM on Software Engineering, Volume 1, Number FSE
Powered by
Conference Publishing Consulting

Proceedings of the ACM on Software Engineering, Volume 1, Number FSE, July 15–19, 2024, Porto de Galinhas, Brazil

FSE – Journal Issue

Contents - Abstracts - Authors


Title Page

Editorial Message



JIT-Smart: A Multi-task Learning Framework for Just-in-Time Defect Prediction and Localization
Xiangping Chen ORCID logo, Furen Xu ORCID logo, Yuan Huang ORCID logo, Neng Zhang ORCID logo, and Zibin Zheng ORCID logo
(Sun Yat-sen University, Guangzhou, China)
Just-in-time defect prediction (JIT-DP) is used to predict the defect-proneness of a commit and just-in-time defect localization (JIT-DL) is used to locate the exact buggy positions (defective lines) in a commit. Recently, various JIT-DP and JIT-DL techniques have been proposed, while most of them use a post-mortem way (e.g., code entropy, attention weight, LIME) to achieve the JIT-DL goal based on the prediction results in JIT-DP. These methods do not utilize the label information of the defective code lines during model building. In this paper, we propose a unified model JIT-Smart, which makes the training process of just-in-time defect prediction and localization tasks a mutually reinforcing multi-task learning process. Specifically, we design a novel defect localization network (DLN), which explicitly introduces the label information of defective code lines for supervised learning in JIT-DL with considering the class imbalance issue. To further investigate the accuracy and cost-effectiveness of JIT-Smart, we compare JIT-Smart with 7 state-of-the-art baselines under 5 commit-level and 5 line-level evaluation metrics in JIT-DP and JIT-DL. The results demonstrate that JIT-Smart is statistically better than all the state-of-the-art baselines in JIT-DP and JIT-DL. In JIT-DP, at the median value, JIT-Smart achieves F1-Score of 0.475, AUC of 0.886, Recall@20%Effort of 0.823, Effort@20%Recall of 0.01 and Popt of 0.942 and improves the baselines by 19.89%-702.74%, 1.23%-31.34%, 9.44%-33.16%, 21.6%-53.82% and 1.94%-34.89%, respectively . In JIT-DL, at the median value, JIT-Smart achieves Top-5 Accuracy of 0.539 and Top-10 Accuracy of 0.396, Recall@20%Effortline of 0.726, Effort@20%Recallline of 0.087 and IFAline of 0.098 and improves the baselines by 101.83%-178.35%, 101.01%-277.31%, 257.88%-404.63%, 71.91%-74.31% and 99.11%-99.41%, respectively. Statistical analysis shows that our JIT-Smart performs more stably than the best-performing model. Besides, JIT-Smart also achieves the best performance compared with the state-of-the-art baselines in cross-project evaluation.

Article Search
ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems
Guangba Yu ORCID logo, Pengfei Chen ORCID logo, Zilong He ORCID logo, Qiuyu Yan ORCID logo, Yu Luo ORCID logo, Fangyuan Li ORCID logo, and Zibin Zheng ORCID logo
(Sun Yat-sen University, Guangzhou, China; Tencent, China)
In large-scale online service systems, the occurrence of software changes is inevitable and frequent. Despite rigorous pre-deployment testing practices, the presence of defective software changes in the online environment cannot be completely eliminated. Consequently, there is a pressing need for automated techniques that can effectively identify these defective changes. However, the current abnormal change detection (ACD) approaches fall short in accurately pinpointing defective changes, primarily due to their disregard for the propagation of faults. To address the limitations of ACD, we propose a novel concept called root cause change analysis (RCCA) to identify the underlying root causes of change-inducing incidents. In order to apply the RCCA concept to practical scenarios, we have devised an intelligent RCCA framework named ChangeRCA. This framework aims to localize the defective change associated with change-inducing incidents among multiple changes. To assess the effectiveness of ChangeRCA, we have conducted an extensive evaluation utilizing a real-world dataset from WeChat and a simulated dataset encompassing 81 diverse defective changes. The evaluation results demonstrate that ChangeRCA outperforms the state-of-the-art ACD approaches, achieving an impressive Top-1 Hit Rate of 85% and significantly reducing the time required to identify defective changes.

Article Search
Bin2Summary: Beyond Function Name Prediction in Stripped Binaries with Functionality-Specific Code Embeddings
Zirui Song ORCID logo, Jiongyi Chen ORCID logo, and Kehuan Zhang ORCID logo
(Chinese University of Hong Kong, Hong Kong; National University of Defense Technology, China)
Nowadays, closed-source software only with stripped binaries still dominates the ecosystem, which brings obstacles to understanding the functionalities of the software and further conducting the security analysis. With such an urgent need, research has traditionally focused on predicting function names, which can only provide fragmented and abbreviated information about functionality. To advance the state-of-the-art, this paper presents Bin2Summary to automatically summarize the functionality of the function in stripped binaries with natural language sentences. Specifically, the proposed framework includes a functionality-specific code embedding module to facilitate fine-grained similarity detection and an attention-based seq2seq model to generate summaries in natural language. Based on 16 widely-used projects (e.g., Coreutils), we have evaluated Bin2Summary with 38,167 functions, which are filtered from 162,406 functions, and all of them have a high-quality comment. Bin2Summary achieves 0.728 in precision and 0.729 in recall on our datasets, and the functionality-specific embedding module can improve the existing assembly language model by up to 109.5% and 109.9% in precision and recall. Meanwhile, the experiments demonstrated that Bin2Summary has outstanding transferability in analyzing the cross-architecture (i.e., in x64 and x86) and cross-environment (i.e., in Cygwin and MSYS2) binaries. Finally, the case study illustrates how Bin2Summary outperforms the existing works in providing functionality summaries with abundant semantics beyond function names.

Article Search
Component Security Ten Years Later: An Empirical Study of Cross-Layer Threats in Real-World Mobile Applications
Keke Lian ORCID logo, Lei Zhang ORCID logo, Guangliang Yang ORCID logo, Shuo Mao ORCID logo, Xinjie Wang ORCID logo, Yuan Zhang ORCID logo, and Min Yang ORCID logo
(Fudan University, China)
Nowadays, mobile apps have greatly facilitated our daily work and lives. They are often designed to work closely and interact with each other through app components for data and functionality sharing. The security of app components has been extensively studied and various component attacks have been proposed. Meanwhile, Android system vendors and app developers have introduced a series of defense measures to mitigate these security threats. However, we have discovered that as apps evolve and develop, existing app component defenses have become inadequate to address the emerging security requirements. This latency in adaptation has given rise to the feasibility of cross-layer exploitation, where attackers can indirectly manipulate app internal functionalities by polluting their dependent data. To assess the security risks of cross-layer exploitation in real-world apps, we design and implement a novel vulnerability analysis approach, called CLDroid, which addresses two non-trivial challenges. Our experiments revealed that 1,215 (8.8%) popular apps are potentially vulnerable to cross-layer exploitation, with a total of more than 18 billion installs. We verified that through cross-layer exploitation, an unprivileged app could achieve various severe security consequences, such as arbitrary code execution, click hijacking, content spoofing, and persistent DoS. We ethically reported verified vulnerabilities to the developers, who acknowledged and rewarded us with bug bounties. As a result, 56 CVE IDs have been assigned, with 22 of them rated as ‘critical’ or ‘high’ severity.

Article Search
Characterizing Python Library Migrations
Mohayeminul Islam ORCID logo, Ajay Kumar Jha ORCID logo, Ildar Akhmetov ORCID logo, and Sarah NadiORCID logo
(University of Alberta, Canada; North Dakota State University, USA; Northeastern University, USA)
Developers heavily rely on Application Programming Interfaces (APIs) from libraries to build their software. As software evolves, developers may need to replace the used libraries with alternate libraries, a process known as library migration. Doing this manually can be tedious, time-consuming, and prone to errors. Automated migration techniques can help alleviate some of this burden. However, designing effective automated migration techniques requires understanding the types of code changes required to transform client code that used the old library to the new library. This paper contributes an empirical study that provides a holistic view of Python library migrations, both in terms of the code changes required in a migration and the typical development effort involved. We manually label 3,096 migration-related code changes in 335 Python library migrations from 311 client repositories spanning 141 library pairs from 35 domains. Based on our labeled data, we derive a taxonomy for describing migration-related code changes, PyMigTax. Leveraging PyMigTax and our labeled data, we investigate various characteristics of Python library migrations, such as the types of program elements and properties of API mappings, the combinations of types of migration-related code changes in a migration, and the typical development effort required for a migration. Our findings highlight various potential shortcomings of current library migration tools. For example, we find that 40% of library pairs have API mappings that involve non-function program elements, while most library migration techniques typically assume that function calls from the source library will map into (one or more) function calls from the target library. As an approximation for the development effort involved, we find that, on average, a developer needs to learn about 4 APIs and 2 API mappings to perform a migration and change 8 lines of code. However, we also found cases of migrations that involve up to 43 unique APIs, 22 API mappings, and 758 lines of code, making them harder to manually implement. Overall, our contributions provide the necessary knowledge and foundations for developing automated Python library migration techniques. We make all data and scripts related to this study publicly available at https://doi.org/10.6084/m9.figshare.24216858.v2.

Article Search Artifacts Available
EyeTrans: Merging Human and Machine Attention for Neural Code Summarization
Yifan Zhang ORCID logo, Jiliang Li ORCID logo, Zachary Karas ORCID logo, Aakash Bansal ORCID logo, Toby Jia-Jun Li ORCID logo, Collin McMillan ORCID logo, Kevin Leach ORCID logo, and Yu Huang ORCID logo
(Vanderbilt University, USA; University of Notre Dame, USA)
Neural code summarization leverages deep learning models to automatically generate brief natural language summaries of code snippets. The development of Transformer models has led to extensive use of attention during model design. While existing work has primarily and almost exclusively focused on static properties of source code and related structural representations like the Abstract Syntax Tree (AST), few studies have considered human attention — that is, where programmers focus while examining and comprehending code. In this paper, we develop a method for incorporating human attention into machine attention to enhance neural code summarization. To facilitate this incorporation and vindicate this hypothesis, we introduce EyeTrans, which consists of three steps: (1) we conduct an extensive eye-tracking human study to collect and pre-analyze data for model training, (2) we devise a data-centric approach to integrate human attention with machine attention in the Transformer architecture, and (3) we conduct comprehensive experiments on two code summarization tasks to demonstrate the effectiveness of incorporating human attention into Transformers. Integrating human attention leads to an improvement of up to 29.91% in Functional Summarization and up to 6.39% in General Code Summarization performance, demonstrating the substantial benefits of this combination. We further explore performance in terms of robustness and efficiency by creating challenging summarization scenarios in which EyeTrans exhibits interesting properties. We also visualize the attention map to depict the simplifying effect of machine attention in the Transformer by incorporating human attention. This work has the potential to propel AI research in software engineering by introducing more human-centered approaches and data.

Article Search Artifacts Available
LILAC: Log Parsing using LLMs with Adaptive Parsing Cache
Zhihan Jiang ORCID logo, Jinyang Liu ORCID logo, Zhuangbin Chen ORCID logo, Yichen Li ORCID logo, Junjie Huang ORCID logo, Yintong Huo ORCID logo, Pinjia He ORCID logo, Jiazhen Gu ORCID logo, and Michael R. Lyu ORCID logo
(Chinese University of Hong Kong, Hong Kong, China; Sun Yat-sen University, Zhuhai, China; Chinese University of Hong Kong, Shenzhen, China)
Log parsing transforms log messages into structured formats, serving as the prerequisite step for various log analysis tasks. Although a variety of log parsing approaches have been proposed, their performance on complicated log data remains compromised due to the use of human-crafted rules or learning-based models with limited training data. The recent emergence of powerful large language models (LLMs) demonstrates their vast pre-trained knowledge related to code and logging, making it promising to apply LLMs for log parsing. However, their lack of specialized log parsing capabilities currently hinders their parsing accuracy. Moreover, the inherent inconsistent answers, as well as the substantial overhead, prevent the practical adoption of LLM-based log parsing.
To address these challenges, we propose LILAC, the first practical Log parsIng framework using LLMs with Adaptive parsing Cache. To facilitate accurate and robust log parsing, LILAC leverages the in-context learning (ICL) capability of the LLM by performing a hierarchical candidate sampling algorithm and selecting high-quality demonstrations. Furthermore, LILAC incorporates a novel component, an adaptive parsing cache, to store and refine the templates generated by the LLM. It helps mitigate LLM's inefficiency issue by enabling rapid retrieval of previously processed log templates. In this process, LILAC adaptively updates the templates within the parsing cache to ensure the consistency of parsed results. The extensive evaluation on public large-scale datasets shows that LILAC outperforms state-of-the-art methods by 69.5% in terms of the average F1 score of template accuracy. In addition, LILAC reduces the query times to LLMs by several orders of magnitude, achieving a comparable efficiency to the fastest baseline.

Article Search Info
Efficiently Detecting Reentrancy Vulnerabilities in Complex Smart Contracts
Zexu Wang ORCID logo, Jiachi Chen ORCID logo, Yanlin Wang ORCID logo, Yu Zhang ORCID logo, Weizhe Zhang ORCID logo, and Zibin Zheng ORCID logo
(Sun Yat-sen University, Zhuhai, China; Peng Cheng Laboratory, Shenzhen, China; Harbin Institute of Technology, Harbin, China; GuangDong Engineering Technology Research Center of Blockchain, Zhuhai, China)
Reentrancy vulnerability as one of the most notorious vulnerabilities, has been a prominent topic in smart contract security research. Research shows that existing vulnerability detection presents a range of challenges, especially as smart contracts continue to increase in complexity. Existing tools perform poorly in terms of efficiency and successful detection rates for vulnerabilities in complex contracts.
To effectively detect reentrancy vulnerabilities in contracts with complex logic, we propose a tool named SliSE. SliSE’s detection process consists of two stages: Warning Search and Symbolic Execution Verification. In Stage 1, SliSE utilizes program slicing to analyze the Inter-contract Program Dependency Graph (I-PDG) of the contract, and collects suspicious vulnerability information as warnings. In Stage 2, symbolic execution is employed to verify the reachability of these warnings, thereby enhancing vulnerability detection accuracy. SliSE obtained the best performance compared with eight state-of-the-art detection tools. It achieved an F1 score of 78.65%, surpassing the highest score recorded by an existing tool of 9.26%. Additionally, it attained a recall rate exceeding 90% for detection of contracts on Ethereum. Overall, SliSE provides a robust and efficient method for detection of Reentrancy vulnerabilities for complex contracts.

Article Search
IRCoCo: Immediate Rewards-Guided Deep Reinforcement Learning for Code Completion
Bolun Li ORCID logo, Zhihong Sun ORCID logo, Tao Huang ORCID logo, Hongyu Zhang ORCID logo, Yao Wan ORCID logo, Ge Li ORCID logo, Zhi Jin ORCID logo, and Chen Lyu ORCID logo
(Shandong Normal University, China; Chongqing University, China; Huazhong University of Science and Technology, China; Peking University, China)
Code completion aims to enhance programming productivity by predicting potential code based on the current programming context. Recently, pre-trained language models (LMs) have become prominent in this field. Various approaches have been proposed to fine-tune LMs using supervised fine-tuning (SFT) techniques for code completion. However, the inherent exposure bias of these models can cause errors to accumulate early in the sequence completion, leading to even more errors in subsequent completions. To address this problem, deep reinforcement learning (DRL) is an alternative technique for fine-tuning LMs for code completion, which can improve the generalization capabilities and overall performance. Nevertheless, integrating DRL-based strategies into code completion faces two major challenges: 1) The dynamic nature of the code context requires the completion model to quickly adapt to changes, which poses difficulties for conventional DRL strategies that focus on delayed rewarding of the final code state. 2) It is difficult to evaluate the correctness of partial code, thus the reward redistribution-based strategies cannot be adapted to code completion. To tackle these challenges, we propose IRCoCo, a code completion-specific DRL-based fine-tuning framework. This framework is designed to provide immediate rewards as feedback for detecting dynamic context changes arising from continuous edits during code completion. With the aid of immediate feedback, the fine-tuned LM can gain a more precise understanding of the current context, thereby enabling effective adjustment of the LM and optimizing code completion in a more refined manner. Experimental results demonstrate that fine-tuning pre-trained LMs with IRCoCo leads to significant improvements in the code completion task, outperforming both SFT-based and other DRL-based baselines.

Article Search
Shadows in the Interface: A Comprehensive Study on Dark Patterns
Liming Nie ORCID logo, Yangyang Zhao ORCID logo, Chenglin Li ORCID logo, Xuqiong Luo ORCID logo, and Yang Liu ORCID logo
(Shenzhen University of Technology, China; Zhejiang Sci-Tech University, China; Changsha University of Science and Technology, China; Nanyang Technological University, Singapore)
As digital interfaces become increasingly prevalent, a series of ethical issues have surfaced, with dark patterns emerging as a key research focus. These manipulative design strategies are widely employed in User Interfaces (UI) with the primary aim of steering user behavior in favor of service providers, often at the expense of the users themselves. This paper aims to address three main challenges in the study of dark patterns: inconsistencies and incompleteness in classification, limitations of detection tools, and inadequacies in data comprehensiveness. In this paper, we introduce a comprehensive framework, called the Dark Pattern Analysis Framework (DPAF). Utilizing this framework, we construct a comprehensive taxonomy of dark patterns, encompassing 64 types, each labeled with its impact on users and the likely scenarios in which it appears, validated through an industry survey. When assessing the capabilities of the detection tools and the completeness of the dataset, we find that of all dark patterns, the five detection tools can only identify 32, yielding a coverage rate of merely 50%. Although the four existing datasets collectively contain 5,566 instances, they cover only 32 of all types of dark patterns, also resulting in a total coverage rate of 50%. The results discussed above suggest that there is still significant room for advancement in the field of dark pattern detection. Through this research, we not only deepen our understanding of dark pattern classification and detection tools, but also offer valuable insights for future research and practice in this field.

Article Search
ProveNFix: Temporal Property-Guided Program Repair
Yahui Song ORCID logo, Xiang Gao ORCID logo, Wenhua Li ORCID logo, Wei-Ngan ChinORCID logo, and Abhik RoychoudhuryORCID logo
(National University of Singapore, Singapore; Beihang University, China)
Model checking has been used traditionally for finding violations of temporal properties. Recently, testing or fuzzing approaches have also been applied to software systems to find temporal property violations. However, model checking suffers from state explosion, while fuzzing can only partially cover program paths. Moreover, once a violation is found, the fix for the temporal error is usually manual. In this work, we develop the first compositional static analyzer for temporal properties, and the analyzer supports a proof-based repair strategy to fix temporal bugs automatically. To enable a more flexible specification style for temporal properties, on top of the classic pre/post-conditions, we allow users to write a future-condition to modularly express the expected behaviors after the function call. Instead of requiring users to write specifications for each procedure, our approach automatically infers the procedure’s specification according to user-supplied specifications for a small number of primitive APIs. We further devise a term rewriting system to check the actual behaviors against its inferred specification. Our method supports the analysis of 1) memory usage bugs, 2) unchecked return values, 3) resource leaks, etc., with annotated specifications for 17 primitive APIs, and detects 515 vulnerabilities from over 1 million lines of code ranging from ten real-world C projects. Intuitively, the benefit of our approach is that a small set of properties can be specified once and used to analyze/repair a large number of programs. Experimental results show that our tool, ProveNFix, detects 72.2% more true alarms than the latest release of the Infer static analyzer. Moreover, we show the effectiveness of our repair strategy when compared to other state-of-the-art systems — fixing 5% more memory leaks than SAVER, 40% more resource leaks than FootPatch, and with a 90% fix rate for null pointer dereferences.

Article Search
SmartAxe: Detecting Cross-Chain Vulnerabilities in Bridge Smart Contracts via Fine-Grained Static Analysis
Zeqin Liao ORCID logo, Yuhong Nan ORCID logo, Henglong Liang ORCID logo, Sicheng Hao ORCID logo, Juan Zhai ORCID logo, Jiajing Wu ORCID logo, and Zibin Zheng ORCID logo
(Sun Yat-sen University, Zhuhai, China; Sun Yat-sen University, Guangzhou, China; University of Massachusetts, Amherst, USA; GuangDong Engineering Technology Research Center of Blockchain, Zhuhai, China)
With the increasing popularity of blockchain, different blockchain platforms coexist in the ecosystem (e.g., Ethereum, BNB, EOSIO, etc.), which prompts the high demand for cross-chain communication. Cross-chain bridge is a specific type of decentralized application for asset exchange across different blockchain platforms. Securing the smart contracts of cross-chain bridges is in urgent need, as there are a number of recent security incidents with heavy financial losses caused by vulnerabilities in bridge smart contracts, as we call them Cross-Chain Vulnerabilities (CCVs). However, automatically identifying CCVs in smart contracts poses several unique challenges. Particularly, it is non-trivial to (1) identify application-specific access control constraints needed for cross-bridge asset exchange, and (2) identify inconsistent cross-chain semantics between the two sides of the bridge.
In this paper, we propose SmartAxe, a new framework to identify vulnerabilities in cross-chain bridge smart contracts. Particularly, to locate vulnerable functions that have access control incompleteness, SmartAxe models the heterogeneous implementations of access control and finds necessary security checks in smart contracts through probabilistic pattern inference. Besides, SmartAxe constructs cross-chain control-flow graph (xCFG) and data-flow graph (xDFG), which help to find semantic inconsistency during cross-chain data communication. To evaluate SmartAxe, we collect and label a dataset of 88 CCVs from real-attacks cross-chain bridge contracts. Evaluation results show that SmartAxe achieves a precision of 84.95% and a recall of 89.77%. In addition, SmartAxe successfully identifies 232 new/unknown CCVs from 129 real-world cross-chain bridge applications (i.e., from 1,703 smart contracts). These identified CCVs affect a total amount of digital assets worth 1,885,250 USD

Article Search
Predictive Program Slicing via Execution Knowledge-Guided Dynamic Dependence Learning
Aashish Yadavally ORCID logo, Yi Li ORCID logo, and Tien N. Nguyen ORCID logo
(University of Texas, Dallas, USA)
Program slicing, the process of extracting program statements that influence values at a designated location (known as the slicing criterion), is helpful in both manual and automated debugging. However, such slicing techniques prove ineffective in scenarios where executing specific inputs is prohibitively expensive, or even impossible, as with partial code. In this paper, we introduce ND-Slicer, a predictive slicing methodology that caters to specific executions based on a particular input, overcoming the need for actual execution. We enable such a process by leveraging execution-aware pre-training to learn the dynamic program dependencies, including both dynamic data and control dependencies between variables in the slicing criterion and the remaining program statements. Such knowledge forms the cornerstone for constructing a predictive backward slice. Our empirical evaluation revealed a high accuracy in predicting program slices, achieving an exact-match accuracy of 81.3% and a ROUGE-LCS F1-score of 95.4% on Python programs. As an extrinsic evaluation, we illustrate ND-Slicer’s usefulness in crash detection, with it locating faults with an accuracy of 63.9%. Furthermore, we include an in-depth qualitative evaluation, assessing ND-Slicer’s understanding of branched structures such as if-else blocks and loops, as well as the control flow in inter-procedural calls.

Article Search
PBE-Based Selective Abstraction and Refinement for Efficient Property Falsification of Embedded Software
Yoel Kim ORCID logo and Yunja Choi ORCID logo
(Kyungpook National University, South Korea)
Comprehensive verification/falsification of embedded software is challenging and often impossible mainly due to the typical characteristics of embedded software, such as the use of global variables, reactive behaviors, and its (soft or hard) real-time requirements, to name but a few. Abstraction is one of the major solutions to this problem, but existing proven abstraction techniques are not effective in this domain as they are uniformly applied to the entire program and often require a large number of refinements to find true alarms. This work proposes a domain-specific solution for efficient property falsification based on the observation that embedded software typically consists of a number of user-defined auxiliary functions, many of which may be loosely coupled with the main control logic. Our approach selectively abstracts auxiliary functions using function summaries synthesized by Programming-By-Example (PBE), which reduces falsification complexity as well as the number of refinements. The drawbacks of using PBE-based function summaries, which are neither sound nor complete, for abstraction are counteracted by symbolic alarm filtering and novel PBE-based refinements for function summaries. We demonstrate that the proposed approach has comparable performance to the state-of-the-art model checkers on SV-COMP benchmark programs and outperforms them on a set of typical embedded software in terms of both falsification efficiency and scalability.

Preprint Artifacts Available
Evaluating Directed Fuzzers: Are We Heading in the Right Direction?
Tae Eun Kim ORCID logo, Jaeseung Choi ORCID logo, Seongjae Im ORCID logo, Kihong Heo ORCID logo, and Sang Kil ChaORCID logo
(KAIST, South Korea; Sogang University, South Korea)
Directed fuzzing recently has gained significant attention due to its ability to reconstruct proof-of-concept (PoC) test cases for target code such as buggy lines or functions. Surprisingly, however, there has been no in-depth study on the way to properly evaluate directed fuzzers despite much progress in the field. In this paper, we present the first systematic study on the evaluation of directed fuzzers. In particular, we analyze common pitfalls in evaluating directed fuzzers with extensive experiments on five state-of-the-art tools, which amount to 30 CPU-years of computational effort, in order to confirm that different choices made at each step of the evaluation process can significantly impact the results. For example, we find that a small change in the crash triage logic can substantially affect the measured performance of a directed fuzzer, while the majority of the papers we studied do not fully disclose their crash triage scripts. We argue that disclosing the whole evaluation process is essential for reproducing research and facilitating future work in the field of directed fuzzing. In addition, our study reveals that several common evaluation practices in the current directed fuzzing literature can mislead the overall assessments. Thus, we identify such mistakes in previous papers and propose guidelines for evaluating directed fuzzers.

Article Search Artifacts Available
DyPyBench: A Benchmark of Executable Python Software
Islem Bouzenia ORCID logo, Bajaj Piyush Krishan ORCID logo, and Michael Pradel ORCID logo
(University of Stuttgart, Germany)
Python has emerged as one of the most popular programming languages, extensively utilized in domains such as machine learning, data analysis, and web applications. Python’s dynamic nature and extensive usage make it an attractive candidate for dynamic program analysis. However, unlike for other popular languages, there currently is no comprehensive benchmark suite of executable Python projects, which hinders the development of dynamic analyses. This work addresses this gap by presenting DyPyBench, the first benchmark of Python projects that is large-scale, diverse, ready-to-run (i.e., with fully configured and prepared test suites), and ready- to-analyze (by integrating with the DynaPyt dynamic analysis framework). The benchmark encompasses 50 popular open-source projects from various application domains, with a total of 681k lines of Python code, and 30k test cases. DyPyBench enables various applications in testing and dynamic analysis, of which we explore three in this work: (i) Gathering dynamic call graphs and empirically comparing them to statically computed call graphs, which exposes and quantifies limitations of existing call graph construction techniques for Python. (ii) Using DyPyBench to build a training data set for LExecutor, a neural model that learns to predict values that otherwise would be missing at runtime. (iii) Using dynamically gathered execution traces to mine API usage specifications, which establishes a baseline for future work on specification mining for Python. We envision DyPyBench to provide a basis for other dynamic analyses and for studying the runtime behavior of Python code.

Article Search
Predicting Configuration Performance in Multiple Environments with Sequential Meta-Learning
Jingzhi Gong ORCID logo and Tao Chen ORCID logo
(University of Electronic Science and Technology of China, China; Loughborough University, United Kingdom; University of Birmingham, United Kingdom)
Learning and predicting the performance of given software configurations are of high importance to many software engineering activities. While configurable software systems will almost certainly face diverse running environments (e.g., version, hardware, and workload), current work often either builds performance models under a single environment or fails to properly handle data from diverse settings, hence restricting their accuracy for new environments. In this paper, we target configuration performance learning under multiple environments. We do so by designing SeMPL — a meta-learning framework that learns the common understanding from configurations measured in distinct (meta) environments and generalizes them to the unforeseen, target environment. What makes it unique is that unlike common meta-learning frameworks (e.g., MAML and MetaSGD) that train the meta environments in parallel, we train them sequentially, one at a time. The order of training naturally allows discriminating the contributions among meta environments in the meta-model built, which fits better with the characteristic of configuration data that is known to dramatically differ between different environments. Through comparing with 15 state-of-the-art models under nine systems, our extensive experimental results demonstrate that SeMPL performs considerably better on 89% of the systems with up to 99% accuracy improvement, while being data-efficient, leading to a maximum of 3.86× speedup. All code and data can be found at our repository: https://github.com/ideas-labo/SeMPL.

Article Search
R2I: A Relative Readability Metric for Decompiled Code
Haeun Eom ORCID logo, Dohee Kim ORCID logo, Sori Lim ORCID logo, Hyungjoon Koo ORCID logo, and Sungjae Hwang ORCID logo
(Sungkyunkwan University, South Korea)
Decompilation is a process of converting a low-level machine code snippet back into a high-level programming language such as C. It serves as a basis to aid reverse engineers in comprehending the contextual semantics of the code. In this respect, commercial decompilers like Hex-Rays have made significant strides in improving the readability of decompiled code over time. While previous work has proposed the metrics for assessing the readability of source code, including identifiers, variable names, function names, and comments, those metrics are unsuitable for measuring the readability of decompiled code primarily due to i) the lack of rich semantic information in the source and ii) the presence of erroneous syntax or inappropriate expressions. In response, to the best of our knowledge, this work first introduces R2I, the Relative Readability Index, a specialized metric tailored to evaluate decompiled code in a relative context quantitatively. In essence, R2I can be computed by i) taking code snippets across different decompilers as input and ii) extracting pre-defined features from an abstract syntax tree. For the robustness of R2I, we thoroughly investigate the enhancement efforts made by (non-)commercial decompilers and academic research to promote code readability, identifying 31 features to yield a reliable index collectively. Besides, we conducted a user survey to capture subjective factors such as one’s coding styles and preferences. Our empirical experiments demonstrate that R2I is a versatile metric capable of representing the relative quality of decompiled code (e.g., obfuscation, decompiler updates) and being well aligned with human perception in our survey.

Article Search Artifacts Available
DiffCoder: Enhancing Large Language Model on API Invocation via Analogical Code Exercises
Daoguang Zan ORCID logo, Ailun Yu ORCID logo, Bo Shen ORCID logo, Bei Chen ORCID logo, Wei Li ORCID logo, Yongshun Gong ORCID logo, Xiaolin Chen ORCID logo, Yafen Yao ORCID logo, Weihua Luo ORCID logo, Bei Guan ORCID logo, Yan Liu ORCID logo, Yongji Wang ORCID logo, Qianxiang Wang ORCID logo, and Lizhen Cui ORCID logo
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Peking University, China; Huawei Technologies, China; Independent Researcher, China; Shandong University, China; Funcun-wuyou Technologies, China)
The task of code generation aims to generate code solutions based on given programming problems. Recently, code large language models (code LLMs) have shed new light on this task, owing to their formidable code generation capabilities. While these models are powerful, they seldom focus on further improving the accuracy of library-oriented API invocation. Nonetheless, programmers frequently invoke APIs in routine coding tasks. In this paper, we aim to enhance the proficiency of existing code LLMs regarding API invocation by mimicking analogical learning, which is a critical learning strategy for humans to learn through differences among multiple instances. Motivated by this, we propose a simple yet effective approach, namely DiffCoder, which excels in API invocation by effectively training on the differences (diffs) between analogical code exercises. To assess the API invocation capabilities of code LLMs, we conduct experiments on seven existing benchmarks that focus on mono-library API invocation. Additionally, we construct a new benchmark, namely PanNumEval, to evaluate the performance of multi-library API invocation. Extensive experiments on eight benchmarks demonstrate the impressive performance of DiffCoder. Furthermore, we develop a VSCode plugin for DiffCoder, and the results from twelve invited participants further verify the practicality of DiffCoder.

Article Search
Maximizing Patch Coverage for Testing of Highly-Configurable Software without Exploding Build Times
Necip Fazıl YıldıranORCID logo, Jeho Oh ORCID logo, Julia Lawall ORCID logo, and Paul GazzilloORCID logo
(University of Central Florida, USA; University of Texas, Austin, USA; Inria, France)
The Linux kernel is highly-configurable, with a build system that takes a configuration file as input and automatically tailors the source code accordingly. Configurability, however, complicates testing, because different configuration options lead to the inclusion of different code fragments. With thousands of patches received per month, Linux kernel maintainers employ extensive automated continuous integration testing. To attempt patch coverage, i.e., taking all changed lines into account, current approaches either use configuration files that maximize total statement coverage or use multiple randomly-generated configuration files, both of which incur high build times without guaranteeing patch coverage. To achieve patch coverage without exploding build times, we propose krepair, which automatically repairs configuration files that are fast-building but have poor patch coverage to achieve high patch coverage with little effect on build times. krepair works by discovering a small set of changes to a configuration file that will ensure patch coverage, preserving most of the original configuration file's settings. Our evaluation shows that, when applied to configuration files with poor patch coverage on a statistically-significant sample of recent Linux kernel patches, krepair achieves nearly complete patch coverage, 98.5% on average, while changing less than 1.53% of the original default configuration file in 99% of patches, which keeps build times 10.5x faster than maximal configuration files.

Preprint Artifacts Available
Abstraction-Aware Inference of Metamorphic Relations
Agustín Nolasco ORCID logo, Facundo Molina ORCID logo, Renzo Degiovanni ORCID logo, Alessandra GorlaORCID logo, Diego GarbervetskyORCID logo, Mike Papadakis ORCID logo, Sebastian Uchitel ORCID logo, Nazareno AguirreORCID logo, and Marcelo F. Frias ORCID logo
(University of Rio Cuarto, Argentina; IMDEA Software Institute, Spain; University of Luxembourg, Luxembourg; University of Buenos Aires, Argentina; Imperial College London, United Kingdom; CONICET, Argentina; Instituto Tecnológico de Buenos Aires, Argentina)
Metamorphic testing is a valuable technique that helps in dealing with the oracle problem. It involves testing software against specifications of its intended behavior given in terms of so called metamorphic relations, statements that express properties relating different software elements (e.g., different inputs, methods, etc). The effective application of metamorphic testing strongly depends on identifying suitable domain-specific metamorphic relations, a challenging task, that is typically manually performed.
This paper introduces MemoRIA, a novel approach that aims at automatically identifying metamorphic relations. The technique focuses on a particular kind of metamorphic relation, which asserts equivalences between methods and method sequences. MemoRIA works by first generating an object-protocol abstraction of the software being tested, then using fuzzing to produce candidate relations from the abstraction, and finally validating the candidate relations through run-time analysis. A SAT-based analysis is used to eliminate redundant relations, resulting in a concise set of metamorphic relations for the software under test. We evaluate our technique on a benchmark consisting of 22 Java subjects taken from the literature, and compare MemoRIA with the metamorphic relation inference technique SBES. Our results show that by incorporating the object protocol abstraction information, MemoRIA is able to more effectively infer meaningful metamorphic relations, that are also more precise, compared to SBES, measured in terms of mutation analysis. Also, the SAT-based reduction allows us to significantly reduce the number of reported metamorphic relations, while in general having a small impact in the bug finding ability of the corresponding obtained relations.

Article Search
TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State
Haiyu Huang ORCID logo, Xiaoyu Zhang ORCID logo, Pengfei Chen ORCID logo, Zilong He ORCID logo, Zhiming Chen ORCID logo, Guangba Yu ORCID logo, Hongyang Chen ORCID logo, and Chen Sun ORCID logo
(Sun Yat-sen University, Guangzhou, China; Huawei, China)
Distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial computational and storage costs. To reduce the quantity of traces, trace sampling has become a prominent topic of discussion, and several methods have been proposed in prior work. To attain higher-quality sampling outcomes, biased sampling has gained more attention compared to random sampling. Previous biased sampling methods primarily considered the importance of traces based on diversity, aiming to sample more edge-case traces and fewer common-case traces. However, we contend that relying solely on trace diversity for sampling is insufficient, system runtime state is another crucial factor that needs to be considered, especially in cases of system failures. In this study, we introduce TraStrainer, an online sampler that takes into account both system runtime state and trace diversity. TraStrainer employs an interpretable and automated encoding method to represent traces as vectors. Simultaneously, it adaptively determines sampling preferences by analyzing system runtime metrics. When sampling, it combines the results of system-bias and diversity-bias through a dynamic voting mechanism. Experimental results demonstrate that TraStrainer can achieve higher quality sampling results and significantly improve the performance of downstream root cause analysis (RCA) tasks. It has led to an average increase of 32.63% in Top-1 RCA accuracy compared to four baselines in two datasets.

Article Search
Fast Graph Simplification for Path-Sensitive Typestate Analysis through Tempo-Spatial Multi-Point Slicing
Xiao Cheng ORCID logo, Jiawei Ren ORCID logo, and Yulei Sui ORCID logo
(UNSW, Australia)
Typestate analysis is a commonly used static technique to identify software vulnerabilities by assessing if a sequence of operations violates temporal safety specifications defined by a finite state automaton. Path-sensitive typestate analysis (PSTA) offers a more precise solution by eliminating false alarms stemming from infeasible paths. To improve the efficiency of path-sensitive analysis, previous efforts have incorporated sparse techniques, with a focus on analyzing the path feasibility of def-use chains. However, they cannot be directly applied to detect typestate vulnerabilities requiring temporal information within the control flow graph, e.g., use-to-use information.
In this paper, we introduce FGS, a Fast Graph Simplification approach designed for PSTA by retaining multi-point temporal information while harnessing the advantages of sparse analysis. We propose a new multi-point slicing technique that captures the temporal and spatial correlations within the program. By doing so, it optimizes the program by only preserving the necessary program dependencies, resulting in a sparser structure for precision-preserving PSTA. Our graph simplification approach, as a fast preprocessing step, offers several benefits for existing PSTA algorithms. These include a more concise yet precision-preserving graph structure, decreased numbers of variables and constraints within execution states, and simplified path feasibility checking. As a result, the overall efficiency of the PSTA algorithm exhibits significant improvement.
We evaluated FGS using NIST benchmarks and ten real-world large-scale projects to detect four types of vulnerabilities, including memory leaks, double-frees, use-after-frees, and null dereferences. On average, when comparing FGS against ESP (baseline PSTA), FGS reduces 89% of nodes, 86% of edges, and 88% of calling context of the input graphs, obtaining a speedup of 116x and a memory usage reduction of 93% on the large projects evaluated. Our experimental results also demonstrate that FGS outperforms six open-source tools (IKOS, ClangSA , Saber, Cppcheck, Infer, and Sparrow) on the NIST benchmarks, which comprises 846 programs. Specifically, FGS achieves significantly higher precision, with improvements of up to 171% (42% on average), and detects a greater number of true positives, with enhancements of up to 245% (52% on average). Moreover, among the ten large-scale projects, FGS successfully found 105 real bugs with a precision rate of 82%. In contrast, our baseline tools not only missed over 42% of the real bugs but also yielded an average precision rate of just 13%.

Article Search
Understanding Developers’ Discussions and Perceptions on Non-functional Requirements: The Case of the Spring Ecosystem
Anderson Oliveira ORCID logo, João Correia ORCID logo, Wesley K. G. AssunçãoORCID logo, Juliana Alves Pereira ORCID logo, Rafael de Mello ORCID logo, Daniel Coutinho ORCID logo, Caio Barbosa ORCID logo, Paulo Libório ORCID logo, and Alessandro Garcia ORCID logo
(PUC-Rio, Brazil; North Carolina State University, USA; Federal University of Rio de Janeiro, Brazil)
Non-Functional Requirements (NFRs) should be defined in the early stages of the software development process, driving developers to make important design decisions. Neglecting NFRs may lead developers to create systems that are difficult to maintain and do not meet users expectations. Despite their importance, the discussion of NFRs is often ad-hoc and scattered through multiple sources, limiting developers' awareness of NFRs. In that scenario, Pull Request (PR) discussions provide a centralized platform for comprehensive NFR discussions. However, existing studies do not explore this important source of information in open-source software development, which developers widely use to discuss software requirements. In this study, we report an investigation of NFR discussions in PRs of repositories of the Spring ecosystem. We collected, manually curated, and analyzed PR discussions addressing four categories of NFRs: maintainability, security, performance, and robustness. We observed that discussions surrounding these PRs tend to address the introduction of a code change or explain some anomaly regarding a particular NFR. Also, we found that more than 77% of the discussions related to NFRs are triggered in the PR title and/or description, indicating that developers are often provided with information regarding NFRs straightway. To gain additional knowledge from these NFR discussions, our study also analyzed the characteristics and activities of developers who actually discuss and fix NFR issues. In particular, we performed an in-depth analysis of 63 developers that stood out in collaborating with the mapped PRs. To complement this analysis, we conducted a survey with 44 developers to gather their perceptions on NFR discussions. By observing how developers approach NFRs and participate in discussions, we documented the best practices and strategies newcomers can adopt to address NFRs effectively. We also provided a curated dataset of 1,533 PR discussions classified with NFR presence.

Article Search Info
Adapting Multi-objectivized Software Configuration Tuning
Tao Chen ORCID logo and Miqing Li ORCID logo
(University of Electronic Science and Technology of China, China; University of Birmingham, United Kingdom)
When tuning software configuration for better performance (e.g., latency or throughput), an important issue that many optimizers face is the presence of local optimum traps, compounded by a highly rugged configuration landscape and expensive measurements. To mitigate these issues, a recent effort has shifted to focus on the level of optimization model (called meta multi-objectivization or MMO) instead of designing better optimizers as in traditional methods. This is done by using an auxiliary performance objective, together with the target performance objective, to help the search jump out of local optima. While effective, MMO needs a fixed weight to balance the two objectives—a parameter that has been found to be crucial as there is a large deviation of the performance between the best and the other settings. However, given the variety of configurable software systems, the “sweet spot” of the weight can vary dramatically in different cases and it is not possible to find the right setting without time-consuming trial and error. In this paper, we seek to overcome this significant shortcoming of MMO by proposing a weight adaptation method, dubbed AdMMO. Our key idea is to adaptively adjust the weight at the right time during tuning, such that a good proportion of the nondominated configurations can be maintained. Moreover, we design a partial duplicate retention mechanism to handle the issue of too many duplicate configurations without losing the rich information provided by the “good” duplicates.
Experiments on several real-world systems, objectives, and budgets show that, for 71% of the cases, AdMMO is considerably superior to MMO and a wide range of state-of-the-art optimizers while achieving generally better efficiency with the best speedup between 2.2x and 20x.

Article Search
CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking
Zian Su ORCID logo, Xiangzhe Xu ORCID logo, Ziyang Huang ORCID logo, Zhuo Zhang ORCID logo, Yapeng Ye ORCID logo, Jianjun Huang ORCID logo, and Xiangyu ZhangORCID logo
(Purdue University, USA; Renmin University of China, China)
Transformer based code models have impressive performance in many software engineering tasks. However, their effectiveness degrades when symbols are missing or not informative. The reason is that the model may not learn to pay attention to the right correlations/contexts without the help of symbols. We propose a new method to pre-train general code models when symbols are lacking. We observe that in such cases, programs degenerate to something written in a very primitive language. We hence propose to use program analysis to extract contexts a priori (instead of relying on symbols and masked language modeling as in vanilla models). We then leverage a novel attention masking method to only allow the model attending to these contexts, e.g., bi-directional program dependence transitive closures and token co-occurrences. In the meantime, the inherent self-attention mechanism is utilized to learn which of the allowed attentions are more important compared to others. To realize the idea, we enhance the vanilla tokenization and model architecture of a BERT model, construct and utilize attention masks, and introduce a new pre-training algorithm. We pre-train this BERT-like model from scratch, using a dataset of 26 million stripped binary functions with explicit program dependence information extracted by our tool. We apply the model in three downstream tasks: binary similarity, type inference, and malware family classification. Our pre-trained model can improve the SOTAs in these tasks from 53% to 64%, 49% to 60%, and 74% to 94%, respectively. It also substantially outperforms other general pre-training techniques of code understanding models.

Article Search
Natural Is the Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models
Yan Wang ORCID logo, Xiaoning Li ORCID logo, Tien N. Nguyen ORCID logo, Shaohua Wang ORCID logo, Chao NiORCID logo, and Ling Ding ORCID logo
(Central University of Finance and Economics, China; University of Texas, Dallas, USA; Zhejiang University, China)
Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are often heavy in computational complexity, and quadratically with the length of the input code sequence. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input program should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input program belongs, the outcome may differ when the model is pre-trained on a different dataset. We propose SlimCode, a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization, we reported that 1) the removal ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm–prompt engineering and interactive in-context learning. The empirical results showed that SlimCode can improve the state-of-the-art technique by 9.46% and 5.15% in terms of MRR and BLEU score on code search and summarization, respectively. More importantly, SlimCode is 133 times faster than the state-of-the-art approach. Additionally, SlimCode can reduce the cost of invoking GPT-4 by up to 24% per API query, while still producing comparable results to those with the original code. With this result, we call for a new direction on code-based, model-agnostic code simplification solutions to further empower LLMs.

Article Search
Go Static: Contextualized Logging Statement Generation
Yichen Li ORCID logo, Yintong Huo ORCID logo, Renyi Zhong ORCID logo, Zhihan Jiang ORCID logo, Jinyang Liu ORCID logo, Junjie Huang ORCID logo, Jiazhen Gu ORCID logo, Pinjia He ORCID logo, and Michael R. Lyu ORCID logo
(Chinese University of Hong Kong, Hong Kong, China; Chinese University of Hong Kong, Shenzhen, China)
Logging practices have been extensively investigated to assist developers in writing appropriate logging statements for documenting software behaviors. Although numerous automatic logging approaches have been proposed, their performance remains unsatisfactory due to the constraint of the single-method input, without informative programming context outside the method. Specifically, we identify three inherent limitations with single-method context: limited static scope of logging statements, inconsistent logging styles, and missing type information of logging variables.
To tackle these limitations, we propose SCLogger, the first contextualized logging statement generation approach with inter-method static contexts.First, SCLogger extracts inter-method contexts with static analysis to construct the contextualized prompt for language models to generate a tentative logging statement. The contextualized prompt consists of an extended static scope and sampled similar methods, ordered by the chain-of-thought (COT) strategy. Second, SCLogger refines the access of logging variables by formulating a new refinement prompt for language models, which incorporates detailed type information of variables in the tentative logging statement.
The evaluation results show that SCLogger surpasses the state-of-the-art approach by 8.7% in logging position accuracy, 32.1% in level accuracy, 19.6% in variable precision, and 138.4% in text BLEU-4 score. Furthermore, SCLogger consistently boosts the performance of logging statement generation across a range of large language models, thereby showcasing the generalizability of this approach.

Article Search
Unprecedented Code Change Automation: The Fusion of LLMs and Transformation by Example
Malinda DilharaORCID logo, Abhiram Bellur ORCID logo, Timofey Bryksin ORCID logo, and Danny Dig ORCID logo
(University of Colorado, Boulder, USA; JetBrains Research, Cyprus)
Software developers often repeat the same code changes within a project or across different projects. These repetitive changes are known as “code change patterns” (CPATs). Automating CPATs is crucial to expedite the software development process. While current Transformation by Example (TBE) techniques can automate CPATs, they are limited by the quality and quantity of the provided input examples. Thus, they miss transforming code variations that do not have the exact syntax, data-, or control-flow of the provided input examples, despite being semantically similar. Large Language Models (LLMs), pre-trained on extensive source code datasets, offer a potential solution. Harnessing the capability of LLMs to generate semantically equivalent, yet previously unseen variants of the original CPAT could significantly increase the effectiveness of TBE systems.
In this paper, we first discover best practices for harnessing LLMs to generate code variants that meet three criteria: correctness (semantic equivalence to the original CPAT), usefulness (reflecting what developers typically write), and applicability (aligning with the primary intent of the original CPAT). We then implement these practices in our tool PyCraft, which synergistically combines static code analysis, dynamic analysis, and LLM capabilities. By employing chain-of-thought reasoning, PyCraft generates variations of input examples and comprehensive test cases that identify correct variations with an F-measure of 96.6%. Our algorithm uses feedback iteration to expand the original input examples by an average factor of 58x. Using these richly generated examples, we inferred transformation rules and then automated these changes, resulting in an increase of up to 39x, with an average increase of 14x in target codes compared to a previous state-of-the-art tool that relies solely on static analysis. We submitted patches generated by PyCraft to a range of projects, notably esteemed ones like microsoft/DeepSpeed and IBM/inFairness. Their developers accepted and merged 83% the 86 CPAT instances submitted through 44 pull requests. This confirms the usefulness of these changes.

Article Search Info
Syntax Is All You Need: A Universal-Language Approach to Mutant Generation
Sourav Deb ORCID logo, Kush Jain ORCID logo, Rijnard van Tonder ORCID logo, Claire Le GouesORCID logo, and Alex GroceORCID logo
(Northern Arizona University, USA; Carnegie Mellon University, USA; Mysten Labs, USA)
While mutation testing has been a topic of academic interest for decades, it is only recently that “real-world” developers, including industry leaders such as Google and Meta, have adopted mutation testing. We propose a new approach to the development of mutation testing tools, and in particular the core challenge of generating mutants. Current practice tends towards two limited approaches to mutation generation: mutants are either (1) generated at the bytecode/IR level, and thus neither human readable nor adaptable to source-level features of languages or projects, or (2) generated at the source level by language-specific tools that are hard to write and maintain, and in fact are often abandoned by both developers and users. We propose instead that source-level mutation generation is a special case of program transformation in general, and that adopting this approach allows for a single tool that can effectively generate source-level mutants for essentially any programming language. Furthermore, by using parser parser combinators many of the seeming limitations of an any-language approach can be overcome, without the need to parse specific languages. We compare this new approach to mutation to existing tools, and demonstrate the advantages of using parser parser combinators to improve on a regular-expression based approach to generation. Finally, we show that our approach can provide effective mutant generation even for a language for which it lacks any language-specific operators, and that is not very similar in syntax to any language it has been applied to previously.

Preprint Info Artifacts Available
CodePlan: Repository-Level Coding using LLMs and Planning
Ramakrishna Bairi ORCID logo, Atharv Sonwane ORCID logo, Aditya Kanade ORCID logo, Vageesh D. C. ORCID logo, Arun IyerORCID logo, Suresh Parthasarathy ORCID logo, Sriram Rajamani ORCID logo, B. Ashok ORCID logo, and Shashank Shet ORCID logo
(Microsoft Research, India)
Software engineering activities such as package migration, fixing error reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. We formulate these activities as repository-level coding tasks.
Recent tools like GitHub Copilot, which are powered by Large Language Models (LLMs), have succeeded in offering high-quality solutions to localized coding problems. Repository-level coding tasks are more involved and cannot be solved directly using LLMs, since code within a repository is inter-dependent and the entire repository may be too large to fit into the prompt. We frame repository-level coding as a planning problem and present a task-agnostic, neuro-symbolic framework called CodePlan to solve it. CodePlan synthesizes a multi-step chain-of-edits (plan), where each step results in a call to an LLM on a code location with context derived from the entire repository, previous code changes and task-specific instructions. CodePlan is based on a novel combination of an incremental dependency analysis, a change may-impact analysis and an adaptive planning algorithm (symbolic components) with the neural LLMs.
We evaluate the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python). Each task is evaluated on multiple code repositories, each of which requires inter-dependent changes to many files (between 2–97 files). Coding tasks of this level of complexity have not been automated using LLMs before. Our results show that CodePlan has better match with the ground truth compared to baselines. CodePlan is able to get 5/7 repositories to pass the validity checks (i.e., to build without errors and make correct code edits) whereas the baselines (without planning but with the same type of contextual information as CodePlan) cannot get any of the repositories to pass them. We provide our (non-proprietary) data, evaluation scripts and supplementary material at https://github.com/microsoft/codeplan.

Article Search
Rocks Coding, Not Development: A Human-Centric, Experimental Evaluation of LLM-Supported SE Tasks
Wei Wang ORCID logo, Huilong Ning ORCID logo, Gaowei Zhang ORCID logo, Libo Liu ORCID logo, and Yi Wang ORCID logo
(Beijing University of Posts and Telecommunications, China; University of Melbourne, Australia)
Recently, large language models (LLM) based generative AI has been gaining momentum for their impressive high-quality performances in multiple domains, particularly after the release of the ChatGPT. Many believe that they have the potential to perform general-purpose problem-solving in software development and replace human software developers. Nevertheless, there are in a lack of serious investigation into the capability of these LLM techniques in fulfilling software development tasks. In a controlled 2 × 2 between-subject experiment with 109 participants, we examined whether and to what degree working with ChatGPT was helpful in the coding task and typical software development task and how people work with ChatGPT. We found that while ChatGPT performed well in solving simple coding problems, its performance in supporting typical software development tasks was not that good. We also observed the interactions between participants and ChatGPT and found the relations between the interactions and the outcomes. Our study thus provides first-hand insights into using ChatGPT to fulfill software engineering tasks with real-world developers and motivates the need for novel interaction mechanisms that help developers effectively work with large language models to achieve desired outcomes.

Article Search Info
Understanding and Detecting Annotation-Induced Faults of Static Analyzers
Huaien Zhang ORCID logo, Yu Pei ORCID logo, Shuyun Liang ORCID logo, and Shin Hwei Tan ORCID logo
(Hong Kong Polytechnic University, China; Southern University of Science and Technology, China; Concordia University, Canada)
Static analyzers can reason about the properties and behaviors of programs and detect various issues without executing them. Hence, they should extract the necessary information to understand the analyzed program well. Annotation has been a widely used feature for different purposes in Java since the introduction of Java 5. Annotations can change program structures and convey semantics information without awareness of static analyzers, consequently leading to imprecise analysis results. This paper presents the first comprehensive study of annotation-induced faults (AIF) by analyzing 246 issues in six open-source and popular static analyzers (i.e., PMD, SpotBugs, CheckStyle, Infer, SonarQube, and Soot). We analyzed the issues' root causes, symptoms, and fix strategies and derived ten findings and some practical guidelines for detecting and repairing annotation-induced faults. Moreover, we developed an automated testing framework called AnnaTester based on three metamorphic relations originating from the findings. AnnaTester generated new tests based on the official test suites of static analyzers and unveiled 43 new faults, 20 of which have been fixed. The results confirm the value of our study and its findings.

Article Search
Only diff Is Not Enough: Generating Commit Messages Leveraging Reasoning and Action of Large Language Model
Jiawei Li ORCID logo, David Faragó ORCID logo, Christian Petrov ORCID logo, and Iftekhar Ahmed ORCID logo
(University of California, Irvine, USA; Innoopract, Germany; QPR Technologies, Germany)
Commit messages play a vital role in software development and maintenance. While previous research has introduced various Commit Message Generation (CMG) approaches, they often suffer from a lack of consideration for the broader software context associated with code changes. This limitation resulted in generated commit messages that contained insufficient information and were poorly readable. To address these shortcomings, we approached CMG as a knowledge-intensive reasoning task. We employed ReAct prompting with a cutting-edge Large Language Model (LLM) to generate high-quality commit messages. Our tool retrieves a wide range of software context information, enabling the LLM to create commit messages that are factually grounded and comprehensive. Additionally, we gathered commit message quality expectations from software practitioners, incorporating them into our approach to further enhance message quality. Human evaluation demonstrates the overall effectiveness of our CMG approach, which we named Omniscient Message Generator (OMG). It achieved an average improvement of 30.2% over human-written messages and a 71.6% improvement over state-of-the-art CMG methods.

Article Search
DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable Computing
Sabaat Haroon ORCID logo, Chris Brown ORCID logo, and Muhammad Ali Gulzar ORCID logo
(Virginia Tech, USA)
SQL is the most commonly used front-end language for data-intensive scalable computing (DISC) applications due to its broad presence in new and legacy workflows and shallow learning curve. However, DISC-backed SQL introduces several layers of abstraction that significantly reduce the visibility and transparency of workflows, making it challenging for developers to find and fix errors in a query. When a query returns incorrect outputs, it takes a non-trivial effort to comprehend every stage of the query execution and find the root cause among the input data and complex SQL query. We aim to bring the benefits of step-through interactive debugging to DISC-powered SQL with DeSQL. Due to the declarative nature of SQL, there are no ordered atomic statements to place a breakpoint to monitor the flow of data. DeSQL’s automated query decomposition breaks a SQL query into its constituent sub queries, offering natural locations for setting breakpoints and monitoring intermediate data. However, due to advanced query optimization and translation in DISC systems, a user query rarely matches the physical execution, making it challenging to associate subqueries with their intermediate data. DeSQL performs fine-grained taint analysis to dynamically map the subqueries to their intermediate data, while also recognizing subqueries removed by the optimizers. For such subqueries, DeSQL efficiently regenerates the intermediate data from a nearby subquery’s data. On the popular TPC-DC benchmark, DeSQL provides a complete debugging view in 13% less time than the original job time while incurring an average overhead of 10% in addition to retaining Apache Spark’s scalability. In a user study comprising 15 participants engaged in two debugging tasks, we find that participants utilizing DeSQL identify the root cause behind a wrong query output in 74% less time than the de-facto, manual debugging.

Article Search
CORE: Resolving Code Quality Issues using LLMs
Nalin Wadhwa ORCID logo, Jui Pradhan ORCID logo, Atharv Sonwane ORCID logo, Surya Prakash Sahu ORCID logo, Nagarajan Natarajan ORCID logo, Aditya Kanade ORCID logo, Suresh Parthasarathy ORCID logo, and Sriram Rajamani ORCID logo
(Microsoft Research, India)
As software projects progress, quality of code assumes paramount importance as it affects reliability, maintainability and security of software. For this reason, static analysis tools are used in developer workflows to flag code quality issues. However, developers need to spend extra efforts to revise their code to improve code quality based on the tool findings. In this work, we investigate the use of (instruction-following) large language models (LLMs) to assist developers in revising code to resolve code quality issues.
We present a tool, CORE (short for COde REvisions), architected using a pair of LLMs organized as a duo comprised of a proposer and a ranker. Providers of static analysis tools recommend ways to mitigate the tool warnings and developers follow them to revise their code. The proposer LLM of CORE takes the same set of recommendations and applies them to generate candidate code revisions. The candidates which pass the static quality checks are retained. However, the LLM may introduce subtle, unintended functionality changes which may go undetected by the static analysis. The ranker LLM evaluates the changes made by the proposer using a rubric that closely follows the acceptance criteria that a developer would enforce. CORE uses the scores assigned by the ranker LLM to rank the candidate revisions before presenting them to the developer.
We conduct a variety of experiments on two public benchmarks to show the ability of CORE: 1) to generate code revisions acceptable to both static analysis tools and human reviewers (the latter evaluated with user study on a subset of the Python benchmark), 2) to reduce human review efforts by detecting and eliminating revisions with unintended changes, 3) to readily work across multiple languages (Python and Java), static analysis tools (CodeQL and SonarQube) and quality checks (52 and 10 checks, respectively), and 4) to achieve fix rate comparable to a rule-based automated program repair tool but with much smaller engineering efforts (on the Java benchmark).
CORE could revise 59.2% Python files (across 52 quality checks) so that they pass scrutiny by both a tool and a human reviewer. The ranker LLM reduced false positives by 25.8% in these cases. CORE produced revisions that passed the static analysis tool in 76.8% Java files (across 10 quality checks) comparable to 78.3% of a specialized program repair tool, with significantly much less engineering efforts. We release code, data, and supplementary material publicly at http://aka.ms/CORE_MSRI.

Article Search Info
Towards AI-Assisted Synthesis of Verified Dafny Methods
Md Rakib Hossain Misu ORCID logo, Cristina V. Lopes ORCID logo, Iris Ma ORCID logo, and James Noble ORCID logo
(University of California, Irvine, USA; Creative Research & Programming, New Zealand; Australian National University, Australia)
Large language models show great promise in many domains, including programming. A promise is easy to make but hard to keep, and language models often fail to keep their promises, generating erroneous code. A promising avenue to keep models honest is to incorporate formal verification: generating programs’ specifications as well as code, so that the code can be proved correct with respect to the specifications. Unfortunately, existing large language models show a severe lack of proficiency in verified programming. In this paper we demonstrate how to improve two pretrained models’ proficiency in the Dafny verified programming language. Using 178 programming problems from the MBPP dataset, we prompt two contemporary models (GPT-4 and PaLM-2) to generate methods in Dafny. We use three different types of prompts: a direct contextless prompt, a second prompt that includes a method signature and test cases, and a third "Chain of Thought" prompt that decomposes the problem into steps and includes further example problems and solutions. Our results show that GPT-4 performs better than PaLM-2 on these tasks, and that both models perform best with the most detailed third prompt. GPT-4 was able to generate verified, human-evaluated, Dafny methods for 58% of the problems, however GPT-4 managed only 19% of the problems with the first prompt, and even fewer (10%) for the second prompt. We are thus able to contribute 153 verified Dafny solutions to MBPP programming problems, 50 that we wrote manually and 103 synthesized by GPT-4. Our results demonstrate that the benefits of formal program verification are now within reach of code- generating large language models. Likewise, program verification systems can benefit from large language models, whether to synthesize code wholesale, to generate specifications, or to act as a "programmer’s verification apprentice", to construct annotations such as loop invariants which are hard for programmers to write or verification tools to find. Finally, we expect that the approach we have pioneered here — generating candidate solutions that are subsequently formally checked for correctness — should transfer to other domains (e.g., legal arguments, transport signaling, structural engineering) where solutions must be correct, where that correctness must be demonstrated, explained and understood by designers and end-users.

Article Search
AROMA: Automatic Reproduction of Maven Artifacts
Mehdi Keshani ORCID logo, Tudor-Gabriel Velican ORCID logo, Gideon Bot ORCID logo, and Sebastian Proksch ORCID logo
(Delft University of Technology, Netherlands)
Modern software engineering establishes software supply chains and relies on tools and libraries to improve productivity. However, reusing external software in a project presents a security risk when the source of the component is unknown or the consistency of a component cannot be verified. The SolarWinds attack serves as a popular example in which the injection of malicious code into a library affected thousands of customers and caused a loss of billions of dollars. Reproducible builds present a mitigation strategy, as they can confirm the origin and consistency of reused components. A large reproducibility community has formed for Debian, but the reproducibility of the Maven ecosystem, the backbone of the Java supply chain, remains understudied in comparison. Reproducible Central is an initiative that curates a list of reproducible Maven libraries, but the list is limited and challenging to maintain due to manual efforts. Our research aims to support these efforts in the Maven ecosystem through automation. We investigate the feasibility of automatically finding the source code of a library from its Maven release and recovering information about the original release environment. Our tool, AROMA, can obtain this critical information from the artifact and the source repository through several heuristics and we use the results for reproduction attempts of Maven packages. Overall, our approach achieves an accuracy of up to 99.5% when compared field-by-field to the existing manual approach. In some instances, we even detected flaws in the manually maintained list, such as broken repository links. We reveal that automatic reproducibility is feasible for 23.4% of the Maven packages using AROMA, and 8% of these packages are fully reproducible. We demonstrate our ability to successfully reproduce new packages and have contributed some of them to the Reproducible Central repository. Additionally, we highlight actionable insights, outline future work in this area, and make our dataset and tools available to the public.

Article Search
Harnessing Neuron Stability to Improve DNN Verification
Hai Duong ORCID logo, Dong Xu ORCID logo, Thanhvu NguyenORCID logo, and Matthew B. Dwyer ORCID logo
(George Mason University, USA; University of Virginia, USA)
Deep Neural Networks (DNN) have emerged as an effective approach to tackling real-world problems. However, like human-written software, DNNs are susceptible to bugs and attacks. This has generated significant interest in developing effective and scalable DNN verification techniques and tools. Recent developments in DNN verification have highlighted the potential of constraint-solving approaches that combine abstraction techniques with SAT solving. Abstraction approaches are effective at precisely encoding neuron behavior when it is linear, but they lead to overapproximation and combinatorial scaling when behavior is non-linear. SAT approaches in DNN verification have incorporated standard DPLL techniques, but have overlooked important optimizations found in modern SAT solvers that help them scale on industrial benchmarks. In this paper, we present VeriStable, a novel extension of the recently proposed DPLL-based constraint DNN verification approach. VeriStable leverages the insight that while neuron behavior may be non-linear across the entire DNN input space, at intermediate states computed during verification many neurons may be constrained to have linear behavior – these neurons are stable. Efficiently detecting stable neurons reduces combinatorial complexity without compromising the precision of abstractions. Moreover, the structure of clauses arising in DNN verification problems shares important characteristics with industrial SAT benchmarks. We adapt and incorporate multi-threading and restart optimizations targeting those characteristics to further optimize DPLL-based DNN verification. We evaluate the effectiveness of VeriStable across a range of challenging benchmarks including fully- connected feedforward networks (FNNs), convolutional neural networks (CNNs) and residual networks (ResNets) applied to the standard MNIST and CIFAR datasets. Preliminary results show that VeriStable is competitive and outperforms state-of-the-art DNN verification tools, including α-β-CROWN and MN-BaB, the first and second performers of the VNN-COMP, respectively.

Article Search
“The Law Doesn’t Work Like a Computer”: Exploring Software Licensing Issues Faced by Legal Practitioners
Nathan Wintersgill ORCID logo, Trevor Stalnaker ORCID logo, Laura A. Heymann ORCID logo, Oscar ChaparroORCID logo, and Denys PoshyvanykORCID logo
(William & Mary, USA)
Most modern software products incorporate open source components, which requires compliance with each component's licenses. As noncompliance can lead to significant repercussions, organizations often seek advice from legal practitioners to maintain license compliance, address licensing issues, and manage the risks of noncompliance. While legal practitioners play a critical role in the process, little is known in the software engineering community about their experiences within the open source license compliance ecosystem. To fill this knowledge gap, a joint team of software engineering and legal researchers designed and conducted a survey with 30 legal practitioners and related occupations and then held 16 follow-up interviews. We identified different aspects of OSS license compliance from the perspective of legal practitioners, resulting in 14 key findings in three main areas of interest: the general ecosystem of compliance, the specific compliance practices of legal practitioners, and the challenges that legal practitioners face. We discuss the implications of our findings.

Article Search Artifacts Available
COSTELLO: Contrastive Testing for Embedding-Based Large Language Model as a Service Embeddings
Weipeng Jiang ORCID logo, Juan Zhai ORCID logo, Shiqing Ma ORCID logo, Xiaoyu Zhang ORCID logo, and Chao Shen ORCID logo
(Xi’an Jiaotong University, China; University of Massachusetts, Amherst, USA)
Large language models have gained significant popularity and are often provided as a service (i.e., LLMaaS). Companies like OpenAI and Google provide online APIs of LLMs to allow downstream users to create innovative applications. Despite its popularity, LLM safety and quality assurance is a well-recognized concern in the real world, requiring extra efforts for testing these LLMs. Unfortunately, while end-to-end services like ChatGPT have garnered rising attention in terms of testing, the LLMaaS embeddings have comparatively received less scrutiny. We state the importance of testing and uncovering problematic individual embeddings without considering downstream applications. The abstraction and non-interpretability of embedded vectors, combined with the black-box inaccessibility of LLMaaS, make testing a challenging puzzle. This paper proposes COSTELLO, a black-box approach to reveal potential defects in abstract embedding vectors from LLMaaS by contrastive testing. Our intuition is that high-quality LLMs can adequately capture the semantic relationships of the input texts and properly represent their relationships in the high-dimensional space. For the given interface of LLMaaS and seed inputs, COSTELLO can automatically generate test suites and output words with potential problematic embeddings. The idea is to synthesize contrastive samples with guidance, including positive and negative samples, by mutating seed inputs. Our synthesis guide will leverage task-specific properties to control the mutation procedure and generate samples with known partial relationships in the high-dimensional space. Thus, we can compare the expected relationship (oracle) and embedding distance (output of LLMs) to locate potential buggy cases. We evaluate COSTELLO on 42 open-source (encoder-based) language models and two real-world commercial LLMaaS. Experimental results show that COSTELLO can effectively detect semantic violations, where more than 62% of violations on average result in erroneous behaviors (e.g., unfairness) of downstream applications.

Article Search
How Does Simulation-Based Testing for Self-Driving Cars Match Human Perception?
Christian Birchler ORCID logo, Tanzil Kombarabettu Mohammed ORCID logo, Pooja Rani ORCID logo, Teodora Nechita ORCID logo, Timo Kehrer ORCID logo, and Sebastiano Panichella ORCID logo
(Zurich University of Applied Sciences, Switzerland; University of Bern, Switzerland; University of Zurich, Switzerland)
Software metrics such as coverage or mutation scores have been investigated for the automated quality assessment of test suites. While traditional tools rely on software metrics, the field of self-driving cars (SDCs) has primarily focused on simulation-based test case generation using quality metrics such as the out-of-bound (OOB) parameter to determine if a test case fails or passes. However, it remains unclear to what extent this quality metric aligns with the human perception of the safety and realism of SDCs. To address this (reality) gap, we conducted an empirical study involving 50 participants to investigate the factors that determine how humans perceive SDC test cases as safe, unsafe, realistic, or unrealistic. To this aim, we developed a framework leveraging virtual reality (VR) technologies, called SDC-Alabaster, to immerse the study participants into the virtual environment of SDC simulators. Our findings indicate that the human assessment of safety and realism of failing/passing test cases can vary based on different factors, such as the test's complexity and the possibility of interacting with the SDC. Especially for the assessment of realism, the participants' age leads to a different perception. This study highlights the need for more research on simulation testing quality metrics and the importance of human perception in evaluating SDCs.

Article Search Artifacts Available
Code-Aware Prompting: A Study of Coverage-Guided Test Generation in Regression Setting using LLM
Gabriel Ryan ORCID logo, Siddhartha Jain ORCID logo, Mingyue Shang ORCID logo, Shiqi Wang ORCID logo, Xiaofei Ma ORCID logo, Murali Krishna Ramanathan ORCID logo, and Baishakhi Ray ORCID logo
(Columbia University, USA; AWS AI Labs, USA)
Testing plays a pivotal role in ensuring software quality, yet conventional Search Based Software Testing (SBST) methods often struggle with complex software units, achieving suboptimal test coverage. Recent work using large language models (LLMs) for test generation have focused on improving generation quality through optimizing the test generation context and correcting errors in model outputs, but use fixed prompting strategies that prompt the model to generate tests without additional guidance. As a result LLM-generated testsuites still suffer from low coverage. In this paper, we present SymPrompt, a code-aware prompting strategy for LLMs in test generation. SymPrompt’s approach is based on recent work that demonstrates LLMs can solve more complex logical problems when prompted to reason about the problem in a multi-step fashion. We apply this methodology to test generation by deconstructing the testsuite generation process into a multi-stage sequence, each of which is driven by a specific prompt aligned with the execution paths of the method under test, and exposing relevant type and dependency focal context to the model. Our approach enables pretrained LLMs to generate more complete test cases without any additional training. We implement SymPrompt using the TreeSitter parsing framework and evaluate on a benchmark challenging methods from open source Python projects. SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2. Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.

Article Search
Enhancing Code Understanding for Impact Analysis by Combining Transformers and Program Dependence Graphs
Yanfu Yan ORCID logo, Nathan Cooper ORCID logo, Kevin Moran ORCID logo, Gabriele BavotaORCID logo, Denys PoshyvanykORCID logo, and Steve Rich ORCID logo
(William & Mary, USA; University of Central Florida, USA; USI Lugano, Lugano, Switzerland; Cisco Systems, USA)
Impact analysis (IA) is a critical software maintenance task that identifies the effects of a given set of code changes on a larger software project with the intention of avoiding potential adverse effects. IA is a cognitively challenging task that involves reasoning about the abstract relationships between various code constructs. Given its difficulty, researchers have worked to automate IA with approaches that primarily use coupling metrics as a measure of the ``connectedness'' of different parts of a software project. Many of these coupling metrics rely on static, dynamic, or evolutionary information and are based on heuristics that tend to be brittle, require expensive execution analysis, or large histories of co-changes to accurately estimate impact sets.
In this paper, we introduce a novel IA approach, called ATHENA, that combines a software system's dependence graph information with a conceptual coupling approach that uses advances in deep representation learning for code without the need for change histories and execution information. Previous IA benchmarks are small, containing less than ten software projects, and suffer from tangled commits, making it difficult to measure accurate results. Therefore, we constructed a large-scale IA benchmark, from 25 open-source software projects, that utilizes fine-grained commit information from bug fixes. On this new benchmark, our best performing approach configuration achieves an mRR, mAP, and HIT@10 score of 60.32%, 35.19%, and 81.48%, respectively. Through various ablations and qualitative analyses, we show that ATHENA's novel combination of program dependence graphs and conceptual coupling information leads it to outperform a simpler baseline by 10.34%, 9.55%, and 11.68% with statistical significance.

Article Search
RavenBuild: Context, Relevance, and Dependency Aware Build Outcome Prediction
Gengyi Sun ORCID logo, Sarra Habchi ORCID logo, and Shane McIntoshORCID logo
(University of Waterloo, Canada; Ubisoft Montréal, Canada)
Continuous Integration (CI) is a common practice adopted by modern software organizations. It plays an especially important role for large corporations like Ubisoft, where thousands of build jobs are submitted daily. Indeed, the cadence of development progress is constrained by the pace at which CI services process build jobs. To provide faster CI feedback, recent work explores how build outcomes can be anticipated. Although early results show plenty of promise, the distinct characteristics of Project X—a AAA video game project at Ubisoft, present new challenges for build outcome prediction. In the Project X setting, changes that do not modify source code also incur build failures. Moreover, we find that the code changes that have an impact that crosses the source-data boundary are more prone to build failures than code changes that do not impact data files. Since such changes are not fully characterized by the existing set of build outcome prediction features, state-of-the art models tend to underperform. Therefore, to accommodate the data context into build outcome prediction, we propose RavenBuild, a novel approach that leverages context, relevance, and dependency-aware features. We apply the state of-the-art BuildFast model and RavenBuild to Project X, and observe that RavenBuild improves the F1 score of the failing class by 46%, the recall of the failing class by 76%, and AUC by 28%. To ease adoption in settings with heterogeneous project sets, we also provide a simplified alternative RavenBuild-CR, which excludes dependency-aware features. We apply RavenBuild-CR on 22 open-source projects and Project X, and observe across-the-board improvements as well. On the other hand, we find that a naïve Parrot approach, which simply echoes the previous build outcome as its prediction, is surprisingly competitive with BuildFast and RavenBuild. Though Parrot fails to predict when the build outcome differs from their immediate predecessor, Parrot serves well as a tendency indicator of the sequences in build outcome datasets. Therefore, future studies should also consider comparing to the Parrot approach as a baseline when evaluating build outcome prediction models.

Article Search Artifacts Available
Towards Efficient Verification of Constant-Time Cryptographic Implementations
Luwei Cai ORCID logo, Fu SongORCID logo, and Taolue Chen ORCID logo
(ShanghaiTech University, China; Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Nanjing Institute of Software Technology, China; Birkbeck University of London, United Kingdom)
Timing side-channel attacks exploit secret-dependent execution time to fully or partially recover secrets of cryptographic implementations, posing a severe threat to software security. Constant-time programming discipline is an effective software-based countermeasure against timing side-channel attacks, but developing constant-time implementations turns out to be challenging and error-prone. Current verification approaches/tools suffer from scalability and precision issues when applied to production software in practice. In this paper, we put forward practical verification approaches based on a novel synergy of taint analysis and safety verification of self-composed programs. Specifically, we first use an IFDS-based lightweight taint analysis to prove that a large number of potential (timing) side-channel sources do not actually leak secrets. We then resort to a precise taint analysis and a safety verification approach to determine whether the remaining potential side-channel sources can actually leak secrets. These include novel constructions of taint-directed semi-cross-product of the original program and its Boolean abstraction, and a taint-directed self-composition of the program. Our approach is implemented as a cross-platform and fully automated tool CT-Prover. The experiments confirm its efficiency and effectiveness in verifying real-world benchmarks from modern cryptographic and SSL/TLS libraries. In particular, CT-Prover identify new, confirmed vulnerabilities of open-source SSL libraries (e.g., Mbed SSL, BearSSL) and significantly outperforms the state-of-the-art tools.

Article Search Info Artifacts Available
Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions
Tao Xiao ORCID logo, Hideaki Hata ORCID logo, Christoph Treude ORCID logo, and Kenichi Matsumoto ORCID logo
(NAIST, Japan; Shinshu University, Japan; Singapore Management University, Singapore)
GitHub's Copilot for Pull Requests (PRs) is a promising service aiming to automate various developer tasks related to PRs, such as generating summaries of changes or providing complete walkthroughs with links to the relevant code. As this innovative technology gains traction in the Open Source Software (OSS) community, it is crucial to examine its early adoption and its impact on the development process. Additionally, it offers a unique opportunity to observe how developers respond when they disagree with the generated content. In our study, we employ a mixed-methods approach, blending quantitative analysis with qualitative insights, to examine 18,256 PRs in which parts of the descriptions were crafted by generative AI. Our findings indicate that: (1) Copilot for PRs, though in its infancy, is seeing a marked uptick in adoption. (2) PRs enhanced by Copilot for PRs require less review time and have a higher likelihood of being merged. (3) Developers using Copilot for PRs often complement the automated descriptions with their manual input. These results offer valuable insights into the growing integration of generative AI in software development.

Article Search Artifacts Available
AI-Assisted Code Authoring at Scale: Fine-Tuning, Deploying, and Mixed Methods Evaluation
Vijayaraghavan Murali ORCID logo, Chandra Maddila ORCID logo, Imad Ahmad ORCID logo, Michael Bolin ORCID logo, Daniel Cheng ORCID logo, Negar Ghorbani ORCID logo, Renuka Fernandez ORCID logo, Nachiappan Nagappan ORCID logo, and Peter C. Rigby ORCID logo
(Meta Platforms, USA; Meta, USA; Concordia University, Canada)
Generative LLMs have been shown to effectively power AI-based code authoring tools that can suggest entire statements or blocks of code during code authoring. In this paper we present CodeCompose, an AI-assisted code authoring tool developed and deployed at Meta internally. CodeCompose is based on the InCoder LLM that merges generative capabilities with bi-directionality. We have scaled up CodeCompose to serve tens of thousands of developers at Meta, across 9 programming languages and several coding surfaces. We present our experience in making design decisions about the model and system architecture for CodeCompose that addresses these challenges.
To release a LLM model at this scale, we needed to first ensure that it is sufficiently accurate. In a random sample of 20K source code files, depending on the language, we are able to reproduce hidden lines between 40% and 58% of the time, an improvement of 1.4× and 4.1× over a model trained only on public data.
We gradually rolled CodeCompose out to developers. At the time of this writing, 16K developers have used it with 8% of their code coming directly from CodeCompose.
To triangulate our numerical findings, we conduct a thematic analysis on the feedback from 70 developers. We find that 91.5% of the feedback is positive, with the most common themes being discovering APIs, dealing with boilerplate code, and accelerating coding. Meta continues to integrate this feedback into CodeCompose.

Article Search
Improving the Learning of Code Review Successive Tasks with Cross-Task Knowledge Distillation
Oussama Ben Sghaier ORCID logo and Houari Sahraoui ORCID logo
(Université de Montréal, Canada)
Code review is a fundamental process in software development that plays a pivotal role in ensuring code quality and reducing the likelihood of errors and bugs. However, code review can be complex, subjective, and time-consuming. Quality estimation, comment generation, and code refinement constitute the three key tasks of this process, and their automation has traditionally been addressed separately in the literature using different approaches. In particular, recent efforts have focused on fine-tuning pre-trained language models to aid in code review tasks, with each task being considered in isolation. We believe that these tasks are interconnected, and their fine-tuning should consider this interconnection. In this paper, we introduce a novel deep-learning architecture, named DISCOREV, which employs cross-task knowledge distillation to address these tasks simultaneously. In our approach, we utilize a cascade of models to enhance both comment generation and code refinement models. The fine-tuning of the comment generation model is guided by the code refinement model, while the fine-tuning of the code refinement model is guided by the quality estimation model. We implement this guidance using two strategies: a feedback-based learning objective and an embedding alignment objective. We evaluate DISCOREV by comparing it to state-of-the-art methods based on independent training and fine-tuning. Our results show that our approach generates better review comments, as measured by the BLEU score, as well as more accurate code refinement according to the CodeBLEU score.

Article Search
Refactoring to Pythonic Idioms: A Hybrid Knowledge-Driven Approach Leveraging Large Language Models
Zejun Zhang ORCID logo, Zhenchang Xing ORCID logo, Xiaoxue Ren ORCID logo, Qinghua Lu ORCID logo, and Xiwei Xu ORCID logo
(Australian National University, Australia; CSIRO’s Data61, Australia; Zhejiang University, China)
Pythonic idioms are highly valued and widely used in the Python programming community. However, many Python users find it challenging to use Pythonic idioms. Adopting rule-based approach or LLM-only approach is not sufficient to overcome three persistent challenges of code idiomatization including code miss, wrong detection and wrong refactoring. Motivated by the determinism of rules and adaptability of LLMs, we propose a hybrid approach consisting of three modules. We not only write prompts to instruct LLMs to complete tasks, but we also invoke Analytic Rule Interfaces (ARIs) to accomplish tasks. The ARIs are Python code generated by prompting LLMs to generate code. We first construct a knowledge module with three elements including ASTscenario, ASTcomponent and Condition, and prompt LLMs to generate Python code for incorporation into an ARI library for subsequent use. After that, for any syntax-error-free Python code, we invoke ARIs from the ARI library to extract ASTcomponent from the ASTscenario, and then filter out ASTcomponent that does not meet the condition. Finally, we design prompts to instruct LLMs to abstract and idiomatize code, and then invoke ARIs from the ARI library to rewrite non-idiomatic code into the idiomatic code. Next, we conduct a comprehensive evaluation of our approach, RIdiom, and Prompt-LLM on nine established Pythonic idioms in RIdiom. Our approach exhibits superior accuracy, F1 score, and recall, while maintaining precision levels comparable to RIdiom, all of which consistently exceed or come close to 90% for each metric of each idiom. Lastly, we extend our evaluation to encompass four new Pythonic idioms. Our approach consistently outperforms Prompt-LLM, achieving metrics with values consistently exceeding 90% for accuracy, F1-score, precision, and recall.

Article Search
Java JIT Testing with Template Extraction
Zhiqiang Zang ORCID logo, Fu-Yao Yu ORCID logo, Aditya Thimmaiah ORCID logo, August ShiORCID logo, and Milos GligoricORCID logo
(University of Texas, Austin, USA)
We present LeJit, a template-based framework for testing Java just-in-time (JIT) compilers. Like recent template-based frameworks, LeJit executes a template---a program with holes to be filled---to generate concrete programs given as inputs to Java JIT compilers. LeJit automatically generates template programs from existing Java code by converting expressions to holes, as well as generating necessary glue code (i.e., code that generates instances of non-primitive types) to make generated templates executable. We have successfully used LeJit to test a range of popular Java JIT compilers, revealing five bugs in HotSpot, nine bugs in OpenJ9, and one bug in GraalVM. All of these bugs have been confirmed by Oracle and IBM developers, and 11 of these bugs were previously unknown, including two CVEs (Common Vulnerabilities and Exposures). Our comparison with several existing approaches shows that LeJit is complementary to them and is a powerful technique for ensuring Java JIT compiler correctness.

Article Search
BRF: Fuzzing the eBPF Runtime
Hsin-Wei Hung ORCID logo and Ardalan Amiri Sani ORCID logo
(University of California, Irvine, USA)
The eBPF technology in the Linux kernel has been widely adopted for different applications, such as networking, tracing, and security, thanks to the programmability it provides. By allowing user-supplied eBPF programs to be executed directly in the kernel, it greatly increases the flexibility and efficiency of deploying customized logic. However, eBPF also introduces a new and wide attack surface: malicious eBPF programs may try to exploit the vulnerabilities in the eBPF subsystem in the kernel. Fuzzing is a promising technique to find such vulnerabilities. Unfortunately, our experiments with the stateof-the-art kernel fuzzer, Syzkaller, show that it cannot effectively fuzz the eBPF runtime, those components that are in charge of executing an eBPF program, for two reasons. First, the eBPF verifier (which is tasked with verifying the safety of eBPF programs) rejects many fuzzing inputs because (1) they do not comply with its required semantics or (2) they miss some dependencies, i.e., other syscalls that need to be issued before the program is loaded. Second, Syzkaller fails to attach and trigger the execution of eBPF programs most of the times. This paper introduces the BPF Runtime Fuzzer (BRF), a fuzzer that can satisfy the semantics and dependencies required by the verifier and the eBPF subsystem. Our experiments show, in 48-hour fuzzing sessions, BRF can successfully execute 8× more eBPF programs compared to Syzkaller (and 32× more programs compared to Buzzer, an eBPF fuzzer released recently from Google). Moreover, eBPF programs generated by BRF are much more expressive than Syzkaller’s. As a result, BRF achieves 101% higher code coverage. Finally, BRF has so far managed to find 6 vulnerabilities (2 of them have been assigned CVE numbers) in the eBPF runtime, proving its effectiveness.

Article Search
DTD: Comprehensive and Scalable Testing for Debuggers
Hongyi Lu ORCID logo, Zhibo Liu ORCID logo, Shuai Wang ORCID logo, and Fengwei Zhang ORCID logo
(Southern University of Science and Technology, China; Hong Kong University of Science and Technology, China)
As a powerful tool for developers, interactive debuggers help locate and fix errors in software. By using debugging information included in binaries, debuggers can retrieve necessary program states about the program. Unlike printf-style debugging, debuggers allow for more flexible inspection and modification of program execution states. However, debuggers may incorrectly retrieve and interpret program execution, causing confusion and hindering the debugging process.
Despite the wide usage of interactive debuggers, a scalable and comprehensive measurement of their functionality correctness does not exist yet. Existing works either fall short in scalability or focus more on the “compiler-side” defects instead of debugger bugs. To facilitate a better assessment of debugger correctness, we first propose and advocate a set of debugger testing criteria, covering both comprehensiveness (in terms of debug information covered) and scalability (in terms of testing overhead). Moreover, we design comparative experiments to show that fulfilling these criteria is not only theoretically appealing, but also brings major improvement to debugger testing. Furthermore, based on these criteria, we present DTD, a differential testing (DT) framework for detecting bugs in interactive debuggers. DTD compares the behaviors of two mainstream debuggers when processing an identical C executable — discrepancies indicate bugs in one of the two debuggers.
DTD leverages a novel heuristic method to avoid the repetitive structures (e.g., loops) that exist in C programs, which facilitates DTD to achieve full debug information coverage efficiently. Moreover, we have also designed a Temporal Differential Filtering method to practically filter out the false positives caused by the uninitialized variables in common C programs. With these carefully designed techniques, DTD fulfills our proposed testing requirements and, therefore, achieves high scalability and testing comprehensiveness. For the first time, it offers large-scale testing for C debuggers to detect debugger behavior discrepancies when inspecting millions of program states. An empirical comparison shows that DTD finds 17× more error-triggering cases and detects 5× more bugs than the state-of-the-art debugger testing technique. We have used DTD to detect 13 bugs in the LLVM toolchain (Clang/LLDB) and 5 bugs in the GNU toolchain (GCC/GDB). One of our fixes has already landed in the latest LLDB development branch.

Article Search Info
PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models
Simin Chen ORCID logo, XiaoNing Feng ORCID logo, Xiaohong Han ORCID logo, Cong Liu ORCID logo, and Wei YangORCID logo
(University of Texas, Dallas, USA; Taiyuan University of Technology, China; University of California, Riverside, USA)
In recent times, a plethora of Large Code Generation Models (LCGMs) have been proposed, showcasing significant potential in assisting developers with complex programming tasks. Within the surge of LCGM proposals, a critical aspect of code generation research involves effectively benchmarking the programming capabilities of models. Benchmarking LCGMs necessitates the creation of a set of diverse programming problems, and each problem comprises the prompt (including the task description), canonical solution, and test inputs. The existing methods for constructing such a problem set can be categorized into two main types: manual methods and perturbation-based methods. However, %both these methods exhibit major limitations. %Firstly, manually-based methods require substantial human effort and are not easily scalable. Moreover, programming problem sets created manually struggle to maintain long-term data integrity due to the greedy training data collection mechanism in LCGMs. On the other hand, perturbation-based approaches primarily produce semantically homogeneous problems, resulting in generated programming problems with identical Canonical Solutions to the seed problem. These methods also tend to introduce typos to the prompt, easily detectable by IDEs, rendering them unrealistic. manual methods demand high effort and lack scalability, while also risking data integrity due to LCGMs' potentially contaminated data collection, and perturbation-based approaches mainly generate semantically homogeneous problems with the same canonical solutions and introduce typos that can be easily auto-corrected by IDE, making them ineffective and unrealistic. Addressing the aforementioned limitations presents several challenges: (1) How to automatically generate semantically diverse Canonical Solutions to enable comprehensive benchmarking on the models, (2) how to ensure long-term data integrity to prevent data contamination, and (3) how to generate natural and realistic programming problems. To tackle the first challenge, we draw key insights from viewing a program as a series of mappings from the input to the output domain. These mappings can be transformed, split, reordered, or merged to construct new programs. Based on this insight, we propose programming problem merging, where two existing programming problems are combined to create new ones. In addressing the second challenge, we incorporate randomness to our programming problem-generation process. Our tool can probabilistically guarantee no data repetition across two random trials. To tackle the third challenge, we propose the concept of a Lambda Programming Problem, comprising a concise one-sentence task description in natural language accompanied by a corresponding program implementation. Our tool ensures the program prompt is grammatically correct. Additionally, the tool leverages return value type analysis to verify the correctness of newly created Canonical Solutions. In our empirical evaluation, we utilize our tool on two widely-used datasets and compare it against nine baseline methods using eight code generation models. The results demonstrate the effectiveness of our tool in generating more challenging, diverse, and natural programming problems, comparing to the baselines.

Article Search
Metamorphic Testing of Secure Multi-party Computation (MPC) Compilers
Yichen Li ORCID logo, Dongwei Xiao ORCID logo, Zhibo Liu ORCID logo, Qi Pang ORCID logo, and Shuai Wang ORCID logo
(Hong Kong University of Science and Technology, China; Carnegie Mellon University, USA)
The demanding need to perform privacy-preserving computations among multiple data owners has led to the prosperous development of secure multi-party computation (MPC) protocols. MPC offers protocols for parties to jointly compute a function over their inputs while keeping those inputs private. To date, MPC has been widely adopted in various real-world, privacy-sensitive sectors, such as healthcare and finance. Moreover, to ease the adoption of MPC, industrial and academic MPC compilers have been developed to automatically translate programs describing arbitrary MPC procedures into low-level MPC executables.
Compiling high-level descriptions into high-efficiency MPC executables is challenging: the compilation often involves converting high-level languages into several intermediate representations (IR), e.g., arithmetic or boolean circuits, optimizing the computation/communication cost, and picking proper MPC protocols (and underlying virtual machines) for a particular task and threat model. Various optimizations and heuristics are employed during the compilation procedure to improve the efficiency of the generated MPC executables.
Despite the prosperous adoption of MPC compilers by industrial vendors and academia, a principled and systematic understanding of the correctness of MPC compilers does not yet exist. To fill this critical gap, this paper introduces , a metamorphic testing (MT) framework specifically designed for MPC compilers to effectively uncover erroneous compilations. Our approach proposes three metamorphic relations (MRs) that are tailored for MPC programs to mutate high-level MPC programs (compiler inputs). We then examine if MPC compilers yield semantics-equivalent MPC executables regarding the original and mutated MPC programs by comparing their execution results.
Real-world MPC compilers exhibit a high level of engineering quality. Nevertheless, we detected 4,772 inputs that can result in erroneous compilations in three popular MPC compilers available on the market. While the discovered error-triggering inputs do not cause the MPC compilers to crash directly, they can lead to the generation of incorrect MPC executables, jeopardizing the underlying dependability of the computation. With substantial manual effort and help from the MPC compiler developers, we uncovered thirteen bugs in these MPC compilers by debugging them using the error-triggering inputs. Our proposed testing frameworks and findings can be used to guide developers in their efforts to improve MPC compilers.

Article Search
Understanding the Impact of APIs Behavioral Breaking Changes on Client Applications
Dhanushka Jayasuriya ORCID logo, Valerio Terragni ORCID logo, Jens Dietrich ORCID logo, and Kelly Blincoe ORCID logo
(University of Auckland, New Zealand; Victoria University of Wellington, New Zealand)
Libraries play a significant role in software development as they provide reusable functionality, which helps expedite the development process. As libraries evolve, they release new versions with optimisations like new functionality, bug fixes, and patches for known security vulnerabilities. To obtain these optimisations, the client applications that depend on these libraries must update to use the latest version. However, this can cause software failures in the clients if the update includes breaking changes. These breaking changes can be divided into syntactic and semantic (behavioral) breaking changes. While there has been considerable research on syntactic breaking changes introduced between library updates and their impact on client projects, there is a notable lack of research regarding behavioral breaking changes introduced during library updates and their impacts on clients. We conducted an empirical analysis to identify the impact behavioral breaking changes have on clients by examining the impact of dependency updates on client test suites. We examined a set of java projects built using Maven, which included 30,548 dependencies under 8,086 Maven artifacts. We automatically updated out-of-date dependencies and ran the client test suites. We found that 2.30% of these updates had behavioral breaking changes that impacted client tests. Our results show that most breaking changes were introduced during a non-Major dependency update, violating the semantic versioning scheme. We further analyzed the effects these behavioral breaking changes have on client tests. We present a taxonomy of effects related to these changes, which we broadly categorize as Test Failures and Test Errors. Our results further indicate that the library developers did not adequately document the exceptions thrown due to precondition violations.

Article Search

proc time: 1.42