FSE 2024
Proceedings of the ACM on Software Engineering, Volume 1, Number FSE
Powered by
Conference Publishing Consulting

Proceedings of the ACM on Software Engineering, Volume 1, Number FSE, July 15–19, 2024, Porto de Galinhas, Brazil

FSE – Journal Issue

Contents - Abstracts - Authors


Title Page

Editorial Message of the PACMSE Editor in Chief
Welcome to the Proceedings of the ACM on Software Engineering (PACMSE).

Editorial Message

FSE 2024 Sponsors and Supporters
Sponsors and Supporters


JIT-Smart: A Multi-task Learning Framework for Just-in-Time Defect Prediction and Localization
Xiangping Chen ORCID logo, Furen Xu ORCID logo, Yuan Huang ORCID logo, Neng Zhang ORCID logo, and Zibin Zheng ORCID logo
(Sun Yat-sen University, Guangzhou, China)
Just-in-time defect prediction (JIT-DP) is used to predict the defect-proneness of a commit and just-in-time defect localization (JIT-DL) is used to locate the exact buggy positions (defective lines) in a commit. Recently, various JIT-DP and JIT-DL techniques have been proposed, while most of them use a post-mortem way (e.g., code entropy, attention weight, LIME) to achieve the JIT-DL goal based on the prediction results in JIT-DP. These methods do not utilize the label information of the defective code lines during model building. In this paper, we propose a unified model JIT-Smart, which makes the training process of just-in-time defect prediction and localization tasks a mutually reinforcing multi-task learning process. Specifically, we design a novel defect localization network (DLN), which explicitly introduces the label information of defective code lines for supervised learning in JIT-DL with considering the class imbalance issue. To further investigate the accuracy and cost-effectiveness of JIT-Smart, we compare JIT-Smart with 7 state-of-the-art baselines under 5 commit-level and 5 line-level evaluation metrics in JIT-DP and JIT-DL. The results demonstrate that JIT-Smart is statistically better than all the state-of-the-art baselines in JIT-DP and JIT-DL. In JIT-DP, at the median value, JIT-Smart achieves F1-Score of 0.475, AUC of 0.886, Recall@20%Effort of 0.823, Effort@20%Recall of 0.01 and Popt of 0.942 and improves the baselines by 19.89%-702.74%, 1.23%-31.34%, 9.44%-33.16%, 21.6%-53.82% and 1.94%-34.89%, respectively . In JIT-DL, at the median value, JIT-Smart achieves Top-5 Accuracy of 0.539 and Top-10 Accuracy of 0.396, Recall@20%Effortline of 0.726, Effort@20%Recallline of 0.087 and IFAline of 0.098 and improves the baselines by 101.83%-178.35%, 101.01%-277.31%, 257.88%-404.63%, 71.91%-74.31% and 99.11%-99.41%, respectively. Statistical analysis shows that our JIT-Smart performs more stably than the best-performing model. Besides, JIT-Smart also achieves the best performance compared with the state-of-the-art baselines in cross-project evaluation.

Publisher's Version
ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems
Guangba Yu ORCID logo, Pengfei Chen ORCID logo, Zilong He ORCID logo, Qiuyu Yan ORCID logo, Yu Luo ORCID logo, Fangyuan Li ORCID logo, and Zibin Zheng ORCID logo
(Sun Yat-sen University, Guangzhou, China; Tencent, China)
In large-scale online service systems, the occurrence of software changes is inevitable and frequent. Despite rigorous pre-deployment testing practices, the presence of defective software changes in the online environment cannot be completely eliminated. Consequently, there is a pressing need for automated techniques that can effectively identify these defective changes. However, the current abnormal change detection (ACD) approaches fall short in accurately pinpointing defective changes, primarily due to their disregard for the propagation of faults. To address the limitations of ACD, we propose a novel concept called root cause change analysis (RCCA) to identify the underlying root causes of change-inducing incidents. In order to apply the RCCA concept to practical scenarios, we have devised an intelligent RCCA framework named ChangeRCA. This framework aims to localize the defective change associated with change-inducing incidents among multiple changes. To assess the effectiveness of ChangeRCA, we have conducted an extensive evaluation utilizing a real-world dataset from WeChat and a simulated dataset encompassing 81 diverse defective changes. The evaluation results demonstrate that ChangeRCA outperforms the state-of-the-art ACD approaches, achieving an impressive Top-1 Hit Rate of 85% and significantly reducing the time required to identify defective changes.

Publisher's Version
Bin2Summary: Beyond Function Name Prediction in Stripped Binaries with Functionality-Specific Code Embeddings
Zirui Song ORCID logo, Jiongyi Chen ORCID logo, and Kehuan Zhang ORCID logo
(Chinese University of Hong Kong, Hong Kong; National University of Defense Technology, China)
Nowadays, closed-source software only with stripped binaries still dominates the ecosystem, which brings obstacles to understanding the functionalities of the software and further conducting the security analysis. With such an urgent need, research has traditionally focused on predicting function names, which can only provide fragmented and abbreviated information about functionality. To advance the state-of-the-art, this paper presents Bin2Summary to automatically summarize the functionality of the function in stripped binaries with natural language sentences. Specifically, the proposed framework includes a functionality-specific code embedding module to facilitate fine-grained similarity detection and an attention-based seq2seq model to generate summaries in natural language. Based on 16 widely-used projects (e.g., Coreutils), we have evaluated Bin2Summary with 38,167 functions, which are filtered from 162,406 functions, and all of them have a high-quality comment. Bin2Summary achieves 0.728 in precision and 0.729 in recall on our datasets, and the functionality-specific embedding module can improve the existing assembly language model by up to 109.5% and 109.9% in precision and recall. Meanwhile, the experiments demonstrated that Bin2Summary has outstanding transferability in analyzing the cross-architecture (i.e., in x64 and x86) and cross-environment (i.e., in Cygwin and MSYS2) binaries. Finally, the case study illustrates how Bin2Summary outperforms the existing works in providing functionality summaries with abundant semantics beyond function names.

Publisher's Version
Component Security Ten Years Later: An Empirical Study of Cross-Layer Threats in Real-World Mobile Applications
Keke Lian ORCID logo, Lei Zhang ORCID logo, Guangliang Yang ORCID logo, Shuo Mao ORCID logo, Xinjie Wang ORCID logo, Yuan Zhang ORCID logo, and Min Yang ORCID logo
(Fudan University, China)
Nowadays, mobile apps have greatly facilitated our daily work and lives. They are often designed to work closely and interact with each other through app components for data and functionality sharing. The security of app components has been extensively studied and various component attacks have been proposed. Meanwhile, Android system vendors and app developers have introduced a series of defense measures to mitigate these security threats. However, we have discovered that as apps evolve and develop, existing app component defenses have become inadequate to address the emerging security requirements. This latency in adaptation has given rise to the feasibility of cross-layer exploitation, where attackers can indirectly manipulate app internal functionalities by polluting their dependent data. To assess the security risks of cross-layer exploitation in real-world apps, we design and implement a novel vulnerability analysis approach, called CLDroid, which addresses two non-trivial challenges. Our experiments revealed that 1,215 (8.8%) popular apps are potentially vulnerable to cross-layer exploitation, with a total of more than 18 billion installs. We verified that through cross-layer exploitation, an unprivileged app could achieve various severe security consequences, such as arbitrary code execution, click hijacking, content spoofing, and persistent DoS. We ethically reported verified vulnerabilities to the developers, who acknowledged and rewarded us with bug bounties. As a result, 56 CVE IDs have been assigned, with 22 of them rated as ‘critical’ or ‘high’ severity.

Publisher's Version
Characterizing Python Library Migrations
Mohayeminul Islam ORCID logo, Ajay Kumar Jha ORCID logo, Ildar Akhmetov ORCID logo, and Sarah NadiORCID logo
(University of Alberta, Canada; North Dakota State University, USA; Northeastern University, Canada)
Developers heavily rely on Application Programming Interfaces (APIs) from libraries to build their software. As software evolves, developers may need to replace the used libraries with alternate libraries, a process known as library migration. Doing this manually can be tedious, time-consuming, and prone to errors. Automated migration techniques can help alleviate some of this burden. However, designing effective automated migration techniques requires understanding the types of code changes required to transform client code that used the old library to the new library. This paper contributes an empirical study that provides a holistic view of Python library migrations, both in terms of the code changes required in a migration and the typical development effort involved. We manually label 3,096 migration-related code changes in 335 Python library migrations from 311 client repositories spanning 141 library pairs from 35 domains. Based on our labeled data, we derive a taxonomy for describing migration-related code changes, PyMigTax. Leveraging PyMigTax and our labeled data, we investigate various characteristics of Python library migrations, such as the types of program elements and properties of API mappings, the combinations of types of migration-related code changes in a migration, and the typical development effort required for a migration. Our findings highlight various potential shortcomings of current library migration tools. For example, we find that 40% of library pairs have API mappings that involve non-function program elements, while most library migration techniques typically assume that function calls from the source library will map into (one or more) function calls from the target library. As an approximation for the development effort involved, we find that, on average, a developer needs to learn about 4 APIs and 2 API mappings to perform a migration and change 8 lines of code. However, we also found cases of migrations that involve up to 43 unique APIs, 22 API mappings, and 758 lines of code, making them harder to manually implement. Overall, our contributions provide the necessary knowledge and foundations for developing automated Python library migration techniques. We make all data and scripts related to this study publicly available at https://doi.org/10.6084/m9.figshare.24216858.v2.

Publisher's Version Published Artifact Artifacts Available
EyeTrans: Merging Human and Machine Attention for Neural Code Summarization
Yifan Zhang ORCID logo, Jiliang Li ORCID logo, Zachary Karas ORCID logo, Aakash Bansal ORCID logo, Toby Jia-Jun Li ORCID logo, Collin McMillan ORCID logo, Kevin Leach ORCID logo, and Yu Huang ORCID logo
(Vanderbilt University, USA; University of Notre Dame, USA)
Neural code summarization leverages deep learning models to automatically generate brief natural language summaries of code snippets. The development of Transformer models has led to extensive use of attention during model design. While existing work has primarily and almost exclusively focused on static properties of source code and related structural representations like the Abstract Syntax Tree (AST), few studies have considered human attention — that is, where programmers focus while examining and comprehending code. In this paper, we develop a method for incorporating human attention into machine attention to enhance neural code summarization. To facilitate this incorporation and vindicate this hypothesis, we introduce EyeTrans, which consists of three steps: (1) we conduct an extensive eye-tracking human study to collect and pre-analyze data for model training, (2) we devise a data-centric approach to integrate human attention with machine attention in the Transformer architecture, and (3) we conduct comprehensive experiments on two code summarization tasks to demonstrate the effectiveness of incorporating human attention into Transformers. Integrating human attention leads to an improvement of up to 29.91% in Functional Summarization and up to 6.39% in General Code Summarization performance, demonstrating the substantial benefits of this combination. We further explore performance in terms of robustness and efficiency by creating challenging summarization scenarios in which EyeTrans exhibits interesting properties. We also visualize the attention map to depict the simplifying effect of machine attention in the Transformer by incorporating human attention. This work has the potential to propel AI research in software engineering by introducing more human-centered approaches and data.

Publisher's Version Published Artifact Artifacts Available
LILAC: Log Parsing using LLMs with Adaptive Parsing Cache
Zhihan Jiang ORCID logo, Jinyang Liu ORCID logo, Zhuangbin Chen ORCID logo, Yichen Li ORCID logo, Junjie Huang ORCID logo, Yintong Huo ORCID logo, Pinjia He ORCID logo, Jiazhen Gu ORCID logo, and Michael R. Lyu ORCID logo
(Chinese University of Hong Kong, Hong Kong, China; Sun Yat-sen University, Zhuhai, China; Chinese University of Hong Kong, Shenzhen, China)
Log parsing transforms log messages into structured formats, serving as the prerequisite step for various log analysis tasks. Although a variety of log parsing approaches have been proposed, their performance on complicated log data remains compromised due to the use of human-crafted rules or learning-based models with limited training data. The recent emergence of powerful large language models (LLMs) demonstrates their vast pre-trained knowledge related to code and logging, making it promising to apply LLMs for log parsing. However, their lack of specialized log parsing capabilities currently hinders their parsing accuracy. Moreover, the inherent inconsistent answers, as well as the substantial overhead, prevent the practical adoption of LLM-based log parsing. To address these challenges, we propose LILAC, the first practical Log parsIng framework using LLMs with Adaptive parsing Cache. To facilitate accurate and robust log parsing, LILAC leverages the in-context learning (ICL) capability of the LLM by performing a hierarchical candidate sampling algorithm and selecting high-quality demonstrations. Furthermore, LILAC incorporates a novel component, an adaptive parsing cache, to store and refine the templates generated by the LLM. It helps mitigate LLM's inefficiency issue by enabling rapid retrieval of previously processed log templates. In this process, LILAC adaptively updates the templates within the parsing cache to ensure the consistency of parsed results. The extensive evaluation on public large-scale datasets shows that LILAC outperforms state-of-the-art methods by 69.5% in terms of the average F1 score of template accuracy. In addition, LILAC reduces the query times to LLMs by several orders of magnitude, achieving a comparable efficiency to the fastest baseline.

Publisher's Version Info
Efficiently Detecting Reentrancy Vulnerabilities in Complex Smart Contracts
Zexu Wang ORCID logo, Jiachi Chen ORCID logo, Yanlin Wang ORCID logo, Yu Zhang ORCID logo, Weizhe Zhang ORCID logo, and Zibin Zheng ORCID logo
(Sun Yat-sen University, Zhuhai, China; Peng Cheng Laboratory, Shenzhen, China; Harbin Institute of Technology, China; GuangDong Engineering Technology Research Center of Blockchain, Zhuhai, China)
Reentrancy vulnerability as one of the most notorious vulnerabilities, has been a prominent topic in smart contract security research. Research shows that existing vulnerability detection presents a range of challenges, especially as smart contracts continue to increase in complexity. Existing tools perform poorly in terms of efficiency and successful detection rates for vulnerabilities in complex contracts.
To effectively detect reentrancy vulnerabilities in contracts with complex logic, we propose a tool named SliSE. SliSE’s detection process consists of two stages: Warning Search and Symbolic Execution Verification. In Stage 1, SliSE utilizes program slicing to analyze the Inter-contract Program Dependency Graph (I-PDG) of the contract, and collects suspicious vulnerability information as warnings. In Stage 2, symbolic execution is employed to verify the reachability of these warnings, thereby enhancing vulnerability detection accuracy. SliSE obtained the best performance compared with eight state-of-the-art detection tools. It achieved an F1 score of 78.65%, surpassing the highest score recorded by an existing tool of 9.26%. Additionally, it attained a recall rate exceeding 90% for detection of contracts on Ethereum. Overall, SliSE provides a robust and efficient method for detection of Reentrancy vulnerabilities for complex contracts.

Publisher's Version
IRCoCo: Immediate Rewards-Guided Deep Reinforcement Learning for Code Completion
Bolun Li ORCID logo, Zhihong Sun ORCID logo, Tao Huang ORCID logo, Hongyu Zhang ORCID logo, Yao Wan ORCID logo, Ge Li ORCID logo, Zhi Jin ORCID logo, and Chen Lyu ORCID logo
(Shandong Normal University, China; Chongqing University, China; Huazhong University of Science and Technology, China; Peking University, China)
Code completion aims to enhance programming productivity by predicting potential code based on the current programming context. Recently, pre-trained language models (LMs) have become prominent in this field. Various approaches have been proposed to fine-tune LMs using supervised fine-tuning (SFT) techniques for code completion. However, the inherent exposure bias of these models can cause errors to accumulate early in the sequence completion, leading to even more errors in subsequent completions. To address this problem, deep reinforcement learning (DRL) is an alternative technique for fine-tuning LMs for code completion, which can improve the generalization capabilities and overall performance. Nevertheless, integrating DRL-based strategies into code completion faces two major challenges: 1) The dynamic nature of the code context requires the completion model to quickly adapt to changes, which poses difficulties for conventional DRL strategies that focus on delayed rewarding of the final code state. 2) It is difficult to evaluate the correctness of partial code, thus the reward redistribution-based strategies cannot be adapted to code completion. To tackle these challenges, we propose IRCoCo, a code completion-specific DRL-based fine-tuning framework. This framework is designed to provide immediate rewards as feedback for detecting dynamic context changes arising from continuous edits during code completion. With the aid of immediate feedback, the fine-tuned LM can gain a more precise understanding of the current context, thereby enabling effective adjustment of the LM and optimizing code completion in a more refined manner. Experimental results demonstrate that fine-tuning pre-trained LMs with IRCoCo leads to significant improvements in the code completion task, outperforming both SFT-based and other DRL-based baselines.

Publisher's Version
Shadows in the Interface: A Comprehensive Study on Dark Patterns
Liming Nie ORCID logo, Yangyang Zhao ORCID logo, Chenglin Li ORCID logo, Xuqiong Luo ORCID logo, and Yang Liu ORCID logo
(Shenzhen Technology University, China; Zhejiang Sci-Tech University, China; Changsha University of Science and Technology, China; Nanyang Technological University, Singapore)
As digital interfaces become increasingly prevalent, a series of ethical issues have surfaced, with dark patterns emerging as a key research focus. These manipulative design strategies are widely employed in User Interfaces (UI) with the primary aim of steering user behavior in favor of service providers, often at the expense of the users themselves. This paper aims to address three main challenges in the study of dark patterns: inconsistencies and incompleteness in classification, limitations of detection tools, and inadequacies in data comprehensiveness. In this paper, we introduce a comprehensive framework, called the Dark Pattern Analysis Framework (DPAF). Using this framework, we construct a comprehensive taxonomy of dark patterns, encompassing 64 types, each labeled with its impact on users and the likely scenarios in which it appears, validated through an industry survey. When assessing the capabilities of the detection tools and the completeness of the dataset, we find that of all dark patterns, the five detection tools can only identify 32, yielding a coverage rate of merely 50%. Although the four existing datasets collectively contain 5,566 instances, they cover only 32 of all types of dark patterns, also resulting in a total coverage rate of 50%. The results discussed above suggest that there is still significant room for advancement in the field of dark pattern detection. Through this research, we not only deepen our understanding of dark pattern classification and detection tools, but also offer valuable insights for future research and practice in this field.

Publisher's Version
ProveNFix: Temporal Property-Guided Program Repair
Yahui Song ORCID logo, Xiang Gao ORCID logo, Wenhua Li ORCID logo, Wei-Ngan ChinORCID logo, and Abhik RoychoudhuryORCID logo
(National University of Singapore, Singapore; Beihang University, China)
Model checking has been used traditionally for finding violations of temporal properties. Recently, testing or fuzzing approaches have also been applied to software systems to find temporal property violations. However, model checking suffers from state explosion, while fuzzing can only partially cover program paths. Moreover, once a violation is found, the fix for the temporal error is usually manual. In this work, we develop the first compositional static analyzer for temporal properties, and the analyzer supports a proof-based repair strategy to fix temporal bugs automatically. To enable a more flexible specification style for temporal properties, on top of the classic pre/post-conditions, we allow users to write a future-condition to modularly express the expected behaviors after the function call. Instead of requiring users to write specifications for each procedure, our approach automatically infers the procedure’s specification according to user-supplied specifications for a small number of primitive APIs. We further devise a term rewriting system to check the actual behaviors against its inferred specification. Our method supports the analysis of 1) memory usage bugs, 2) unchecked return values, 3) resource leaks, etc., with annotated specifications for 17 primitive APIs, and detects 515 vulnerabilities from over 1 million lines of code ranging from ten real-world C projects. Intuitively, the benefit of our approach is that a small set of properties can be specified once and used to analyze/repair a large number of programs. Experimental results show that our tool, ProveNFix, detects 72.2% more true alarms than the latest release of the Infer static analyzer. Moreover, we show the effectiveness of our repair strategy when compared to other state-of-the-art systems — fixing 5% more memory leaks than SAVER, 40% more resource leaks than FootPatch, and with a 90% fix rate for null pointer dereferences.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
SmartAxe: Detecting Cross-Chain Vulnerabilities in Bridge Smart Contracts via Fine-Grained Static Analysis
Zeqin Liao ORCID logo, Yuhong Nan ORCID logo, Henglong Liang ORCID logo, Sicheng Hao ORCID logo, Juan Zhai ORCID logo, Jiajing Wu ORCID logo, and Zibin Zheng ORCID logo
(Sun Yat-sen University, Zhuhai, China; Sun Yat-sen University, Guangzhou, China; University of Massachusetts, Amherst, USA; GuangDong Engineering Technology Research Center of Blockchain, Zhuhai, China)
With the increasing popularity of blockchain, different blockchain platforms coexist in the ecosystem (e.g., Ethereum, BNB, EOSIO, etc.), which prompts the high demand for cross-chain communication. Cross-chain bridge is a specific type of decentralized application for asset exchange across different blockchain platforms. Securing the smart contracts of cross-chain bridges is in urgent need, as there are a number of recent security incidents with heavy financial losses caused by vulnerabilities in bridge smart contracts, as we call them Cross-Chain Vulnerabilities (CCVs). However, automatically identifying CCVs in smart contracts poses several unique challenges. Particularly, it is non-trivial to (1) identify application-specific access control constraints needed for cross-bridge asset exchange, and (2) identify inconsistent cross-chain semantics between the two sides of the bridge. In this paper, we propose SmartAxe, a new framework to identify vulnerabilities in cross-chain bridge smart contracts. Particularly, to locate vulnerable functions that have access control incompleteness, SmartAxe models the heterogeneous implementations of access control and finds necessary security checks in smart contracts through probabilistic pattern inference. Besides, SmartAxe constructs cross-chain control-flow graph (xCFG) and data-flow graph (xDFG), which help to find semantic inconsistency during cross-chain data communication. To evaluate SmartAxe, we collect and label a dataset of 88 CCVs from real-attacks cross-chain bridge contracts. Evaluation results show that SmartAxe achieves a precision of 84.95% and a recall of 89.77%. In addition, SmartAxe successfully identifies 232 new/unknown CCVs from 129 real-world cross-chain bridge applications (i.e., from 1,703 smart contracts). These identified CCVs affect a total amount of digital assets worth 1,885,250 USD

Publisher's Version
Predictive Program Slicing via Execution Knowledge-Guided Dynamic Dependence Learning
Aashish Yadavally ORCID logo, Yi Li ORCID logo, and Tien N. Nguyen ORCID logo
(University of Texas at Dallas, USA)
Program slicing, the process of extracting program statements that influence values at a designated location (known as the slicing criterion), is helpful in both manual and automated debugging. However, such slicing techniques prove ineffective in scenarios where executing specific inputs is prohibitively expensive, or even impossible, as with partial code. In this paper, we introduce ND-Slicer, a predictive slicing methodology that caters to specific executions based on a particular input, overcoming the need for actual execution. We enable such a process by leveraging execution-aware pre-training to learn the dynamic program dependencies, including both dynamic data and control dependencies between variables in the slicing criterion and the remaining program statements. Such knowledge forms the cornerstone for constructing a predictive backward slice. Our empirical evaluation revealed a high accuracy in predicting program slices, achieving an exact-match accuracy of 81.3% and a ROUGE-LCS F1-score of 95.4% on Python programs. As an extrinsic evaluation, we illustrate ND-Slicer’s usefulness in crash detection, with it locating faults with an accuracy of 63.9%. Furthermore, we include an in-depth qualitative evaluation, assessing ND-Slicer’s understanding of branched structures such as if-else blocks and loops, as well as the control flow in inter-procedural calls.

Publisher's Version
PBE-Based Selective Abstraction and Refinement for Efficient Property Falsification of Embedded Software
Yoel Kim ORCID logo and Yunja Choi ORCID logo
(Kyungpook National University, South Korea)
Comprehensive verification/falsification of embedded software is challenging and often impossible mainly due to the typical characteristics of embedded software, such as the use of global variables, reactive behaviors, and its (soft or hard) real-time requirements, to name but a few. Abstraction is one of the major solutions to this problem, but existing proven abstraction techniques are not effective in this domain as they are uniformly applied to the entire program and often require a large number of refinements to find true alarms. This work proposes a domain-specific solution for efficient property falsification based on the observation that embedded software typically consists of a number of user-defined auxiliary functions, many of which may be loosely coupled with the main control logic. Our approach selectively abstracts auxiliary functions using function summaries synthesized by Programming-By-Example (PBE), which reduces falsification complexity as well as the number of refinements. The drawbacks of using PBE-based function summaries, which are neither sound nor complete, for abstraction are counteracted by symbolic alarm filtering and novel PBE-based refinements for function summaries. We demonstrate that the proposed approach has comparable performance to the state-of-the-art model checkers on SV-COMP benchmark programs and outperforms them on a set of typical embedded software in terms of both falsification efficiency and scalability.

Publisher's Version Published Artifact Artifacts Available
Evaluating Directed Fuzzers: Are We Heading in the Right Direction?
Tae Eun Kim ORCID logo, Jaeseung Choi ORCID logo, Seongjae Im ORCID logo, Kihong Heo ORCID logo, and Sang Kil ChaORCID logo
(KAIST, South Korea; Sogang University, South Korea)
Directed fuzzing recently has gained significant attention due to its ability to reconstruct proof-of-concept (PoC) test cases for target code such as buggy lines or functions. Surprisingly, however, there has been no in-depth study on the way to properly evaluate directed fuzzers despite much progress in the field. In this paper, we present the first systematic study on the evaluation of directed fuzzers. In particular, we analyze common pitfalls in evaluating directed fuzzers with extensive experiments on five state-of-the-art tools, which amount to 30 CPU-years of computational effort, in order to confirm that different choices made at each step of the evaluation process can significantly impact the results. For example, we find that a small change in the crash triage logic can substantially affect the measured performance of a directed fuzzer, while the majority of the papers we studied do not fully disclose their crash triage scripts. We argue that disclosing the whole evaluation process is essential for reproducing research and facilitating future work in the field of directed fuzzing. In addition, our study reveals that several common evaluation practices in the current directed fuzzing literature can mislead the overall assessments. Thus, we identify such mistakes in previous papers and propose guidelines for evaluating directed fuzzers.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
DyPyBench: A Benchmark of Executable Python Software
Islem Bouzenia ORCID logo, Bajaj Piyush Krishan ORCID logo, and Michael Pradel ORCID logo
(University of Stuttgart, Germany)
Python has emerged as one of the most popular programming languages, extensively utilized in domains such as machine learning, data analysis, and web applications. Python’s dynamic nature and extensive usage make it an attractive candidate for dynamic program analysis. However, unlike for other popular languages, there currently is no comprehensive benchmark suite of executable Python projects, which hinders the development of dynamic analyses. This work addresses this gap by presenting DyPyBench, the first benchmark of Python projects that is large-scale, diverse, ready-to-run (i.e., with fully configured and prepared test suites), and ready- to-analyze (by integrating with the DynaPyt dynamic analysis framework). The benchmark encompasses 50 popular open-source projects from various application domains, with a total of 681k lines of Python code, and 30k test cases. DyPyBench enables various applications in testing and dynamic analysis, of which we explore three in this work: (i) Gathering dynamic call graphs and empirically comparing them to statically computed call graphs, which exposes and quantifies limitations of existing call graph construction techniques for Python. (ii) Using DyPyBench to build a training data set for LExecutor, a neural model that learns to predict values that otherwise would be missing at runtime. (iii) Using dynamically gathered execution traces to mine API usage specifications, which establishes a baseline for future work on specification mining for Python. We envision DyPyBench to provide a basis for other dynamic analyses and for studying the runtime behavior of Python code.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Predicting Configuration Performance in Multiple Environments with Sequential Meta-Learning
Jingzhi Gong ORCID logo and Tao Chen ORCID logo
(University of Electronic Science and Technology of China, China; Loughborough University, United Kingdom; University of Birmingham, United Kingdom)
Learning and predicting the performance of given software configurations are of high importance to many software engineering activities. While configurable software systems will almost certainly face diverse running environments (e.g., version, hardware, and workload), current work often either builds performance models under a single environment or fails to properly handle data from diverse settings, hence restricting their accuracy for new environments. In this paper, we target configuration performance learning under multiple environments. We do so by designing SeMPL — a meta-learning framework that learns the common understanding from configurations measured in distinct (meta) environments and generalizes them to the unforeseen, target environment. What makes it unique is that unlike common meta-learning frameworks (e.g., MAML and MetaSGD) that train the meta environments in parallel, we train them sequentially, one at a time. The order of training naturally allows discriminating the contributions among meta environments in the meta-model built, which fits better with the characteristic of configuration data that is known to dramatically differ between different environments. Through comparing with 15 state-of-the-art models under nine systems, our extensive experimental results demonstrate that SeMPL performs considerably better on 89% of the systems with up to 99% accuracy improvement, while being data-efficient, leading to a maximum of 3.86× speedup. All code and data can be found at our repository: https://github.com/ideas-labo/SeMPL.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
R2I: A Relative Readability Metric for Decompiled Code
Haeun Eom ORCID logo, Dohee Kim ORCID logo, Sori Lim ORCID logo, Hyungjoon Koo ORCID logo, and Sungjae Hwang ORCID logo
(Sungkyunkwan University, South Korea)
Decompilation is a process of converting a low-level machine code snippet back into a high-level programming language such as C. It serves as a basis to aid reverse engineers in comprehending the contextual semantics of the code. In this respect, commercial decompilers like Hex-Rays have made significant strides in improving the readability of decompiled code over time. While previous work has proposed the metrics for assessing the readability of source code, including identifiers, variable names, function names, and comments, those metrics are unsuitable for measuring the readability of decompiled code primarily due to i) the lack of rich semantic information in the source and ii) the presence of erroneous syntax or inappropriate expressions. In response, to the best of our knowledge, this work first introduces R2I, the Relative Readability Index, a specialized metric tailored to evaluate decompiled code in a relative context quantitatively. In essence, R2I can be computed by i) taking code snippets across different decompilers as input and ii) extracting pre-defined features from an abstract syntax tree. For the robustness of R2I, we thoroughly investigate the enhancement efforts made by (non-)commercial decompilers and academic research to promote code readability, identifying 31 features to yield a reliable index collectively. Besides, we conducted a user survey to capture subjective factors such as one’s coding styles and preferences. Our empirical experiments demonstrate that R2I is a versatile metric capable of representing the relative quality of decompiled code (e.g., obfuscation, decompiler updates) and being well aligned with human perception in our survey.

Publisher's Version Published Artifact Artifacts Available
DiffCoder: Enhancing Large Language Model on API Invocation via Analogical Code Exercises
Daoguang Zan ORCID logo, Ailun Yu ORCID logo, Bo Shen ORCID logo, Bei Chen ORCID logo, Wei Li ORCID logo, Yongshun Gong ORCID logo, Xiaolin Chen ORCID logo, Yafen Yao ORCID logo, Weihua Luo ORCID logo, Bei Guan ORCID logo, Yan Liu ORCID logo, Yongji Wang ORCID logo, Qianxiang Wang ORCID logo, and Lizhen Cui ORCID logo
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Peking University, China; Huawei Technologies, China; Independent Researcher, China; Shandong University, China; Funcun-wuyou Technologies, China)
The task of code generation aims to generate code solutions based on given programming problems. Recently, code large language models (code LLMs) have shed new light on this task, owing to their formidable code generation capabilities. While these models are powerful, they seldom focus on further improving the accuracy of library-oriented API invocation. Nonetheless, programmers frequently invoke APIs in routine coding tasks. In this paper, we aim to enhance the proficiency of existing code LLMs regarding API invocation by mimicking analogical learning, which is a critical learning strategy for humans to learn through differences among multiple instances. Motivated by this, we propose a simple yet effective approach, namely DiffCoder, which excels in API invocation by effectively training on the differences (diffs) between analogical code exercises. To assess the API invocation capabilities of code LLMs, we conduct experiments on seven existing benchmarks that focus on mono-library API invocation. Additionally, we construct a new benchmark, namely PanNumEval, to evaluate the performance of multi-library API invocation. Extensive experiments on eight benchmarks demonstrate the impressive performance of DiffCoder. Furthermore, we develop a VSCode plugin for DiffCoder, and the results from twelve invited participants further verify the practicality of DiffCoder.

Publisher's Version
Maximizing Patch Coverage for Testing of Highly-Configurable Software without Exploding Build Times
Necip Fazıl Yıldıran ORCID logo, Jeho Oh ORCID logo, Julia Lawall ORCID logo, and Paul GazzilloORCID logo
(University of Central Florida, USA; University of Texas at Austin, USA; Inria, France)
The Linux kernel is highly-configurable, with a build system that takes a configuration file as input and automatically tailors the source code accordingly. Configurability, however, complicates testing, because different configuration options lead to the inclusion of different code fragments. With thousands of patches received per month, Linux kernel maintainers employ extensive automated continuous integration testing. To attempt patch coverage, i.e., taking all changed lines into account, current approaches either use configuration files that maximize total statement coverage or use multiple randomly-generated configuration files, both of which incur high build times without guaranteeing patch coverage. To achieve patch coverage without exploding build times, we propose krepair, which automatically repairs configuration files that are fast-building but have poor patch coverage to achieve high patch coverage with little effect on build times. krepair works by discovering a small set of changes to a configuration file that will ensure patch coverage, preserving most of the original configuration file's settings. Our evaluation shows that, when applied to configuration files with poor patch coverage on a statistically-significant sample of recent Linux kernel patches, krepair achieves nearly complete patch coverage, 98.5% on average, while changing less than 1.53% of the original default configuration file in 99% of patches, which keeps build times 10.5x faster than maximal configuration files.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Abstraction-Aware Inference of Metamorphic Relations
Agustín Nolasco ORCID logo, Facundo Molina ORCID logo, Renzo Degiovanni ORCID logo, Alessandra GorlaORCID logo, Diego GarbervetskyORCID logo, Mike Papadakis ORCID logo, Sebastian Uchitel ORCID logo, Nazareno AguirreORCID logo, and Marcelo F. Frias ORCID logo
(University of Rio Cuarto, Argentina; IMDEA Software Institute, Spain; University of Luxembourg, Luxembourg; University of Buenos Aires, Argentina; Imperial College London, United Kingdom; CONICET, Argentina; University of Texas at El Paso, USA)
Metamorphic testing is a valuable technique that helps in dealing with the oracle problem. It involves testing software against specifications of its intended behavior given in terms of so called metamorphic relations, statements that express properties relating different software elements (e.g., different inputs, methods, etc). The effective application of metamorphic testing strongly depends on identifying suitable domain-specific metamorphic relations, a challenging task, that is typically manually performed. This paper introduces MemoRIA, a novel approach that aims at automatically identifying metamorphic relations. The technique focuses on a particular kind of metamorphic relation, which asserts equivalences between methods and method sequences. MemoRIA works by first generating an object-protocol abstraction of the software being tested, then using fuzzing to produce candidate relations from the abstraction, and finally validating the candidate relations through run-time analysis. A SAT-based analysis is used to eliminate redundant relations, resulting in a concise set of metamorphic relations for the software under test. We evaluate our technique on a benchmark consisting of 22 Java subjects taken from the literature, and compare MemoRIA with the metamorphic relation inference technique SBES. Our results show that by incorporating the object protocol abstraction information, MemoRIA is able to more effectively infer meaningful metamorphic relations, that are also more precise, compared to SBES, measured in terms of mutation analysis. Also, the SAT-based reduction allows us to significantly reduce the number of reported metamorphic relations, while in general having a small impact in the bug finding ability of the corresponding obtained relations.

Publisher's Version
TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State
Haiyu Huang ORCID logo, Xiaoyu Zhang ORCID logo, Pengfei Chen ORCID logo, Zilong He ORCID logo, Zhiming Chen ORCID logo, Guangba Yu ORCID logo, Hongyang Chen ORCID logo, and Chen Sun ORCID logo
(Sun Yat-sen University, Guangzhou, China; Huawei, China)
Distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial computational and storage costs. To reduce the quantity of traces, trace sampling has become a prominent topic of discussion, and several methods have been proposed in prior work. To attain higher-quality sampling outcomes, biased sampling has gained more attention compared to random sampling. Previous biased sampling methods primarily considered the importance of traces based on diversity, aiming to sample more edge-case traces and fewer common-case traces. However, we contend that relying solely on trace diversity for sampling is insufficient, system runtime state is another crucial factor that needs to be considered, especially in cases of system failures. In this study, we introduce TraStrainer, an online sampler that takes into account both system runtime state and trace diversity. TraStrainer employs an interpretable and automated encoding method to represent traces as vectors. Simultaneously, it adaptively determines sampling preferences by analyzing system runtime metrics. When sampling, it combines the results of system-bias and diversity-bias through a dynamic voting mechanism. Experimental results demonstrate that TraStrainer can achieve higher quality sampling results and significantly improve the performance of downstream root cause analysis (RCA) tasks. It has led to an average increase of 32.63% in Top-1 RCA accuracy compared to four baselines in two datasets.

Publisher's Version
Fast Graph Simplification for Path-Sensitive Typestate Analysis through Tempo-Spatial Multi-Point Slicing
Xiao Cheng ORCID logo, Jiawei Ren ORCID logo, and Yulei Sui ORCID logo
(UNSW, Sydney, Australia)
Typestate analysis is a commonly used static technique to identify software vulnerabilities by assessing if a sequence of operations violates temporal safety specifications defined by a finite state automaton. Path-sensitive typestate analysis (PSTA) offers a more precise solution by eliminating false alarms stemming from infeasible paths. To improve the efficiency of path-sensitive analysis, previous efforts have incorporated sparse techniques, with a focus on analyzing the path feasibility of def-use chains. However, they cannot be directly applied to detect typestate vulnerabilities requiring temporal information within the control flow graph, e.g., use-to-use information. In this paper, we introduce FGS, a Fast Graph Simplification approach designed for PSTA by retaining multi-point temporal information while harnessing the advantages of sparse analysis. We propose a new multi-point slicing technique that captures the temporal and spatial correlations within the program. By doing so, it optimizes the program by only preserving the necessary program dependencies, resulting in a sparser structure for precision-preserving PSTA. Our graph simplification approach, as a fast preprocessing step, offers several benefits for existing PSTA algorithms. These include a more concise yet precision-preserving graph structure, decreased numbers of variables and constraints within execution states, and simplified path feasibility checking. As a result, the overall efficiency of the PSTA algorithm exhibits significant improvement. We evaluated FGS using NIST benchmarks and ten real-world large-scale projects to detect four types of vulnerabilities, including memory leaks, double-frees, use-after-frees, and null dereferences. On average, when comparing FGS against ESP (baseline PSTA), FGS reduces 89% of nodes, 86% of edges, and 88% of calling context of the input graphs, obtaining a speedup of 116x and a memory usage reduction of 93% on the large projects evaluated. Our experimental results also demonstrate that FGS outperforms six open-source tools (IKOS, ClangSA , Saber, Cppcheck, Infer, and Sparrow) on the NIST benchmarks, which comprises 846 programs. Specifically, FGS achieves significantly higher precision, with improvements of up to 171% (42% on average), and detects a greater number of true positives, with enhancements of up to 245% (52% on average). Moreover, among the ten large-scale projects, FGS successfully found 105 real bugs with a precision rate of 82%. In contrast, our baseline tools not only missed over 42% of the real bugs but also yielded an average precision rate of just 13%.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Understanding Developers’ Discussions and Perceptions on Non-functional Requirements: The Case of the Spring Ecosystem
Anderson Oliveira ORCID logo, João Correia ORCID logo, Wesley K. G. AssunçãoORCID logo, Juliana Alves Pereira ORCID logo, Rafael de Mello ORCID logo, Daniel Coutinho ORCID logo, Caio Barbosa ORCID logo, Paulo Libório ORCID logo, and Alessandro Garcia ORCID logo
(PUC-Rio, Brazil; North Carolina State University, USA; Federal University of Rio de Janeiro, Brazil)
Non-Functional Requirements (NFRs) should be defined in the early stages of the software development process, driving developers to make important design decisions. Neglecting NFRs may lead developers to create systems that are difficult to maintain and do not meet users expectations. Despite their importance, the discussion of NFRs is often ad-hoc and scattered through multiple sources, limiting developers' awareness of NFRs. In that scenario, Pull Request (PR) discussions provide a centralized platform for comprehensive NFR discussions. However, existing studies do not explore this important source of information in open-source software development, which developers widely use to discuss software requirements. In this study, we report an investigation of NFR discussions in PRs of repositories of the Spring ecosystem. We collected, manually curated, and analyzed PR discussions addressing four categories of NFRs: maintainability, security, performance, and robustness. We observed that discussions surrounding these PRs tend to address the introduction of a code change or explain some anomaly regarding a particular NFR. Also, we found that more than 77% of the discussions related to NFRs are triggered in the PR title and/or description, indicating that developers are often provided with information regarding NFRs straightway. To gain additional knowledge from these NFR discussions, our study also analyzed the characteristics and activities of developers who actually discuss and fix NFR issues. In particular, we performed an in-depth analysis of 63 developers that stood out in collaborating with the mapped PRs. To complement this analysis, we conducted a survey with 44 developers to gather their perceptions on NFR discussions. By observing how developers approach NFRs and participate in discussions, we documented the best practices and strategies newcomers can adopt to address NFRs effectively. We also provided a curated dataset of 1,533 PR discussions classified with NFR presence.

Publisher's Version Info
Adapting Multi-objectivized Software Configuration Tuning
Tao Chen ORCID logo and Miqing Li ORCID logo
(University of Electronic Science and Technology of China, China; University of Birmingham, United Kingdom)
When tuning software configuration for better performance (e.g., latency or throughput), an important issue that many optimizers face is the presence of local optimum traps, compounded by a highly rugged configuration landscape and expensive measurements. To mitigate these issues, a recent effort has shifted to focus on the level of optimization model (called meta multi-objectivization or MMO) instead of designing better optimizers as in traditional methods. This is done by using an auxiliary performance objective, together with the target performance objective, to help the search jump out of local optima. While effective, MMO needs a fixed weight to balance the two objectives—a parameter that has been found to be crucial as there is a large deviation of the performance between the best and the other settings. However, given the variety of configurable software systems, the “sweet spot” of the weight can vary dramatically in different cases and it is not possible to find the right setting without time-consuming trial and error. In this paper, we seek to overcome this significant shortcoming of MMO by proposing a weight adaptation method, dubbed AdMMO. Our key idea is to adaptively adjust the weight at the right time during tuning, such that a good proportion of the nondominated configurations can be maintained. Moreover, we design a partial duplicate retention mechanism to handle the issue of too many duplicate configurations without losing the rich information provided by the “good” duplicates. Experiments on several real-world systems, objectives, and budgets show that, for 71% of the cases, AdMMO is considerably superior to MMO and a wide range of state-of-the-art optimizers while achieving generally better efficiency with the best speedup between 2.2x and 20x.

Publisher's Version Published Artifact Artifacts Available
CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking
Zian Su ORCID logo, Xiangzhe Xu ORCID logo, Ziyang Huang ORCID logo, Zhuo Zhang ORCID logo, Yapeng Ye ORCID logo, Jianjun Huang ORCID logo, and Xiangyu ZhangORCID logo
(Purdue University, USA; Renmin University of China, China)
Transformer based code models have impressive performance in many software engineering tasks. However, their effectiveness degrades when symbols are missing or not informative. The reason is that the model may not learn to pay attention to the right correlations/contexts without the help of symbols. We propose a new method to pre-train general code models when symbols are lacking. We observe that in such cases, programs degenerate to something written in a very primitive language. We hence propose to use program analysis to extract contexts a priori (instead of relying on symbols and masked language modeling as in vanilla models). We then leverage a novel attention masking method to only allow the model attending to these contexts, e.g., bi-directional program dependence transitive closures and token co-occurrences. In the meantime, the inherent self-attention mechanism is utilized to learn which of the allowed attentions are more important compared to others. To realize the idea, we enhance the vanilla tokenization and model architecture of a BERT model, construct and utilize attention masks, and introduce a new pre-training algorithm. We pre-train this BERT-like model from scratch, using a dataset of 26 million stripped binary functions with explicit program dependence information extracted by our tool. We apply the model in three downstream tasks: binary similarity, type inference, and malware family classification. Our pre-trained model can improve the SOTAs in these tasks from 53% to 64%, 49% to 60%, and 74% to 94%, respectively. It also substantially outperforms other general pre-training techniques of code understanding models.

Publisher's Version Published Artifact Artifacts Available
Natural Is the Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models
Yan Wang ORCID logo, Xiaoning Li ORCID logo, Tien N. Nguyen ORCID logo, Shaohua Wang ORCID logo, Chao NiORCID logo, and Ling Ding ORCID logo
(Central University of Finance and Economics, China; University of Texas at Dallas, USA; Zhejiang University, China)
Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are often heavy in computational complexity, and quadratically with the length of the input code sequence. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input program should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input program belongs, the outcome may differ when the model is pre-trained on a different dataset. We propose SlimCode, a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization, we reported that 1) the removal ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm–prompt engineering and interactive in-context learning. The empirical results showed that SlimCode can improve the state-of-the-art technique by 9.46% and 5.15% in terms of MRR and BLEU score on code search and summarization, respectively. More importantly, SlimCode is 133 times faster than the state-of-the-art approach. Additionally, SlimCode can reduce the cost of invoking GPT-4 by up to 24% per API query, while still producing comparable results to those with the original code. With this result, we call for a new direction on code-based, model-agnostic code simplification solutions to further empower LLMs.

Publisher's Version
Go Static: Contextualized Logging Statement Generation
Yichen Li ORCID logo, Yintong Huo ORCID logo, Renyi Zhong ORCID logo, Zhihan Jiang ORCID logo, Jinyang Liu ORCID logo, Junjie Huang ORCID logo, Jiazhen Gu ORCID logo, Pinjia He ORCID logo, and Michael R. Lyu ORCID logo
(Chinese University of Hong Kong, Hong Kong, China; Chinese University of Hong Kong, Shenzhen, China)
Logging practices have been extensively investigated to assist developers in writing appropriate logging statements for documenting software behaviors. Although numerous automatic logging approaches have been proposed, their performance remains unsatisfactory due to the constraint of the single-method input, without informative programming context outside the method. Specifically, we identify three inherent limitations with single-method context: limited static scope of logging statements, inconsistent logging styles, and missing type information of logging variables. To tackle these limitations, we propose SCLogger, the first contextualized logging statement generation approach with inter-method static contexts.First, SCLogger extracts inter-method contexts with static analysis to construct the contextualized prompt for language models to generate a tentative logging statement. The contextualized prompt consists of an extended static scope and sampled similar methods, ordered by the chain-of-thought (COT) strategy. Second, SCLogger refines the access of logging variables by formulating a new refinement prompt for language models, which incorporates detailed type information of variables in the tentative logging statement. The evaluation results show that SCLogger surpasses the state-of-the-art approach by 8.7% in logging position accuracy, 32.1% in level accuracy, 19.6% in variable precision, and 138.4% in text BLEU-4 score. Furthermore, SCLogger consistently boosts the performance of logging statement generation across a range of large language models, thereby showcasing the generalizability of this approach.

Publisher's Version
Unprecedented Code Change Automation: The Fusion of LLMs and Transformation by Example
Malinda DilharaORCID logo, Abhiram Bellur ORCID logo, Timofey Bryksin ORCID logo, and Danny Dig ORCID logo
(University of Colorado, Boulder, USA; JetBrains Research, Cyprus; JetBrains Research - University of Colorado, Boulder, USA)
Software developers often repeat the same code changes within a project or across different projects. These repetitive changes are known as “code change patterns” (CPATs). Automating CPATs is crucial to expedite the software development process. While current Transformation by Example (TBE) techniques can automate CPATs, they are limited by the quality and quantity of the provided input examples. Thus, they miss transforming code variations that do not have the exact syntax, data-, or control-flow of the provided input examples, despite being semantically similar. Large Language Models (LLMs), pre-trained on extensive source code datasets, offer a potential solution. Harnessing the capability of LLMs to generate semantically equivalent, yet previously unseen variants of the original CPAT could significantly increase the effectiveness of TBE systems. In this paper, we first discover best practices for harnessing LLMs to generate code variants that meet three criteria: correctness (semantic equivalence to the original CPAT), usefulness (reflecting what developers typically write), and applicability (aligning with the primary intent of the original CPAT). We then implement these practices in our tool PyCraft, which synergistically combines static code analysis, dynamic analysis, and LLM capabilities. By employing chain-of-thought reasoning, PyCraft generates variations of input examples and comprehensive test cases that identify correct variations with an F-measure of 96.6%. Our algorithm uses feedback iteration to expand the original input examples by an average factor of 58x. Using these richly generated examples, we inferred transformation rules and then automated these changes, resulting in an increase of up to 39x, with an average increase of 14x in target codes compared to a previous state-of-the-art tool that relies solely on static analysis. We submitted patches generated by PyCraft to a range of projects, notably esteemed ones like microsoft/DeepSpeed and IBM/inFairness. Their developers accepted and merged 83% the 86 CPAT instances submitted through 44 pull requests. This confirms the usefulness of these changes.

Publisher's Version Info
Syntax Is All You Need: A Universal-Language Approach to Mutant Generation
Sourav Deb ORCID logo, Kush Jain ORCID logo, Rijnard van Tonder ORCID logo, Claire Le GouesORCID logo, and Alex GroceORCID logo
(Northern Arizona University, USA; Carnegie Mellon University, USA; Mysten Labs, USA)
While mutation testing has been a topic of academic interest for decades, it is only recently that “real-world” developers, including industry leaders such as Google and Meta, have adopted mutation testing. We propose a new approach to the development of mutation testing tools, and in particular the core challenge of generating mutants. Current practice tends towards two limited approaches to mutation generation: mutants are either (1) generated at the bytecode/IR level, and thus neither human readable nor adaptable to source-level features of languages or projects, or (2) generated at the source level by language-specific tools that are hard to write and maintain, and in fact are often abandoned by both developers and users. We propose instead that source-level mutation generation is a special case of program transformation in general, and that adopting this approach allows for a single tool that can effectively generate source-level mutants for essentially any programming language. Furthermore, by using parser parser combinators many of the seeming limitations of an any-language approach can be overcome, without the need to parse specific languages. We compare this new approach to mutation to existing tools, and demonstrate the advantages of using parser parser combinators to improve on a regular-expression based approach to generation. Finally, we show that our approach can provide effective mutant generation even for a language for which it lacks any language-specific operators, and that is not very similar in syntax to any language it has been applied to previously.

Publisher's Version Published Artifact Info Artifacts Available
CodePlan: Repository-Level Coding using LLMs and Planning
Ramakrishna Bairi ORCID logo, Atharv Sonwane ORCID logo, Aditya Kanade ORCID logo, Vageesh D. C. ORCID logo, Arun IyerORCID logo, Suresh Parthasarathy ORCID logo, Sriram Rajamani ORCID logo, B. Ashok ORCID logo, and Shashank Shet ORCID logo
(Microsoft Research, India)
Software engineering activities such as package migration, fixing error reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. We formulate these activities as repository-level coding tasks. Recent tools like GitHub Copilot, which are powered by Large Language Models (LLMs), have succeeded in offering high-quality solutions to localized coding problems. Repository-level coding tasks are more involved and cannot be solved directly using LLMs, since code within a repository is inter-dependent and the entire repository may be too large to fit into the prompt. We frame repository-level coding as a planning problem and present a task-agnostic, neuro-symbolic framework called CodePlan to solve it. CodePlan synthesizes a multi-step chain-of-edits (plan), where each step results in a call to an LLM on a code location with context derived from the entire repository, previous code changes and task-specific instructions. CodePlan is based on a novel combination of an incremental dependency analysis, a change may-impact analysis and an adaptive planning algorithm (symbolic components) with the neural LLMs. We evaluate the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python). Each task is evaluated on multiple code repositories, each of which requires inter-dependent changes to many files (between 2–97 files). Coding tasks of this level of complexity have not been automated using LLMs before. Our results show that CodePlan has better match with the ground truth compared to baselines. CodePlan is able to get 5/7 repositories to pass the validity checks (i.e., to build without errors and make correct code edits) whereas the baselines (without planning but with the same type of contextual information as CodePlan) cannot get any of the repositories to pass them. We provide our (non-proprietary) data, evaluation scripts and supplementary material at https://github.com/microsoft/codeplan.

Publisher's Version
Rocks Coding, Not Development: A Human-Centric, Experimental Evaluation of LLM-Supported SE Tasks
Wei Wang ORCID logo, Huilong Ning ORCID logo, Gaowei Zhang ORCID logo, Libo Liu ORCID logo, and Yi Wang ORCID logo
(Beijing University of Posts and Telecommunications, China; University of Melbourne, Australia)
Recently, large language models (LLM) based generative AI has been gaining momentum for their impressive high-quality performances in multiple domains, particularly after the release of the ChatGPT. Many believe that they have the potential to perform general-purpose problem-solving in software development and replace human software developers. Nevertheless, there are in a lack of serious investigation into the capability of these LLM techniques in fulfilling software development tasks. In a controlled 2 × 2 between-subject experiment with 109 participants, we examined whether and to what degree working with ChatGPT was helpful in the coding task and typical software development task and how people work with ChatGPT. We found that while ChatGPT performed well in solving simple coding problems, its performance in supporting typical software development tasks was not that good. We also observed the interactions between participants and ChatGPT and found the relations between the interactions and the outcomes. Our study thus provides first-hand insights into using ChatGPT to fulfill software engineering tasks with real-world developers and motivates the need for novel interaction mechanisms that help developers effectively work with large language models to achieve desired outcomes.

Publisher's Version Info
Understanding and Detecting Annotation-Induced Faults of Static Analyzers
Huaien Zhang ORCID logo, Yu Pei ORCID logo, Shuyun Liang ORCID logo, and Shin Hwei Tan ORCID logo
(Hong Kong Polytechnic University, China; Southern University of Science and Technology, China; Concordia University, Canada)
Static analyzers can reason about the properties and behaviors of programs and detect various issues without executing them. Hence, they should extract the necessary information to understand the analyzed program well. Annotation has been a widely used feature for different purposes in Java since the introduction of Java 5. Annotations can change program structures and convey semantics information without awareness of static analyzers, consequently leading to imprecise analysis results. This paper presents the first comprehensive study of annotation-induced faults (AIF) by analyzing 246 issues in six open-source and popular static analyzers (i.e., PMD, SpotBugs, CheckStyle, Infer, SonarQube, and Soot). We analyzed the issues' root causes, symptoms, and fix strategies and derived ten findings and some practical guidelines for detecting and repairing annotation-induced faults. Moreover, we developed an automated testing framework called AnnaTester based on three metamorphic relations originating from the findings. AnnaTester generated new tests based on the official test suites of static analyzers and unveiled 43 new faults, 20 of which have been fixed. The results confirm the value of our study and its findings.

Publisher's Version
Only diff Is Not Enough: Generating Commit Messages Leveraging Reasoning and Action of Large Language Model
Jiawei Li ORCID logo, David Faragó ORCID logo, Christian Petrov ORCID logo, and Iftekhar Ahmed ORCID logo
(University of California at Irvine, USA; Innoopract, Germany; QPR Technologies, Germany)
Commit messages play a vital role in software development and maintenance. While previous research has introduced various Commit Message Generation (CMG) approaches, they often suffer from a lack of consideration for the broader software context associated with code changes. This limitation resulted in generated commit messages that contained insufficient information and were poorly readable. To address these shortcomings, we approached CMG as a knowledge-intensive reasoning task. We employed ReAct prompting with a cutting-edge Large Language Model (LLM) to generate high-quality commit messages. Our tool retrieves a wide range of software context information, enabling the LLM to create commit messages that are factually grounded and comprehensive. Additionally, we gathered commit message quality expectations from software practitioners, incorporating them into our approach to further enhance message quality. Human evaluation demonstrates the overall effectiveness of our CMG approach, which we named Omniscient Message Generator (OMG). It achieved an average improvement of 30.2% over human-written messages and a 71.6% improvement over state-of-the-art CMG methods.

Publisher's Version
DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable Computing
Sabaat Haroon ORCID logo, Chris Brown ORCID logo, and Muhammad Ali Gulzar ORCID logo
(Virginia Tech, USA)
SQL is the most commonly used front-end language for data-intensive scalable computing (DISC) applications due to its broad presence in new and legacy workflows and shallow learning curve. However, DISC-backed SQL introduces several layers of abstraction that significantly reduce the visibility and transparency of workflows, making it challenging for developers to find and fix errors in a query. When a query returns incorrect outputs, it takes a non-trivial effort to comprehend every stage of the query execution and find the root cause among the input data and complex SQL query. We aim to bring the benefits of step-through interactive debugging to DISC-powered SQL with DeSQL. Due to the declarative nature of SQL, there are no ordered atomic statements to place a breakpoint to monitor the flow of data. DeSQL’s automated query decomposition breaks a SQL query into its constituent sub queries, offering natural locations for setting breakpoints and monitoring intermediate data. However, due to advanced query optimization and translation in DISC systems, a user query rarely matches the physical execution, making it challenging to associate subqueries with their intermediate data. DeSQL performs fine-grained taint analysis to dynamically map the subqueries to their intermediate data, while also recognizing subqueries removed by the optimizers. For such subqueries, DeSQL efficiently regenerates the intermediate data from a nearby subquery’s data. On the popular TPC-DC benchmark, DeSQL provides a complete debugging view in 13% less time than the original job time while incurring an average overhead of 10% in addition to retaining Apache Spark’s scalability. In a user study comprising 15 participants engaged in two debugging tasks, we find that participants utilizing DeSQL identify the root cause behind a wrong query output in 74% less time than the de-facto, manual debugging.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
CORE: Resolving Code Quality Issues using LLMs
Nalin Wadhwa ORCID logo, Jui Pradhan ORCID logo, Atharv Sonwane ORCID logo, Surya Prakash Sahu ORCID logo, Nagarajan Natarajan ORCID logo, Aditya Kanade ORCID logo, Suresh Parthasarathy ORCID logo, and Sriram Rajamani ORCID logo
(Microsoft Research, India)
As software projects progress, quality of code assumes paramount importance as it affects reliability, maintainability and security of software. For this reason, static analysis tools are used in developer workflows to flag code quality issues. However, developers need to spend extra efforts to revise their code to improve code quality based on the tool findings. In this work, we investigate the use of (instruction-following) large language models (LLMs) to assist developers in revising code to resolve code quality issues. We present a tool, CORE (short for COde REvisions), architected using a pair of LLMs organized as a duo comprised of a proposer and a ranker. Providers of static analysis tools recommend ways to mitigate the tool warnings and developers follow them to revise their code. The proposer LLM of CORE takes the same set of recommendations and applies them to generate candidate code revisions. The candidates which pass the static quality checks are retained. However, the LLM may introduce subtle, unintended functionality changes which may go un-detected by the static analysis. The ranker LLM evaluates the changes made by the proposer using a rubric that closely follows the acceptance criteria that a developer would enforce. CORE uses the scores assigned by the ranker LLM to rank the candidate revisions before presenting them to the developer. We conduct a variety of experiments on two public benchmarks to show the ability of CORE: (1) to generate code revisions acceptable to both static analysis tools and human reviewers (the latter evaluated with user study on a subset of the Python benchmark), (2) to reduce human review efforts by detecting and eliminating revisions with unintended changes, (3) to readily work across multiple languages (Python and Java), static analysis tools (CodeQL and SonarQube) and quality checks (52 and 10 checks, respectively), and (4) to achieve fix rate comparable to a rule-based automated program repair tool but with much smaller engineering efforts (on the Java benchmark). CORE could revise 59.2% Python files (across 52 quality checks) so that they pass scrutiny by both a tool and a human reviewer. The ranker LLM reduced false positives by 25.8% in these cases. CORE produced revisions that passed the static analysis tool in 76.8% Java files (across 10 quality checks) comparable to 78.3% of a specialized program repair tool, with significantly much less engineering efforts. We release code, data, and supplementary material publicly at http://aka.ms/COREMSRI.

Publisher's Version Info
Towards AI-Assisted Synthesis of Verified Dafny Methods
Md Rakib Hossain Misu ORCID logo, Cristina V. Lopes ORCID logo, Iris Ma ORCID logo, and James Noble ORCID logo
(University of California at Irvine, USA; Creative Research & Programming, New Zealand; Australian National University, Australia)
Large language models show great promise in many domains, including programming. A promise is easy to make but hard to keep, and language models often fail to keep their promises, generating erroneous code. A promising avenue to keep models honest is to incorporate formal verification: generating programs’ specifications as well as code so that the code can be proved correct with respect to the specifications. Unfortunately, existing large language models show a severe lack of proficiency in verified programming. In this paper, we demonstrate how to improve two pretrained models’ proficiency in the Dafny verification-aware language. Using 178 problems from the MBPP dataset, we prompt two contemporary models (GPT-4 and PaLM-2) to synthesize Dafny methods. We use three different types of prompts: a direct Contextless prompt; a Signature prompt that includes a method signature and test cases, and a Chain of Thought (CoT) prompt that decomposes the problem into steps and includes retrieval augmentation generated example problems and solutions. Our results show that GPT-4 performs better than PaLM-2 on these tasks and that both models perform best with the retrieval augmentation generated CoT prompt. GPT-4 was able to generate verified, human-evaluated, Dafny methods for 58% of the problems, however, GPT-4 managed only 19% of the problems with the Contextless prompt, and even fewer (10%) for the Signature prompt. We are thus able to contribute 153 verified Dafny solutions to MBPP problems, 50 that we wrote manually, and 103 synthesized by GPT-4. Our results demonstrate that the benefits of formal program verification are now within reach of code generating large language models. Likewise, program verification systems can benefit from large language models, whether to synthesize code wholesale, to generate specifications, or to act as a "programmer’s verification apprentice", to construct annotations such as loop invariants which are hard for programmers to write or verification tools to find. Finally, we expect that the approach we have pioneered here — generating candidate solutions that are subsequently formally checked for correctness — should transfer to other domains (e.g., legal arguments, transport signaling, structural engineering) where solutions must be correct, where that correctness must be demonstrated, explained and understood by designers and end-users.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
AROMA: Automatic Reproduction of Maven Artifacts
Mehdi Keshani ORCID logo, Tudor-Gabriel Velican ORCID logo, Gideon Bot ORCID logo, and Sebastian Proksch ORCID logo
(Delft University of Technology, Netherlands)
Modern software engineering establishes software supply chains and relies on tools and libraries to improve productivity. However, reusing external software in a project presents a security risk when the source of the component is unknown or the consistency of a component cannot be verified. The SolarWinds attack serves as a popular example in which the injection of malicious code into a library affected thousands of customers and caused a loss of billions of dollars. Reproducible builds present a mitigation strategy, as they can confirm the origin and consistency of reused components. A large reproducibility community has formed for Debian, but the reproducibility of the Maven ecosystem, the backbone of the Java supply chain, remains understudied in comparison. Reproducible Central is an initiative that curates a list of reproducible Maven libraries, but the list is limited and challenging to maintain due to manual efforts. Our research aims to support these efforts in the Maven ecosystem through automation. We investigate the feasibility of automatically finding the source code of a library from its Maven release and recovering information about the original release environment. Our tool, AROMA, can obtain this critical information from the artifact and the source repository through several heuristics and we use the results for reproduction attempts of Maven packages. Overall, our approach achieves an accuracy of up to 99.5% when compared field-by-field to the existing manual approach. In some instances, we even detected flaws in the manually maintained list, such as broken repository links. We reveal that automatic reproducibility is feasible for 23.4% of the Maven packages using AROMA, and 8% of these packages are fully reproducible. We demonstrate our ability to successfully reproduce new packages and have contributed some of them to the Reproducible Central repository. Additionally, we highlight actionable insights, outline future work in this area, and make our dataset and tools available to the public.

Publisher's Version
Harnessing Neuron Stability to Improve DNN Verification
Hai Duong ORCID logo, Dong Xu ORCID logo, Thanhvu NguyenORCID logo, and Matthew B. Dwyer ORCID logo
(George Mason University, USA; University of Virginia, USA)
Deep Neural Networks (DNN) have emerged as an effective approach to tackling real-world problems. However, like human-written software, DNNs are susceptible to bugs and attacks. This has generated significant interest in developing effective and scalable DNN verification techniques and tools. Recent developments in DNN verification have highlighted the potential of constraint-solving approaches that combine abstraction techniques with SAT solving. Abstraction approaches are effective at precisely encoding neuron behavior when it is linear, but they lead to overapproximation and combinatorial scaling when behavior is non-linear. SAT approaches in DNN verification have incorporated standard DPLL techniques, but have overlooked important optimizations found in modern SAT solvers that help them scale on industrial benchmarks. In this paper, we present VeriStable, a novel extension of the recently proposed DPLL-based constraint DNN verification approach. VeriStable leverages the insight that while neuron behavior may be non-linear across the entire DNN input space, at intermediate states computed during verification many neurons may be constrained to have linear behavior – these neurons are stable. Efficiently detecting stable neurons reduces combinatorial complexity without compromising the precision of abstractions. Moreover, the structure of clauses arising in DNN verification problems shares important characteristics with industrial SAT benchmarks. We adapt and incorporate multi-threading and restart optimizations targeting those characteristics to further optimize DPLL-based DNN verification. We evaluate the effectiveness of VeriStable across a range of challenging benchmarks including fully- connected feedforward networks (FNNs), convolutional neural networks (CNNs) and residual networks (ResNets) applied to the standard MNIST and CIFAR datasets. Preliminary results show that VeriStable is competitive and outperforms state-of-the-art DNN verification tools, including α-β-CROWN and MN-BaB, the first and second performers of the VNN-COMP, respectively.

Publisher's Version
“The Law Doesn’t Work Like a Computer”: Exploring Software Licensing Issues Faced by Legal Practitioners
Nathan Wintersgill ORCID logo, Trevor Stalnaker ORCID logo, Laura A. Heymann ORCID logo, Oscar ChaparroORCID logo, and Denys PoshyvanykORCID logo
(William & Mary, USA)
Most modern software products incorporate open source components, which requires compliance with each component's licenses. As noncompliance can lead to significant repercussions, organizations often seek advice from legal practitioners to maintain license compliance, address licensing issues, and manage the risks of noncompliance. While legal practitioners play a critical role in the process, little is known in the software engineering community about their experiences within the open source license compliance ecosystem. To fill this knowledge gap, a joint team of software engineering and legal researchers designed and conducted a survey with 30 legal practitioners and related occupations and then held 16 follow-up interviews. We identified different aspects of OSS license compliance from the perspective of legal practitioners, resulting in 14 key findings in three main areas of interest: the general ecosystem of compliance, the specific compliance practices of legal practitioners, and the challenges that legal practitioners face. We discuss the implications of our findings.

Publisher's Version Published Artifact Artifacts Available
COSTELLO: Contrastive Testing for Embedding-Based Large Language Model as a Service Embeddings
Weipeng Jiang ORCID logo, Juan Zhai ORCID logo, Shiqing Ma ORCID logo, Xiaoyu Zhang ORCID logo, and Chao Shen ORCID logo
(Xi’an Jiaotong University, China; University of Massachusetts, Amherst, USA)
Large language models have gained significant popularity and are often provided as a service (i.e., LLMaaS). Companies like OpenAI and Google provide online APIs of LLMs to allow downstream users to create innovative applications. Despite its popularity, LLM safety and quality assurance is a well-recognized concern in the real world, requiring extra efforts for testing these LLMs. Unfortunately, while end-to-end services like ChatGPT have garnered rising attention in terms of testing, the LLMaaS embeddings have comparatively received less scrutiny. We state the importance of testing and uncovering problematic individual embeddings without considering downstream applications. The abstraction and non-interpretability of embedded vectors, combined with the black-box inaccessibility of LLMaaS, make testing a challenging puzzle. This paper proposes COSTELLO, a black-box approach to reveal potential defects in abstract embedding vectors from LLMaaS by contrastive testing. Our intuition is that high-quality LLMs can adequately capture the semantic relationships of the input texts and properly represent their relationships in the high-dimensional space. For the given interface of LLMaaS and seed inputs, COSTELLO can automatically generate test suites and output words with potential problematic embeddings. The idea is to synthesize contrastive samples with guidance, including positive and negative samples, by mutating seed inputs. Our synthesis guide will leverage task-specific properties to control the mutation procedure and generate samples with known partial relationships in the high-dimensional space. Thus, we can compare the expected relationship (oracle) and embedding distance (output of LLMs) to locate potential buggy cases. We evaluate COSTELLO on 42 open-source (encoder-based) language models and two real-world commercial LLMaaS. Experimental results show that COSTELLO can effectively detect semantic violations, where more than 62% of violations on average result in erroneous behaviors (e.g., unfairness) of downstream applications.

Publisher's Version
How Does Simulation-Based Testing for Self-Driving Cars Match Human Perception?
Christian Birchler ORCID logo, Tanzil Kombarabettu Mohammed ORCID logo, Pooja Rani ORCID logo, Teodora Nechita ORCID logo, Timo Kehrer ORCID logo, and Sebastiano Panichella ORCID logo
(Zurich University of Applied Sciences, Switzerland; University of Bern, Switzerland; University of Zurich, Switzerland)
Software metrics such as coverage or mutation scores have been investigated for the automated quality assessment of test suites. While traditional tools rely on software metrics, the field of self-driving cars (SDCs) has primarily focused on simulation-based test case generation using quality metrics such as the out-of-bound (OOB) parameter to determine if a test case fails or passes. However, it remains unclear to what extent this quality metric aligns with the human perception of the safety and realism of SDCs. To address this (reality) gap, we conducted an empirical study involving 50 participants to investigate the factors that determine how humans perceive SDC test cases as safe, unsafe, realistic, or unrealistic. To this aim, we developed a framework leveraging virtual reality (VR) technologies, called SDC-Alabaster, to immerse the study participants into the virtual environment of SDC simulators. Our findings indicate that the human assessment of safety and realism of failing/passing test cases can vary based on different factors, such as the test's complexity and the possibility of interacting with the SDC. Especially for the assessment of realism, the participants' age leads to a different perception. This study highlights the need for more research on simulation testing quality metrics and the importance of human perception in evaluating SDCs.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Code-Aware Prompting: A Study of Coverage-Guided Test Generation in Regression Setting using LLM
Gabriel Ryan ORCID logo, Siddhartha Jain ORCID logo, Mingyue Shang ORCID logo, Shiqi Wang ORCID logo, Xiaofei Ma ORCID logo, Murali Krishna Ramanathan ORCID logo, and Baishakhi Ray ORCID logo
(Columbia University, USA; AWS AI Labs, USA)
Testing plays a pivotal role in ensuring software quality, yet conventional Search Based Software Testing (SBST) methods often struggle with complex software units, achieving suboptimal test coverage. Recent work using large language models (LLMs) for test generation have focused on improving generation quality through optimizing the test generation context and correcting errors in model outputs, but use fixed prompting strategies that prompt the model to generate tests without additional guidance. As a result LLM-generated testsuites still suffer from low coverage.
In this paper, we present SymPrompt, a code-aware prompting strategy for LLMs in test generation. SymPrompt’s approach is based on recent work that demonstrates LLMs can solve more complex logical problems when prompted to reason about the problem in a multi-step fashion. We apply this methodology to test generation by deconstructing the testsuite generation process into a multi-stage sequence, each of which is driven by a specific prompt aligned with the execution paths of the method under test, and exposing relevant type and dependency focal context to the model. Our approach enables pretrained LLMs to generate more complete test cases without any additional training. We implement SymPrompt using the TreeSitter parsing framework and evaluate on a benchmark challenging methods from open source Python projects. SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2. Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.

Publisher's Version
Enhancing Code Understanding for Impact Analysis by Combining Transformers and Program Dependence Graphs
Yanfu Yan ORCID logo, Nathan Cooper ORCID logo, Kevin Moran ORCID logo, Gabriele BavotaORCID logo, Denys PoshyvanykORCID logo, and Steve Rich ORCID logo
(William & Mary, USA; University of Central Florida, USA; USI Lugano, Switzerland; Cisco Systems, USA)
Impact analysis (IA) is a critical software maintenance task that identifies the effects of a given set of code changes on a larger software project with the intention of avoiding potential adverse effects. IA is a cognitively challenging task that involves reasoning about the abstract relationships between various code constructs. Given its difficulty, researchers have worked to automate IA with approaches that primarily use coupling metrics as a measure of the "connectedness" of different parts of a software project. Many of these coupling metrics rely on static, dynamic, or evolutionary information and are based on heuristics that tend to be brittle, require expensive execution analysis, or large histories of co-changes to accurately estimate impact sets.
In this paper, we introduce a novel IA approach, called ATHENA, that combines a software system's dependence graph information with a conceptual coupling approach that uses advances in deep representation learning for code without the need for change histories and execution information. Previous IA benchmarks are small, containing less than ten software projects, and suffer from tangled commits, making it difficult to measure accurate results. Therefore, we constructed a large-scale IA benchmark, from 25 open-source software projects, that utilizes fine-grained commit information from bug fixes. On this new benchmark, our best performing approach configuration achieves an mRR, mAP, and HIT@10 score of 60.32%, 35.19%, and 81.48%, respectively. Through various ablations and qualitative analyses, we show that ATHENA's novel combination of program dependence graphs and conceptual coupling information leads it to outperform a simpler baseline by 10.34%, 9.55%, and 11.68% with statistical significance.

Publisher's Version
RavenBuild: Context, Relevance, and Dependency Aware Build Outcome Prediction
Gengyi Sun ORCID logo, Sarra Habchi ORCID logo, and Shane McIntoshORCID logo
(University of Waterloo, Canada; Ubisoft, Canada)
Continuous Integration (CI) is a common practice adopted by modern software organizations. It plays an especially important role for large corporations like Ubisoft, where thousands of build jobs are submitted daily. Indeed, the cadence of development progress is constrained by the pace at which CI services process build jobs. To provide faster CI feedback, recent work explores how build outcomes can be anticipated. Although early results show plenty of promise, the distinct characteristics of Project X—a AAA video game project at Ubisoft, present new challenges for build outcome prediction. In the Project X setting, changes that do not modify source code also incur build failures. Moreover, we find that the code changes that have an impact that crosses the source-data boundary are more prone to build failures than code changes that do not impact data files. Since such changes are not fully characterized by the existing set of build outcome prediction features, state-of-the art models tend to underperform. Therefore, to accommodate the data context into build outcome prediction, we propose RavenBuild, a novel approach that leverages context, relevance, and dependency-aware features. We apply the state of-the-art BuildFast model and RavenBuild to Project X, and observe that RavenBuild improves the F1 score of the failing class by 50%, the recall of the failing class by 105%, and AUC by 11%. To ease adoption in settings with heterogeneous project sets, we also provide a simplified alternative RavenBuild-CR, which excludes dependency-aware features. We apply RavenBuild-CR on 22 open-source projects and Project X, and observe across-the-board improvements as well. On the other hand, we find that a naïve Parrot approach, which simply echoes the previous build outcome as its prediction, is surprisingly competitive with BuildFast and RavenBuild. Though Parrot fails to predict when the build outcome differs from their immediate predecessor, Parrot serves well as a tendency indicator of the sequences in build outcome datasets. Therefore, future studies should also consider comparing to the Parrot approach as a baseline when evaluating build outcome prediction models.

Publisher's Version Published Artifact Artifacts Available
Towards Efficient Verification of Constant-Time Cryptographic Implementations
Luwei Cai ORCID logo, Fu SongORCID logo, and Taolue Chen ORCID logo
(ShanghaiTech University, China; Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Nanjing Institute of Software Technology, China; Birkbeck University of London, United Kingdom)
Timing side-channel attacks exploit secret-dependent execution time to fully or partially recover secrets of cryptographic implementations, posing a severe threat to software security. Constant-time programming discipline is an effective software-based countermeasure against timing side-channel attacks, but developing constant-time implementations turns out to be challenging and error-prone. Current verification approaches/tools suffer from scalability and precision issues when applied to production software in practice. In this paper, we put forward practical verification approaches based on a novel synergy of taint analysis and safety verification of self-composed programs. Specifically, we first use an IFDS-based lightweight taint analysis to prove that a large number of potential (timing) side-channel sources do not actually leak secrets. We then resort to a precise taint analysis and a safety verification approach to determine whether the remaining potential side-channel sources can actually leak secrets. These include novel constructions of taint-directed semi-cross-product of the original program and its Boolean abstraction, and a taint-directed self-composition of the program. Our approach is implemented as a cross-platform and fully automated tool CT-Prover. The experiments confirm its efficiency and effectiveness in verifying real-world benchmarks from modern cryptographic and SSL/TLS libraries. In particular, CT-Prover identify new, confirmed vulnerabilities of open-source SSL libraries (e.g., Mbed SSL, BearSSL) and significantly outperforms the state-of-the-art tools.

Publisher's Version Published Artifact Info Artifacts Available
Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions
Tao Xiao ORCID logo, Hideaki Hata ORCID logo, Christoph Treude ORCID logo, and Kenichi Matsumoto ORCID logo
(NAIST, Japan; Shinshu University, Japan; Singapore Management University, Singapore)
GitHub's Copilot for Pull Requests (PRs) is a promising service aiming to automate various developer tasks related to PRs, such as generating summaries of changes or providing complete walkthroughs with links to the relevant code. As this innovative technology gains traction in the Open Source Software (OSS) community, it is crucial to examine its early adoption and its impact on the development process. Additionally, it offers a unique opportunity to observe how developers respond when they disagree with the generated content. In our study, we employ a mixed-methods approach, blending quantitative analysis with qualitative insights, to examine 18,256 PRs in which parts of the descriptions were crafted by generative AI. Our findings indicate that: (1) Copilot for PRs, though in its infancy, is seeing a marked uptick in adoption. (2) PRs enhanced by Copilot for PRs require less review time and have a higher likelihood of being merged. (3) Developers using Copilot for PRs often complement the automated descriptions with their manual input. These results offer valuable insights into the growing integration of generative AI in software development.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
AI-Assisted Code Authoring at Scale: Fine-Tuning, Deploying, and Mixed Methods Evaluation
Vijayaraghavan Murali ORCID logo, Chandra Maddila ORCID logo, Imad Ahmad ORCID logo, Michael Bolin ORCID logo, Daniel Cheng ORCID logo, Negar Ghorbani ORCID logo, Renuka Fernandez ORCID logo, Nachiappan Nagappan ORCID logo, and Peter C. Rigby ORCID logo
(Meta Platforms, USA; Meta, USA; Concordia University, Canada)
Generative LLMs have been shown to effectively power AI-based code authoring tools that can suggest entire statements or blocks of code during code authoring. In this paper we present CodeCompose, an AI-assisted code authoring tool developed and deployed at Meta internally. CodeCompose is based on the InCoder LLM that merges generative capabilities with bi-directionality. We have scaled up CodeCompose to serve tens of thousands of developers at Meta, across 9 programming languages and several coding surfaces. We present our experience in making design decisions about the model and system architecture for CodeCompose that addresses these challenges. To release a LLM model at this scale, we needed to first ensure that it is sufficiently accurate. In a random sample of 20K source code files, depending on the language, we are able to reproduce hidden lines between 40% and 58% of the time, an improvement of 1.4× and 4.1× over a model trained only on public data. We gradually rolled CodeCompose out to developers. At the time of this writing, 16K developers have used it with 8% of their code coming directly from CodeCompose. To triangulate our numerical findings, we conduct a thematic analysis on the feedback from 70 developers. We find that 91.5% of the feedback is positive, with the most common themes being discovering APIs, dealing with boilerplate code, and accelerating coding. Meta continues to integrate this feedback into CodeCompose.

Publisher's Version
Improving the Learning of Code Review Successive Tasks with Cross-Task Knowledge Distillation
Oussama Ben Sghaier ORCID logo and Houari Sahraoui ORCID logo
(Université de Montréal, Canada)
Code review is a fundamental process in software development that plays a pivotal role in ensuring code quality and reducing the likelihood of errors and bugs. However, code review can be complex, subjective, and time-consuming. Quality estimation, comment generation, and code refinement constitute the three key tasks of this process, and their automation has traditionally been addressed separately in the literature using different approaches. In particular, recent efforts have focused on fine-tuning pre-trained language models to aid in code review tasks, with each task being considered in isolation. We believe that these tasks are interconnected, and their fine-tuning should consider this interconnection. In this paper, we introduce a novel deep-learning architecture, named DISCOREV, which employs cross-task knowledge distillation to address these tasks simultaneously. In our approach, we utilize a cascade of models to enhance both comment generation and code refinement models. The fine-tuning of the comment generation model is guided by the code refinement model, while the fine-tuning of the code refinement model is guided by the quality estimation model. We implement this guidance using two strategies: a feedback-based learning objective and an embedding alignment objective. We evaluate DISCOREV by comparing it to state-of-the-art methods based on independent training and fine-tuning. Our results show that our approach generates better review comments, as measured by the BLEU score, as well as more accurate code refinement according to the CodeBLEU score.

Publisher's Version
Refactoring to Pythonic Idioms: A Hybrid Knowledge-Driven Approach Leveraging Large Language Models
Zejun Zhang ORCID logo, Zhenchang Xing ORCID logo, Xiaoxue Ren ORCID logo, Qinghua Lu ORCID logo, and Xiwei Xu ORCID logo
(Australian National University, Australia; CSIRO’s Data61, Australia; Zhejiang University, China)
Pythonic idioms are highly valued and widely used in the Python programming community. However, many Python users find it challenging to use Pythonic idioms. Adopting rule-based approach or LLM-only approach is not sufficient to overcome three persistent challenges of code idiomatization including code miss, wrong detection and wrong refactoring. Motivated by the determinism of rules and adaptability of LLMs, we propose a hybrid approach consisting of three modules. We not only write prompts to instruct LLMs to complete tasks, but we also invoke Analytic Rule Interfaces (ARIs) to accomplish tasks. The ARIs are Python code generated by prompting LLMs to generate code. We first construct a knowledge module with three elements including ASTscenario, ASTcomponent and Condition, and prompt LLMs to generate Python code for incorporation into an ARI library for subsequent use. After that, for any syntax-error-free Python code, we invoke ARIs from the ARI library to extract ASTcomponent from the ASTscenario, and then filter out ASTcomponent that does not meet the condition. Finally, we design prompts to instruct LLMs to abstract and idiomatize code, and then invoke ARIs from the ARI library to rewrite non-idiomatic code into the idiomatic code. Next, we conduct a comprehensive evaluation of our approach, RIdiom, and Prompt-LLM on nine established Pythonic idioms in RIdiom. Our approach exhibits superior accuracy, F1 score, and recall, while maintaining precision levels comparable to RIdiom, all of which consistently exceed or come close to 90% for each metric of each idiom. Lastly, we extend our evaluation to encompass four new Pythonic idioms. Our approach consistently outperforms Prompt-LLM, achieving metrics with values consistently exceeding 90% for accuracy, F1-score, precision, and recall.

Publisher's Version
Java JIT Testing with Template Extraction
Zhiqiang Zang ORCID logo, Fu-Yao Yu ORCID logo, Aditya Thimmaiah ORCID logo, August ShiORCID logo, and Milos GligoricORCID logo
(University of Texas at Austin, USA)
We present LeJit, a template-based framework for testing Java just-in-time (JIT) compilers. Like recent template-based frameworks, LeJit executes a template---a program with holes to be filled---to generate concrete programs given as inputs to Java JIT compilers. LeJit automatically generates template programs from existing Java code by converting expressions to holes, as well as generating necessary glue code (i.e., code that generates instances of non-primitive types) to make generated templates executable. We have successfully used LeJit to test a range of popular Java JIT compilers, revealing five bugs in HotSpot, nine bugs in OpenJ9, and one bug in GraalVM. All of these bugs have been confirmed by Oracle and IBM developers, and 11 of these bugs were previously unknown, including two CVEs (Common Vulnerabilities and Exposures). Our comparison with several existing approaches shows that LeJit is complementary to them and is a powerful technique for ensuring Java JIT compiler correctness.

Publisher's Version
BRF: Fuzzing the eBPF Runtime
Hsin-Wei Hung ORCID logo and Ardalan Amiri Sani ORCID logo
(University of California at Irvine, USA)
The eBPF technology in the Linux kernel has been widely adopted for different applications, such as networking, tracing, and security, thanks to the programmability it provides. By allowing user-supplied eBPF programs to be executed directly in the kernel, it greatly increases the flexibility and efficiency of deploying customized logic. However, eBPF also introduces a new and wide attack surface: malicious eBPF programs may try to exploit the vulnerabilities in the eBPF subsystem in the kernel. Fuzzing is a promising technique to find such vulnerabilities. Unfortunately, our experiments with the stateof-the-art kernel fuzzer, Syzkaller, show that it cannot effectively fuzz the eBPF runtime, those components that are in charge of executing an eBPF program, for two reasons. First, the eBPF verifier (which is tasked with verifying the safety of eBPF programs) rejects many fuzzing inputs because (1) they do not comply with its required semantics or (2) they miss some dependencies, i.e., other syscalls that need to be issued before the program is loaded. Second, Syzkaller fails to attach and trigger the execution of eBPF programs most of the times. This paper introduces the BPF Runtime Fuzzer (BRF), a fuzzer that can satisfy the semantics and dependencies required by the verifier and the eBPF subsystem. Our experiments show, in 48-hour fuzzing sessions, BRF can successfully execute 8× more eBPF programs compared to Syzkaller (and 32× more programs compared to Buzzer, an eBPF fuzzer released recently from Google). Moreover, eBPF programs generated by BRF are much more expressive than Syzkaller’s. As a result, BRF achieves 101% higher code coverage. Finally, BRF has so far managed to find 6 vulnerabilities (2 of them have been assigned CVE numbers) in the eBPF runtime, proving its effectiveness.

Publisher's Version
DTD: Comprehensive and Scalable Testing for Debuggers
Hongyi Lu ORCID logo, Zhibo Liu ORCID logo, Shuai Wang ORCID logo, and Fengwei Zhang ORCID logo
(Southern University of Science and Technology, China; Hong Kong University of Science and Technology, China)
As a powerful tool for developers, interactive debuggers help locate and fix errors in software. By using debugging information included in binaries, debuggers can retrieve necessary program states about the program. Unlike printf-style debugging, debuggers allow for more flexible inspection and modification of program execution states. However, debuggers may incorrectly retrieve and interpret program execution, causing confusion and hindering the debugging process. Despite the wide usage of interactive debuggers, a scalable and comprehensive measurement of their functionality correctness does not exist yet. Existing works either fall short in scalability or focus more on the “compiler-side” defects instead of debugger bugs. To facilitate a better assessment of debugger correctness, we first propose and advocate a set of debugger testing criteria, covering both comprehensiveness (in terms of debug information covered) and scalability (in terms of testing overhead). Moreover, we design comparative experiments to show that fulfilling these criteria is not only theoretically appealing, but also brings major improvement to debugger testing. Furthermore, based on these criteria, we present DTD, a differential testing (DT) framework for detecting bugs in interactive debuggers. DTD compares the behaviors of two mainstream debuggers when processing an identical C executable — discrepancies indicate bugs in one of the two debuggers. DTD leverages a novel heuristic method to avoid the repetitive structures (e.g., loops) that exist in C programs, which facilitates DTD to achieve full debug information coverage efficiently. Moreover, we have also designed a Temporal Differential Filtering method to practically filter out the false positives caused by the uninitialized variables in common C programs. With these carefully designed techniques, DTD fulfills our proposed testing requirements and, therefore, achieves high scalability and testing comprehensiveness. For the first time, it offers large-scale testing for C debuggers to detect debugger behavior discrepancies when inspecting millions of program states. An empirical comparison shows that DTD finds 17× more error-triggering cases and detects 5× more bugs than the state-of-the-art debugger testing technique. We have used DTD to detect 13 bugs in the LLVM toolchain (Clang/LLDB) and 5 bugs in the GNU toolchain (GCC/GDB). One of our fixes has already landed in the latest LLDB development branch.

Publisher's Version Info
PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models
Simin Chen ORCID logo, XiaoNing Feng ORCID logo, Xiaohong Han ORCID logo, Cong Liu ORCID logo, and Wei YangORCID logo
(University of Texas at Dallas, USA; Taiyuan University of Technology, China; University of California at Riverside, USA)
In recent times, a plethora of Large Code Generation Models (LCGMs) have been proposed, showcasing significant potential in assisting developers with complex programming tasks. Within the surge of LCGM proposals, a critical aspect of code generation research involves effectively benchmarking the programming capabilities of models. Benchmarking LCGMs necessitates the creation of a set of diverse programming problems, and each problem comprises the prompt (including the task description), canonical solution, and test inputs. The existing methods for constructing such a problem set can be categorized into two main types: manual methods and perturbation-based methods. However, %both these methods exhibit major limitations. %Firstly, manually-based methods require substantial human effort and are not easily scalable. Moreover, programming problem sets created manually struggle to maintain long-term data integrity due to the greedy training data collection mechanism in LCGMs. On the other hand, perturbation-based approaches primarily produce semantically homogeneous problems, resulting in generated programming problems with identical Canonical Solutions to the seed problem. These methods also tend to introduce typos to the prompt, easily detectable by IDEs, rendering them unrealistic. manual methods demand high effort and lack scalability, while also risking data integrity due to LCGMs' potentially contaminated data collection, and perturbation-based approaches mainly generate semantically homogeneous problems with the same canonical solutions and introduce typos that can be easily auto-corrected by IDE, making them ineffective and unrealistic. Addressing the aforementioned limitations presents several challenges: (1) How to automatically generate semantically diverse Canonical Solutions to enable comprehensive benchmarking on the models, (2) how to ensure long-term data integrity to prevent data contamination, and (3) how to generate natural and realistic programming problems. To tackle the first challenge, we draw key insights from viewing a program as a series of mappings from the input to the output domain. These mappings can be transformed, split, reordered, or merged to construct new programs. Based on this insight, we propose programming problem merging, where two existing programming problems are combined to create new ones. In addressing the second challenge, we incorporate randomness to our programming problem-generation process. Our tool can probabilistically guarantee no data repetition across two random trials. To tackle the third challenge, we propose the concept of a Lambda Programming Problem, comprising a concise one-sentence task description in natural language accompanied by a corresponding program implementation. Our tool ensures the program prompt is grammatically correct. Additionally, the tool leverages return value type analysis to verify the correctness of newly created Canonical Solutions. In our empirical evaluation, we utilize our tool on two widely-used datasets and compare it against nine baseline methods using eight code generation models. The results demonstrate the effectiveness of our tool in generating more challenging, diverse, and natural programming problems, comparing to the baselines.

Publisher's Version
Metamorphic Testing of Secure Multi-party Computation (MPC) Compilers
Yichen Li ORCID logo, Dongwei Xiao ORCID logo, Zhibo Liu ORCID logo, Qi Pang ORCID logo, and Shuai Wang ORCID logo
(Hong Kong University of Science and Technology, China; Carnegie Mellon University, USA)
The demanding need to perform privacy-preserving computations among multiple data owners has led to the prosperous development of secure multi-party computation (MPC) protocols. MPC offers protocols for parties to jointly compute a function over their inputs while keeping those inputs private. To date, MPC has been widely adopted in various real-world, privacy-sensitive sectors, such as healthcare and finance. Moreover, to ease the adoption of MPC, industrial and academic MPC compilers have been developed to automatically translate programs describing arbitrary MPC procedures into low-level MPC executables.
Compiling high-level descriptions into high-efficiency MPC executables is challenging: the compilation often involves converting high-level languages into several intermediate representations (IR), e.g., arithmetic or boolean circuits, optimizing the computation/communication cost, and picking proper MPC protocols (and underlying virtual machines) for a particular task and threat model. Various optimizations and heuristics are employed during the compilation procedure to improve the efficiency of the generated MPC executables.
Despite the prosperous adoption of MPC compilers by industrial vendors and academia, a principled and systematic understanding of the correctness of MPC compilers does not yet exist. To fill this critical gap, this paper introduces MT-MPC, a metamorphic testing (MT) framework specifically designed for MPC compilers to effectively uncover erroneous compilations. Our approach proposes three metamorphic relations (MRs) that are tailored for MPC programs to mutate high-level MPC programs (compiler inputs). We then examine if MPC compilers yield semantics-equivalent MPC executables regarding the original and mutated MPC programs by comparing their execution results.
Real-world MPC compilers exhibit a high level of engineering quality. Nevertheless, we detected 4,772 inputs that can result in erroneous compilations in three popular MPC compilers available on the market. While the discovered error-triggering inputs do not cause the MPC compilers to crash directly, they can lead to the generation of incorrect MPC executables, jeopardizing the underlying dependability of the computation. With substantial manual effort and help from the MPC compiler developers, we uncovered thirteen bugs in these MPC compilers by debugging them using the error-triggering inputs. Our proposed testing frameworks and findings can be used to guide developers in their efforts to improve MPC compilers.

Publisher's Version
Understanding the Impact of APIs Behavioral Breaking Changes on Client Applications
Dhanushka Jayasuriya ORCID logo, Valerio Terragni ORCID logo, Jens Dietrich ORCID logo, and Kelly Blincoe ORCID logo
(University of Auckland, New Zealand; Victoria University of Wellington, New Zealand)
Libraries play a significant role in software development as they provide reusable functionality, which helps expedite the development process. As libraries evolve, they release new versions with optimisations like new functionality, bug fixes, and patches for known security vulnerabilities. To obtain these optimisations, the client applications that depend on these libraries must update to use the latest version. However, this can cause software failures in the clients if the update includes breaking changes. These breaking changes can be divided into syntactic and semantic (behavioral) breaking changes. While there has been considerable research on syntactic breaking changes introduced between library updates and their impact on client projects, there is a notable lack of research regarding behavioral breaking changes introduced during library updates and their impacts on clients. We conducted an empirical analysis to identify the impact behavioral breaking changes have on clients by examining the impact of dependency updates on client test suites. We examined a set of java projects built using Maven, which included 30,548 dependencies under 8,086 Maven artifacts. We automatically updated out-of-date dependencies and ran the client test suites. We found that 2.30% of these updates had behavioral breaking changes that impacted client tests. Our results show that most breaking changes were introduced during a non-Major dependency update, violating the semantic versioning scheme. We further analyzed the effects these behavioral breaking changes have on client tests. We present a taxonomy of effects related to these changes, which we broadly categorize as Test Failures and Test Errors. Our results further indicate that the library developers did not adequately document the exceptions thrown due to precondition violations.

Publisher's Version Published Artifact Artifacts Available
Effective Teaching through Code Reviews: Patterns and Anti-patterns
Anita SarmaORCID logo and Nina Chen ORCID logo
(Oregon State University, USA; Google, USA)
Code reviews are an ubiquitous and essential part of the software development process. They also offer a unique, at-scale opportunity for teaching developers in the context of their day-to-day development activities versus something more removed and formal, like a class. Yet there is little research on effective teaching through code reviews: focusing on learning for the author and not just changes to the code. We address this gap through a case study at Google: interviews with 14 developers revealed 12 patterns and 15 anti-patterns in code reviews that impact learning. For instance, explanatory rationale, sample solutions backed by standards, and a constructive tone facilitates learning, whereas harsh comments, excessive shallow critiques, and non-pragmatic reviewing that ignores authors' constraints hinders learning. We validated our qualitative findings through member checking, interviews with reviewers, a literature review, and a survey of 324 developers. This comprehensive study provides an empirical evidence of how social dynamics in code reviews impact learning. Based on our findings, we provide practical recommendations on how to frame constructive reviews to create a supportive learning environment.

Publisher's Version
An Analysis of the Costs and Benefits of Autocomplete in IDEs
Shaokang Jiang ORCID logo and Michael Coblenz ORCID logo
(University of California at San Diego, San Diego, USA)
Many IDEs support an autocomplete feature, which may increase developer productivity by reducing typing requirements and by providing convenient access to relevant information. However, to date, there has been no evaluation of the actual benefit of autocomplete to programmers. We conducted a between-subjects experiment (N=32) using an eye tracker to evaluate the costs and benefits of IDE-based autocomplete features to programmers who use an unfamiliar API. Participants who used autocomplete spent significantly less time reading documentation and got significantly higher scores on our post-study API knowledge test, indicating that it helped them learn more about the API. However, autocomplete did not significantly reduce the number of keystrokes required to finish tasks. We conclude that the primary benefit of autocomplete is in providing information, not in reducing time spent typing.

Publisher's Version Published Artifact Artifacts Available
Decomposing Software Verification using Distributed Summary Synthesis
Dirk BeyerORCID logo, Matthias Kettl ORCID logo, and Thomas Lemberger ORCID logo
(LMU Munich, Germany)
There are many approaches for automated software verification, but they are either imprecise, do not scale well to large systems, or do not sufficiently leverage parallelization. This hinders the integration of software model checking into the development process (continuous integration). We propose an approach to decompose one large verification task into multiple smaller, connected verification tasks, based on blocks in the program control flow. For each block, summaries (block contracts) are computed — based on independent, distributed, continuous refinement by communication between the blocks. The approach iteratively synthesizes preconditions to assume at the block entry (computed from postconditions received from block predecessors, i.e., which program states reach this block) and violation conditions to check at the block exit (computed from violation conditions received from block successors, i.e., which program states lead to a specification violation). This separation of concerns leads to an architecture in which all blocks can be analyzed in parallel, as independent verification problems. Whenever new information (as a postcondition or violation condition) is available from other blocks, the verification can decide to restart with this new information. We realize our approach on the basis of configurable program analysis and implement it for the verification of C programs in the widely used verifier CPAchecker. A large experimental evaluation shows the potential of our new approach: The distribution of the workload to several processing units works well, and there is a significant reduction of the response time when using multiple processing units. There are even cases in which the new approach beats the highly-tuned, existing single-threaded predicate abstraction.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable
Can GPT-4 Replicate Empirical Software Engineering Research?
Jenny T. Liang ORCID logo, Carmen Badea ORCID logo, Christian Bird ORCID logo, Robert DeLine ORCID logo, Denae FordORCID logo, Nicole Forsgren ORCID logo, and Thomas ZimmermannORCID logo
(Carnegie Mellon University, USA; Microsoft Research, USA)
Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners could benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering- and science-related tasks, these models could help replicate and thus democratize empirical software engineering research. In this paper, we examine GPT-4’s abilities to perform replications of empirical software engineering research on new data. We specifically study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggles to generate ones that apply common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.

Publisher's Version
A Critical Review of Common Log Data Sets Used for Evaluation of Sequence-Based Anomaly Detection Techniques
Max Landauer ORCID logo, Florian Skopik ORCID logo, and Markus Wurzenberger ORCID logo
(AIT Austrian Institute of Technology, Austria)
Log data store event execution patterns that correspond to underlying workflows of systems or applications. While most logs are informative, log data also include artifacts that indicate failures or incidents. Accordingly, log data are often used to evaluate anomaly detection techniques that aim to automatically disclose unexpected or otherwise relevant system behavior patterns. Recently, detection approaches leveraging deep learning have increasingly focused on anomalies that manifest as changes of sequential patterns within otherwise normal event traces. Several publicly available data sets, such as HDFS, BGL, Thunderbird, OpenStack, and Hadoop, have since become standards for evaluating these anomaly detection techniques, however, the appropriateness of these data sets has not been closely investigated in the past. In this paper we therefore analyze six publicly available log data sets with focus on the manifestations of anomalies and simple techniques for their detection. Our findings suggest that most anomalies are not directly related to sequential manifestations and that advanced detection techniques are not required to achieve high detection rates on these data sets.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
SimLLM: Calculating Semantic Similarity in Code Summaries using a Large Language Model-Based Approach
Xin Jin ORCID logo and Zhiqiang Lin ORCID logo
(Ohio State University, USA)
Code summaries are pivotal in software engineering, serving to improve code readability, maintainability, and collaboration. While recent advancements in Large Language Models (LLMs) have opened new avenues for automatic code summarization, existing metrics for evaluating summary quality, such as BLEU and BERTScore, have notable limitations. Specifically, these existing metrics either fail to capture the nuances of semantic meaning in summaries or are further limited in understanding domain-specific terminologies and expressions prevalent in code summaries. In this paper, we present SimLLM, a novel LLM-based approach designed to more precisely evaluate the semantic similarity of code summaries. Built upon an autoregressive LLM using a specialized pretraining task on permutated inputs and a pooling-based pairwise similarity measure, SimLLM overcomes the shortcomings of existing metrics. Our empirical evaluations demonstrate that SimLLM not only outperforms existing metrics but also shows a significantly high correlation with human ratings.

Publisher's Version Published Artifact Artifacts Available
On the Contents and Utility of IoT Cybersecurity Guidelines
Jesse Chen ORCID logo, Dharun Anandayuvaraj ORCID logo, James C. DavisORCID logo, and Sazzadur Rahaman ORCID logo
(University of Arizona, USA; Purdue University, USA)
Cybersecurity concerns of Internet of Things (IoT) devices and infrastructure are growing each year. In response, organizations worldwide have published IoT security guidelines to protect their citizens and customers by providing recommendations on the development and operation of IoT systems. While these guidelines are being adopted, e.g. by US federal contractors, their content and merits have not been critically examined. Specifically, we do not know what topics and recommendations they cover and their effectiveness at preventing real-world IoT failures. In this paper, we address these gaps through a qualitative study of guidelines. We collect 142 IoT cybersecurity guidelines and sample them for recommendations until reaching saturation at 25 guidelines. From the resulting 958 unique recommendations, we iteratively develop a hierarchical taxonomy following grounded theory coding principles and study the guidelines’ comprehensiveness. In addition, we evaluate the actionability and specificity of each recommendation and match recommendations to CVEs and security failures in the news they can prevent. We report that: (1) Each guideline has gaps in its topic coverage and comprehensiveness; (2) 87.2% recommendations are actionable and 38.7% recommendations can prevent specific threats; and (3) although the union of the guidelines mitigates all 17 of the failures from our news stories corpus, 21% of the CVEs evade the guidelines. In summary, we report shortcomings in each guideline’s depth and breadth, but as a whole they address major security issues.

Publisher's Version Info
A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization
Sungmin Kang ORCID logo, Gabin An ORCID logo, and Shin Yoo ORCID logo
(KAIST, South Korea)
Fault Localization (FL), in which a developer seeks to identify which part of the code is malfunctioning and needs to be fixed, is a recurring challenge in debugging. To reduce developer burden, many automated FL techniques have been proposed. However, prior work has noted that existing techniques fail to provide rationales for the suggested locations, hindering developer adoption of these techniques. With this in mind, we propose AutoFL, a Large Language Model (LLM)-based FL technique that generates an explanation of the bug along with a suggested fault location. AutoFL prompts an LLM to use function calls to navigate a repository, so that it can effectively localize faults over a large software repository and overcome the limit of the LLM context length. Extensive experiments on 798 real-world bugs in Java and Python reveal AutoFL improves method-level acc@1 by up to 233.3% over baselines. Furthermore, developers were interviewed on their impression of AutoFL-generated explanations, showing that developers generally liked the natural language explanations of AutoFL, and that they preferred reading a few, high-quality explanations instead of many.

Publisher's Version Info
Static Application Security Testing (SAST) Tools for Smart Contracts: How Far Are We?
Kaixuan Li ORCID logo, Yue Xue ORCID logo, Sen Chen ORCID logo, Han Liu ORCID logo, Kairan Sun ORCID logo, Ming Hu ORCID logo, Haijun Wang ORCID logo, Yang Liu ORCID logo, and Yixiang Chen ORCID logo
(East China Normal University, China; Metatrust Labs, Singapore; Tianjin University, China; Nanyang Technological University, Singapore; Xi’an Jiaotong University, China)
In recent years, the importance of smart contract security has been heightened by the increasing number of attacks against them. To address this issue, a multitude of static application security testing (SAST) tools have been proposed for detecting vulnerabilities in smart contracts. However, objectively comparing these tools to determine their effectiveness remains challenging. Existing studies often fall short due to the taxonomies and benchmarks only covering a coarse and potentially outdated set of vulnerability types, which leads to evaluations that are not entirely comprehensive and may display bias. In this paper, we fill this gap by proposing an up-to-date and fine-grained taxonomy that includes 45 unique vulnerability types for smart contracts. Taking it as a baseline, we develop an extensive benchmark that covers 40 distinct types and includes a diverse range of code characteristics, vulnerability patterns, and application scenarios. Based on them, we evaluated 8 SAST tools using this benchmark, which comprises 788 smart contract files and 10,394 vulnerabilities. Our results reveal that the existing SAST tools fail to detect around 50% of vulnerabilities in our benchmark and suffer from high false positives, with precision not surpassing 10%. We also discover that by combining the results of multiple tools, the false negative rate can be reduced effectively, at the expense of flagging 36.77 percentage points more functions. Nevertheless, many vulnerabilities, especially those beyond Access Control and Reentrancy vulnerabilities, remain undetected. We finally highlight the valuable insights from our study, hoping to provide guidance on tool development, enhancement, evaluation, and selection for developers, researchers, and practitioners.

Publisher's Version
A Deep Dive into Large Language Models for Automated Bug Localization and Repair
Soneya Binta HossainORCID logo, Nan Jiang ORCID logo, Qiang Zhou ORCID logo, Xiaopeng Li ORCID logo, Wen-Hao Chiang ORCID logo, Yingjun Lyu ORCID logo, Hoan Nguyen ORCID logo, and Omer Tripp ORCID logo
(University of Virginia, USA; Purdue University, USA; Amazon Web Services, USA)
Large language models (LLMs) have shown impressive effectiveness in various software engineering tasks, including automated program repair (APR). In this study, we take a deep dive into automated bug localization and repair utilizing LLMs. In contrast to many deep learning-based APR methods that assume known bug locations, rely on line-level localization tools, or address bug prediction and fixing in one step, our approach uniquely employs LLMs to predict bug location at the token level and subsequently utilizes them for bug fixing. This methodological separation of bug localization and fixing using different LLMs enables effective integration of diverse contextual information and improved incorporation of inductive biases. We introduce Toggle: Token-Granulated Bug Localization and Repair, a comprehensive program repair framework that integrates a bug localization model, an adjustment model to address tokenizer inconsistencies, and a bug-fixing model. Toggle takes a buggy function as input and generates a complete corrected function. We investigate various styles of prompting to the bug fixing model to identify the most effective prompts that better utilize the inductive bias and significantly outperform others. Toggle achieves the new state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark, and exhibits better and comparable performance on several other widely-used APR datasets, including Defects4J. In the Defects4J benchmark, our approach consistently ranks above other methods, achieving superior results in the Top-10, Top-30, Top-50, and Top-100 metrics. Besides examining Toggle’s generalizability to unseen data, evaluating the effectiveness of various prompts, we also investigate the impact of additional contextual information such as buggy lines and code comments on bug localization, and explore the importance of the adjustment model. Our extensive experiments offer valuable insights and answers to critical research questions.

Publisher's Version
Towards Efficient Build Ordering for Incremental Builds with Multiple Configurations
Jun Lyu ORCID logo, Shanshan Li ORCID logo, He Zhang ORCID logo, Lanxin Yang ORCID logo, Bohan Liu ORCID logo, and Manuel Rigger ORCID logo
(Nanjing University, China; National University of Singapore, Singapore)
Software products have many configurations to meet different environments and diverse needs. Building software with multiple software configurations typically incurs high costs in terms of build time and computing resources. Incremental builds could reuse intermediate artifacts if configuration settings affect only a portion of the build artifacts. The efficiency gains depend on the strategic ordering of the incremental builds as the order influences which build artifacts can be reused. Deriving an efficient order is challenging and an open problem, since it is infeasible to reliably determine the degree of re-use and time savings before an actual build. In this paper, we propose an approach, called BUDDI—BUild Declaration DIstance, for C-based and Make-based projects to derive an efficient order for incremental builds from the static information provided by the build scripts (i.e., Makefile). The core strategy of BUDDI is to measure the distance between the build declarations of configurations and predict the build size of a configuration from the build targets and build commands in each configuration. Since some artifacts could be reused in the subsequent builds if there is a close distance between the build scripts for different configurations. We implemented BUDDI as an automated tool called BuddiPlanner and evaluated it on 20 popular open-source projects, by comparing it to a baseline that randomly selects a build order. The experimental results show that the order created by BuddiPlanner outperforms 96.5% (193/200) of the random build orders in terms of build time and reduces the build time by an average of 305.94s (26%) compared to the random build orders, with a median saving of 64.88s (28%). BuddiPlanner demonstrates its potential to relieve practitioners of excessive build times and computational resource burdens caused by building multiple software configurations.

Publisher's Version
On Reducing Undesirable Behavior in Deep-Reinforcement-Learning-Based Software
Ophir M. Carmel ORCID logo and Guy Katz ORCID logo
(Hebrew University of Jerusalem, Israel)
Deep reinforcement learning (DRL) has proven extremely useful in a large variety of application domains. However, even successful DRL-based software can exhibit highly undesirable behavior. This is due to DRL training being based on maximizing a reward function, which typically captures general trends but cannot precisely capture, or rule out, certain behaviors of the model. In this paper, we propose a novel framework aimed at drastically reducing the undesirable behavior of DRL-based software, while maintaining its excellent performance. In addition, our framework can assist in providing engineers with a comprehensible characterization of such undesirable behavior. Under the hood, our approach is based on extracting decision tree classifiers from erroneous state-action pairs, and then integrating these trees into the DRL training loop, penalizing the model whenever it performs an error. We provide a proof-of-concept implementation of our approach, and use it to evaluate the technique on three significant case studies. We find that our approach can extend existing frameworks in a straightforward manner, and incurs only a slight overhead in training time. Further, it incurs only a very slight hit to performance, or even in some cases --- improves it, while significantly reducing the frequency of undesirable behavior.

Publisher's Version
Semi-supervised Crowdsourced Test Report Clustering via Screenshot-Text Binding Rules
Shengcheng Yu ORCID logo, Chunrong Fang ORCID logo, Quanjun Zhang ORCID logo, Mingzhe Du ORCID logo, Jia Liu ORCID logo, and Zhenyu ChenORCID logo
(Nanjing University, China)
Due to the openness of the crowdsourced testing paradigm, crowdworkers submit massive spotty duplicate test reports, which hinders developers from effectively reviewing the reports and detecting bugs. Test report clustering is widely used to alleviate this problem and improve the effectiveness of crowdsourced testing. Existing clustering methods basically rely on the analysis of textual descriptions. A few methods are independently supplemented by analyzing screenshots in test reports as pixel sets, leaving out the semantics of app screenshots from the widget perspective. Further, ignoring the semantic relationships between screenshots and textual descriptions may lead to the imprecise analysis of test reports, which in turn negatively affects the clustering effectiveness. This paper proposes a semi-supervised crowdsourced test report clustering approach, namely SemCluster. SemCluster respectively extracts features from app screenshots and textual descriptions and forms the structure feature, the content feature, the bug feature, and reproduction steps. The clustering is principally conducted on the basis of the four features. Further, in order to avoid bias of specific individual features, SemCluster exploits the semantic relationships between app screenshots and textual descriptions to form the semantic binding rules as guidance for clustering crowdsourced test reports. Experiment results show that SemCluster outperforms state-of-the-art approaches on six widely used metrics by 10.49% -- 200.67%, illustrating the excellent effectiveness.

Publisher's Version
CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection
Shihan Dou ORCID logo, Yueming Wu ORCID logo, Haoxiang Jia ORCID logo, Yuhao Zhou ORCID logo, Yan Liu ORCID logo, and Yang Liu ORCID logo
(Fudan University, China; Nanyang Technological University, Singapore; Huazhong University of Science and Technology, China)
With the development of the open source community, the code is often copied, spread, and evolved in multiple software systems, which brings uncertainty and risk to the software system (e.g., bug propagation and copyright infringement). Therefore, it is important to conduct code clone detection to discover similar code pairs. Many approaches have been proposed to detect code clones where token-based tools can scale to big code. However, due to the lack of program details, they cannot handle more complicated code clones, i.e., semantic code clones. In this paper, we introduce CC2Vec, a novel code encoding method designed to swiftly identify simple code clones while also enhancing the capability for semantic code clone detection. To retain the program details between tokens, CC2Vec divides them into different categories (i.e., typed tokens) according to the syntactic types and then applies two self-attention mechanism layers to encode them. To resist changes in the code structure of semantic code clones, CC2Vec performs contrastive learning to reduce the differences introduced by different code implementations. We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam) and the results report that our method can effectively detect simple code clones. In addition, CC2Vec not only attains comparable performance to widely used semantic code clone detection systems such as ASTNN, SCDetector, and FCCA by simply fine-tuning, but also significantly surpasses these methods in both detection efficiency.

Publisher's Version
Exploring and Unleashing the Power of Large Language Models in Automated Code Translation
Zhen Yang ORCID logo, Fang Liu ORCID logo, Zhongxing Yu ORCID logo, Jacky Wai Keung ORCID logo, Jia Li ORCID logo, Shuo Liu ORCID logo, Yifan Hong ORCID logo, Xiaoxue Ma ORCID logo, Zhi Jin ORCID logo, and Ge Li ORCID logo
(Shandong University, China; Beihang University, China; City University of Hong Kong, China; Peking University, China)
Code translation tools, namely transpilers, are developed for automatic source-to-source translation. Latest learning-based transpilers have shown impressive enhancement against rule-based counterparts in both translation accuracy and readability, owing to their task-specific pre-training on extensive monolingual corpora. Nevertheless, their current performance still remains unsatisfactory for practical deployment, and the associated training resources are also prohibitively expensive. Large Language Models (LLMs), pre-trained on huge amounts of human-written code/text, have shown remarkable performance in many code intelligence tasks due to their powerful generality, even without task-specific re-training/fine-tuning. Thus, LLMs can potentially circumvent the above limitations, but they have not been exhaustively explored yet. This paper investigates diverse LLMs and learning-based transpilers for automated code translation tasks, finding that: although certain LLMs have outperformed current transpilers, they still have some accuracy issues, where most of the failures are induced by a lack of comprehension of source programs (38.51%), missing clear instructions on I/O types in translation (14.94%), and ignoring discrepancies between source and target programs (41.38%). Enlightened by the above findings, we further propose UniTrans, a Unified code Translation framework, applicable to various LLMs, for unleashing their power in this field. Specifically, UniTrans first crafts a series of test cases for target programs with the assistance of source programs. Next, it harnesses the above auto-generated test cases to augment the code translation and then evaluate their correctness via execution. Afterward, UniTrans further (iteratively) repairs incorrectly translated programs prompted by test case execution results. Extensive experiments are conducted on six settings of translation datasets between Python, Java, and C++. Three recent LLMs of diverse sizes, including GPT-3.5 and LLaMA-13B/7B, are tested with UniTrans, and all achieve substantial improvements in terms of computational accuracy and exact match accuracy among almost all translation settings, showing the universal effectiveness of UniTrans in practice.

Publisher's Version
TIPS: Tracking Integer-Pointer Value Flows for C++ Member Function Pointers
Changwei Zou ORCID logo, Dongjie He ORCID logo, Yulei Sui ORCID logo, and Jingling Xue ORCID logo
(UNSW, Sydney, Australia; Chongqing University, China)
C++ is crucial in software development, providing low-level memory control for performance and supporting object-oriented programming to construct modular, reusable code structures. Consequently, tackling pointer analysis for C++ becomes challenging, given the need to address these two fundamental features. A relatively unexplored research area involves the handling of C++ member function pointers. Previous efforts have tended to either disregard this feature or adopt a conservative approach, resulting in unsound or imprecise results. C++ member function pointers, handling both virtual (via virtual table indexes) and non-virtual functions (through addresses), pose a significant challenge for pointer analysis due to the mix of integers and pointers, often resulting in unsound or imprecise analysis. We introduce TIPS, the first pointer analysis that effectively manages both pointers and integers, offering support for C++ member function pointers by tracking their value flows. Our evaluation on TIPS demonstrates its accuracy in identifying C++ member function call targets, a task where other tools falter, across fourteen large C++ programs from SPEC CPU, Qt, LLVM, Ninja, and GoogleTest, while maintaining low analysis overhead. In addition, our micro-benchmark suite, complete with ground truth data, allows for precise evaluation of points-to information for C++ member function pointers across various inheritance scenarios, highlighting TIPS's precision enhancements.

Publisher's Version
Do Words Have Power? Understanding and Fostering Civility in Code Review Discussion
Md Shamimur Rahman ORCID logo, Zadia Codabux ORCID logo, and Chanchal K. Roy ORCID logo
(University of Saskatchewan, Canada)
Modern Code Review (MCR) is an integral part of the software development process where developers improve product quality through collaborative discussions. Unfortunately, these discussions can sometimes become heated by the presence of inappropriate behaviors such as personal attacks, insults, disrespectful comments, and derogatory conduct, often referred to as incivility. While researchers have extensively explored such incivility in various public domains, our understanding of its causes, consequences, and courses of action remains limited within the professional context of software development, specifically within code review discussions. To bridge this gap, our study draws upon the experience of 171 professional software developers representing diverse development practices across different geographical regions. Our findings reveal that more than half of these developers (56.72%) have encountered instances of workplace incivility, and a substantial portion of that group (83.70%) reported experiencing such incidents at least once a month. We also identified various causes, positive and negative consequences, and potential courses of action for uncivil communication. Moreover, to address the negative aspects of incivility, we propose a model for promoting civility that detects uncivil comments during communication and provides alternative civil suggestions while preserving the original comments’ semantics, enabling developers to engage in respectful and constructive discussions. An in-depth analysis of 2K uncivil review comments using eight different evaluation metrics and a manual evaluation suggested that our proposed approach could generate civil alternatives significantly compared to the state-of-the-art politeness and detoxification models. Moreover, a survey involving 36 developers who used our civility model reported its effectiveness in enhancing online development interactions, fostering better relationships, increasing contributor involvement, and expediting development processes. Our research is a pioneer in generating civil alternatives for uncivil discussions in software development, opening new avenues for research in collaboration and communication within the software engineering context.

Publisher's Version Published Artifact Info Artifacts Available
Finding and Understanding Defects in Static Analyzers by Constructing Automated Oracles
Weigang He ORCID logo, Peng Di ORCID logo, Mengli Ming ORCID logo, Chengyu Zhang ORCID logo, Ting Su ORCID logo, Shijie Li ORCID logo, and Yulei Sui ORCID logo
(East China Normal University, China; University of Technology Sydney, Australia; Ant Group, China; ETH Zurich, Switzerland; UNSW, Sydney, Australia)
Static analyzers are playing crucial roles in helping find programming mistakes and security vulnerabilities. The correctness of their analysis results is crucial for the usability in practice. Otherwise, the potential defects in these analyzers (, implementation errors, improper design choices) could affect the soundness (leading to false negatives) and precision (leading to false positives). However, finding the defects in off-the-shelf static analyzers is challenging because these analyzers usually lack clear and complete specifications, and the results of different analyzers may differ. To this end, this paper designs two novel types of automated oracles to find defects in static analyzers with randomly generated programs. The first oracle is constructed by using dynamic program executions and the second one leverages the inferred static analysis results. We applied these two oracles on three state-of-the-art static analyzers: Clang Static Analyzer (CSA), GCC Static Analyzer (GSA), and Pinpoint. We found 38 unique defects in these analyzers, 28 of which have been confirmed or fixed by the developers. We conducted a case study on these found defects followed by several insights and lessons learned for improving and better understanding static analyzers. We have made all the artifacts publicly available at https://github.com/Geoffrey1014/SA_Bugs for replication and benefit the community.

Publisher's Version
Enhancing Function Name Prediction using Votes-Based Name Tokenization and Multi-task Learning
Xiaoling Zhang ORCID logo, Zhengzi Xu ORCID logo, Shouguo Yang ORCID logo, Zhi Li ORCID logo, Zhiqiang Shi ORCID logo, and Limin Sun ORCID logo
(Institute of Information Engineering at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Nanyang Technological University, Singapore)
Reverse engineers would acquire valuable insights from descriptive function names, which are absent in publicly released binaries. Recent advances in binary function name prediction using data-driven machine learning show promise. However, existing approaches encounter difficulties in capturing function semantics in diverse optimized binaries and fail to reserve the meaning of labels in function names. We propose Epitome, a framework that enhances function name prediction using votes-based name tokenization and multi-task learning, specifically tailored for different compilation optimization binaries. Epitome learns comprehensive function semantics by pre-trained assembly language model and graph neural network, incorporating function semantics similarity prediction task, to maximize the similarity of function semantics in the context of different compilation optimization levels. In addition, we present two data preprocessing methods to improve the comprehensibility of function names. We evaluate the performance of Epitome using 2,597,346 functions extracted from binaries compiled with 5 optimizations (O0-Os) for 4 architectures (x64, x86, ARM, and MIPS). Epitome outperforms the state-of-the-art function name prediction tool by up to 44.34%, 64.16%, and 54.44% in precision, recall, and F1 score, while also exhibiting superior generalizability.

Publisher's Version
Evaluating and Improving ChatGPT for Unit Test Generation
Zhiqiang Yuan ORCID logo, Mingwei LiuORCID logo, Shiji Ding ORCID logo, Kaixin Wang ORCID logo, Yixuan Chen ORCID logo, Xin Peng ORCID logo, and Yiling LouORCID logo
(Fudan University, China)
Unit testing plays an essential role in detecting bugs in functionally-discrete program units (e.g., methods). Manually writing high-quality unit tests is time-consuming and laborious. Although the traditional techniques are able to generate tests with reasonable coverage, they are shown to exhibit low readability and still cannot be directly adopted by developers in practice. Recent work has shown the large potential of large language models (LLMs) in unit test generation. By being pre-trained on a massive developer-written code corpus, the models are capable of generating more human-like and meaningful test code. In this work, we perform the first empirical study to evaluate the capability of ChatGPT (i.e., one of the most representative LLMs with outstanding performance in code generation and comprehension) in unit test generation. In particular, we conduct both a quantitative analysis and a user study to systematically investigate the quality of its generated tests in terms of correctness, sufficiency, readability, and usability. We find that the tests generated by ChatGPT still suffer from correctness issues, including diverse compilation errors and execution failures (mostly caused by incorrect assertions); but the passing tests generated by ChatGPT almost resemble manually-written tests by achieving comparable coverage, readability, and even sometimes developers' preference. Our findings indicate that generating unit tests with ChatGPT could be very promising if the correctness of its generated tests could be further improved. Inspired by our findings above, we further propose ChatTester, a novel ChatGPT-based unit test generation approach, which leverages ChatGPT itself to improve the quality of its generated tests. ChatTester incorporates an initial test generator and an iterative test refiner. Our evaluation demonstrates the effectiveness of ChatTester by generating 34.3% more compilable tests and 18.7% more tests with correct assertions than the default ChatGPT. In addition to ChatGPT, we further investigate the generalization capabilities of ChatTester by applying it to two recent open-source LLMs (i.e., CodeLLama-Instruct and CodeFuse) and our results show that ChatTester can also improve the quality of tests generated by these LLMs.

Publisher's Version
How to Gain Commit Rights in Modern Top Open Source Communities?
Xin Tan ORCID logo, Yan Gong ORCID logo, Geyu Huang ORCID logo, Haohua Wu ORCID logo, and Li Zhang ORCID logo
(Beihang University, China)
The success of open source software (OSS) projects relies on voluntary contributions from various community roles. Among these roles, being a committer signifies gaining trust and higher privileges in OSS projects. Substantial studies have focused on the requirements of becoming a committer in OSS projects, but most of them are based on interviews or several hypotheses, lacking a comprehensive understanding of committers' qualifications. To address this knowledge gap, we explore both the policies and practical implementations of committer qualifications in modern top OSS communities. Through a thematic analysis of these policies, we construct a taxonomy of committer qualifications, consisting of 26 codes categorized into nine themes, including "Personnel-related to Project", "Communication", and "Long-term Participation". We also highlight the variations in committer qualifications emphasized in different OSS community governance models. For example, projects following the "core maintainer model" place great importance on project comprehension, while projects following the "company-backed model" place significant emphasis on user issue resolution. Based on the above findings, we propose eight sets of metrics and perform survival analysis on two representative OSS projects to understand how these qualifications are implemented in practice. We find that the probability of gaining commit rights decreases as participation time passes. The selection criteria in practice are generally consistent with the community policies. Developers who submit high-quality code, actively engage in code review, and make extensive contributions to related projects are more likely to be granted commit rights. However, there are some qualifications that do not align precisely, and some are not adequately evaluated. This study enhances trust understanding in top OSS communities, aids in optimal commit rights allocation, and empowers developers' self-actualization via OSS engagement.

Publisher's Version
An Empirical Study on Focal Methods in Deep-Learning-Based Approaches for Assertion Generation
Yibo He ORCID logo, Jiaming Huang ORCID logo, Hao Yu ORCID logo, and Tao Xie ORCID logo
(Peking University, China)
Unit testing is widely recognized as an essential aspect of the software development process. Generating high-quality assertions automatically is one of the most important and challenging problems in automatic unit test generation. To generate high-quality assertions, deep-learning-based approaches have been proposed in recent years. For state-of-the-art deep-learning-based approaches for assertion generation (DLAGs), the focal method (i.e., the main method under test) for a unit test case plays an important role of being a required part of the input to these approaches. To use DLAGs in practice, there are two main ways to provide a focal method for these approaches: (1) manually providing a developer-intended focal method or (2) identifying a likely focal method from the given test prefix (i.e., complete unit test code excluding assertions) with test-to-code traceability techniques. However, the state-of-the-art DLAGs are all evaluated on the ATLAS dataset, where the focal method for a test case is assumed as the last non-JUnit-API method invoked in the complete unit test code (i.e., code from both the test prefix and assertion portion). There exist two issues of the existing empirical evaluations of DLAGs, causing inaccurate assessment of DLAGs toward adoption in practice. First, it is unclear whether the last method call before assertions (LCBA) technique can accurately reflect developer-intended focal methods. Second, when applying DLAGs in practice, the assertion portion of a unit test is not available as a part of the input to DLAGs (actually being the output of DLAGs); thus, the assumption made by the ATLAS dataset does not hold in practical scenarios of applying DLAGs. To address the first issue, we conduct a study of seven test-to-code traceability techniques in the scenario of assertion generation. We find that the LCBA technique is not the best among the seven techniques and can accurately identify focal methods with only 43.38% precision and 38.42% recall; thus, the LCBA technique cannot accurately reflect developer-intended focal methods, raising a concern on using the ATLAS dataset for evaluation. To address the second issue along with the concern raised by the preceding finding, we apply all seven test-to-code traceability techniques, respectively, to identify focal methods automatically from only test prefixes and construct a new dataset named ATLAS+ by replacing the existing focal methods in the ATLAS dataset with the focal methods identified by the seven traceability techniques, respectively. On a test set from new ATLAS+, we evaluate four state-of-the-art DLAGs trained on a training set from the ATLAS dataset. We find that all of the four DLAGs achieve lower accuracy on a test set in ATLAS+ than the corresponding test set in the ATLAS dataset, indicating that DLAGs should be (re)evaluated with a test set in ATLAS+, which better reflects practical scenarios of providing focal methods than the ATLAS dataset. In addition, we evaluate state-of-the-art DLAGs trained on training sets in ATLAS+. We find that using training sets in ATLAS+ helps effectively improve the accuracy of the ATLAS approach and T5 approach over these approaches trained using the corresponding training set from the ATLAS dataset.

Publisher's Version
Demystifying Invariant Effectiveness for Securing Smart Contracts
Zhiyang Chen ORCID logo, Ye Liu ORCID logo, Sidi Mohamed Beillahi ORCID logo, Yi LiORCID logo, and Fan Long ORCID logo
(University of Toronto, Canada; Nanyang Technological University, Singapore)
Smart contract transactions associated with security attacks often exhibit distinct behavioral patterns compared with historical benign transactions before the attacking events. While many runtime monitoring and guarding mechanisms have been proposed to validate invariants and stop anomalous transactions on the fly, the empirical effectiveness of the invariants used remains largely unexplored. In this paper, we studied 23 prevalent invariants of 8 categories, which are either deployed in high-profile protocols or endorsed by leading auditing firms and security experts. Using these well-established invariants as templates, we developed a tool Trace2Inv which dynamically generates new invariants customized for a given contract based on its historical transaction data. We evaluated Trace2Inv on 42 smart contracts that fell victim to 27 distinct exploits on the Ethereum blockchain. Our findings reveal that the most effective invariant guard alone can successfully block 18 of the 27 identified exploits with minimal gas overhead. Our analysis also shows that most of the invariants remain effective even when the experienced attackers attempt to bypass them. Additionally, we studied the possibility of combining multiple invariant guards, resulting in blocking up to 23 of the 27 benchmark exploits and achieving false positive rates as low as 0.28%. Trace2Inv significantly outperforms state-of-the-art works on smart contract invariant mining and transaction attack detection in accuracy. Trace2Inv also surprisingly found two previously unreported exploit transactions.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable
Cut to the Chase: An Error-Oriented Approach to Detect Error-Handling Bugs
Haoran Liu ORCID logo, Zhouyang Jia ORCID logo, Shanshan Li ORCID logo, Yan Lei ORCID logo, Yue Yu ORCID logo, Yu Jiang ORCID logo, Xiaoguang Mao ORCID logo, and Xiangke Liao ORCID logo
(National University of Defense Technology, China; Chongqing University, China; Tsinghua University, China)
Error-handling bugs are prevalent in software systems and can result in severe consequences. Existing works on error-handling bug detection can be categorized into template-based and learning-based approaches. The former requires much human effort and is difficult to accommodate the software evolution. The latter usually focuses on errors of API and assumes that error handling should be right after the handled error. Such an assumption, however, may affect both learning and detecting phases. The existing learning-based approaches can be regarded as API-oriented, which starts from an API and learns if the API requires error handling. In this paper, we propose EH-Digger, an ERROR-oriented approach, which starts from an error handling. Our approach can learn why the error occurs and when the error has to be handled. We conduct a comprehensive study on 2,322 error-handling code snippets from 22 widely used software systems across 8 software domains to reveal the limitation of existing approaches and guide the design of EH-Digger. We evaluated EH-Digger on the Linux Kernel and 11 open-source applications. It detected 53 new bugs confirmed by the developers and 71 historical bugs fixed in the latest versions. We also compared EH-Digger with three state-of-the-art approaches, 30.1% of bugs detected by EH-Digger cannot be detected by the existing approaches.

Publisher's Version
Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice
Ranim Khojah ORCID logo, Mazen Mohamad ORCID logo, Philipp Leitner ORCID logo, and Francisco Gomes de Oliveira Neto ORCID logo
(Chalmers - University of Gothenburg, Sweden; RISE Research Institutes of Sweden, Sweden; Chalmers, Sweden)
Large Language Models (LLMs) are frequently discussed in academia and the general public as support tools for virtually any use case that relies on the production of text, including software engineering. Currently, there is much debate, but little empirical evidence, regarding the practical usefulness of LLM-based tools such as ChatGPT for engineers in industry. We conduct an observational study of 24 professional software engineers who have been using ChatGPT over a period of one week in their jobs, and qualitatively analyse their dialogues with the chatbot as well as their overall experience (as captured by an exit survey). We find that rather than expecting ChatGPT to generate ready-to-use software artifacts (e.g., code), practitioners more often use ChatGPT to receive guidance on how to solve their tasks or learn about a topic in more abstract terms. We also propose a theoretical framework for how the (i) purpose of the interaction, (ii) internal factors (e.g., the user's personality), and (iii) external factors (e.g., company policy) together shape the experience (in terms of perceived usefulness and trust). We envision that our framework can be used by future research to further the academic discussion on LLM usage by software engineering practitioners, and to serve as a reference point for the design of future empirical LLM research in this domain.

Publisher's Version
Active Monitoring Mechanism for Control-Based Self-Adaptive Systems
Yi Qin ORCID logo, Yanxiang Tong ORCID logo, Yifei Xu ORCID logo, Chun Cao ORCID logo, and Xiaoxing Ma ORCID logo
(Nanjing University, China)
Control-based self-adaptive systems (control-SAS) are susceptible to deviations from their pre-identified nominal models. If this model deviation exceeds a threshold, the optimal performance and theoretical guarantees of the control-SAS can be compromised. Existing approaches detect these deviations by locating the mismatch between the control signal of the managing system and the response output of the managed system. However, vague observations may mask a potential mismatch where the explicit system behavior does not reflect the implicit variation of the nominal model. In this paper, we propose the Active Monitoring Mechanism (AMM for short) as a solution to this issue. The basic intuition of AMM is to stimulate the control-SAS with an active control signal when vague observations might mask model deviations. To determine the appropriate time for triggering the active signals, AMM proposes a stochastic framework to quantify the relationship between the implicit variation of a control-SAS and its explicit observation. Based on this framework, AMM’s monitor and remediator enhance model deviation detection by generating active control signals of well-designed timing and intensity. Results from empirical evaluations on three representative systems demonstrate AMM’s effectiveness (33.0% shorter detection delay, 18.3% lower FN rate, 16.7% lower FP rate) and usefulness (19.3% lower abnormal rates and 88.2% higher utility).

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional
State Reconciliation Defects in Infrastructure as Code
Md Mahadi Hassan ORCID logo, John Salvador ORCID logo, Shubhra Kanti Karmaker Santu ORCID logo, and Akond Rahman ORCID logo
(Auburn University, USA)
In infrastructure as code (IaC), state reconciliation is the process of querying and comparing the infrastructure state prior to changing the infrastructure. As state reconciliation is pivotal to manage IaC-based computing infrastructure at scale, defects related to state reconciliation can create large-scale consequences. A categorization of state reconciliation defects, i.e., defects related to state reconciliation, can aid in understanding the nature of state reconciliation defects. We conduct an empirical study with 5,110 state reconciliation defects where we apply qualitative analysis to categorize state reconciliation defects. From the identified defect categories, we derive heuristics to design prompts for a large language model (LLM), which in turn are used for validation of state reconciliation. From our empirical study, we identify 8 categories of state reconciliation defects, amongst which 3 have not been reported for previously-studied software systems. The most frequently occurring defect category is inventory, i.e., the category of defects that occur when managing infrastructure inventory. Using an LLM with heuristics-based paragraph style prompts, we identify 9 previously unknown state reconciliation defects of which 7 have been accepted as valid defects, and 4 have already been fixed. Based on our findings, we conclude the paper by providing a set of recommendations for researchers and practitioners.

Publisher's Version Published Artifact Info Artifacts Available
Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?
Madeline EndresORCID logo, Sarah Fakhoury ORCID logo, Saikat Chakraborty ORCID logo, and Shuvendu K. LahiriORCID logo
(University of Michgan, USA; Microsoft Research, USA)
Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a program’s intent. However, there is typically no guarantee that a program’s implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language, which makes natural language intent challenging to check programmatically. The “emergent abilities” of Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe nl2postcondition, the problem of leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different nl2postcondition approaches, using the correctness and discriminative power of generated postconditions. We then use qualitative and quantitative methods to assess the quality of nl2postcondition postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that via LLMs has the potential to be helpful in practice; generated postconditions were able to catch 64 real-world historical bugs from Defects4J.

Publisher's Version
Misconfiguration Software Testing for Failure Emergence in Autonomous Driving Systems
Yuntianyi Chen ORCID logo, Yuqi Huai ORCID logo, Shilong Li ORCID logo, Changnam Hong ORCID logo, and Joshua GarciaORCID logo
(University of California at Irvine, Irvine, USA)
The optimization of a system’s configuration options is crucial for determining its performance and functionality, particularly in the case of autonomous driving software (ADS) systems because they possess a multitude of such options. Research efforts in the domain of ADS have prioritized the development of automated testing methods to enhance the safety and security of self-driving cars. Presently, search-based approaches are utilized to test ADS systems in a virtual environment, thereby simulating real-world scenarios. However, such approaches rely on optimizing the waypoints of ego cars and obstacles to generate diverse scenarios that trigger violations, and no prior techniques focus on optimizing the ADS from the perspective of configuration. To address this challenge, we present a framework called ConfVE, which is the first automated configuration testing framework for ADSes. ConfVE’s design focuses on the emergence of violations through rerunning scenarios generated by different ADS testing approaches under different configurations, leveraging 9 test oracles to enable previous ADS testing approaches to find more types of violations without modifying their designs or implementations and employing a novel technique to identify bug-revealing violations and eliminate duplicate violations. Our evaluation results demonstrate that ConfVE can discover 1,818 unique violations and reduce 74.19% of duplicate violations.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Towards Better Graph Neural Network-Based Fault Localization through Enhanced Code Representation
Md Nakhla Rafi ORCID logo, Dong Jae Kim ORCID logo, An Ran Chen ORCID logo, Tse-Hsun (Peter) Chen ORCID logo, and Shaowei Wang ORCID logo
(Concordia University, Canada; De Paul University, USA; University of Alberta, Canada; University of Manitoba, Canada)
Automatic software fault localization plays an important role in software quality assurance by pinpointing faulty locations for easier debugging. Coverage-based fault localization is a commonly used technique, which applies statistics on coverage spectra to rank faulty code based on suspiciousness scores. However, statistics- based approaches based on formulae are often rigid, which calls for learning-based techniques. Amongst all, Grace, a graph-neural network (GNN) based technique has achieved state-of-the-art due to its capacity to preserve coverage spectra, i.e., test-to-source coverage relationships, as precise abstract syntax-enhanced graph representation, mitigating the limitation of other learning-based technique which compresses the feature representation. However, such representation is not scalable due to the increasing complexity of software, correlating with increasing coverage spectra and AST graph, making it challenging to extend, let alone train the graph neural network in practice. In this work, we proposed a new graph representation, DepGraph, that reduces the complexity of the graph representation by 70% in nodes and edges by integrating the interprocedural call graph in the graph representation of the code. Moreover, we integrate additional features—code change information—into the graph as attributes so the model can leverage rich historical project data. We evaluate DepGraph using Defects4j 2.0.0, and it outperforms Grace by locating 20% more faults in Top-1 and improving the Mean First Rank (MFR) and the Mean Average Rank (MAR) by over 50% while decreasing GPU memory usage by 44% and training/inference time by 85%. Additionally, in cross-project settings, DepGraph surpasses the state-of-the-art baseline with a 42% higher Top-1 accuracy, and 68% and 65% improvement in MFR and MAR, respectively. Our study demonstrates DepGraph’s robustness, achieving state-of-the-art accuracy and scalability for future extension and adoption.

Publisher's Version
Predicting Failures of Autoscaling Distributed Applications
Giovanni Denaro ORCID logo, Noura El Moussa ORCID logo, Rahim Heydarov ORCID logo, Francesco Lomio ORCID logo, Mauro PezzèORCID logo, and Ketai Qiu ORCID logo
(University of Milano-Bicocca, Italy; USI Lugano, Switzerland; Constructor Institute Schaffhausen, Switzerland)
Predicting failures in production environments allows service providers to activate countermeasures that prevent harming the users of the applications. The most successful approaches predict failures from error states that the current approaches identify from anomalies in time series of fixed sets of KPI values collected at runtime. They cannot handle time series of KPI sets with size that varies over time. Thus these approaches work with applications that run on statically configured sets of components and computational nodes, and do not scale up to the many popular cloud applications that exploit autoscaling. This paper proposes Preface, a novel approach to predict failures in cloud applications that exploit autoscaling. Preface originally augments the neural-network-based failure predictors successfully exploited to predict failures in statically configured applications, with a Rectifier layer that handles KPI sets of highly variable size as the ones collected in cloud autoscaling applications, and reduces those KPIs to a set of rectified-KPIs of fixed size that can be fed to the neural-network predictor. The Preface Rectifier computes the rectified-KPIs as descriptive statistics of the original KPIs, for each logical component of the target application. The descriptive statistics shrink the highly variable sets of KPIs collected at different timestamps to a fixed set of values compatible with the input nodes of the neural-network failure predictor. The neural network can then reveal anomalies that correspond to error states, before they propagate to failures that harm the users of the applications. The experiments on both a commercial application and a widely used academic exemplar confirm that Preface can indeed predict many harmful failures early enough to activate proper countermeasures.

Publisher's Version Published Artifact Artifacts Available
Predicting Code Comprehension: A Novel Approach to Align Human Gaze with Code using Deep Neural Networks
Tarek Alakmeh ORCID logo, David Reich ORCID logo, Lena Jäger ORCID logo, and Thomas FritzORCID logo
(University of Zurich, Switzerland; University of Potsdam, Germany)
The better the code quality and the less complex the code, the easier it is for software developers to comprehend and evolve it. Yet, how do we best detect quality concerns in the code? Existing measures to assess code quality, such as McCabe’s cyclomatic complexity, are decades old and neglect the human aspect. Research has shown that considering how a developer reads and experiences the code can be an indicator of its quality. In our research, we built on these insights and designed, trained, and evaluated the first deep neural network that aligns a developer’s eye gaze with the code tokens the developer looks at to predict code comprehension and perceived difficulty. To train and analyze our approach, we performed an experiment in which 27 participants worked on a range of 16 short code comprehension tasks while we collected fine-grained gaze data using an eye tracker. The results of our evaluation show that our deep neural sequence model that integrates both the human gaze and the stimulus code, can predict (a) code comprehension and (b) the perceived code difficulty significantly better than current state-of-the-art reference methods. We also show that aligning human gaze with code leads to better performance than models that rely solely on either code or human gaze. We discuss potential applications and propose future work to build better human-inclusive code evaluation systems.

Publisher's Version
A Miss Is as Good as A Mile: Metamorphic Testing for Deep Learning Operators
Jinyin Chen ORCID logo, Chengyu Jia ORCID logo, Yunjie Yan ORCID logo, Jie Ge ORCID logo, Haibin Zheng ORCID logo, and Yao Cheng ORCID logo
(Zhejiang University of Technology, China; TÜV SÜD Asia Pacific, Singapore)
Deep learning (DL) is a critical tool for real-world applications, and comprehensive testing of DL models is vital to ensure their quality before deployment. However, recent studies have shown that even subtle deviations in DL operators can result in catastrophic consequences, underscoring the importance of rigorous testing of these components. Unlike testing other DL system components, operator analysis poses unique challenges due to complex inputs and uncertain outputs. The existing DL operator testing approach has limitations in terms of testing efficiency and error localization. In this paper, we propose Meta, a novel operator testing framework based on metamorphic testing that automatically tests and assists bug location based on metamorphic relations (MRs). Meta distinguishes itself in three key ways: (1) it considers both parameters and input tensors to detect operator errors, enabling it to identify both implementation and precision errors; (2) it uses MRs to guide the generation of more effective inputs (i.e., tensors and parameters) in less time; (3) it assists the precision error localization by tracing the error to the input level of the operator based on MR violations. We designed 18 MRs for testing 10 widely used DL operators. To assess the effectiveness of Meta, we conducted experiments on 13 released versions of 5 popular DL libraries. Our results revealed that Meta successfully detected 41 errors, including 14 new ones that were reported to the respective platforms and 8 of them are confirmed/fixed. Additionally, Meta demonstrated high efficiency, outperforming the baseline by detecting ∼2 times more errors of the baseline. Meta is open-sourced and available at https://github.com/TDY-raedae/Medi-Test.

Publisher's Version Info
A Transferability Study of Interpolation-Based Hardware Model Checking for Software Verification
Dirk BeyerORCID logo, Po-Chun Chien ORCID logo, Marek Jankola ORCID logo, and Nian-Ze Lee ORCID logo
(LMU Munich, Germany)
Assuring the correctness of computing systems is fundamental to our society and economy, and formal verification is a class of techniques approaching this issue with mathematical rigor. Researchers have invented numerous algorithms to automatically prove whether a computational model, e.g., a software program or a hardware digital circuit, satisfies its specification. In the past two decades, Craig interpolation has been widely used in both hardware and software verification. Despite the similarities in the theoretical foundation between hardware and software verification, previous works usually evaluate interpolation-based algorithms on only one type of verification tasks (e.g., either circuits or programs), so the conclusions of these studies do not necessarily transfer to different types of verification tasks. To investigate the transferability of research conclusions from hardware to software, we adopt two performant approaches of interpolation-based hardware model checking, (1) Interpolation-Sequence-Based Model Checking (Vizel and Grumberg, 2009) and (2) Intertwined Forward-Backward Reachability Analysis Using Interpolants (Vizel, Grumberg, and Shoham, 2013), for software verification. We implement the algorithms proposed by the two publications in the software verifier CPAchecker because it has a software-verification adoption of the first interpolation-based algorithm for hardware model checking from 2003, which the two publications use as a comparison baseline. To assess whether the claims in the two publications transfer to software verification, we conduct an extensive experiment on the largest publicly available suite of safety-verification tasks for the programming language C. Our experimental results show that the important characteristics of the two approaches for hardware model checking are transferable to software verification, and that the cross-disciplinary algorithm adoption is beneficial, as the approaches adopted from hardware model checking were able to tackle tasks unsolvable by existing methods. This work consolidates the knowledge in hardware and software verification and provides open-source implementations to improve the understanding of the compared interpolation-based algorithms.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Reusable
Sharing Software-Evolution Datasets: Practices, Challenges, and Recommendations
David Broneske ORCID logo, Sebastian Kittan ORCID logo, and Jacob Krüger ORCID logo
(DZHW Hannover, Germany; University of Magdeburg, Germany; Eindhoven University of Technology, Netherlands)
Sharing research artifacts (e.g., software, data, protocols) is an immensely important topic for improving transparency, replicability, and reusability in research, and has recently gained more and more traction in software engineering. For instance, recent studies have focused on artifact reviewing, the impact of open science, and specific legal or ethical issues of sharing artifacts. Most of such studies are concerned with artifacts created by the researchers themselves (e.g., scripts, algorithms, tools) and processes for quality assuring these artifacts (e.g., through artifact-evaluation committees). In contrast, the practices and challenges of sharing software-evolution datasets (i.e., republished version-control data with person-related information) have only been scratched in such works. To tackle this gap, we conducted a meta study of software-evolution datasets published at the International Conference on Mining Software Repositories from 2017 until 2021 and snowballed a set of papers that build upon these datasets. Investigating 200 papers, we elicited what types of software-evolution datasets have been shared following what practices and what challenges researchers experienced with sharing or using the datasets. We discussed our findings with an authority on research-data management and ethics reviews through a semi-structured interview to put the practices and challenges into context. Through our meta study, we provide an overview of the sharing practices for software-evolution datasets and the corresponding challenges. The expert interview enriched this analysis by discussing how to solve the challenges and by defining recommendations for sharing software-evolution datasets in the future. Our results extend and complement current research, and we are confident that they can help researchers share software-evolution datasets (as well as datasets involving the same types of data) in a reliable, ethical, and trustworthy way.

Publisher's Version
Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection
Yuxi Li ORCID logo, Yi Liu ORCID logo, Gelei Deng ORCID logo, Ying Zhang ORCID logo, Wenjia Song ORCID logo, Ling Shi ORCID logo, Kailong Wang ORCID logo, Yuekang Li ORCID logo, Yang Liu ORCID logo, and Haoyu Wang ORCID logo
(Huazhong University of Science and Technology, China; Nanyang Technological University, Singapore; Virginia Tech, USA; UNSW, Sydney, Australia)
With the expanding application of Large Language Models (LLMs) in various domains, it becomes imperative to comprehensively investigate their unforeseen behaviors and consequent outcomes. In this study, we introduce and systematically explore the phenomenon of “glitch tokens”, which are anomalous tokens produced by established tokenizers and could potentially compromise the models’ quality of response. Specifically, we experiment on seven top popular LLMs utilizing three distinct tokenizers and involving a totally of 182,517 tokens. We present categorizations of the identified glitch tokens and symptoms exhibited by LLMs when interacting with glitch tokens. Based on our observation that glitch tokens tend to cluster in the embedding space, we propose GlitchHunter, a novel iterative clustering-based technique, for efficient glitch token detection. The evaluation shows that our approach notably outperforms three baseline methods on eight open-source LLMs. To the best of our knowledge, we present the first comprehensive study on glitch tokens. Our new detection further provides valuable insights into mitigating tokenization-related errors in LLMs.

Publisher's Version
LogSD: Detecting Anomalies from System Logs through Self-Supervised Learning and Frequency-Based Masking
Yongzheng Xie ORCID logo, Hongyu Zhang ORCID logo, and Muhammad Ali Babar ORCID logo
(University of Adelaide, Australia; University of Newcastle, Australia)
Log analysis is one of the main techniques that engineers use for troubleshooting large-scale software systems. Over the years, many supervised, semi-supervised, and unsupervised log analysis methods have been proposed to detect system anomalies by analyzing system logs. Among these, semi-supervised methods have garnered increasing attention as they strike a balance between relaxed labeled data requirements and optimal detection performance, contrasting with their supervised and unsupervised counterparts. However, existing semi-supervised methods overlook the potential bias introduced by highly frequent log messages on the learned normal patterns, which leads to their less than satisfactory performance. In this study, we propose LogSD, a novel semi-supervised self-supervised learning approach. LogSD employs a dual-network architecture and incorporates a frequency-based masking scheme, a global-to-local reconstruction paradigm and three self-supervised learning tasks. These features enable LogSD to focus more on relatively infrequent log messages, thereby effectively learning less biased and more discriminative patterns from historical normal data. This emphasis ultimately leads to improved anomaly detection performance. Extensive experiments have been conducted on three commonly-used datasets and the results show that LogSD significantly outperforms eight state-of-the-art benchmark methods.

Publisher's Version
MirrorFair: Fixing Fairness Bugs in Machine Learning Software via Counterfactual Predictions
Ying Xiao ORCID logo, Jie M. Zhang ORCID logo, Yepang Liu ORCID logo, Mohammad Reza Mousavi ORCID logo, Sicen Liu ORCID logo, and Dingyuan Xue ORCID logo
(Southern University of Science and Technology, China; King’s College London, United Kingdom)
With the increasing utilization of Machine Learning (ML) software in critical domains such as employee hiring, college admission, and credit evaluation, ensuring fairness in the decision-making processes of underlying models has emerged as a paramount ethical concern. Nonetheless, existing methods for rectifying fairness issues can hardly strike a consistent trade-off between performance and fairness across diverse tasks and algorithms. Informed by the principles of counterfactual inference, this paper introduces MirrorFair, an innovative adaptive ensemble approach designed to mitigate fairness concerns. MirrorFair initially constructs a counterfactual dataset derived from the original data, training two distinct models—one on the original dataset and the other on the counterfactual dataset. Subsequently, MirrorFair adaptively combines these model predictions to generate fairer final decisions. We conduct an extensive evaluation of MirrorFair and compare it with 15 existing methods across a diverse range of decision-making scenarios. Our findings reveal that MirrorFair outperforms all the baselines in every measurement (i.e., fairness improvement, performance preservation, and trade-off metrics). Specifically, in 93% of cases, MirrorFair surpasses the fairness and performance trade-off baseline proposed by the benchmarking tool Fairea, whereas the state-of-the-art method achieves this in only 88% of cases. Furthermore, MirrorFair consistently demonstrates its superiority across various tasks and algorithms, ranking first in balancing model performance and fairness in 83% of scenarios. To foster replicability and future research, we have made our code, data, and results openly accessible to the research community.

Publisher's Version
Analyzing Quantum Programs with LintQ: A Static Analysis Framework for Qiskit
Matteo PaltenghiORCID logo and Michael Pradel ORCID logo
(University of Stuttgart, Germany)
As quantum computing is rising in popularity, the amount of quantum programs and the number of developers writing them are increasing rapidly. Unfortunately, writing correct quantum programs is challenging due to various subtle rules developers need to be aware of. Empirical studies show that 40–82% of all bugs in quantum software are specific to the quantum domain. Yet, existing static bug detection frameworks are mostly unaware of quantum-specific concepts, such as circuits, gates, and qubits, and hence miss many bugs. This paper presents LintQ, a comprehensive static analysis framework for detecting bugs in quantum programs. Our approach is enabled by a set of abstractions designed to reason about common concepts in quantum computing without referring to the details of the underlying quantum computing platform. Built on top of these abstractions, LintQ offers an extensible set of ten analyses that detect likely bugs, such as operating on corrupted quantum states, redundant measurements, and incorrect compositions of sub-circuits. We apply the approach to a newly collected dataset of 7,568 real-world Qiskit-based quantum programs, showing that LintQ effectively identifies various programming problems, with a precision of 91.0% in its default configuration with the six best performing analyses. Comparing to a general-purpose linter and two existing quantum-aware techniques shows that almost all problems (92.1%) found by LintQ during our evaluation are missed by prior work. LintQ hence takes an important step toward reliable software in the growing field of quantum computing.

Publisher's Version Published Artifact Artifacts Available
Less Cybersickness, Please: Demystifying and Detecting Stereoscopic Visual Inconsistencies in Virtual Reality Apps
Shuqing LiORCID logo, Cuiyun Gao ORCID logo, Jianping Zhang ORCID logo, Yujia Zhang ORCID logo, Yepang Liu ORCID logo, Jiazhen Gu ORCID logo, Yun Peng ORCID logo, and Michael R. Lyu ORCID logo
(Chinese University of Hong Kong, China; Harbin Institute of Technology, China; Southern University of Science and Technology, China)
The quality of Virtual Reality (VR) apps is vital, particularly the rendering quality of the VR Graphical User Interface (GUI). Different from traditional two-dimensional (2D) apps, VR apps create a 3D digital scene for users, by rendering two distinct 2D images for the user’s left and right eyes, respectively. Stereoscopic visual inconsistency (denoted as “SVI”) issues, however, undermine the rendering process of the user’s brain, leading to user discomfort and even adverse health effects. Such issues commonly exist in VR apps but remain underexplored. To comprehensively understand the SVI issues, we conduct an empirical analysis on 282 SVI bug reports collected from 15 VR platforms, summarizing 15 types of manifestations of the issues. The empirical analysis reveals that automatically detecting SVI issues is challenging, mainly because: (1) lack of training data; (2) the manifestations of SVI issues are diverse, complicated, and often application-specific; (3) most accessible VR apps are closed-source commercial software, we have no access to code, scene configurations, etc. for issue detection. Our findings imply that the existing pattern-based supervised classification approaches may be inapplicable or ineffective in detecting the SVI issues. To counter these challenges, we propose an unsupervised black-box testing framework named StereoID to identify the stereoscopic visual inconsistencies, based only on the rendered GUI states. StereoID generates a synthetic right-eye image based on the actual left-eye image and computes distances between the synthetic right-eye image and the actual right-eye image to detect SVI issues. We propose a depth-aware conditional stereo image translator to power the image generation process, which captures the expected perspective shifts between left-eye and right-eye images. We build a large-scale unlabeled VR stereo screenshot dataset with larger than 171K images from real-world VR apps, which can be utilized to train our depth-aware conditional stereo image translator and evaluate the whole testing framework StereoID. After substantial experiments, depth-aware conditional stereo image translator demonstrates superior performance in generating stereo images, outpacing traditional architectures. It achieved the lowest average L1 and L2 losses and the highest SSIM score, signifying its effectiveness in pixel-level accuracy and structural consistency for VR apps. StereoID further demonstrates its power for detecting SVI issues in both user reports and wild VR apps. In summary, this novel framework enables effective detection of elusive SVI issues, benefiting the quality of VR apps.

Publisher's Version
Learning to Detect and Localize Multilingual Bugs
Haoran Yang ORCID logo, Yu Nong ORCID logo, Tao ZhangORCID logo, Xiapu LuoORCID logo, and Haipeng CaiORCID logo
(Washington State University, USA; Macau University of Science and Technology, China; Hong Kong Polytechnic University, China)
Increasing studies have shown bugs in multi-language software as a critical loophole in modern software quality assurance, especially those induced by language interactions (i.e., multilingual bugs). Yet existing tool support for bug detection/localization remains largely limited to single-language software, despite the long-standing prevalence of multi-language systems in various real-world software domains. Extant static/dynamic analysis and deep learning (DL) based approaches all face major challenges in addressing multilingual bugs. In this paper, we present xLoc, a DL-based technique/tool for detecting and localizing multilingual bugs. Motivated by results of our bug-characteristics study on top locations of multilingual bugs, xLoc first learns the general knowledge relevant to differentiating various multilingual control-flow structures. This is achieved by pre-training a Transformer model with customized position encoding against novel objectives. Then, xLoc learns task-specific knowledge for the task of multilingual bug detection/localization, through another new position encoding scheme (based on cross-language API vicinity) that allows for the model to attend particularly to control-flow constructs that bear most multilingual bugs during fine-tuning. We have implemented xLoc for Python-C software and curated a dataset of 3,770 buggy and 15,884 non-buggy Python-C samples, which enabled our extensive evaluation of xLoc against two state-of-the-art baselines: fine-tuned CodeT5 and zero-shot ChatGPT. Our results show that xLoc achieved 94.98% F1 and 87.24%@Top-1 accuracy, which are significantly (up to 162.88% and 511.75%) higher than the baselines. Ablation studies further confirmed significant contributions of each of the novel design elements in xLoc. With respective bug-location characteristics and labeled bug datasets for fine-tuning, our design may be applied to other language combinations beyond Python-C.

Publisher's Version
BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection
Luan Pham ORCID logo, Huong Ha ORCID logo, and Hongyu Zhang ORCID logo
(RMIT University, Australia; Chongqing University, China)
Detecting failures and identifying their root causes promptly and accurately is crucial for ensuring the availability of microservice systems. A typical failure troubleshooting pipeline for microservices consists of two phases: anomaly detection and root cause analysis. While various existing works on root cause analysis require accurate anomaly detection, there is no guarantee of accurate estimation with anomaly detection techniques. Inaccurate anomaly detection results can significantly affect the root cause localization results. To address this challenge, we propose BARO, an end-to-end approach that integrates anomaly detection and root cause analysis for effectively troubleshooting failures in microservice systems. BARO leverages the Multivariate Bayesian Online Change Point Detection technique to model the dependency within multivariate time-series metrics data, enabling it to detect anomalies more accurately. BARO also incorporates a novel nonparametric statistical hypothesis testing technique for robustly identifying root causes, which is less sensitive to the accuracy of anomaly detection compared to existing works. Our comprehensive experiments conducted on three popular benchmark microservice systems demonstrate that BARO consistently outperforms state-of-the-art approaches in both anomaly detection and root cause analysis.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
An Empirical Study on Code Review Activity Prediction and Its Impact in Practice
Doriane Olewicki ORCID logo, Sarra Habchi ORCID logo, and Bram Adams ORCID logo
(Queen’s University, Canada; Ubisoft, Canada)
During code reviews, an essential step in software quality assurance, reviewers have the difficult task of understanding and evaluating code changes to validate their quality and prevent introducing faults to the codebase. This is a tedious process where the effort needed is highly dependent on the code submitted, as well as the author’s and the reviewer’s experience, leading to median wait times for review feedback of 15-64 hours. Through an initial user study carried with 29 experts, we found that re-ordering the files changed by a patch within the review environment has potential to improve review quality, as more comments are written (+23%), and participants’ file-level hot-spot precision and recall increases to 53% (+13%) and 28% (+8%), respectively, compared to the alphanumeric ordering. Hence, this paper aims to help code reviewers by predicting which files in a submitted patch need to be (1) commented, (2) revised, or (3) are hot-spots (commented or revised). To predict these tasks, we evaluate two different types of text embeddings (i.e., Bag-of-Words and Large Language Models encoding) and review process features (i.e., code size-based and history-based features). Our empirical study on three open-source and two industrial datasets shows that combining the code embedding and review process features leads to better results than the state-of-the-art approach. For all tasks, F1-scores (median of 40-62%) are significantly better than the state-of-the-art (from +1 to +9%).

Publisher's Version
Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?
Bonan Kou ORCID logo, Shengmai Chen ORCID logo, Zhijie Wang ORCID logo, Lei Ma ORCID logo, and Tianyi Zhang ORCID logo
(Purdue University, USA; University of Alberta, Canada; University of Tokyo, Japan)
Large Language Models (LLMs) have recently been widely used for code generation. Due to the complexity and opacity of LLMs, little is known about how these models generate code. We made the first attempt to bridge this knowledge gap by investigating whether LLMs attend to the same parts of a task description as human programmers during code generation. An analysis of six LLMs, including GPT-4, on two popular code generation benchmarks revealed a consistent misalignment between LLMs' and programmers' attention. We manually analyzed 211 incorrect code snippets and found five attention patterns that can be used to explain many code generation errors. Finally, a user study showed that model attention computed by a perturbation-based method is often favored by human programmers. Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust.

Publisher's Version
Evolutionary Multi-objective Optimization for Contextual Adversarial Example Generation
Shasha Zhou ORCID logo, Mingyu Huang ORCID logo, Yanan Sun ORCID logo, and Ke Li ORCID logo
(University of Electronic Science and Technology of China, China; University of Exeter, United Kingdom; Sichuan University, China)
The emergence of the 'code naturalness' concept, which suggests that software code shares statistical properties with natural language, paves the way for deep neural networks (DNNs) in software engineering (SE). However, DNNs can be vulnerable to certain human imperceptible variations in the input, known as adversarial examples (AEs), which could lead to adverse model performance. Numerous attack strategies have been proposed to generate AEs in the context of computer vision and natural language processing, but the same is less true for source code of programming languages in SE. One of the challenges is derived from various constraints including syntactic, semantics and minimal modification ratio. These constraints, however, are subjective and can be conflicting with the purpose of fooling DNNs. This paper develops a multi-objective adversarial attack method (dubbed MOAA), a tailored NSGA-II, a powerful evolutionary multi-objective (EMO) algorithm, integrated with CodeT5 to generate high-quality AEs based on contextual information of the original code snippet. Experiments on 5 source code tasks with 10 datasets of 6 different programming languages show that our approach can generate a diverse set of high-quality AEs with promising transferability. In addition, using our AEs, for the first time, we provide insights into the internal behavior of pre-trained models.

Publisher's Version Info
Mining Action Rules for Defect Reduction Planning
Khouloud Oueslati ORCID logo, Gabriel Laberge ORCID logo, Maxime Lamothe ORCID logo, and Foutse KhomhORCID logo
(Polytechnique Montréal, Canada)
Defect reduction planning plays a vital role in enhancing software quality and minimizing software maintenance costs. By training a black box machine learning model and “explaining” its predictions, explainable AI for software engineering aims to identify the code characteristics that impact maintenance risks. However, post-hoc explanations do not always faithfully reflect what the original model computes. In this paper, we introduce CounterACT, a Counterfactual ACTion rule mining approach that can generate defect reduction plans without black-box models. By leveraging action rules, CounterACT provides a course of action that can be considered as a counterfactual explanation for the class (e.g., buggy or not buggy) assigned to a piece of code. We compare the effectiveness of CounterACT with the original action rule mining algorithm and six established defect reduction approaches on 9 software projects. Our evaluation is based on (a) overlap scores between proposed code changes and actual developer modifications; (b) improvement scores in future releases; and (c) the precision, recall, and F1-score of the plans. Our results show that, compared to competing approaches, CounterACT’s explainable plans achieve higher overlap scores at the release level (median 95%) and commit level (median 85.97%), and they offer better trade-off between precision and recall (median F1-score 88.12%). Finally, we venture beyond planning and explore leveraging Large Language models (LLM) for generating code edits from our generated plans. Our results show that suggested LLM code edits supported by our plans are actionable and are more likely to pass relevant test cases than vanilla LLM code recommendations.

Publisher's Version
ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification
Fangwen Mu ORCID logo, Lin Shi ORCID logo, Song Wang ORCID logo, Zhuohao Yu ORCID logo, Binquan Zhang ORCID logo, ChenXue Wang ORCID logo, Shichao Liu ORCID logo, and Qing Wang ORCID logo
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Beihang University, China; York University, Canada; Harbin Institute of Technology, China; Huawei Central Software Institute, China)
Large Language Models (LLMs), such as ChatGPT, have demonstrated impressive capabilities in automatically generating code from provided natural language requirements. However, in real-world practice, it is inevitable that the requirements written by users might be ambiguous or insufficient. Current LLMs will directly generate programs according to those unclear requirements, regardless of interactive clarification, which will likely deviate from the original user intents. To bridge that gap, we introduce a novel framework named ClarifyGPT, which aims to enhance code generation by empowering LLMs with the ability to identify ambiguous requirements and ask targeted clarifying questions. Specifically, ClarifyGPT first detects whether a given requirement is ambiguous by performing a code consistency check. If it is ambiguous, ClarifyGPT prompts an LLM to generate targeted clarifying questions. After receiving question responses, ClarifyGPT refines the ambiguous requirement and inputs it into the same LLM to generate a final code solution. To evaluate our ClarifyGPT, we invite ten participants to use ClarifyGPT for code generation on two benchmarks: MBPP-sanitized and MBPP-ET. The results show that ClarifyGPT elevates the performance (Pass@1) of GPT-4 from 70.96% to 80.80% on MBPP-sanitized. Furthermore, to conduct large-scale automated evaluations of ClarifyGPT across different LLMs and benchmarks without requiring user participation, we introduce a high-fidelity simulation method to simulate user responses. The results demonstrate that ClarifyGPT can significantly enhance code generation performance compared to the baselines. In particular, ClarifyGPT improves the average performance of GPT-4 and ChatGPT across five benchmarks from 62.43% to 69.60% and from 54.32% to 62.37%, respectively. A human evaluation also confirms the effectiveness of ClarifyGPT in detecting ambiguous requirements and generating high-quality clarifying questions. We believe that ClarifyGPT can effectively facilitate the practical application of LLMs in real-world development environments.

Publisher's Version
Are Human Rules Necessary? Generating Reusable APIs with CoT Reasoning and In-Context Learning
Yubo Mai ORCID logo, Zhipeng GaoORCID logo, Xing Hu ORCID logo, Lingfeng Bao ORCID logo, Yu Liu ORCID logo, and JianLing Sun ORCID logo
(Zhejiang University, China)
Inspired by the great potential of Large Language Models (LLMs) for solving complex coding tasks, in this paper, we propose a novel approach, named Code2API, to automatically perform APIzation for Stack Overflow code snippets. Code2API does not require additional model training or any manual crafting rules and can be easily deployed on personal computers without relying on other external tools. Specifically, Code2API guides the LLMs through well-designed prompts to generate well-formed APIs for given code snippets. To elicit knowledge and logical reasoning from LLMs, we used chain-of-thought (CoT) reasoning and few-shot in-context learning, which can help the LLMs fully understand the APIzation task and solve it step by step in a manner similar to a developer. Our evaluations show that Code2API achieves a remarkable accuracy in identifying method parameters (65%) and return statements (66%) equivalent to human-generated ones, surpassing the current state-of-the-art approach, APIzator, by 15.0% and 16.5% respectively. Moreover, compared with APIzator, our user study demonstrates that Code2API exhibits superior performance in generating meaningful method names, even surpassing the human-level performance, and developers are more willing to use APIs generated by our approach, highlighting the applicability of our tool in practice. Finally, we successfully extend our framework to the Python dataset, achieving a comparable performance with Java, which verifies the generalizability of our tool.

Publisher's Version
A Weak Supervision-Based Approach to Improve Chatbots for Code Repositories
Farbod Farhour ORCID logo, Ahmad Abdellatif ORCID logo, Essam Mansour ORCID logo, and Emad Shihab ORCID logo
(Concordia University, Canada; University of Calgary, Canada)
Software chatbots are growing in popularity and have been increasingly used in software projects due to their benefits in saving time, cost, and effort. At the core of every chatbot is a Natural Language Understanding (NLU) component that enables chatbots to comprehend the users' queries. Prior work shows that chatbot practitioners face challenges in training the NLUs because the labeled training data is scarce. Consequently, practitioners resort to user queries to enhance chatbot performance. They annotate these queries and use them for NLU training. However, such training is done manually and prohibitively expensive. Therefore, we propose AlphaBot to automate the query annotation process for SE chatbots. Specifically, we leverage weak supervision to label users' queries posted to a software repository-based chatbot. To evaluate the impact of using AlphaBot on the NLU's performance, we conducted a case study using a dataset that comprises 749 queries and 52 intents. The results show that using AlphaBot improves the NLU's performance in terms of F1-score, with improvements ranging from 0.96% to 35%. Furthermore, our results show that applying more labeling functions improves the NLU's classification of users' queries. Our work enables practitioners to focus on their chatbots' core functionalities rather than annotating users' queries.

Publisher's Version
Revealing Software Development Work Patterns with PR-Issue Graph Topologies
Cleidson R. B. de Souza ORCID logo, Emilie Ma ORCID logo, Jesse Wong ORCID logo, Dongwook Yoon ORCID logo, and Ivan Beschastnikh ORCID logo
(Federal University of Pará, Brazil; University of British Columbia, Canada)
How software developers work and collaborate, and how we can best support them is an important topic for software engineering research. One issue for developers is a limited understanding of work that has been done and is ongoing. Modern systems allow developers to create Issues and pull requests (PRs) to track and structure work. However, developers lack a coherent view that brings together related Issues and PRs. In this paper, we first report on a study of work practices of developers through Issues, PRs, and the links that connect them. Specifically, we mine graphs where the Issues and PRs are nodes, and references (links) between them are the edges. This graph-based approach provides a window into a set of collaborative software engineering practices that have not been previously described. Based on a qualitative analysis of 56 GitHub projects, we report on eight types of work practices alongside their respective PR/Issue topologies. Next, inspired by our findings, we developed a tool called WorkflowsExplorer to help developers visualize and study workflow types in their own projects. We evaluated WorkflowsExplorer with 6 developers and report on findings from our interviews. Overall, our work illustrates the value of embracing a topology-focused perspective to investigate collaborative work practices in software development.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
DeciX: Explain Deep Learning Based Code Generation Applications
Simin Chen ORCID logo, Zexin Li ORCID logo, Wei YangORCID logo, and Cong Liu ORCID logo
(University of Texas at Dallas, USA; University of California at Riverside, Riverside, USA)
Deep learning-based code generation (DL-CG) applications have shown great potential for assisting developers in programming with human-competitive accuracy. However, lacking transparency in such applications due to the uninterpretable nature of deep learning models makes the automatically generated programs untrustworthy. In this paper, we develop DeciX, a first explanation method dedicated to DL-CG applications. DeciX is motivated by observing two unique properties of DL-CG applications: output-to-output dependencies and irrelevant value and semantic space. These properties violate the fundamental assumptions made in existing explainable DL techniques and thus cause applying existing techniques to DL-CG applications rather pessimistic and even incorrect. DeciX addresses these two limitations by constructing a causal inference dependency graph, containing a novel method leveraging causal inference that can accurately quantify the contribution of each dependency edge in the graph to the end prediction result. Proved by extensive experiments assessing popular, widely-used DL-CG applications and several baseline methods, DeciX is able to achieve significantly better performance compared to state-of-the-art in terms of several critical performance metrics, including correctness, succinctness, stability, and overhead. Furthermore, DeciX can be applied to practical scenarios since it does not require any knowledge of the DL-CG model under explanation. We have also conducted case studies that demonstrate the applicability of DeciX in practice.

Publisher's Version
FeatMaker: Automated Feature Engineering for Search Strategy of Symbolic Execution
Jaehan Yoon ORCID logo and Sooyoung Cha ORCID logo
(Sungkyunkwan University, South Korea)
We present FeatMaker, a novel technique that automatically generates state features to enhance the search strategy of symbolic execution. Search strategies, designed to address the well-known state-explosion problem, prioritize which program states to explore. These strategies typically depend on a ”state feature” that describes a specific property of program states, using this feature to score and rank them. Recently, search strategies employing multiple state features have shown superior performance over traditional strategies that use a single, generic feature. However, the process of designing these features remains largely manual. Moreover, manually crafting state features is both time-consuming and prone to yielding unsatisfactory results. The goal of this paper is to fully automate the process of generating state features for search strategies from scratch. The key idea is to leverage path-conditions, which are basic but vital information maintained by symbolic execution, as state features. A challenge arises when employing all path-conditions as state features, as it results in an excessive number of state features. To address this, we present a specialized algorithm that iteratively generates and refines state features based on data accumulated during symbolic execution. Experimental results on 15 open-source C programs show that FeatMaker significantly outperforms existing search strategies that rely on manually-designed features, both in terms of branch coverage and bug detection. Notably, FeatMaker achieved an average of 35.3% higher branch coverage than state-of-the-art strategies and discovered 15 unique bugs. Of these, six were detected exclusively by FeatMaker.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
DAInfer: Inferring API Aliasing Specifications from Library Documentation via Neurosymbolic Optimization
Chengpeng WangORCID logo, Jipeng Zhang ORCID logo, Rongxin Wu ORCID logo, and Charles ZhangORCID logo
(Hong Kong University of Science and Technology, China; Xiamen University, China)
Modern software systems heavily rely on various libraries, necessitating understanding API semantics in static analysis. However, summarizing API semantics remains challenging due to complex implementations or the unavailability of library code. This paper presents DAInfer, a novel approach for inferring API aliasing specifications from library documentation. Specifically, we employ Natural Language Processing (NLP) models to interpret informal semantic information provided by the documentation, which enables us to reduce the specification inference to an optimization problem. Furthermore, we propose a new technique called neurosymbolic optimization to efficiently solve the optimization problem, yielding the desired API aliasing specifications. We have implemented DAInfer as a tool and evaluated it upon Java classes from several popular libraries. The results indicate that DAInfer infers the API aliasing specifications with a precision of 79.78% and a recall of 82.29%, averagely consuming 5.35 seconds per class. These obtained aliasing specifications further facilitate alias analysis, revealing 80.05% more alias facts for API return values in 15 Java projects. Additionally, the tool supports taint analysis, identifying 85 more taint flows in 23 Android apps. These results demonstrate the practical value of DAInfer in library-aware static analysis.

Publisher's Version
Partial Solution Based Constraint Solving Cache in Symbolic Execution
Ziqi Shuai ORCID logo, Zhenbang ChenORCID logo, Kelin Ma ORCID logo, Kunlin Liu ORCID logo, Yufeng Zhang ORCID logo, Jun SunORCID logo, and Ji Wang ORCID logo
(National University of Defense Technology, China; Hunan University, China; Singapore Management University, Singapore)
Constraint solving is one of the main challenges for symbolic execution. Caching is an effective mechanism to reduce the number of the solver invocations in symbolic execution and is adopted by many mainstream symbolic execution engines. However, caching can not perform well on all programs. How to improve caching’s effectiveness is challenging in general. In this work, we propose a partial solution-based caching method for improving caching’s effectiveness. Our key idea is to utilize the partial solutions inside the constraint solving to generate more cache entries. A partial solution may satisfy other constraints of symbolic execution. Hence, our partial solution-based caching method naturally improves the rate of cache hits. We have implemented our method on two mainstream symbolic executors (KLEE and Symbolic PathFinder) and two SMT solvers (STP and Z3). The results of extensive experiments on real-world benchmarks demonstrate that our method effectively increases the number of the explored paths in symbolic execution. Our caching method achieves 1.07x to 2.3x speedups for exploring the same amount of paths on different benchmarks.

Publisher's Version
Your Code Secret Belongs to Me: Neural Code Completion Tools Can Memorize Hard-Coded Credentials
Yizhan Huang ORCID logo, Yichen Li ORCID logo, Weibin Wu ORCID logo, Jianping Zhang ORCID logo, and Michael R. Lyu ORCID logo
(Chinese University of Hong Kong, China; Sun Yat-sen University, China)
Neural Code Completion Tools (NCCTs) have reshaped the field of software engineering, which are built upon the language modeling technique and can accurately suggest contextually relevant code snippets. However, language models may emit the training data verbatim during inference with appropriate prompts. This memorization property raises privacy concerns of NCCTs about hard-coded credential leakage, leading to unauthorized access to applications, systems, or networks. Therefore, to answer whether NCCTs will emit the hard-coded credential, we propose an evaluation tool called Hard-coded Credential Revealer (HCR). HCR constructs test prompts based on GitHub code files with credentials to reveal the memorization phenomenon of NCCTs. Then, HCR designs four filters to filter out ill-formatted credentials. Finally, HCR directly checks the validity of a set of non-sensitive credentials. We apply HCR to evaluate three representative types of NCCTs: Commercial NCCTs, open-source models, and chatbots with code completion capability. Our experimental results show that NCCTs can not only return the precise piece of their training data but also inadvertently leak additional secret strings. Notably, two valid credentials were identified during our experiments. Therefore, HCR raises a severe privacy concern about the potential leakage of hard-coded credentials in the training data of commercial NCCTs. All artifacts and data are released for future research purposes in https://github.com/HCR-Repo/HCR.

Publisher's Version
Bounding Random Test Set Size with Computational Learning Theory
Neil Walkinshaw ORCID logo, Michael Foster ORCID logo, José Miguel Rojas ORCID logo, and Robert M. Hierons ORCID logo
(University of Sheffield, United Kingdom)
Random testing approaches work by generating inputs at random, or by selecting inputs randomly from some pre-defined operational profile. One long-standing question that arises in this and other testing contexts is as follows: When can we stop testing? At what point can we be certain that executing further tests in this manner will not explore previously untested (and potentially buggy) software behaviors? This is analogous to the question in Machine Learning, of how many training examples are required in order to infer an accurate model. In this paper we show how probabilistic approaches to answer this question in Machine Learning (arising from Computational Learning Theory) can be applied in our testing context. This enables us to produce an upper bound on the number of tests that are required to achieve a given level of adequacy. We are the first to enable this from only knowing the number of coverage targets (e.g. lines of code) in the source code, without needing to observe a sample test executions. We validate this bound on a large set of Java units, and an autonomous driving system.

Publisher's Version
MTAS: A Reference-Free Approach for Evaluating Abstractive Summarization Systems
Xiaoyan Zhu ORCID logo, Mingyue Jiang ORCID logo, Xiao-Yi Zhang ORCID logo, Liming Nie ORCID logo, and Zuohua Ding ORCID logo
(Zhejiang Sci-Tech University, China; University of Science and Technology Beijing, China; Shenzhen Technology University, China)
Abstractive summarization (AS) systems, which aim to generate a text for summarizing crucial information of the original document, have been widely adopted in recent years. Unfortunately, factually unreliable summaries may still occur, leading to unexpected misunderstanding and distortion of information. This calls for methods that can properly evaluate the quality of AS systems. Yet, the existing reference-based evaluation approach for AS relies on reference summaries as well as automatic evaluation metrics (e.g., ROUGE). Therefore, the reference-based evaluation approach is highly restricted by the availability and quality of reference summaries as well as the capability of existing automatic evaluation metrics. In this study, we propose MTAS, a novel metamorphic testing based approach for evaluating AS in a reference-free way. Our two major contributions are (i) five metamorphic relations towards AS, which involve semantic-preserving and focus-preserving transformations at the document level, and (ii) a summary consistency evaluation metric SCY, which measures the alignment between a pair of summaries by incorporating both the semantic and factual consistency. Our experimental results show that the proposed metric SCY has a significantly higher correlation with human judgment as compared to a set of existing metrics. It is also demonstrated that MTAS can break the dependence on reference summaries, and it successfully reports a large number of summary inconsistencies, revealing various summarization issues on state-of-the-art AS systems.

Publisher's Version
Bloat beneath Python’s Scales: A Fine-Grained Inter-Project Dependency Analysis
Georgios-Petros Drosos ORCID logo, Thodoris Sotiropoulos ORCID logo, Diomidis Spinellis ORCID logo, and Dimitris Mitropoulos ORCID logo
(Athens University of Economics and Business, Greece; ETH Zurich, Switzerland; Delft University of Technology, Netherlands; University of Athens, Greece)
Modern programming languages promote software reuse via package managers that facilitate the integration of inter-dependent software libraries. Software reuse comes with the challenge of dependency bloat, which refers to unneeded and excessive code incorporated into a project through reused libraries. Such bloat exhibits security risks and maintenance costs, increases storage requirements, and slows down load times. In this work, we conduct a large-scale, fine-grained analysis to understand bloated dependency code in the PyPI ecosystem. Our analysis is the first to focus on different granularity levels, including bloated dependencies, bloated files, and bloated methods. This allows us to identify the specific parts of a library that contribute to the bloat. To do so, we analyze the source code of 1,302 popular Python projects and their 3,232 transitive dependencies. For each project, we employ a state-of-the-art static analyzer and incrementally construct the fine-grained project dependency graph (FPDG), a representation that captures all inter-project dependencies at method-level. Our reachability analysis on the FPDG enables the assessment of bloated dependency code in terms of several aspects, including its prevalence in the PyPI ecosystem, its relation to software vulnerabilities, its root causes, and developer perception. Our key finding suggests that PyPI exhibits significant resource underutilization: more than 50% of dependencies are bloated. This rate gets worse when considering bloated dependency code at a more subtle level, such as bloated files and bloated methods. Our fine-grained analysis also indicates that there are numerous vulnerabilities that reside in bloated areas of utilized packages (15% of the defects existing in PyPI). Other major observations suggest that bloated code primarily stems from omissions during code refactoring processes and that developers are willing to debloat their code: Out of the 36 submitted pull requests, developers accepted and merged 30, removing a total of 35 bloated dependencies. We believe that our findings can help researchers and practitioners come up with new debloating techniques and development practices to detect and avoid bloated code, ensuring that dependency resources are utilized efficiently.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages
Kai Gao ORCID logo, Weiwei Xu ORCID logo, Wenhao Yang ORCID logo, and Minghui Zhou ORCID logo
(Peking University, China)
A package's source code repository records the package's development history, which is critical for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package's development platform from its distribution platform. To establish the link, existing tools retrieve the release's repository information from its metadata, which suffers from two limitations: the metadata may not contain or contain wrong information. Our analysis shows that existing tools can only retrieve repository information for up to 70.5% of PyPI releases. To address the limitations, this paper proposes PyRadar, a novel framework that utilizes the metadata and source distribution to retrieve and validate the repository information for PyPI releases. We start with an empirical study to compare four existing tools on 4,227,425 PyPI releases and analyze phantom files (files appearing in the release's distribution but not in the release's repository) in 14,375 correct and 2,064 incorrect package-repository links. Based on the findings, we design PyRadar with three components, i.e., Metadata-based Retriever, Source Code Repository Validator, and Source Code-based Retriever, that progressively retrieves correct source code repository information for PyPI releases. In particular, the Metadata-based Retriever combines best practices of existing tools and successfully retrieves repository information from the metadata for 72.1% of PyPI releases. The Source Code Repository Validator applies common machine learning algorithms on six crafted features and achieves an AUC of up to 0.995. The Source Code-based Retriever queries World of Code with the SHA-1 hashes of all Python files in the release's source distribution and retrieves repository information for 90.2% of packages in our dataset with an accuracy of 0.970. Both practitioners and researchers can employ the PyRadar to better use PyPI packages.

Publisher's Version
Dependency-Induced Waste in Continuous Integration: An Empirical Study of Unused Dependencies in the npm Ecosystem
Nimmi Rashinika Weeraddana ORCID logo, Mahmoud Alfadel ORCID logo, and Shane McIntoshORCID logo
(University of Waterloo, Canada)
Modern software systems are increasingly dependent upon code from external packages (i.e., dependencies). Building upon external packages allows software reuse to span across projects seamlessly. Package maintainers regularly release updated versions to provide new features, fix defects, and address security vulnerabilities. Due to the potential for regression, managing dependencies is not just a trivial matter of selecting the latest versions. Since it is perceived to be less risky to retain a dependency than remove it, as projects evolve, they tend to accrue dependencies, exacerbating the difficulty of dependency management. It is not uncommon for a considerable proportion of external packages to be unused by the projects that list them as a dependency. Although such unused dependencies are not required to build and run the project, updates to their dependency specifications will still trigger Continuous Integration (CI) builds. The CI builds that are initiated by updates to unused dependencies are fundamentally wasteful. Considering that CI build time is a finite resource that is directly associated with project development and service operational costs, understanding the consequences of unused dependencies within this CI context is of practical importance. In this paper, we study the CI waste that is generated by updates to unused dependencies. We collect a dataset of 20,743 commits that are solely updating dependency specifications (i.e., the package.json file), spanning 1,487 projects that adopt npm for managing their dependencies. Our findings illustrate that 55.88% of the CI build time that is associated with dependency updates is only triggered by unused dependencies. At the project level, the median project spends 56.09% of its dependency-related CI build time on updates to unused dependencies. For projects that exceed the budget of free build minutes, we find that the median percentage of billable CI build time that is wasted due to unused-dependency commits is 85.50%. Moreover, we find that automated bots are the primary producers of dependency-induced CI waste, contributing 92.93% of the CI build time that is spent on unused dependencies. The popular Dependabot is responsible for updates to unused dependencies that account for 74.52% of that waste. To mitigate the impact of unused dependencies on CI resources, we introduce Dep-sCImitar, an approach to cut down wasted CI time by identifying and skipping CI builds that are triggered due to unused-dependency commits. A retrospective evaluation of the 20,743 studied commits shows that Dep-sCImitar reduces wasted CI build time by 68.34% by skipping wasteful builds with a precision of 94%.

Publisher's Version
Mobile Bug Report Reproduction via Global Search on the App UI Model
Zhaoxu Zhang ORCID logo, Fazle Mohammed Tawsif ORCID logo, Komei Ryu ORCID logo, Tingting Yu ORCID logo, and William G. J. Halfond ORCID logo
(University of Southern California, USA; University of Connecticut, USA)
Bug report reproduction is an important, but time-consuming task carried out during mobile app maintenance. To accelerate this task, current research has proposed automated reproduction techniques that rely on a guided dynamic exploration of the app to match bug report steps with UI events in a mobile app. However, these techniques struggle to find the correct match when the bug reports have missing or inaccurately described steps. To address these limitations, we propose a new bug report reproduction technique that uses an app’s UI model to perform a global search across all possible matches between steps and UI actions and identify the most likely match while accounting for the possibility of missing or inaccurate steps. To do this, our approach redefines the bug report reproduction process as a Markov model and finds the best paths through the model using a dynamic programming based technique. We conducted an empirical evaluation on 72 real-world bug reports. Our approach achieved a 94% reproduction rate on the total bug reports and a 93% reproduction rate on bug reports with missing steps, significantly outperforming the state-of-the-art approaches. Our approach was also more effective in finding the matches from the steps to UI events than the state-of-the-art approaches.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Natural Symbolic Execution-Based Testing for Big Data Analytics
Yaoxuan Wu ORCID logo, Ahmad Humayun ORCID logo, Muhammad Ali Gulzar ORCID logo, and Miryung Kim ORCID logo
(University of California at Los Angeles, Los Angeles, USA; Virginia Tech, USA)
Symbolic execution is an automated test input generation technique that models individual program paths as logical constraints. However, the realism of concrete test inputs generated by SMT solvers often comes into question. Existing symbolic execution tools only seek arbitrary solutions for given path constraints. These constraints do not incorporate the naturalness of inputs that observe statistical distributions, range constraints, or preferred string constants. This results in unnatural-looking inputs that fail to emulate real-world data. In this paper, we extend symbolic execution with consideration for incorporating naturalness. Our key insight is that users typically understand the semantics of program inputs, such as the distribution of height or possible values of zipcode, which can be leveraged to advance the ability of symbolic execution to produce natural test inputs. We instantiate this idea in NaturalSym, a symbolic execution-based test generation tool for data-intensive scalable computing (DISC) applications. NaturalSym generates natural-looking data that mimics real-world distributions by utilizing user-provided input semantics to drastically enhance the naturalness of inputs, while preserving strong bug-finding potential. On DISC applications and commercial big data test benchmarks, NaturalSym achieves a higher degree of realism —as evidenced by a perplexity score 35.1 points lower on median, and detects 1.29× injected faults compared to the state-of-the-art symbolic executor for DISC, BigTest. This is because BigTest draws inputs purely based on the satisfiability of path constraints constructed from branch predicates, while NaturalSym is able to draw natural concrete values based on user-specified semantics and prioritize using these values in input generation. Our empirical results demonstrate that NaturalSym finds injected faults 47.8× more than NaturalFuzz (a coverage-guided fuzzer) and 19.1× more than ChatGPT. Meanwhile, TestMiner (a mining-based approach) fails to detect any injected faults. NaturalSym is the first symbolic executor that combines the notion of input naturalness in symbolic path constraints during SMT-based input generation. We make our code available at https://github.com/UCLA-SEAL/NaturalSym.

Publisher's Version Published Artifact Artifacts Available Artifacts Reusable
Investigating Documented Privacy Changes in Android OS
Chuan Yan ORCID logo, Mark Huasong Meng ORCID logo, Fuman Xie ORCID logo, and Guangdong Bai ORCID logo
(University of Queensland, Australia; National University of Singapore, Singapore; Institute for Infocomm Research at A*STAR, Singapore)
Android has empowered third-party apps to access data and services on mobile devices since its genesis.This involves a wide spectrum of user privacy-sensitive data, such as the device ID and location. In recent years, Android has taken proactive measures to adapt its access control policies for such data, in response to the increasingly strict privacy protection regulations around the world. When each new Android version is released, its privacy changes induced by the version evolution are transparently disclosed, and we refer to them as documented privacy changes (DPCs). Implementing DPCs in Android OS is a non-trivial task, due to not only the dispersed nature of those access control points within the OS, but also the challenges posed by backward compatibility. As a result, whether the actual access control enforcement in the OS implementations aligns with the disclosed DPCs becomes a critical concern. In this work, we conduct the first systematic study on the consistency between the operational behaviors of the OS at runtime and the officially disclosed DPCs. We propose DopCheck, an automatic DPC-driven testing framework equipped with a large language model (LLM) pipeline. It features a serial of analysis to extract the ontology from the privacy change documents written in natural language, and then harnesses the few-shot capability of LLMs to construct test cases for the detection of DPC-compliance issues in OS implementations. We apply DopCheck with the latest versions (10 to 13) of Android Open Source Project (AOSP). Our evaluation involving 79 privacy-sensitive APIs demonstrates that DopCheck can effectively recognize DPCs from Android documentation and generate rigorous test cases. Our study reveals that the status quo of the DPC-compliance issues is concerning, evidenced by 19 bugs identified by DopCheck. Notably, 12 of them are discovered in Android 13 and 6 in Android 10 for the first time, posing more than 35% Android users to the risk of privacy leakage. Our findings should raise an alert to Android users and app developers on the DPC compliance issues when using or developing an app, and would also underscore the necessity for Google to comprehensively validate the actual implementation against its privacy documentation prior to the OS release.

Publisher's Version
CrossCert: A Cross-Checking Detection Approach to Patch Robustness Certification for Deep Learning Models
Qilin Zhou ORCID logo, Zhengyuan Wei ORCID logo, Haipeng Wang ORCID logo, Bo Jiang ORCID logo, and Wing-Kwong Chan ORCID logo
(City University of Hong Kong, China; University of Hong Kong, China; Beihang University, China)
Patch robustness certification is an emerging kind of defense technique against adversarial patch attacks with provable guarantees. There are two research lines: certified recovery and certified detection. They aim to correctly label malicious samples with provable guarantees and issue warnings for malicious samples predicted to non-benign labels with provable guarantees, respectively. However, existing certified detection defenders suffer from protecting labels subject to manipulation, and existing certified recovery defenders cannot systematically warn samples about their labels. A certified defense that simultaneously offers robust labels and systematic warning protection against patch attacks is desirable. This paper proposes a novel certified defense technique called CrossCert. CrossCert formulates a novel approach by cross-checking two certified recovery defenders to provide unwavering certification and detection certification. Unwavering certification ensures that a certified sample, when subjected to a patched perturbation, will always be returned with a benign label without triggering any warnings with a provable guarantee. To our knowledge, CrossCert is the first certified detection technique to offer this guarantee. Our experiments show that, with a slightly lower performance than ViP and comparable performance with PatchCensor in terms of detection certification, CrossCert certifies a significant proportion of samples with the guarantee of unwavering certification.

Publisher's Version
Detecting, Creating, Repairing, and Understanding Indivisible Multi-Hunk Bugs
Qi Xin ORCID logo, Haojun Wu ORCID logo, Jinran Tang ORCID logo, Xinyu Liu ORCID logo, Steven P. ReissORCID logo, and Jifeng XuanORCID logo
(Wuhan University, China; Hubei Luojia Laboratory, China; Brown University, USA)
This paper presents our approach proposed to detect and create indivisible multi-hunk bugs, an evaluation of existing repair techniques based on these bugs, and a study of the patches of these bugs constructed by the developers and existing tools. Multi-hunk bug repair aims to deal with complex bugs by fixing multiple locations of the program. Previous research on multi-hunk bug repair is severely misguided, as the evaluation of previous techniques is predominantly based on the Defects4J dataset containing a great deal of divisible multi-hunk bugs. A divisible multi-hunk bug is essentially a combination of multiple bugs triggering different failures and is uncommon while debugging, as the developer typically deals with one failure at a time. To address this problem and provide a better basis for multi-hunk bug repair, we propose an enumeration-based approach IBugFinder, which given a bug dataset can automatically detect divisible and indivisible bugs in the dataset and further isolate the divisible bugs into new indivisible bugs. We applied IBugFinder to 281 multi-hunk bugs from the Defects4J dataset. IBugFinder identified 139 divisible bugs and created 249 new bugs among which 105 are multi-hunk. We evaluated existing repair techniques with the indivisible multi-hunk bugs detected and created by IBugFinder and found that these techniques repaired only a small number of bugs suggesting weak multi-hunk repair abilities. We further studied the patches of indivisible multi-hunk bugs constructed by the developers and the various tools with a focus on understanding the relationships of the partial patches made at different locations. The study has led to the identification of 8 partial patch relationships, which suggest different strategies for multi-hunk patch generation and provide important implication for multi-hunk bug repair.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional

proc time: 27.2